Apache Lucene

Architecture and Implementation

Welcome

The purpose of this web site is to share my thesis results  which subject was to strudy the architecture and implementation of Apache Lucene

At the beginning of this  project, the D.k.d Internet services, a Typo3 web development company in Frankfurt/ germany,  had the idea to create a better searchengine extension for Typo3. Thus, they found Solr which is a Lucene based web search application. Together with my supervisor Dr.  B. Renz, we had the idea of studying the internal architecture of Apache Lucene in order to enable a better usage of its components. 

I first developped a small Search Engine based on Lucene, named SeboL, to help going in-depth into the Lucene components . Delving into the Lucene indexing was a quite difficult task because of the complexity of the library. On the long run it was possible to illustrate the internal architecture of the following Lucene components: Field, Lucene Document, Lucene Analysis, The Index writing mechanism, the decorator pattern used by the Analyzer, the Lucene index file formats and the structure of a Lucene Query object.

On this website I'll point out the important schema of those components and their interaction. 

The SeboL search engine is also available for download in the Download section.

What is Apache Lucene?

Lucene was developed 1998 by Doug cuting and published on Sourcefourge as Open source Project. Lucene is not an abreviation but the second name od Doug cutting's wife. Since 2011 till date, Lucene is part of the Apache foundation and is called Apache Lucene.

According to the founder, 'Apache Lucene is a software library for full-text search. It's not an application but rather a technology that can be incorporated  in applications' Doug Cutting.

Here is a short video of Doug cutting taliking about the future of search:  Lucene nutch and hadoop.

 

The compositional structure of an application based on Lucene may have the following components:

  • A Data Pool which holds all kind of documents like pdf, Html , Xml, ...
  • Lucene Documents
  • An Index
  • An Index search implementation