Apache Lucene

Architecture and Implementation

Creating a Lucene Index :How-to

The creation of an Index involves two different processes. The first one populates Lucene Documents with Fields. It is the responsibility of the search engine application to convert original data(PDF,Html,Txt,...) into Lucene Document(field,value), using an appropriate document parser(exple. : PdfBox,JTidy,Sax) .

Once Lucene Document are created the second process is taken over by the IndexWriter, this one is used to create and maintain the Index: IndexWriter's addDocument(LuceneDocument) method gather Lucene Document Fields Value into the Index.

This is  one of the Syntax to use to create an IndexWriter:

 IndexWriter W = new IndexWriter(FSDirectory.open(indexdir),

new  StandardAnalyzer(version.LUCENE_30,IndexWriter.MaxFieldLength.UNLIMITED));

 FSDirectory is an Implementation of Directory, that  store the index in a new or an existing directory in the computer. the Index files would be stored in indexdir

The Analyzer is a startegy used by the IndexWriter to analyze the Lucene Documents fields before they are stored. In this case we choose the simplest one, the StandardAnalyzer for the version 3.0 of Lucene. We can also choose not to limit the length of a field, so all the terms in a field should be considered.

To prevent concurrency, a Lock is used to avoid other IndexWriters to open the same Index directory.

The next step is to delve into each component of this process. Let's start by the Lucene Document.

The Lucene Document

A Lucene Document is a set of Fields. A Field comprises a name and one or more values. The name is usually a word (String type) describing the field like content, path, name, date of creation are examples of field's names. The value is the text of that field.

  • An example of a field value is "discover the web" for the field content.
  •  "/document and settings/2012/index" is an example of field value for the field name's path.
  •  "2010-09-22T00:00:00Z" is an example of date value

Lucene Document is used in these three cases :

  •  As a logical representation of the original documents(txt,pdf,html,...) provided by the document parser
  • A part of the index which hold stored (Lucene) fields. The following components of Lucene Documents are usually stored in the index:
    • Terms of each field: these build the terms dictionnary
    • Documents'identification numbers (DocIds) : These are numbers that are automatically increment when a new Lucene Document is added. There are used as pointer on Lucene Documents they represent.
  • As representaion of the result of a query : Field's values matching the query are gather and displayed by the search application as result.