Apache Lucene

Architecture and Implementation

Architecture overview

Any Application using Apache Lucene must first of all transform its original data, into Lucene Documents. For this purpose a Document Handler interface is needed, this one is provided by the Lucene contribution Library. The Document Handler interface allows the extraction of information like textual contents, numbers and meta data from original documents and provide them as Lucene Documents. These are used for further processing during indexing and search.

Each common document type like HTML, PDF, XML and so on needs a specific document parser  to extract its contents. Document parsers are not part of the Apache Lucene core. They are available on the web for free. Some of them are JTidy : a HTML Parser, Pdfbox: a PDF documents parser and SAX: an XML Parser.

Once a Lucene Document  is created, the IndexWriter is the next component that is in charge to analyze and store Lucene Documents into the index. This is done according to particular attributes. The indexWriter uses one or more Analyser as a Strategy for index writing. 

 The analyzer purges Lucene Documents from useless contents like space, hyphen, stop words and much  more depending on the choosen anaylzer(s) .At the end of the analysis process a Lucene Document is broken into terms(also called terms) that are use for search .

To Search inside the index , the user has to provide a human-readable expression called query string. This one is transform by the QueryParser into an object of type Query. The Query object has to be analyze , then assign to the IndexSearcher.

The IndexSearcher is the core of the search process. It uses the IndexReader to access the Index and to retrieve all the terms that matches the the terms in the Query object, and returns the hints and topdocs as search result. A Filter can be use to permit or prohibit one or more terms in the search results. 

 

What Apache Lucene can do?
  • Providing an index from an amount of data from different types. This process is called Indexing
  • Parsing user query string
  • Searching for occurences of query terms within the index
  • computing statistics over indexed documents
  • Process Management: The application programmer has to select suitable components available in the Apache Lucene library, those who satisfies its needs.
  • Files Selection: The Programmer decide which files are to be indexed either Pdf, Html orXml files
  • Display Search query

 

Members Area

Recent Forum Posts

by manou over a year ago

Quote of the Day

Quote of the Day

Number of hits

Like this?