Any Application using Apache Lucene must first of all transform its original data, into Lucene Documents. For this purpose a Document Handler interface is needed, this one is provided by the Lucene contribution Library. The Document Handler interface allows the extraction of information like textual contents, numbers and meta data from original documents and provide them as Lucene Documents. These are used for further processing during indexing and search.
Each common document type like HTML, PDF, XML and so on needs a specific document parser to extract its contents. Document parsers are not part of the Apache Lucene core. They are available on the web for free. Some of them are JTidy : a HTML Parser, Pdfbox: a PDF documents parser and SAX: an XML Parser.
Once a Lucene Document is created, the IndexWriter is the next component that is in charge to analyze and store Lucene Documents into the index. This is done according to particular attributes. The indexWriter uses one or more Analyser as a Strategy for index writing.
The analyzer purges Lucene Documents from useless contents like space, hyphen, stop words and much more depending on the choosen anaylzer(s) .At the end of the analysis process a Lucene Document is broken into terms(also called terms) that are use for search .
To Search inside the index , the user has to provide a human-readable expression called query string. This one is transform by the QueryParser into an object of type Query. The Query object has to be analyze , then assign to the IndexSearcher.
The IndexSearcher is the core of the search process. It uses the IndexReader to access the Index and to retrieve all the terms that matches the the terms in the Query object, and returns the hints and topdocs as search result. A Filter can be use to permit or prohibit one or more terms in the search results.