Lucene Highlighter Tutorial with Example

The post explains how to implement search terms Highlighter using Apache Lucene 5.1 along with example code. When users search, they want to search in minimum time. So the techniques that facilitate users to search fast are important for better user search experience, highlighter is one of those techniques.

HighLighter performs two functions:
  1. It makes the terms bold in search result which were part of user query, so that user can identiy and quickly review the result.
  2. If your document text is long, Highlighter also select best fragment of text that contains the search keywords, so that user could read 2-3 lines of document to decide whether exploring the link further would help.
Using Google, you must have noticed, Google highlight the query keywords making them bold and also select a particular fragment of text from the description that is stored in Google about that articles. As show below:





Notice, there are three query terms: java, inheritance and bitspedia. In result Google has highlighted these terms in URL and description. In this article we want to achieve same functioanlity using Lucene search engine library.


Lucene Indexing Process

To search something using Apache Lucene, we need to create an index of data. Then we run the search operation on that index. So lets first create an index of some data:

package com.bitspedia.lucene.highlighter;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.io.File;

public class Indexer {
 private IndexWriter indexWriter;

 public Indexer(String indexerDirectoryPath) throws Exception { 
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        indexWriterConfig.setOpenMode(OpenMode.CREATE);

        File indexFile = new File(indexerDirectoryPath);
        Directory directory = FSDirectory.open(indexFile.toPath());
        indexWriter = new IndexWriter(directory, indexWriterConfig);
 }

 public int createIndex() throws Exception {  

     String titles[] = {"Lucene In Action", "Hibernate In Action", "Java In Action", "Action Script"
 , "Action that Changed the World" , "How To Java", "How To C++", "Anroid In Action"};  

     for(String title : titles) {
         Document document = new Document();
         document.add(new TextField("title", title, Store.YES));
         indexWriter.addDocument(document);
      }  
      indexWriter.commit();
      return indexWriter.numDocs();
 }

 public void close() throws Exception {
               indexWriter.close();
        }
}


The constructor instantiate IndexWriter object that is used to create index. Analyzer helps to create right tokens or keywords from given text. Without Analyzer, IndexWriter can't create the index. For example, if you see a index at the end of a book, its contains keywords used in the book. So keyword identification is required before the indexing process.

Apache Lucene provide different type of Analyzers and mechanism to plug custom Analyzers, StandardAnalyzer extract tokens out of the text, lower case the tokens, eliminates common words and punctuations, etc. So StandardAnalyzer is very helpful for common search cases.

 The createIndex method actually creates the index using indexWriter and data (given in the form to Document objects). The Document is Lucene provided class, we create Document objects and pass to indexWriter object. Each Document consist of multiple fields. I have added only one TextField to keep the example simple. Later we would create the Indexer object and invoke createIndex method to create the index. Here is how we would do the indexing using above created code:


package com.bitspedia.lucene.highlighter;

public class LuceneHighlighter {

private static final String INDEX_DIRECTORY_PATH = "D:\\Lucene\\Lucene Highlighter";

public void createIndex() throws Exception {

       Indexer indexer = new Indexer(INDEX_DIRECTORY_PATH);
       Integer maxDoc = indexer.createIndex(); // Returns total documents indexed
       System.out.println("Index Created, total documents indexed: " + maxDoc);
       indexer.close(); // Close index writer
 }
}

Lucene Search Process

Lets make Search component that we could use to search keywords on above created Index. The primary class used to search the index is IndexSearcher. We instantiate this object passing INDEX_DIRECTORY_PATH. Then we an search information placed in the specified index using keywords. Below code creates the IndexSearcher and expose two methods i.e. search (to search) and getDocument (to retrieve a specific document by id).

package com.bitspedia.lucene.highlighter;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.io.File;

public class Searcher { 

 private IndexSearcher indexSearcher;
 
 public Searcher(String indexerDirectoryPath) throws Exception {
  
        File indexFile = new File(indexerDirectoryPath);
        Directory directory = FSDirectory.open(indexFile.toPath());
        IndexReader indexReader = DirectoryReader.open(directory);
        indexSearcher = new IndexSearcher(indexReader);
 } 

 public TopDocs search(Query query, int n) throws Exception {
        return indexSearcher.search(query, n);
 }

 public Document getDocument(int docID) throws Exception {
        return indexSearcher.doc(docID); // Returns a document at the nth ID
 }
}
The Searcher class constructor instantiate the IndexSearcher object on the index we created earlier. The search method receives Query and an integer parameter that represent the maximum number of documents to retrieve. Document IDs are returned along with relevance score, but not the actual Documents. The "doc" method is used to retrieve the actual Document, which takes the Document ID.

Lets prepare the IndexSearcher by adding another method in LuceneHighlighter class:



package com.bitspedia.lucene.highlighter;

public class LuceneHighlighter {

    private Searcher searcher;

    public void searchIndex(String searchQuery) throws Exception {
        searcher = new Searcher(INDEX_DIRECTORY_PATH);
        Analyzer analyzer = new StandardAnalyzer();
        QueryParser queryParser = new QueryParser("title", analyzer);
        Query query = queryParser.parse(searchQuery);

        TopDocs topDocs = searcher.search(query, maxDoc.SIZE);
        ScoreDoc scoreDocs[] = topDocs.scoreDocs;

        for (ScoreDoc scoreDoc : scoreDocs) {
            Document document = searcher.getDocument(scoreDoc.doc);
            String title = document.get("title");
            System.out.println(title);
        }
    }
}
In above code, I created Searcher object passing INDEX_DIRECTORY_PATH. The QueryParser represent the query in Lucene understandable format. There are 3 types of information which are important from querying perspective:

1. Analyzer object, so that Lucene code analyze the query string
2. The field name on which search should be operated, in our case its "title"
3. The actual query, see "java action" above

We pass this query to searcher's search method which returns the TopDocs. Apache Lucene sort the returned results based on relevance, by default. We can change the sort parameter, that would see in different article. So far it search the titles but do not highlight search keywords. Now we are ready to discuss the core objective i.e. how to use Lucene Highlighter.


Lucene Highlighter

Lets add another method highlightSearchKeywords in our LuceneHighlighter class that use scoreDocs, query and other Lucene components that provides highlighting. Lets first see the code sample, then I would explain how it works and the purpose of different classes used:

public void searchAndHighLightKeywords(String searchQuery) throws Exception {

        // STEP A
        QueryParser queryParser = new QueryParser("title", new StandardAnalyzer());
        Query query = queryParser.parse(searchQuery);
        QueryScorer queryScorer = new QueryScorer(query, "title");
        Fragmenter fragmenter = new SimpleSpanFragmenter(queryScorer);

        Highlighter highlighter = new Highlighter(queryScorer); // Set the best scorer fragments
        highlighter.setTextFragmenter(fragmenter); // Set fragment to highlight

        // STEP B
        File indexFile = new File(INDEX_DIRECTORY_PATH);
        Directory directory = FSDirectory.open(indexFile.toPath());
        IndexReader indexReader = DirectoryReader.open(directory);

        // STEP C
        System.out.println("");
        ScoreDoc scoreDocs[] = searcher.search(query, maxDoc.SIZE).scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            Document document = searcher.getDocument(scoreDoc.doc);
            String title = document.get("title");
            TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader,
                    scoreDoc.doc, "title", document, new StandardAnalyzer());
            String fragment = highlighter.getBestFragment(tokenStream, title);
            System.out.println(fragment);
        }
    }
First you must understand HighLighter not only highlights keywords but also select the best text fragment if our field value (e.g. "title") is large. In STEP A, I have also discussed query object. The Scorer is used to gain stream of tokens, the QueryScorer scores text fragments by the number of unique query terms found. Then I create a Fragmenter that breaks text into multiple fragments for consideration for HighLighter. Highlighter later choose a best fragment to show. Then I created HighLighter object using Scorer and Fragmenter objects. So the take away note is, Highlighter chooses best text fragment to show and also highlight the keywords in that fragment.

In STEP B, I have created IndexReader, which is used to read the index, as explained in Searcher section above.

IN STEP C, we do the actual work. I use indexReader, scoreDoc, field-name, document and analyzer to create the token stream which HighLighter uses (in addition to actual content) to identify the best fragment of text. Here is the code snippet that runs the whole process.



public static void main(String[] args) throws Exception {
        String searchQuery = "java action";
        LuceneHighlighter luceneHighlighter = new LuceneHighlighter();
        luceneHighlighter.createIndex();
        //luceneHighlighter.searchIndex(searchQuery); // without highlight functionality
        luceneHighlighter.searchAndHighLightKeywords(searchQuery);
    }
Here is the output:

<B>Java</B> In <B>Action</B>
How To <B>Java</B>
Lucene In <B>Action</B>
Hibernate In <B>Action</B>
<B>Action</B> Script
Anroid In <B>Action</B>
<B>Action</B> that Changed the World
So it has added tag around searched keywords. To see how fragments work, if I change the title "Lucene in Action" to "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, Lucene In Action quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat".

You would notice, it select specific fragement from title that contains the searched keywords. The output of highlighter would look like this:



<B>Java</B> In <B>Action</B>
How To <B>Java</B>
Hibernate In <B>Action</B>
<B>Action</B> Script
Anroid In <B>Action</B>
<B>Action</B> that Changed the World
 et dolore magna aliqua. Ut enim ad minim veniam, Lucene In <B>Action</B> quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat

Comments

  1. the code give exception atsearchAndHighLightKeywords()method classnotfound exceptiom

    ReplyDelete

Post a Comment