Apache Solr More Like This Query with Parameters

There are 4 different ways for usage of MoreLikeThis in Apache Solr.
  1. As a search component
  2. As a request handler
  3. As a request handler with externally supplied text
  4. As a query parser

Here I have used 4th approach, Solr 5 includes a query parser named mlt that can more easily be combined with other queries or relevancy boosting than the other options.

Before starting this task, there are few terms that we need to know their basics.

qf = Specifies the fields to use for similarity. i.e. i have used here qf=title in my case (which is demonstrated below). NOTE: it is important to check the termVector to that specific field or fields on which you want to do the MoreLikeThis search.(this option is available in Solr Admin when you add field in provided drop down options) .

mintf = Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document. in simple words, The minimum number of times a term must be used within a document (across those field(s) in "qf" ) for it to be an interesting term(here interesting term means the term(s) that are present in qf field and also available in other document's field(s)). The default is 2.

mindf = Specifies the Minimum Document Frequency, the frequency at which words will be ignored when they do not occur in at least this many documents. in other words, The minimum number of documents that a term must be used in for it to be an interesting term. It defaults to 5.
here is the reference for above details and here.

mlt = used to get the MoreLikeThis results

A new core created with name "MoreLikeThis" and few fields added by using Solr admin (i.e. title, color, description, type) few documents added using document builder and then used "mlt" with different quires to test its working.

Complete list of indexed documents is as follows:

{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "q":"*:*",
      "indent":"on",
      "rows":"50",
      "wt":"json",
      "_":"1488961760373"}},
  "response":{"numFound":16,"start":0,"docs":[
      {
        "id":"01",
        "title":"abc abc def xyz",
        "color":"white",
        "description":"my name is harry potter",
        "_version_":1561290064730783744},
      {
        "id":"03",
        "title":"i am xyz",
        "color":"brown",
        "description":"also harry potter",
        "_version_":1561289717104771072},
      {
        "id":"04",
        "title":"pc mobile laptop",
        "color":"no",
        "description":"blank",
        "_version_":1561289729029177344},
      {
        "id":"07",
        "type":"new",
        "title":"xyz xyz xyz xyz abc abc def mobile def",
        "_version_":1561289982509842432},
      {
        "id":"06",
        "type":"non",
        "title":"xyz xyz abc abc def mobile def",
        "_version_":1561289951273811968},
      {
        "id":"05",
        "title":"abc def xyz mobile",
        "_version_":1561289753301614592},
      {
        "id":"15",
        "title":"xyz abc",
        "_version_":1561292079187886080},
      {
        "id":"16",
        "title":"abc abc",
        "_version_":1561321357248036864},
      {
        "id":"14",
        "title":"xyz",
        "_version_":1561291979998887936},
      {
        "id":"02",
        "title":"abc xyz that",
        "color":"black",
        "description":"my name is also harry potter",
        "_version_":1561289696912343040},
      {
        "id":"10",
        "type":"my name is wasif hafeez and i am a student and a good guy",
        "_version_":1561199001380847616},
      {
        "id":"09",
        "type":"my name is wasif hafeez and i am a student",
        "_version_":1561198975604752384},
      {
        "id":"13",
        "title":"harry potter harry potter",
        "_version_":1561288932466884608},
      {
        "id":"12",
        "type":"older",
        "_version_":1561200548971020288},
      {
        "id":"11",
        "type":"older",
        "_version_":1561200640981467136},
      {
        "id":"08",
        "type":"old",
        "_version_":1561187573469020160}]
  }}

First we will test on qf=title of document id=16 with the query

{!mlt qf=title mintf=1 mindf=1}16

here 16 is document's id in which title's data will be considered as source to be looked in other documents with similar terms, as here in document 16 is "abc abc" in title, and mintf and mindf are set to 1. for mintf that means it will look for the interesting term "abc" in those who have atleast 1 time that specific term in the title field and for mindf that term should appear in atleast 1 document then show the results. and running this query give the following output:

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"{!mlt qf=title mintf=1 mindf=1}16",
      "indent":"on",
      "rows":"50",
      "wt":"json",
      "_":"1488992938045"}},
  "response":{"numFound":6,"start":0,"docs":[
      {
        "id":"01",
        "title":"abc abc def xyz",
        "color":"white",
        "description":"my name is harry potter",
        "_version_":1561290064730783744},
      {
        "id":"15",
        "title":"xyz abc",
        "_version_":1561292079187886080},
      {
        "id":"02",
        "title":"abc xyz that",
        "color":"black",
        "description":"my name is also harry potter",
        "_version_":1561289696912343040},
      {
        "id":"06",
        "type":"non",
        "title":"xyz xyz abc abc def mobile def",
        "_version_":1561289951273811968},
      {
        "id":"05",
        "title":"abc def xyz mobile",
        "_version_":1561289753301614592},
      {
        "id":"07",
        "type":"new",
        "title":"xyz xyz xyz xyz abc abc def mobile def",
        "_version_":1561289982509842432}]
  }}

this means that only those documents are shown which also has the "term"= "abc" atleast 1 time(this means mintf).

a few more documents added to test mintf greater than 1.


Here in above picture document 17,18 and 19 are added with having two terms same i.e. "Apache" and "Solr" but in document 17 they are only one time.

Now moving towards changing some parameters for the document=17 with mintf=2 and mindf=1

{!mlt qf=title mintf=2 mindf=1}19


mintf: is 2. that means, to make query from src doc. include only those terms that appear in source document qf 2 times.
It does not matter, that those terms also appear 2 times or single time in target documents. would be sorted by number of terms appear in destination.
this will show nothing, ofcourse there are no terms which are atleast 2 times in document id=17.

But same query is tested on document id=19. This will produce following:


This means that, document with id=19 has a field "title" with content of "Apache Solr Apache Solr" with Apache 2 time and same for Solr thats why when mintf=2 it is showing those documents which also have two times that "interesting term" "Apache" and "Solr" and side by side, it also shows that also have a single same term. if we increase mintf to "3" it will again show nothing as there are no terms greater than 2 in document 19.

Now Moving towards mindf, here if we set mindf=4 for the document 19, this will produce nothing because there are only total 3 documents with the content in field "title" related to document id=19. which means it will look for atleast 4 documents and if they are available then it will show the results otherwise nothing.

Query:
{!mlt qf=title mintf=2 mindf=4}19


and if we set less than 4 with any value i.e. 0-3 it will show the following results because there are document equal to 3 but not more than three, less then and equal to 3 mindf will then produce the results with available documents.
Query:

{!mlt qf=title mintf=2 mindf=3}19


I hope this makes the understanding of MoreLikeThis with additional parameters of mintf and mindf.

Comments