Apache Solr Index Sample Data

Start Apache Solr instance using "solr start" from bin directory of Apache Solr home directory. Or see first article how to setup Apache Solr and start it first time.

Create New Core 
Before we index sample data, we must create a core. Core is like a table that contain many records and has a unique configuration. Each record in Solr is called a document. So we create core and add documents to it which are indexed by the Solr, so that we could search.

One Solr instance may have more than one cores, each with different set of documents and unique configurations. So we need to create a core, before we add documents into it. Here is how we create a core:

1. In CLI, go to bin directory e.g. D:\solr-6.4.2/bin> and issue following command:
solr create -c core1 -d basic_configs
-c specifies the core name
-d specifies the folder that contains core configuration data.

Solr instance already have a folder named basic_configs which contains default configuration information. "solr" batch file (stored in bin directory) would create a folder using our specified core-name and copy the conf folder of basic_configs into core1 directory. Here is the command with output:

D:\solr-6.4.2\bin>solr create -c core1 -d basic_configs

Copying configuration to new core instance directory:

Creating new core 'core1' using command:


You see, it first copied the configuraion folder from basic_configs directory into core1. The new core i.e. core1 is created at "D:\solr-6.4.2\server\solr\core1". basic_configs folder is located at: "D:\solr-6.4.2\server\solr\configsets\basic_configs"

After the folder is created and configuration is copied, the command makes a call to Solr API to create to core1. We can make call this API from our own programs to make the cores dynamically, we would see that in some other post. Now we just want to put some data into core1.

Add Data into core1
Do you know, what is the benefit of the Index given at end of a book? It makes it easy to search that terms in the book. Because index contains lists of keywords sorted alphabetically so that a reader could search a particular keyword quickly. And againt each terms, page numbers are written that contains information about that term. The index created by Solr is same in semantics, though its much more complex, but original nature is same.

In same way, if we want to search documents using Solr, we must first add those documents in Solr. So that, Solr could build index of those documents, that it would use at search time. Just as we use book index to search where a particular topic is written in book.

Now, we can add documents into core1 using Solr Admin UI using JSON format. But Solr new installation already contains multiple sample data. Lets first import that data into Solr and run some basic search queries. Later we would also see how to add a document in Solr using Admin UI and from our programs. The sample data files are located in "D:\solr-6.4.2\example\exampledocs>" folder. Set your CLI to exampledocs folder and issue the following command to import the data.

java -Dtype=text/csv -Durl=http://localhost:8983/solr/core1/update -jar post.jar  books.csv 

We are importing data stored in books.csv file into core1. The URL argument contains the core name in URL argument. post.jar is located is same folder, it contains code to parse data of common formats e.g. xml, json, csv etc. and send this data to Solr API i.e specified by URL argument, in form of XML/JSON using HTTP. (How to send data using HTTP client ourselves, would be explained in separate post). When you issue the command, you would see an error, with other details, the error would contain a line "unknown field 'cat'".

Its annoying, as we expect it should have imported the data. Here I describe the reason. Lets first have a look at books.csv contents. The first line containd field names, other lines contains the records. post.jar would use these field names and data to draft XML/JSON before passing to Solr.

0553573403,book,A Game of Thrones,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553579908,book,A Clash of Kings,7.99,true,George R.R. Martin,"A Song of Ice and Fire",2,fantasy
055357342X,book,A Storm of Swords,7.99,true,George R.R. Martin,"A Song of Ice and Fire",3,fantasy
0553293354,book,Foundation,7.99,true,Isaac Asimov,Foundation Novels,1,scifi
0812521390,book,The Black Company,6.99,false,Glen Cook,The Chronicles of The Black Company,1,fantasy
0812550706,book,Ender's Game,6.99,true,Orson Scott Card,Ender,1,scifi
0441385532,book,Jhereg,7.95,false,Steven Brust,Vlad Taltos,1,fantasy
0380014300,book,Nine Princes In Amber,6.99,true,Roger Zelazny,the Chronicles of Amber,1,fantasy
0805080481,book,The Book of Three,5.99,true,Lloyd Alexander,The Chronicles of Prydain,1,fantasy
080508049X,book,The Black Cauldron,5.99,true,Lloyd Alexander,The Chronicles of Prydain,2,fantasy

When Solr receive the data, it must know its type to properly index it. There are three ways to specify the type of each field of the document.
  1. Send the data-type of each field when passing the data as XML/JSON
  2. Configure Solr specifying field name and its data-type. So that, whenever that field-name is used in a document, Solr could assign it a data-type as per its configuration
  3. Use Solr defined field-name suffixes so that Solr could infer the type e.g. in series_t, sequence_i, and genre_s, _t, _i, _s represent data type of these fields is general-text, integer and string.
Option 1 is not valid in our case, because we are not creating XML/JSON, its created by post.jar file. Option 3 is valid for only last 3 fields. Solr is configured with assuming the type of id is string. But Solr is unable to determine the data-type of cat, name, price, inStock and author fields. Adding following lines in managed-schema.xml located in core1 configuration folder i.e. "D:\solr-6.4.2\server\solr\core1\conf", after the id field under schema tag. 

<field name="category" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="price" type="double" indexed="true" stored="true" required="true" multiValued="false" />
<field name="inStock" type="boolean" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="author" type="string" indexed="true" stored="true" required="true" multiValued="false" />

There are different attributes. name and type represent the field name and its datatype. indexed value is set to true, that means, we want Solr to index this field. stored attribute is set to true, that represent the Solr should also store the value of the field, so that later it could be extracted back from Solr as search query result. Multialued is set to false, because we want to store only single value in each document of the these fields. If you need to store multiple values againt one file in same document, then can be be set to true. Lets restart the Solr instance and import the books.csv file again.

D:\solr-6.4.2\bin>solr stop -all
Stopping Solr process 4328 running on port 8983
Waiting for 4 seconds, press a key to continue ...

D:\solr-6.4.2\bin>solr start
Archiving 1 old GC log files to D:\solr-6.4.2\server\logs\archived
Archiving 1 console log files to D:\solr-6.4.2\server\logs\archived
Rotating solr logs, keeping a max of 9 generations
Waiting up to 30 to see Solr running on port 8983
Started Solr server on port 8983. Happy searching!

D:\solr-6.4.2\bin>cd ../example\exampledocs
D:\solr-6.4.2\example\exampledocs>java -Dtype=text/csv -Durl=http://localhost:8983/solr/core1/update
 -jar post.jar books.csv

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/core1/update using content-type text/csv...
POSTing file books.csv to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core1/update...
Time spent: 0:00:00.238

Now the file is indexed successfully and we are ready to run some search queries on Solr Index.