 Hi everyone, I am Swamitra, so first time I will quickly show you how to set up the VM for those who have still not had their VM on. So once you have that zip file, just extract that zip file, select on new, give any name, select type as Linux and version as Red Hat 64 bit, click next, then it will ask for memory, so give as much RAM as possible because whatever we will be doing it will be memory intensive. So for 8 GB systems try giving at least 3 GB RAM and for 4 GB systems try giving at least like 1800 GB RAM. So once you assign the RAM, then you have to select this, use an existing virtual hard drive file and browse to that VDI image, whatever file you got after extracting that zip file, select that file and just click on create, it will create the VM and you can just click on start to start the VM. Any confusion in this part? Some of you I guess don't have that VM up and running because of some dependency or any other things, so I think the only option is to bear up. What about the login? User name is Solar User, password is Solar User, anywhere it's asking for password it's Solar User. So can you briefly tell us what is there inside? Yeah. And what package is this? Yeah. User, password is same as username, Solar User. So I will quickly go through what all there in this image. So first of all there are basic applications like you have a browser, you have a backlit, you have a editor and you have a SQL client. We will be importing data from MySQL and it will sync it to Solar, we will use the SQL client. Then all of the main work which we will do is that inside this work directory, everything we will be doing as Solar User. So inside Solar User you will see two folders, one is tools and other is work. So tools has installer for all the things. You need not worry about all those things because they are already installed. Inside work directory you will see first of all there is a Solar package, then there is ZooKeeper, then there is sample data. Eclipse and all those things is not required. So basically there are three things. All the sample data has instructions for all the hands-on activities plus the sample data which will be required for each of the hands-on activities. So if you will see the content of sample data folder, it is based on activities and inside every activity, there is data required for that activity plus read-by file which gives you the instructions for that activity. And instead of some of you might not be familiar with the command line. So I think you guys can open this subline text and in the left hand pane you will see all the read-by files and all those things. So first thing I want all of you to just open the subline editor and open this sample data folder. We need a file, it says kernel to be loaded. Yeah, there is some module missing on that stuff. I think it will be faster if you can clear up it somehow before you take time. You have to come up and run it. So how many of you don't have the image running? Can you please, can you just still chance to sit with someone who has some running? Anyone who don't have some running, make sure that you have some partner for the hands-on activities. If you guys want to change the image, this is the quick one. Can you give us the story, like, what's the problem with this? Sorry? So there is some trouble, there is some talk stream. Yeah. Anyways, I will just go through the theory first. So the agenda of this session is to understand the out-of-box features of solar and try to understand how we can implement our distributed search platform using solar. So we have a pretty ambitious agenda. I will try to cover most of the things. We will skim through some other things. Our aim will be to at least understand, like, what all capabilities solar has and how you can solve your use case using solar. So first let's try to see what is the basic requirement for our search application. So whenever you want to build a search application, there are a few basic things which you will always need. First thing is you want a basic full-text search. You want to enter something and you want to see results for that. You want highlighting of your search terms in your document wherever that search term has matched. Then you want some sorting criteria, like you want to sort your results based on some fields, on some criteria. You want sorting capabilities. You want pagination. You don't want to show user all the results at one time. You want to paginate those things. Then you want something like called faceting. So you might have seen this thing in Flipkart and other e-commerce sites where for every category you get a count. Like for this category, I have this many results. So this is called faceting. So solar provides you with faceting also. Then you have some basic charting capabilities. So solar provides you with faceting capabilities. Then all the time-based filtering, all those things. Then there is something called clustering. Whatever results you get, you might want to cluster those results together into some groups so that user can more easily skim through the results. So all these type of features, they are provided out of box by solar and we will see how we can build all these features today. So how is clustering different from this? So in clustering, you get groups. Like I have like 250 matches. Out of these 250 matches, solar tries to cluster group, creates some groups. Like it tries to create this commercial co-opter groups and assign some of the documents in those groups. Faceting is you are getting number of... Faceting, we will cover that in more detail in the topic. We will cover that when we go forward. It's completely different. Clustering is completely different from this. So solar, basically what solar is. Solar is a search platform. It's a search server. You put in some documents in solar and then you can provide search features on that. Like in databases, you have the smallest entity as row. So in solar, the smallest entity is documents. And what exactly is a document? A document consists of number of attributes and number of fields. So our document looks like this. So this is very important to understand because sometimes people get confused at starting itself like how does solar store things. So this is important to understand that what is the basic entity in solar. Then we will go on to discuss other things. So a document consists of nothing but some fields and each field can have some data type and multi-valued values and things like that. So like this, I have one document and this document has many fields like ST tags, ST site and all those things. This comprises of one document and similarly there can be multiple documents each of that document can have similar many fields. And if you are familiar with no SQL then in solar also every document did not have all the attributes. Some of the documents can have this ST tag fields some of the documents cannot. So with this little introduction about document we will see how solar is built. So solar is basically built on top of Apache Lucene. Lucene provided all this functionality all the search functionalities like faceting, scoring and all those things. And what solar did was Lucene in Lucene when you are using Lucene everything you have to write code for everything you have to write Java code for everything. So what solar did was it exposed all those features as configuration options. So instead of writing code what you can do to achieve all these things you can directly do some configuration and you will get those things. For instance if you want to have sorting on some field then directly you can define your schema that I want to have sorting on this field and you will get the job done. So Lucene is very scalable super fast and you can get like 150 GB per hour level of indexing speed. In most of the cases Lucene and solar will never be the bottleneck for your applications because it's very fast. And the bottleneck becomes like how fast can you index your data. So solar basically provides API to access Lucene over HTTP and it also added some additional features like distributed search and replication and all those things it added on top of Lucene. So one thing people get confused at starting is that they think that solar can be used as a database. So even though since solar forth solar added functionality which are similar to a no secret database but it should not be used as a no secret database otherwise over the time it performance will get affected very badly. So we will see what are the performance factors which should be kept in mind while creating the schema and while indexing documents. So all these features like facetting more like this these are further added by solar. Good thing about solar is you can write your own plugins and plug in extra components like this clustering component. So clustering component is not built in solar it is actually using a software called carrot so actually whenever you are saying clustering in solar it means carrot clustering. So we can plug in extra software into solar very easily. So we will have a quick overview of what indexing and querying means. So as we saw that basic and little yeah in the end we will try to see if we have time left we will definitely go to that. So we have seen that the basic entity in solar is a document. A document consists of many fields. So people coming from relational database background you can assume a document as a single row but that row need not have all the columns. So in traditional RDBMS in a table every row must have all the columns otherwise you need to assign some values or like that in this and in a solar document. Some of the attributes can be present some of the attributes cannot be present just like any no secret database. You can compare it equivalent to Cassandra or something. So raw data goes into our parser and we create a document out of that raw data. What that document consists of? That document consists of fields and inside that fields so first of all let me tell you what data we will try to index today. We will try to index data from stack exchange sites stack exchange sites you must be familiar with like stack overflow robotics and all those things. So they provide the dumps for all their data and it consists of all the stats like all the questions asked all the answer received what is the score of each of the posts. So that data we will try to index. So in this document you can see there are fields like robotics so it tells that this data belongs to the robotics side then it tells something like what is the post type so post type is question then what is the actual post and number of comments that post got what is the title of that post and favorite count and all those fields and you will see that this title field this title field will be present only for questions so in stack overflow if you go only the questions will have title answers do not have title so this title field will be present only in documents where post type is equal to question so we have some raw data we converted that raw data into a document and by the document we meant that we stated that data into some fields and then inside that field every field has some terms so all this data will eventually be converted into tokens or you can simplify that as terms so this ST post data let's say this title data will be broken into tokens like what is the like that so document had field and that field went through analyzer that analyzer broke that free data into terms and those terms are written into a data structure which we call as inverted index and inverted index is the basic data structure so that is the flow for indexing so let's see what inverted index is suppose we have this document the bright blue butterfly hangs on the breeze this document went through analysis phase and in analysis phase we do many things a simple thing which we do is we have a stopper list so there are some words which a user will never be interested in searching for like the user never searches for those things so during the analysis phase we remove all those words and then finally it's broken into terms and that terms goes into the inverted index so you can see that for bright we have a term basically we have tokens for all the terms apart from the ones which are in the stopper list so you can say that this part is the analysis phase and there are lot of other things which happens during the analysis stage and at the end we get an inverted index inverted index basically consists of a mapping between the terms and the documents in which that term is present like the word best so best is present only in document 2 so it will store that best in document 2 now this term blue it is present in document 1 and 3 so blue is present here blue is present here it's present in both document 1 and 3 so this data structure is called inverted index and the reason why we see this so fast is because of this whenever user comes and search for any term we already have that mapping so whenever user search for bright we directly go into this data structure and we get list of all the documents which have this term bright and then we can later perform other operation there are certain use cases where this data structure doesn't work so we also have an inverted index but that use cases are very few so in that case we have an inverted indexing like we store that for document 1 what all tokens are there but we will focus only on cases where inverted indexing works yeah ranking we will during querying we will see ranking so these are basic concepts which will come like term frequency is the number of times a document has occurred in a document so like in document 2 what is the number of times a term has occurred in a certain document that is term frequency and term frequency would be able to store that we will consider that while querying and all those things then there is inverted document frequency it tells us our basic goal is if we only go by term frequency then we will not always get good results because suppose we might have one very large document it might have some terms repeating 10 times and at the same time we might have a very small document it might have a term repeating only twice a price if we just go by term frequency then we will not get good results so we try to calculate what is the weight of that term in that document we kind of normalize that score based on the length of the documents so and based on that we calculate what is the weight of that term so we will cover this while you are covering querying so this is the architecture for solar at the core of the solar is the Lucy and Lucy provides all the index searcher and all the query parsers query parser basically means I mean you can specify different kind of query language so there are multiple query parsers and each query parser has some additional advantages some provides easy way to specify boosting and all those things so all these things come from Apache Lucy and then on top of that solar added some other things like data import handler data import handler helps us to import data from external sources we can import it from database that external file or even crawl from the web then Tika it helps in indexing this text document like PDF then index handlers it helps us in specifying how we want to control our index and what operation we want to do on that then replication of index so if you want to have multiple copies of our index for all the lines and high level then we can replicate so these things are added by solar Lucy newly provided the ore functionality is the single node thing and on top of that solar added a lot of things then solar can run in any jQuery container so by default it comes with a jetty so whenever you start solar it will start it will deploy your application on jetty and it will start on support then it provides schema and metadata configuration so whatever things you did through programming in Lucy it has selected those things as lot of things as configuration and you can specify all those things as metadata configuration then it provides you clients and client APIs to access your index so what we will quickly do now is we will fire up one solar instance we will try to see what all is included in the solar and understand the basically directory layout because there are so many examples bundled in the package so anyone using solar for the first time it's become very difficult to figure out like what the directory means for and this is mentioned about in another video in another video in more detail so if you all can just open this activity one and just open the readme file for activity one so first thing we have to do is we have to start a single node solar instance see there can be control the time as in can we decrease the number of times tags in this document yes I mean this is your data this tags is nothing but a multi-valued field so whatever data you are putting this is under your control so ST tags is the field name and this is the value of those tags UC tags one second let me open that so whenever you have any document from stack overflow at the end you have some tags like what all the categories it belongs to so while indexing our documents so this tags are coming from the data itself I mean it's not some predefined thing so any other document it can have tags like php, jquery, nljx so it's just a multi-valued field all these are fields name corresponding to that document and we need to define this thing so all this field definition this is called schema destination and we will see in this example like how we can define this schema yeah yeah you can add this yeah yeah you can add this it's very dynamic so you can today you have some requirements so you have to find some fields tomorrow some new requirements can come up you can add new fields and it will index on top of that you can just index in the new fields if you want the old documents also to get updated then you have to re-index the old document also yeah and if you can just open up the terminal and go inside Perth and solar 481 directory so here you will see one example folder one example final and one example's minimal folder I will explain in just couple of minutes what all these things mean but for now just go to example minimal folder so I am here at home solar wizard work solar 481 example minimal and once you are there just add this schema java jha hyphen jha start dot jha it will start up a solar it will basically fire up a jetty instance it will fly solar bar over there and it will start on the port 8983 folder confusion any question so far this side is everyone able to do that much so if you didn't make any mistake then you will see this thing register new searcher ok so it will tell you that it has started at port 8983 now if you open your browser and go to localhost 8983 it will give you all of context available on the server just click that icon or you can just directly go to localhost 8983 slash solar let me know if anyone has any problems so far so this is the basic admin UI for solar it provides a lot of information you can do a lot of things from here itself so what we did just now is we started our single load solar here you will see something like port selector it will show something like collection work so this is our default example that comes bundled when you download solar collection is equivalent to our table in adbms so for a collection you define the schema and then you index documents into one collection there is something called instance directory yeah this is the instance directory whatever instance you are running as this is called basically one instance object is called an instance in solar vocab so I have just launched one solar instance and for each instance you can have multiple in each instance you can have multiple collections so you can have multiple instances of solar running on one machine and inside each instance you can have a multiple collection basically instances are separately indexed query etc they both see each other no they see each other yeah they communicate with each other so instances are the way that we distribute our search so shards shards is one mode level very deep of abstraction it's like when you start multiple nodes let's say Cassandra I need background you are familiar with use card Hadoop on multiple nodes so you have multiple instances of Hadoop running so it's equivalent to that now when you are running a pseudo cluster if you want you can run you can run multiple instances on same machine also so it's equivalent to that we will talk more about instances when we come to solar cloud yeah so it basically provides you some report information like how much and memory you have heat size all those things plus it provides you all the properties and most important thing is provide you list of collections that you have if you select at some if you select the report collection it will tell you how many number of documents are present in that collection and documents as I mentioned earlier is equivalent to number of rows so let's leave it up to here so everyone got this admin UI up and running so now we will try to see what's that virtual machine oh you are still at virtual machine so first we will try to understand the directory structure of solar so at the base level you will have many things just focus on this example thing so this for one thing to remember to solar is names are very confusing in solar and always try not the names to get you confused so example here in solar means a server so whenever you download solar yeah so there is azira open for changing the terminology for all this thing so here example means one example means one server directory so you can have multiple you can make multiple copies of example directories and that way you can run multiple instances on same machine so what I have did is I have made up two copies of this example folder so I have actually made two copies of the server folder one is example minimal in the example minimal I have removed all the extra stuff which makes it confusing for the first time and then I have example final where I have put all that stuff so that you guys can take your sense from that so whatever we will do we will do inside this example minimal directory we will take reference from the whatever I have mentioned in the activity docs from this example final directory so if if we will go in one board level deep that is inside server directory yeah so apart from the server directory there are other directories like this so if we have some other package like clustering component or data import handle so if you want to include other things we can include those charts inside this this folder and those will be visible to the solar so then if you go inside the example folder again you will see lot of directories which might confuse you for the first time but main directory is that solar is the directory called solar so this is the directory which will store your actual collections your actual index is schema and all the configuration so basically all the configurations and all the actual data that will be inside solar directory there is other directory called scripts directory so it has some other scripts for doing stuff like uploading files to zookeeper and all those things then there is our jar for start.jar so start.jar will start your solar basically it will fire up a jetting stance it will uncompress the files and it will deploy the bar file basically so whenever we have to start our solar we can just give this java jar start.jar and we will start up the solar now if we go inside the solar directory again we will see something so the only thing that is of our concern is collection 1 so collection 1 as I mentioned is equivalent to a table so all the data it will be stored inside the solar plus configuration also so how you create collections is first you define some configuration for our collection just like how you create our tables in rdbms you first define a create table statement and that creates your table then you index some data so equivalent to that here also to create a collection we have to first define some configuration like what all fields will be there for that collection and what all other parameters then when we index the data then this data directory is created so it will hold the actual data we can configure this data directory and we can put it somewhere else also so now we will see what all configuration is required to create a new collection so again if you you can ignore all this files and just focus on two files there is schema.xml and solar.xml rest all the files you can copy from the example so whenever we want to create a new collection the only thing we have to define is we have to define a schema like what all fields will be there in the document so we will talk about a schema and for that we will just open the schema file for this collection one so inside this example minimal directory if I go inside solar collection one con so here I have my schema file and let's open this file and see what all fields are there what all things are there inside so in a schema basically you define three things first thing you define is the primary key for your documents then you define the fields which will be there in your documents and it's not necessary that every document will have every field but whatever fields you want to have in all the documents you can define all those fields over here then you have to define the fields for your field so for example like I have to define only three fields one is id, title and text and then I have to define the data type for the fields string, string and text tn in solar you can create your there it comes bundled with lot of data types plus you can create your own data types so here in solar we would call it data types we call it field types so string, string and text tn these are like bundled field types and if we want we can create our own field types so first thing was this unique key so you have to specify one field which will be treated as unique key and unique key has to be there in every document there are some cases where you can ignore having unique key but it's always good to have a unique key so unique key and primary key are the same right the difference is that primary key you need to have in solar there are some cases where you can avoid having id field those cases are very few most of the time people always have id field but you can ignore that thing you have to define that so as you can see here I have to find a field named id and I have to find the type of that thing as string so these three things are there first is the id field then fields and then the definition of the data types that you want to use so there are some default data types plus there are some complex data types so data types define what type of operations you want to have on that field like if you have a document coming a stream of document text coming in what type of analysis you want to perform on that thing you might want to remove test operas you might want to convert the upper case to lower case now field type define all those things like what all operations you want to do on that stream of raw text before you finally put it into your index we can create our own data types also so for now just try to understand this that we have defined three fields id, title and text now we will first index one document and see how it goes into solar and then we will discuss things later now if you will refer the repeat file for activity one we will tell you how to index some documents basically you have to go to this activity one folder and there is a utility called post.jar that will index document into solar and when I say index it basically means insert document into solar so just go inside this activity one folder before posting let me show you what is inside the docs folder so inside doc folder I have two documents this is what my document look like it is basically an XML document it has three fields id, title and text and there is some value for this three fields any questions so far so every time you create a XML file no no you can use a bunch of you can use CSV, JSON there are a lot of formats in your example you are taking up only XML in fact this thing also identifying the fields also you can ignore there is a schema smart where it will automatically try to understand the type and what our fields are there for our document for simplicity sake we are just going with XML for now but in that thing you would have to name the name of the fields that is what solar can detect those things also for you for now let's keep it aside so let's try to index this document so just run this command this is mentioned in the readme.txt file inside activity one so it will tell you that one file index it will show you the time it has taken to index that file so what index means it took the file it basically broke it into fields then it analyzed it did analysis on those fields it broke it into tokens and it finally put those tokens inside the inverted index so let's let's see what it looks like in the admin UI so if we will go into the admin UI it will show you number of documents one so it has basically index one document let me know if any has questions about it everyone is able to index one document yeah so numdocs is the total number of documents that are visible to you so maxdocs actually whenever you index documents there is a slight delay in when those things become visible it becomes visible to you only when you do commit we will cover those things so maxdocs tell you like what is the actual number of documents are there numdocs tell you number of visible documents can we see the data in inverted index that would be in binary format so okay let me open up the directory where that data is so as I mentioned that all the data and all the configuration inside the solar folder and inside solar folder for every different collection there will be a different cons and data directory so as you can see for collection one one cons directory and one data directory inside this cons directory I have defined that schema and when I index this document it went inside this data directory so inside this data directory you will see two things one is index and other folder is t-log so index is your actual inverted index so you can see that it writes all this in internal file format so you won't be able to see dot x it stores all things in binary format before writing to index it actually first write to transaction log this t-log and then it write to index so whatever you see as max-log is actually some of this index plus t-log and index is the actual documents which are whispered to you so I have index one document if now I will do this is our select all query you can assume this as a select all query same to it is that go inside solar then go inside collection one then select and select what I am telling it to select everything from everything so basically this means field colon value the syntax is field colon value suppose I only wanted to get results for id field ok this query you asking if you will go into this readme file then you will find this query after posting this things open the browser and do a select all query everyone is able to follow so it is showing so let's try to understand what is the response I am getting first thing is I am getting a header and header it is showing me the status of my query last time it took for that query so it is showing me that it took 2 millisecond and status 0 means ok non 0 means some problem then it is showing me the parameters which I passed to the query then it is showing me the documents which are available for that query so it is showing that I have one document for the select all query and it is showing the actual document so it is showing the whatever document I have index it is showing there like that there is this version field you can ignore this version field for now we will discuss it later what is the use of this version field so basically the idea is we have defined some schema and we have indexed this document and we are able to see this document in our UI now if we want we can do some search on particular field like if you want to search for this term open so you can first specify on which field you want to search like I want to search on text field then whatever value you want to search for you can give after colon so I want to search for open now you can see that my query is text open and it still gave me this result let's try to index the second document also no it is not a standard utility at all it is a bundled utility just to quickly get started with indexing document otherwise you have to use solar jr, you have to set up some javclang it is a time taking thing so just for bender's purpose we will have most of the time we will have some most of the time java so I will take second document also again if I do a select a start query then now showing me that there are two documents available these are the two documents both the documents this time have same fields so one more thing you might see in field definition is apart from field definition one thing we have mentioned for id field is require equal to true so this forces solar to check whether that field is available in the document or not but this title field and text field we have not defined anything like required equal to true so now let's try to index some document we will make a copy of this document and we will remove this text field we will change the id and let's try to post this document so you can see that in this document there was no text field if we will go and do a select a start query you will see that in our third document there is no text field so it's not necessary that every document will have every field some of the documents will have some field the fields which you want to always have you can make it as required equal to true in your schema id like a primary key also yeah so this unique key is actually kind of primary key the only difference is that there are some cases where you can live without primary key also but in most of the cases you will need primary key especially if you are having a distributed environment then for consistency all of those things you will need primary key and in all the production is deployment you will have some primary key also if you want to later update your documents then you need to somehow which document to update to get to that document you need it will still show you so it's still showing me that two documents can I search on some other fields yeah you can search on title field so I will tell you why it might not be showing the result it is because of the analyzer first let me show you you have to give full things so you have to give complete apache hadu yeah I will tell you so that is the next part analyzer so we will try to understand why for text field it was showing me result even when I was giving a single token but for the title field I have to give the complete thing so doesn't it show how many times that word has come searching open yeah if we can have multiple open in the same document yeah there is a different component using that you can find out how many times word has occurred in your index but it will not show you for like for a particular document how many times that open has occurred this you will not get in your search result yeah the way you want you will do that so basically you are saying you want to have a term which you want to search on both title and text field and if it matches in any of the field you want to display the result for that we use something called copy field we will see that well a title it works if you do a star yeah but the normal way to but the normal way to do that is when you want to have multiple field you want to search on a field which consists of tokens from multiple field you use something called copy field no you are saying in the value you are saying or in the field name field name oh field name because there is no field for this stuff we will see that when we will see copy field we will see how we can query on multiple field yeah so now comes the role of data types so notice that the data type of all the fields except this text name field was string and that is why when we were searching on partial token then it was not giving me the results but for text tn field we were getting the results even when we were giving the partial words so we will see the results for this but before that we will see what analyzers are then it will make things more clear so how do you process the text whatever text that is getting the data types is solar before putting it in the inverter index how do you process that so that we define through analyzers so suppose this is the raw text that you are getting this is the HTML document that we are analyzing so you want to analyze this field first thing you want to do is before putting it into the index first thing we will want to do is we will want to filter out few things like the actual data like all this HTML tokens so before when data is put inside the inverter index it goes through analyzer and in analyzer we can there are basically three types of analyzers first is the character filter what character filter does it it applies on every character so whatever is done so it treats the complete thing as a stream of characters and on every character whatever type you have to find it will check for that and it will remove those try to remove those things then the second thing is tokenizer tokenizer breaks your document into terms so this is the document we have seen in the inverter index that it is a term to document mapping before putting it into the index we need to first break it into terms now how to break it into terms suppose there might be some words like for e-commerce like flipkart for genes there might be a term called pepe genes so how to break these pepe genes how can you tell solar that instead of breaking pepe genes into two words don't break it into two words put it as a single word so basically how to break your input and raw text into tokens that you define through tokenizer then the third thing is token filter now when you have the tokens which of the tokens you want to keep in your index and which of the tokens you want to remove from your index that you specify through token filter so just take an example of this HTML document first we have defined a character filter called HTML so these are filters are available and by default there are a lot of filters available and you can also create your own filters and put it plug it into solar so first filter we have defined is HTML strip filter factory what it will do it will go ahead and strip HTML tokens like HTML body so after going through this analyzer you will get this stream your raw text will get converted into this, this is a sample HTML document now you can specify multiple character filters for our input stream you could have done other things also you could have removed all the lower case characters also so once you have once it goes through character filter next thing you have to define is a tokenizer what tokenizer will tell is how to break it into tokens like I have defined a widest space tokenizer for this, it will tell you that break my document based on the widest space anytime it sees a widest space it will try to break it into tokens so tokenizer you can only specify one for our particular field because there is only one way you can break things into tokens there cannot be multiple ways to break our text stream after you have tokens so these are the terms that will actually go into inverted index but before putting them into index what we will do is we will filter out unnecessary terms so that we can do using token filters we can use like for example we can use this stop filter factory which will remove all the stop words so A and the that all these things are like stop words we can remove that stop plus you can give your own list of tokens which you do not want to put into solar then there are other token filters like pattern filters where you can specify pattern suppose you are indexing credit card information then you can give us rejects pattern to tell that do not index like 16 or 14 digit numbers so finally we will get this document is converted into this array tokens sample HTML and document so this document is converted into three tokens and this will go inside the inverted index and there will be a mapping that sample token is present in document 1 HTML token is present in document 1 and this document token is present in document 1 now here I have given a tokenizer at wise space tokenizer there are some cases where you want to tokenize things like if I have I do not want to tokenize it so if I am putting a brand name so if I have a brand name I do not want to tokenize a brand name into multiple things so in that case there is a tokenizer called keyword tokenizer what it does is it takes complete thing as index as a single token so when we are not putting in inverted index different tokens then obviously suppose instead of using this wise space tokenizer I would have used keyword space tokenizer right then instead of going as sample HTML document this would have went as complete thing sample HTML document as a single token so obviously when you do not have something in your data structure and then when then you have to search on complete thing you will get results only when you will search for complete sample HTML document so this string type it does not tokenize the difference between string data type and text gen data type is that it does not tokenize string does not tokenize your token whatever you are putting it it puts as a single token in case of people are still not want to document in this sense of paper individually or paper genes together so again it depends on your case so for that what we can do is again we can use a copy field one field we can use for having the complete we can have same input set of dark data but we can create two fields in solar one field will be of type string and so the data will go as non tokenized and as a the field we can have as text gen or some other tokenized type so when you are searching then you can search on that on the tokenized field and when you are doing other things like faceting which we will cover later we can use the non tokenized there are certain so it depends on case so my idea to explain here is you have control over how you want to tokenize and what you want to tokenize if you want to tokenize something you can use some different filter factories and there is a bunch of filter factories there is very good documentation available like what each token what each tokenizer do yeah ok ok so lower case filter factory it is converting all the upper case to the lower case so token filters basically transceive the transformation that you want to define before putting finally putting the token into the inverted index what all transformation you want to do on your token so stop filter factory filter is telling to remove stop words lower case filter factory is telling to convert all the characters into lower case stop filter is removing the stop words like A and the and all those things so it basically before putting it into index what all transformation you want to do on your filter that you specify through token filter it is not that every token filter will remove stuff it will transform basically stop filter factory is an example where it is removing the token but not every filter will remove token there will be cases like for pattern rep and replace so it will search for suppose you wanted to convert this HTML to XML so you can specify a pattern match filter where you can match for all the HTML occurrence and convert that token into XML basically you can see a transformation there that is a pretty factor just a silly question if I wanted to so does the tokenizer maintain the token order I mean so basically the question is if I yes yes it maintains so token order is very important so you are basically asking how sample token is related to document token right if I am trying to calculate an Ngram so I can do that in the token filter yeah you can use as Ngram token filter so it depends on the order in which you have specified so whenever you define any analyzer you first define the tokenizer okay so first you can define the character filter it is not necessary that you always need to define character filter first you can define character filter then you need to define tokenizer and then in whatever order you define the filters in the same order transformation will happen so if this top filter factory you would have defined at the last it will be applied at the last no text en is the default it comes bundled so what it does is it basically so whatever types we are we use in our schema those types have to be defined in the schema.xml itself so you can see there is a definition for field type string there is a definition for field type text en what we can do is we can have a look at so for a field type we have to define analyzer so there can be two parts first is the index part and another is the query part so here you can see that we can have separate index and query analyzer for a field type like what all transformation we want to do suppose while indexing the documents you want to convert everything to lower case but while searching you don't want everything to get converted into lower case so you can define different tokenizer for index time and query time so that was the reason why search on that title field was not giving results when we were only searching for a single token because it was a non tokenized field so we have to search for complete thing then only it would return result is there a way to set it up to treat quoted strings differently quoted can you turn a filter on or off inside a quoted string I don't exactly get it supposing I want to turn everything to lower case unless it is inside a quoted string okay okay can I turn off the lower case filter I am not sure if there is any default filter exactly for this use case but it is very easy to implement this and you can plug it you can just write up a pattern matcher and accept double quotes you can convert everything to lower case yeah but I don't want to do it only for lower case I want to do it for every factory I have and I want to treat quoted strings differently say on or off kind of thing I am not sure I don't know if there is any filter there might be there is a whole bunch of filters I can show it to you so can you extend an existing filter yeah you can extend but I don't want to do it yeah I am going to write a bugging code yeah so there is a list of all the tokenizer so you can go through this list and see if there is anything I guess there is no default filter yeah so I mentioned character filters apply on per character basis token filters they can add change or remove token basically transform their tokens and this is the link where you can get list of all the tokenizers and before tokenizer and analyzer then the other thing is the dynamic fields so till now we have seen that we have to pre-define the field whatever field we are indexing through our document we have to pre-define those fields in our document so there is something called dynamic fields using which you need not know the name of the fields beforehand so you can say anything ending with ending with underscore I that would be of type N and that would be index and store so using this you can add dynamic functionality to your schema you need not have everything defined all the field definitions defined in your schema so suppose if you are having some analytics application and you want you know that all my columns which are ending with something XYZ those are those belong to this particular class then you can define a dynamic field for that class and suppose if I entered ABC underscore I then it would get matched by this dynamic field definition and it would be stored as an end when you are searching for that field you can go ahead and use the original name whatever you have defined while indexing like ABC underscore I so name is a regular expression or it just takes a start while indexing in the schema definition you have to give it as a start only start whatever anything specific like ABC underscore I anything after start you can give next thing is copy so there are some cases where you want to have multiple versions of the same field so you are indexing some document but you want to have different version you want to have one thing with a document which has title in it but I want one field which has title in uppercase and I want one field which has title in a lower case what I can do is I can index that field only once and then create a copy field multiple copies of that field so suppose and put it inside some fields so suppose I want to have a master field and a master search field and that has that need to have tokens for all the fields so I can use copy field functionality here I can specify like this so I have a destination field called text and this is my master search field it will have tokens from all the fields so what I will do is I will define source as this is the category just a number field and destination is whatever is my master field so what will happen is tokens from CAD name, menu, features includes all these tokens will go inside text field so when I will go ahead and search on this text field even though that token is present in any of these fields it will give me the results then there are certain parameters which we can define while defining our schema so one is index other thing is stored so what index means is index says that put this thing into inverted index whenever you are searching what it is doing it is going into the inverted index it is looking for our term and it is giving you the list of documents which have that term so index equal to true means index that thing into inverted index so that user can search on it so index equal to true make things searchable if you do not want to search on particular field you just want to store it and later display to user for that case we do stored equal to true so stored equal to true tells that I do not want to search on this field I only want to store so that is where people abuse this thing so even though they do not want many things to be searchable but then also they try to use solar as a news no secret store and they make everything stored true it increases your data structure very much and it affects the performance so even though it is there but I mean use it wisely anything you can move outside do not make everything stored equal to true solar is for searching not storing things do you think that anything which has the interpreted field as integer or what is being write does the integer has better performance yeah it completely depends so there are certain your performance of a data types depend on your queries type of queries you want to go suppose you are doing lot of rage queries on some field so there are some try based data types T and T long so rage queries will be much faster using those data types so your data type definition has to be completely based on your query so so anything which I have defined stored equal to false will not show up in my document so even though I can go ahead and search on it but when I retrieve the results it will not show me then other field parameter is multi-valued if you want to store an array of values then you can set multi-value equal to true then other thing is dog values there are certain use cases like sorting of results so when dog values what dog values does it by default solar keeps everything all these mappings in the heap so if you have very large data set and if you want to do things like sorting and all those things if you want to move things out of heap then you can set dog values equal to true then what solar will do it will kind of create an uninverter index for those things and keep it on disk then other thing is omit norms so there is thing called normalization so suppose you have one small document and one very large document and if our term is occurring in that small document one time and same term is occurring on the last document say 10 times then you can exactly say that what is the weight of the term so what omit norms does is it keeps our normalized weight of that term so for that field so if our document is very large and that term has occurred only once then the weight of that term will be small because the normalized score of that term will be small so if you want to normal do normalization normalize keep normalize score on a field then you can set omit norms otherwise you can ignore this because it takes a bunch of memory then there are few cases for which you have to change the variable term vectors and positions it keeps a mapping like what is the distance between this term there are certain cases where you need that in those cases you can keep these things on or off so all these are field parameters and you can define this in your schema so what we can quickly do now is we can just create our own custom data type and we can index some documents from stack exchange and we will see how we can analyze those data types from the admin file so if you are going to rate me of activity too there is a schema.xml so we need to copy this schema and we need to copy this schema and copy it over to the collection one of our minimal folder so what this schema has this schema has a couple of field definitions that is suited for stack exchange data like we have date creation date what is the site and what is the post type question all these things we have created so what you can do is copy this file and replace it with the original collection so what we did just now is we changed the configuration of collection one if you will go in the admin UI and if you will go to select this collection one let me show first let me show so if you go to this collection one and schema browser then you will see all the fields which are available for that schema so right now I am not able to see any of the new fields because I have changed my configuration so what I need to do is either I can restart solar or I can just reload it reload particular code from admin UI itself so just go to core admin select collection one and just hit reload so it will reload the code now if I will go to schema browser I will see all the new fields is everyone able to update the schema we are going to copy the code beyond query we are doing everything inside example minimal solar collection one folder can you look at what it looks like what looks like schema you can select the schema browser schema browser you are not able to see just do the full screen your resolution might be low actually resolution is low then the things get hidden like this you can enter the schema browser from on the UI itself if you are not able to see just enter the schema browser icon browser any questions so far so what I will do now is I have changed the configuration I still have three documents which I have indexed earlier now I will try to index some new document so right now if I try to retrieve some new documents you will see the older documents existing documents and if the schema has changed to a certain extent where the earlier index fields are not there then that would be an error then it will throw up error even though it will have those tokens somewhere in your inverted index it will throw error before even going to that data section that's why it will throw up error so what people will see is whenever you change the schema very much then they re-index all the documents so if you will refer rate me of activity too now we will try to re-index one sample documents from the stack exchange website so change to this directory so this is what the document looks like now note that there is no text or title field over here I will again use the same post utility new file has been indexed using squeeze fields but the old one seems to be using the old skill yeah so it is like that even though you won't be able to search those things but those things will be there in your data directory I can pause this yeah so what people do is they re-index the complete thing we re-index so right now I am not re-indexing things I am keeping old stuff as it is so I am showing this example that you can have like dynamic schema whenever this level of schema changes there either we beforehand predict that our schema will change and we keep our data types like based on that otherwise we will re-index complete thing so I have indexed these things now if I will go and you will see that it is showing me now old documents what you can do is if you just want to delete those things you can use this delete query and it will so it will delete all the documents so now found is 0 now you can do a fresh insert so the delete query is also mentioned in the Arrogant file query to delete all the documents so now if I did this you can see that there is only one document index I will quickly show you how you can use the feed analysis browser so there is an analysis component and here in analysis component you can go ahead and see what your data will look like after being indexed I can enter whatever token I have I can click on analyze value and it will show you what all things it will do and what all stage so basically you can use this UI to analyze for example I have this custom data type what I am doing through this custom data type is I am first storing everything as a single token then I am replacing all the 4 digits into star so I can add this custom data type into my schema definition so I have created and will write my custom type and in this custom type I am replacing all the 4 digits with star so after making this change I can reload or restart solar so if my custom type if I give anything like some data plus if I give 4 character digits so let me explain this a little first I have defined a custom data type I have defined 3 things here first I have defined a tokenizer called keyword tokenizer then there are 2 filters in the analysis UI you can see what is the stage after each of these things so when this thing goes into this keyword tokenizer what I get then after this lower case filter what I get suppose one of thing I can convert it to you can see that in the lower case filter it is converting this to lower case and the final is the pattern replace filter in pattern replace filter I have to replace all the 4 digits to star and it is doing that so that way you can change multiple filters and create your own data type it shows all the stages of your data type and this was for this was for index time you can do the same thing for query time also if you enter thing it will basically show both query and index time analysis so using this you can tune your data types to match on it then there is a schema less mode and it is actually not useful at all when you are going into production because you can see that the performance of solar depends hugely on how you define your data types so in schema less mode what it do is it tries to detect the type it tries to detect the date integers and create the schema by itself but sometimes it is not as much efficient as it should be and that is why it is not at all used in production it is just there for namesake so we can take a quick tea break then we can see how we can index document using solar shall we take a break or shall we go ahead so whenever you index some data it goes to some request handlers and there are some predefined request handlers available like there is an update request handler which will take the document and index it you can also define your request handler if you want to do some other pre-processing before indexing one thing in solar is if you have index one document using some ID and if you are again using the same idea and re-indexing that what it will do is it will delete the previous one and it will index the new one what happens in rdpms is if you are indexing using the same priority it throws a banner so in solar update actually means delete and override so whenever you are indexing something it will basically delete the old ID indexing you think now if you want to only update parts of the document then there is something called atomic updates instead of updating the whole document you can update some parts of the field so right now there are three things available set add and increment if you want to change the value for some field you can use set if you want to add a new value to a multi-valued field you can use add and if you want to increment value of a certain field you can use increment so it is there it is a bit slow then normal indexing but it is there you need not to update the document always so there are clients available for solar in lot of languages most commonly use is the java one because it always keeps itself updated with the solar versions whenever they release a new solar version they always package the equivalent solar j client with it and also solar j is the only one which provides cloud server client so it takes care of doing your request both at query time and indexing time on other glance what you have to do is you have to put your own load balancing on top of your cluster if you open eclipse then there is a project called solar demo inside solar demo there is a package called consert solar j there is an example indexer example so how you can connect to solar is basically to connect to solar you need to create a instance of solar server and provide the link of the whatever collection you want to connect to my collection has collection one I have ignored this for a few hours for now and my solar is running at 89834 so I will specify it below say 8983 excuse me just some clarity this is probably for installing solar in distributed way so do we need to install the zookeeper separately or it comes as part of solar it comes with zookeeper it comes bundled with solar package but as emulated zookeeper but it is very unstable so what you need to do is you will have to set up your own zookeeper separately and then you can point it you can while creating this instance you just need to give the zookeeper code so this example is for a single load so whenever you are running on a single load then you have to use this zookeeper solar server and when you are in cloud mode I mean distributed mode when you have multiple solar instances running then you have to use this cloud solar server client what it does is and you will notice here that while connecting over here we have to give the exact IP and exact code while using solar cloud while using solar j in cloud mode we just need to give the zookeeper IP we will talk about this when we will talk about zookeeper so after you create a STPP solar server instance what you can do is either you can index document one by one or you can index document in bulk so if you want to index document basically you need to create a solar input document and you need to just add it on the server basically so here what I am doing is I am trying to index multiple documents in bulk so that also you can do and if you want to index indexing in bulk obviously improves the performance so it is always advisable to get a suitable spot of how many documents you want to send in one bulk and index documents in bulk so if I run it it is picking some documents from sample data activity 3 and docs folder and it is trying to index those documents it is very straight forward there is no trailer added document 0 now same thing when you are using cloud indexer then you will notice everything else remains same we just need to create one solar input document so here what we did we just created our solar input document and we just specified the field and its value so same thing you have to do when you are using in cloud mode also you just need to create the same way the only difference is instead of using http solar server you just need to use cloud solar server so there are clients available other language also if you want to use so whenever you index something first of all it goes into a transaction log and from transaction log when you call commit so you can call so here you will see that I am calling commit unless and until I do not call commit it do not get written to the actual index we can define this commit duration in the configuration file also but basically whenever you index something it first gets written to a transaction log and from that transaction log whenever we later call commit it gets written to the actual index there are two type of commits one is hard commit and other is soft commit the way solar store the red eyes it creates multiple inverter index and later try to merge those index so whenever you are indexing new data it will create a new index for those data and later when you call optimize there is a thing called optimize it will try to merge those segments so if you want near real-time updates okay if you want as soon as data index you want the users to see those things then you will only go with soft commit what soft commit do is it will just try to transaction log and whatever you are reading you can get those results from transaction log itself but the downside of this is if you don't call commit and if it don't get written to the actual index if your solar crashes then whatever is inside the transaction log it might get corrupted and you won't you will lose that thing so it's always good to index in batch to save on the network okay now the part is data import handler right now we have seen that we have we are indexing using the flat files using flat xml's what you can do is instead of using the flat files you can directly use other data sources like most common use cases indexing document from your database itself so you can use data import handlers and you can define your configuration directly import data from database to solar so we will try this activity so what we will do we will first load data into mysql if you will go and read me off activity 4 first what we need to do is we need to create a database so inside this activity 4 there is a sql.tdl here we are creating 3 tables first table is to store the comments so whenever user post whenever we have some post on stack or flow there are some comments so we are creating 3 tables first table is to store another table is to store the actual post the third table is to create the user id and name so what we will do using these 3 tables we will first insert data in these 3 tables from the stack exchange dumps and then from the mysql we will index the data directly into the solar so first you need to start mysql you can use this comment to start the mysql after that you can go to this activity 4 folder there you will find this mysql schema file you can create database and tables using so what this did was it created our database and it created 3 tables comments user and post to store the data now we have inside the sample data folder we have this folder called stack exchange data now it has data for 2 of the sites robotics and windows form we will first insert data of robotics of site into mysql so if you go to this stack exchange you can use this utility to insert data into mysql so it is reading data from the stack exchange dump and inserting into mysql why do you want to do that so we are going to see an example how to import data directly from mysql to solar without creating this flat files or other things so this is basically what data import handler is in data import handler you just need a source and you just need to have a schema it will directly import data from source and source can be any other thing also any stream actual data is present inside sample data stack exchange data then in solar yeah so it is just utility to read data from that file and insert it into mysql so whatever we are going over here it is reading from that sample data folder and it is just inserting that data into mysql so we have loaded data in mysql now what we need to do is we need to define configuration like how we will import this data if you will go in and see the table definitions then you can see that comments table has this column post table has this column and user table has id and display name now there are three tables but in solar there is a document goes as a single document so we need to define some mapping like select from this we have this and this for mapping you have to define some config file so inside activity folder you will see the config file so here in this config file what I am defining is I am first defining the data source from where my data will come from so I have defined this thing the mysql username and password now to create the document I have to define the mapping saying that select this from post table in mysql this query will be executed on mysql and when this query is executed I will get all these columns now I need to map those columns to the equivalent solar columns like side column this side column is coming from post table in mysql now when going into solar this will go as st underscore side same thing for comments so as I mentioned that there will be only single documents so what we need to do is we can define multiple nested queries also so after selecting this id from post table what I can do is I can get the comments for that id from the comments table so you can define multiple nested queries also using this way so here I am reading data from mysql I am defining the mapping that side column from mysql how it should go into solar while going into solar it will be converted to standard score side then I am selecting from other tables also and finally I am putting it into solar so once I have this config file ready I just need to place this config file inside the solar con folder so this type of configuration for data import this type of configuration we need to do in solarconfig.xml so this has to be done for every collection this is done per collection one inside con folder and solarconfig we can add this request handler and what I am saying in this request handler is I am creating a request handler called data import whatever configuration I want whatever configuration I have mentioned that location of that file I can mention inside this whole thing inside config field that where it should read the configuration from so once I copy that file inside con folder and I add this data import to the solarconfig I click on data import I will see that exact same configuration file over here this is telling solar how to import the data from mysql now I can directly just execute this thing so it is reading records from mysql and directly indexing into solar so this same thing we can do for other sources also we can do for streams so we need not worry about putting things over some raw file system and everything like that yeah Hadoop integration there is a component for all two other different components for Hadoop where you can read and write your index to Hadoop but the thing is the reason why Hadoop and solar is not yet con that present main stream is the reason for which you use Hadoop all of those features are most of those features are already provided to you by solar cloud all that application and all that student file features yeah in that case yeah you can do if you import yeah you can import so while importing you can also define your script transformers I mean you can specify creative functions which will be applied to your data before putting it into solar so script transformers also you can define you can define this into multiple language, javascript, we also languages you can define now I will quickly come to the query part and give an instance and come to solar cloud there are all sorts of queries that you can do all in solar solar provides lot of different query parsers different query parsers have different set of features like some of the query parsers they are better if you want to do boasting and all those things you can you can define the query parser in solar config like which query parser to use you can use that query time yourself in the browser so there are different type of queries you can do you can do simple text search where you can just search for one keyword you can change number of rows, number of records which are retrieved you can have pagination, you can tell when you want pagination facility you can have the start parameter it will tell where to take this you can search on some specific fields so you can tell if you only want to search on a specific field you can say that field colon value then while searching you will notice that whenever we search for something here data import handler you will see that it has finished if not I will go and see the records so it has created like 2000 documents from my SQL so whenever we query anything from solar you will see that it written me all the fields for the document I can limit those things by specifying the field list suppose I only interested in getting the title field so I can specify that field and then I will only get that field whenever you ask for any field then it has to go on disk and press those fields so unless and until you don't need all the fields it's always better to specify what exact fields you need there is delete query you can have and to search and or boolean level queries you can have you can have not queries you can sort the results based on some query so suppose we have you want to sort the results based on the score of that question how many favorites in life that question has covered more so we can sort it in ascending order so it will give you the highest it will give you document which have higher score first you can change the order of sorting from descending to ascending the next thing is facetting so this is a very useful thing like let's say you are interested to know that how many how many documents have username equal to triple h say so this is called facetting facetting gives you for every field how many documents have what number of unique values so for that you can use the facet query facet query will return for every field whatever field this you will give it will return you the set of unique values plus the count of unique value plus the count of documents so like I have 2000 documents so for 2000 documents there will be multiple values for this ST site if I do facetting then I will get unique values present in ST site and along with unique values I will also get the number of documents which have that particular value so facetting you can have and you can change the facet method so while doing facetting if the number of unique values for a field are very low then you can change it to facet method equal to enum that is much better in performance then you can have a STAT query so STAT query will give you all kind of STATs on a particular field if you have really many analytics application then this might be of use it will give you the max main values for that field then you can do a facet range query if you will see in this example so I am building this chart where I am showing how many documents are present for year 2012 so basically I am asking solar to give me aggregated count of documents over a range update so this can be done achieve through facet range queries that syntax is this sorry I am rushing to this I mean these examples you can take some slides then you can have range queries you can specify the range that the value of this should be between this and then you can do a boosting on a field suppose you want to return results from a query but you want to boost results which are value equal to questions you want to see the results but at the same time you want to boost the results and you want to keep questions higher in your result set and answer lower in your result set so what you can do you can use a boost query and here I am saying is vq equal to st post type column question so I am asking solar to boost values which have question in it by a factor of 5 so boost query you can use in all different ways to influence then there is fuzzy search fuzzy search is for approximate matching suppose you have multiple type of words in your documents like electromagnetism electromagnetic electromagnets and whenever user search for electromagnet you want him to give results for all those things so in that case you can have fuzzy search and you can specify a fuzziness factor that okay it should be between 1 and 0 and 1 the more it nears to 1 less fuzzy so if you specify like this then it will give you results which match electromagnetic and electromagnetism also so if it is a low fuzzy column is it still matching electromagnet or will it be fuzzy with the T and the E also if it is low then it is more strict fuzziness factor yeah so this fuzziness factor how fuzzy it can be so what is the difference between so earlier we saw something like electromagnet it is star so that is equivalent to maximum fuzziness yeah electromagnetic star is everything match electromagnet plus anything after electromagnet so less fuzzy is what so less fuzzy will also match electromagnetic MAGN have you used to do with any distance between the words not necessarily it will be electromagnet that much doesn't have to match within a token distance between the characters distance between words is this proximity search so suppose you want to match documents okay which suppose in your result set you have something like calculating some coordinates now if you search for calculating coordinates then it will not match those things but you can specify a proximity factor that okay even if the distance between coordinating and calculating words it is to still match those things give the proximity level between two tokens using proximity search then there are function queries you can have all sort of function queries plus function queries you can write your own function queries to operate on whatever languages you have there is term and group queries you have for grouping in sub basic use case more like this so more like this feature you can use to get the count of to get documents which are matching to a certain document you can give idea of some document and then source order that give me documents which are more like this so this you can do using more like this component is this closely related to clustering no no no it is not related to clustering in clustering in clustering what you try to do is in clustering what you try to do is you have some documents and out of those documents you try to create clusters of your own you would try to define words of your own and then assign each document to one or more of the clusters here you are just searching for terms in that documents and whatever is the maximum whatever document matches maximum of the terms you return those things so the text inside the solos it is already in process using this externally supplied text can we also make them go through those things and how does it happen? externally supplied text means more like this handle yeah so here I am giving just an idea so I am giving an idea so it will get tokens for this document document with id robotics underscore 1 okay it will get terms for that and it will try to based on those terms it will try to find matching documents now this part is clustering so in clustering what it is doing is I search for data for this documents now out of these documents it will try to create some clusters and then assign each document to one or more of this cluster so this is that clustering component and this it takes from the carrot actually so there are a lot of configuration options for clustering auto complete I do not think we have time for this so actually it is very well documented in the activities so let's say I first search for data it gave me results all well and good but suppose I miss spread the word okay and I instead I type DETA instead of data so what this spell check component can do is it can provide you suggestions like instead of DETA what user was actually trying to say and then it will provide you collision also so it will tell you that if it would have been DETA then there were two fifty hits if it would have been DETA then there were twenty three hits so you can use this to I mean auto correct the user query suppose he search for something how people mostly do this they always keep this spell check on and they see that how many documents are found and then they go ahead and do some perform some analysis on whatever results they are getting from the collision okay suppose when I search for data I got one result and this collision is suggesting me for some other term it suggests to me like thousand matches then I can show somewhere in the UI that you mean like this so you can use this to build did you mean kind of functionality also you can use this to auto complete user queries so I have mentioned over here how to configure this it's just a matter of configuration you have to do some field which field you want to use for auto complete it will take tokens from that field and it will do a spell check and suggestions from that field so this is the spell check component and similarly you same functionality you can use for auto complete you can give some query give some term and it will show you suggestions for that term you can also have matching phrases you can configure it to give you complete phrases like if you give for some then it will give you complete phrases instead of it's the very straight forward yeah so the most important part now is the solar cloud so actually the reason solar is popular is only because of solar cloud and the way it provides the distributed search otherwise it would have been very before that it was very difficult to build scalable and distributed search applications so let's try to understand what was the need of solar cloud earlier we used to have only single index or single machine so first of all the problem was whenever you are indexing some things then complete thing has to go on a single collection and that improves that affects query time very much so whenever you are indexing something then your index is continuously changing you are query again all the same thing then obviously the performance was very low second thing was when you have everything at the single place and it goes down there is no fault tolerance so solar cloud solve all those things it provides performance by splitting your data it splits your data into multiple portions which are called shards and you only query some of the shards and not all shards always it provides the scalability you can have multiple shards on same machine or you can have multiple shards on different machine and all the communication is done through zookeeper it's high available you can have multiple copies of same data so if one copy goes down then you will have a different copy for that data it's very simple it's completely abstracted and while querying and indexing you are seeing in solar jet that there is no need to change anything apart from just changing your sub-client so this is what a high level solar cloud setup will look like by default when you are using a single instance solar then there is only machine ones there is only one instance of solar running and inside that instance there will be medical collections but each of them will have only one shard so what solar cloud did it now we can have the same collection let's consider this example collection 2 what it will do is we can have our collection is split across multiple machines and on multiple machines we can run multiple instance of solar and some of the machines will have some portion of the data plus we can make copies of that data and all the communication when running in solar cloud is done through zookeeper so what zookeeper does is actually first of all we start a zookeeper cluster what zookeeper do is it keeps the shared resources so all this right now you have seen that we are keeping all the configuration inside this collection one directory itself so that is not a good way to do it so what zookeeper does is we can put all the configuration in zookeeper and then whenever we want to create a new collection we can use the same configuration from the zookeeper and create the collection so there will be a cluster of zookeeper zookeeper will help communication between multiple shards on multiple machines zookeeper is not a suspect to solar it's a standalone project it's used in adobe system it provides a lot of distributed communication facilities on solar side we use zookeeper for two things first is for communication between solar nodes on different machines and keeping the shared resources so a portion of data when in solar cloud we call it as shard so what I can do is I can quickly bring up a cluster you guys want to try on this exercise bring me up a cluster yeah we can do this we have to start a zookeeper node so if you go let's open the activity so what we will do now is we will start a zookeeper node we will start two solar nodes we will then create a distributed collection and then we will index some documents into that collection and see how that goes so first thing is we have to start zookeeper so just go to this zookeeper folder so zookeeper uses the configuration files to locate which all machines are there in that quorum but since we are starting only a single node cluster so single node zookeeper cluster so what we can do is we can just leave it to default so default data directory is barless zookeeper zookeeper will store its own data and shared resources in this directory and by default it will run on to 1814 so just run this command zookeeper start it will start a zookeeper node it will start zookeeper in the background so if you can just then after starting you can go and click on zookeeper status so it will show you status so it is showing mode as standalone so now once we have zookeeper zookeeper and you guys have started ok now we have zookeeper up so now what we will do is we will create two instance of two solar instance so as I mentioned earlier that example actually example folder is actually a server folder so we will make two copies of this example folder and we will fire up solar on both of these copies so I have created two copies of example folder now if you will go inside solar folder on both of these copies then you will see that you still have this collection one first thing I have to do is I have to remove this collection one from here and put it as a shared resource in zookeeper I will remove the collection one from both of these nodes after removing collection one from both of these nodes now I will start my solar and point it to zookeeper so that in that way I will start solar in cloud mode so starting solar cloud is nothing but giving the hold of zookeeper it will automatically understand that we are trying to understand it start it in cloud mode and it will take care of all the communications so now what we need to do is since we are starting two nodes on the same machine we have to change the hold we need to change hold of both the machines so for first instance get hold 2 0 0 1 so this configuration d j t dot port it will change the port then other configuration is d z k host equal to local host 21 81 so what we are giving here is what we can give through z k host is a comma separated list of zookeeper zookeeper nodes so we are running a single node zookeeper on our local so we will give as local host 21 81 then other configuration is bootstrap conf equal to false what it is telling is that I do not want to upload any upload some default configuration to zookeeper otherwise what we can do is you can specify some default collection which will be uploaded to zookeeper so we do not want to upload any default thing to zookeeper so give as false and then do start this command I am just directly copy pasting from this green dot type so you will see over here that client is connected to zookeeper and it has updated the cluster state from zookeeper now if you will go to browser whenever it starts so learn by providing the zookeeper boost it will add an additional type over here called cloud this cloud will show you data from zookeeper so right now you will see as if you go inside cloud and then pre-tap then it will show you a list of live nodes so right now there is only one live node running on 2004 so let's start one more node and we will start this node on 2004 remaining command will be same only we will change the gd.code this command also I am exactly copy pasting from readme so I will go in preview and I will see the list of live nodes it will show me two live nodes so now I have two solar running, two solar nodes running one zookeeper running now what we need to do is we need to upload some configuration which will be used for reading a collection so again I will refer the readme so uploading a config set to zookeeper so instead of calling it a configuration we call it a config set we call it has multiple configuration files so solar provides some default cloud scripts to upload things to zookeeper but it is completely solar independent what you put inside zookeeper but you can use the scripts which comes bundled with solar to upload things to zookeeper so if you will go inside solar 481 and cloud scripts folder you will see some scripts so I am here inside node 1 scripts and cloud script folder and it is showing me that I have the CKCLI script I can use this to upload config set to solar config set is nothing it has same configuration files which we had for the same collection one so what I am telling over here is I am telling this script to upload config to zookeeper at this node and I want to upload this directory and I want to name this config set as collection one command of config zookeeper so what folks normally do is instead of running a single zookeeper we have corum cluster of zookeeper running so that if one zookeeper fails because you can see over here that all the communications will happen through zookeeper all the node is covered everything happens through zookeeper so if one zookeeper goes down it will completely fail our cluster so we have multiple zookeeper running and then multiple solar nodes running and then they talk to each other so I am uploading this config set by the name of collection one it has uploaded successfully now again if I will go into review I can go into review by taken cloud then tree it will show me a new folder called configs inside config I can see the name of the config set which I have just uploaded inside this config set you will see all the configuration files you can see over here it is schema and over here it is our solar config and all the additional stuff that comes pre-packaged so now we have two solar nodes one zookeeper one config set now we can use this config set and create a collection while creating a collection now in solar cloud I have the power how many partitions of the data I will need so I can tell how many shards I need additionally I can tell how many copies of each shard I will need so I can tell the replication factor plus I can tell how many shards I want on a single node so I can specify number of shards on a single node so let's see how we can what's the API to do that so solar provides collections API for dealing with all the operations using for this creating creating shards replica splitting a shard into multiple parts then creating copies of shards moving shards over here there is a whole bunch of collections yes for everything so this is the command to create the collection what it is saying is I am saying that I want to use an API I am saying action is equal to create collection name of collection should be collection 1 number of shards should be 2 so it means that I will break my data into 2 parts then the replication factor is I can change it to 2 that means I want 2 copies on each for each shard then I can then I have to provide the config set which will be used to create the collection of the data config set by the name of collection 1 I can give that collection of config name and that port should be one of the new ones yeah plus one more thing so you will see that I am saying solar that create 2 shards and 2 replicas that means I will have 4 portions of data so that means on each and I have 2 node cluster and so that means on each node I will have 2 shards so I have to add one more parameter over here that is max shards per node so that way per node I will have 2 shards and in total I will have 4 shards 2 actual shards and 2 copies that is replicas at this time you are not saying how to shard how do you mean what data goes to shard 1 and what goes to shard 2 yeah so there are 2 type of product so that is the thing for document routing so that also you can control I will explain it is it necessary to specify max number of shards if you are creating more than the default value is 1 so if you have only 2 solar nodes running then you will be able to only create 1 shard and 1 replica or 2 shards with 0 replicas so by default it will accept only 1 shard for 1 node if you want to change that then we need to in production we do not most of the time we do not do this in production we always keep 1 node to host 1 portion of a shard you can have multiple collections on a machine so you suppose you have 2 machines cluster you can have multiple collections on those 2 machines you can have say 20 collections on those different machines but for each collection how many parts of that 1 collection you want on that machine we only want one part of that collection to be on that machine so that if that goes down I will have some replica on some different machine if I specify 2 shards in production then what happens is it might create replica and shard both things on the same machine so if my machine goes down then I will lose both my shard and replica so now it has created 4 different shards so and it has given this is the report name you can provide name also what you want but by default it names it like that collection 1 shard 1 replica 2 shard replica 2 shard 1 replica 1 like that now if I will again go in cloud mode I will see like this collection 1 so my collection 1 is broken into 2 shards and each of this shard will have 2 copies out of these copies it automatically chooses some leader ok so now comes how will we index documents so before indexing what it does it it will first go to one node obviously and then from one node it will be copied and replicated over to other replicas so which node will be chosen as leader so this leader election is so for this leader election what it does it while creating our collection it first goes to zookeeper this is the task of zookeeper to tell it which all nodes are alive out of those alive nodes it will select one of the nodes as leader and the leader will take all the incoming indexing read and write request so like for collection 1 while before creating collection 1 the solar guy went to zookeeper and asked you tell me what is the number of alive nodes so I told that ok there are 2 alive nodes then it selected one of them as leader for every shard the leader is chosen so for shard 1 this node is chosen leader and by chance for shard 2 this node is chosen as leader so whenever I index something it will first go into this shard 1 and then it will be replicated to the shard on 2002 so zookeeper is responsible for choosing the leader whether it is responsible for telling which nodes are alive then this leader election code is on solar the only thing zookeeper guy has is this cluster state file so if you see this cluster state file then it has information about all the shards so first thing you will see that it has the state of that shard then it has range so this range actually tells which document to put where it is basically a hash so before indexing create a hash and based on this range it decides where to put the document where means which shard now you can control this routing document routing you can specify your multiple level routers ok let's just try to index some documents into cloud mode so in this solar j there is a cloud indexer file if you just run it it will index some documents it will index 4 documents into solar cloud and you can change the hash function you have to specify that in your id so if you see so now I went to this 2001 and I selected this collection 1 collection 1 shard 1 collection 1 shard 1 replica 2 first let me select do I select star query on this so you will see that we have 4 documents but when I went to this collection 1 shard 1 replica 2 you will see that I only have it is only showing dumb docs equal to 1 because this shard 1 is only hosting one document and the other document will be in other shard so it almost try to balance based on this automatic shard because the number of documents each shard will receive but it will be sometimes a little bit low height so this is the high level architecture whenever there is zookeeper there is a zookeeper cluster then all this shard and replica they talk to zookeeper and then based on that they decide where to and what to do this is the terminology when to so your question is sharding strategy so this is very important part of like how to shard how many shards to have versus how many collections to have so the most basic implementation practices whenever you have large very large for a small data set it does not matter put anything anywhere it does not matter as long as your data is fitting inside your ram then it does not matter you can and you do not have very strict query latency it does not matter but if you are running into scaling issues then how what you can do is first of all you can divide your data into multiple collections suppose you are into analytics and you are getting data for a month then what you can do instead of putting all the month data into a single collection you can split that data into multiple collections and then solar provider facility or API called collection aliasing and using that aliasing what you can do is you can abstract the split collection you can tell solar that even though you have split the data into multiple collections but you can create alias which will say master collection and you can direct all your reads to that master collection it will read data from all those collections and return to the service the other thing comes up like in how many shards how many shards should I have for my data so that I guess depend on so the number of shards you have the more you are distributing it among your CPUs that goes to what level of query latency you want to have so in most of the case on one machine you can have like 8 to 10 shards and it will perform well so it depends on what collections you have and what shards to have it depends on yours case and what latency exactly latency you want to have but solar scale is very well in creating as well as creating large number of shards and cores on any number of machines yeah so one thing is you can have multiple collections other thing is you want to control which document goes to which shard so this you can have before that replication first thing is solar is not there is no master slave anything like that there is no master node no slave node whenever a node goes down solar talks to zookeeper if a replica goes down no problem if a leader goes down solar talks to zookeeper it will if there will be a new leader elections as long as one replica is alive for all the shards solar will process it will give you the responses if all the replica for a shard goes down that solar will tell that ok all the replicas are down for some shards I cannot give you response in that case you can overwrite that warning ok whatever data you have give me that yeah so you can have custom routing also by default it will create hash but if you want to have your custom routing you can have that and that you can do by having a try level ID so this is the syntax suppose you are having three domains and you have data for three domains and you know that while querying user only querying for one domain data so you can specify the domain name as the first first part of the routing key then this is the submission mark is the syntax and then you can so you can have basically three level of routing so domain one then other field based on that field you can have routing so you can specify your own routing key while in texting you have to create this route key that's the only thing so how it indexes whenever so cloud solar server is a smart client whenever we are indexing something just now we indexed using solar j so what it did was it first went to zookeeper it got the cluster state based on cluster state it got it got the shard to which it should index it got the shard data first it will write the thing to a transaction log so as soon as it's written to transaction log you can see those results after adding to transaction log this leader will replicate the results to whatever replica it will have will be there for this shard to so it will write to this then after when the replication is complete it will return to that okay indexing that same goes for querying whenever we query for anything it will first go to zookeeper it will get ID shard on which it should query and then based on that it will get the results and return it so one thing while querying is in solar cloud mode actually if you are as far as possible if you don't need all the fields don't ask for all the fields whatever fields you need just get those fields and lot of production use case we only index documents in solar and based on the ID we do a mapping between the index to some other external database so what you do is you only get the ID from the solar for a matching document and for that equivalent ID you get the actual results set from the external database so in that case there is very good optimization if you are only querying for ID field so if you are only need ID field then just do field equal to none you can even after setting up the shards suppose right now while creating collection you mentioned that you need two shards for our collection you can always go ahead and create even more shards it's very dynamic it's very very flexible so we have collections API for creating collections and doing all sort of things quickly I will explain this performance factor so performance factor basically lies on these four things your schema design that is the most important part and this is the part where most of the performance issue happens when we are starting for a sustain because solar comes bundled with example folder and it has everything turned on by default and lot of additional stuff which we don't need so most of the time forces take reference from example folder and start building it up so it has lot of index field which we don't even use copy fields and lot of additional stuff so first thing which you should do even before taking your product to production is clean up the schema understand what all thing does and anything which you don't need is remove those things even people have seen difference of like 10x to 20x performance benefit by just cleaning up the schema so that is the level of harm that this wrong schema can do then the other thing is omit norms it is where it takes lot of jlvm so if you are as far as you don't worry about the normalize the score of your field you can just turn off this omit norms because what happens is if you turn omit norm on then for free field and in every document it will store one additional bit of data so when you are talking about billions of documents then it adds up to like very large amount of RAM and it totally kills your jlvm performance term vectors and dog values these things are as long as you don't need it just term vectors turn off dog values when you are faceting on very huge number of values so when you are faceting and when you are sorting on any field which have very high cardinality very high number of unique values then you can definitely try this thing it is very good for performance caches I totally must caches are the most important thing so solar have caches for different type of things and whenever you are implementing your query then try to cache as much thing as possible and be very specific what you put in the cache so it's very intelligent solar cache are very smart and it is one of the reason why solar performs so well so spend some time understanding what each cache do and change your query based on that indexing bulk updates are always good then there is a commit strategy so you need to be very smart about when you are committing so committing takes lot of resources so you need to keep a balance between when how much you are indexing and at what intervals you are doing the commits another thing is optimize so optimize again is a very memory intensive operation memory and CPU intensive operation what it does it improves the query speed but it takes lot of time to optimize it so lot of folks do optimize very frequently and that truly kills performance querying side we are very smart and try to use as many as filters as possible so solar is something called FQ whatever you put in FQ that goes into cache and if you are using the same filters again most of the times what you will do is you will be reusing your filters you are going to an e-commerce website you will first select the category then you will select some other things so while querying put all those things into the query so that it will get in cache and your subsequent queries are very fast so that's it I guess I am really sorry that I couldn't try and we couldn't cover a lot of hands on can I there was time you spent some time comparing solar to yeah so solar versus elastic search that is very important thing because whenever you try to implement something this is the first choice you have to make whether we should go with solar or elastic search we also did some study like what to use so my personal so whatever what I personally feel is first of all I am little biased towards solar so the reason for that is it's the apache product tomorrow anything can happen to elastic search you don't know but solar is going to be there apache is going to be there so nothing is going to happen to solar other thing is very strong community very strong developer community in solar compared to elastic search elastic search also good especially it has received a huge funding recently so it will see more interest in that where elastic search is good is that solar initially solar is too very long back the time it was written it was not written for this distributed functionality in mind so elastic search shines in the areas because it was written from round up for distributed capabilities so api's are little bit smarter you can say so it's easy to get started sometimes people feel intimidated to set up all this zookeeper cluster if you will go and see comparison online that people will say that oh I have to set up zookeeper cluster in solar I don't have to do that in elastic search elastic search automatically can really use it some stone so that but that is not a reason to go for technology because it's not you might feel for intermediate for the first time but later you will realize that it's not that big of deal there are few features so that are available in solar but not in elastic search and vice versa so if you have any specific requirements before I would say that go for solar it's very good and at least for us we have scaled it to billions of documents we have very good response time very important very reliable customers and clusters so I personally never saw any problem with using solar there are some features like there is a feature called percolator that is available in elastic search it is yet to come in solar it is available through external plugin so if you need that feature that is useful for lot of cases if you need that you can go for search at the end of the day you have to decide what exactly is your use case and what feature do you need personally for me solar I never saw any kind of problem with solar not scaling configuration developer help so what I will do next is I will be sending a cheat sheet where I will provide you a lot of useful links where you can go and refer I will write a large document about how to use this activity because we have to rest through most of the hands-on session so I will provide a document where you can use this activity folder and I will upload it somewhere so that folks who couldn't get it can set up configure on your own machine instead of you know and just go through activities and try it yourself apart from that if you have any questions just drop me on the e-mail I will be there for you thank you