 So good afternoon everyone, I am Amrit Sarkar working as search engineer at Lucidworks and for next half an hour or so we will see how we can build analytics applications with streaming expressions in Apache Solar. A bit about Lucidworks, we are a search enterprise company based out of San Francisco, we have offices all over the world including Bangalore. We have a product Lucidworks Fusion which is built on top of Apache Solar which drives search engines and also we provide consulting and support organizations who are using solar as their search or analytics technology. So we will begin the discussion with the challenges one face while building this application on near real-time data. We will introduce streaming expression and gets a brief overview of the same. We will categorize our expressions to sources, decorators and evaluators and we will see some certain examples. This particular talk is heavy use case based, we will discuss some real-life use cases from some simple to complex one or understand their performance complexity. We will then introduce statistical programming, these are some fairly new statistical functions being added in solar before listing out the references. So building an analytics application on offline data, we have number of tools and technologies already available. You can pre-processor offline data to create certain views which can further be become an input for creating your meaningful dashboards. The challenge arises when you receive constant updates and you need to refresh your analytics applications at regular intervals. Executing complex correlations and functions on unstructured non-preprovisors data is extremely time consuming. Also to bring together an entire analytics application you are dependent on multiple tools, a database, a tool for pre-processing the data, a data visualizer to build the dashboards which leads to higher maintenance cost. So before listing out what is streaming expression let's get a brief overview of what Apache Solar is. Apache Solar is an open-source search engine built on top of Apache Lucene Library. It's highly scalable, flexible and provides rich search capabilities on text. There is a distributed mode in solar called Solar Cloud where your number of solar nodes can reside on different servers which is being managed by Apache Zookeeper. There are other features like spelling, autocompletion, highlighting available in solar itself. In this talk we will discuss about streaming and aggregations and we have advanced features like learning to rank which leverages machine learning algorithms to better your search experience. So the features on the right hand side comprise the parallel computing framework of Apache Solar which is only available in the solar cloud mode. In this talk we will discuss streaming API, expression, shuffling and worker collections while parallel SQL is a wrapper on top of streaming expression. Whatever SQL query you provide to solar gets converted into a streaming expression internally and gets executed implicitly. Starting with streaming API, it's a Java API available for parallel computation of map reduce and relational algebraic operations. You create streaming objects through stream factory and you have number of APIs available to perform different operations. The open function emits the search results as stream of tuples. Each tuple is a tuple stream object. Tuple stream is the base class of streaming in Apache Solar source code. And all the relatable streaming objects API's are available in the package ORGapacheSolarClientSolarJ.io. To extend this capabilities of streaming to non-Java folks who are building their applications on Ruby, Python, Perl and any other programming language or script, streaming expressions were introduced. These are stream query language and the serialized format for the streaming API. This expression can compile to a tuple stream object and tuple stream object can then further be serialized to streaming expression. They can be executed directly via STTP to a solar J, which can further be executed against an API stream, which is implicitly defined in solar itself. Now let's look at a common example. Here, we are performing a full index search on a solar collection and retrieving some specific fields on it. If you look at the bottom, we are executing this query against the stream API and the expression itself is a search expression, which is taking the input getting started, which is the collection name. We have specified the zookeeper on which this collection is hosted localhost 9 and 8 3. In the queue parameter, we have hashbacks. We only want to retrieve those documents, which has the keyword hashbacks in any of their fields. And we are limiting the results set to year 2014. We are retrieving ID and model name and we want the final result set to be in the ascending order of ID. Looking at the result on the right side, on the right top corner, we can see the representation of what is happening. That is, we are fetching data from the collection getting started and performing a search on top of it. The result set is emitted in the form of a JSON, where we have ID 1, 2, 3 in the ascending order with respect to their model names M and O. The last couple of stripping expression is EOF endo file true with its response time. The response time is the execution time of the expression in the solar service itself. In this example, it's 12 milliseconds. This expression can then further be divided into certain categories depending upon their usability. We have stream sources, which are the origin of couple stream. That is, you have search and facet stream, which can fetch data from a solar cloud collection. You have JDBC, which can pull data from a relational database. And we have stats, topics, time series strain, which can pull data from a machine learning model, which can extract certain features. These sources are then wrapped around by stream decorators, which performs the aggregations, operations, and functions on top of them. These operations are performed row wise. Let us suppose you want to merge to result set, perform an inner join, select only top five or only want to select unique tuples or unique rows. Stream decorators are there. If you want to calculate or add new field values depending on the existing field value on each row, stream evaluators comes into action and they are operated column wise. You can execute straight forward mathematical calculations like division, multiplication, summation, and also can perform conditional statements like if else then. So as I said in the beginning of the talk, we will discuss some real life use cases here. The real life use case is discussed, has been adopted by our clients at Lucidworks in one way or another. So first look at the data set we have here. Let us suppose we have a collection where we have data of certain flights of their source city and their destination city. And we want to determine the destinations which are reachable within single stop from New York. That is they have a stoppage or a layover kind of a thing. There is obviously a number of solutions which are possible of achieving this particular use case. But we can in this particular example, we try to visualize this data in a graphical format. So the inner nodes streaming expression on the distances collection will put the New York as the root node of that graph and it will gather all the immediate destination cities from New York. This inner, this destinations we are gathering here will become then the input for the sources for the outer nodes streaming expression such that we will get one level away from New York. All the destination cities looking at the graphical representation, right? And if you look at the result set, New York has at level zero because it is a root node and we have another node Bengaluru, which is at level two as desired, whose ancestors are Paris and New Delhi, such that you can fly from New York to New Delhi to Bangalore or you can fly from New York to Paris to Bangalore. And there is a very popular use case where you have a big amount of data in your collection or basically in Cassandra, anywhere in a storage device and you want to retrieve some relevant keywords from that so that you can get a brief summary of what that data represents. Significant expression in solar leverages and leverages Apache Lucene's rich text search capabilities and helps us to retrieve those terms. I'm using the data set and all emails here, which is a very popular training set for a data science problem. And we are restricting the result set to queue in building that is all the emails sent to Tim Belden only some extra parameters like the minimum documents. Those terms should be present at 10, not more than 20% of the document should contain that keyword so that you can rule out the very common English words used in a communication, helping verbs, articles, prepositions, etc. And we are also defining the minimum length of this term should be 5. Looking at the result, we will get this information in the descending order of scores depending on the relevance. We have the term entities and the term John J. Larea, which were used more number of times when anyone was sending an email to Tim Belden. Now these were very straightforward use cases we discussed where we just leveraged one single streaming expression and got our results. Let's try to let's try a bit more complex one. So we have two data sets here on the right hand side. We have the data of certain organizations who have adopted certain campaigns and we have weekly data of their impressions clicks and conversions. Looking at the data, we have campaign O one organizations O one received 134 impressions 48 clicks for conversions. And we have this information in weekly manner. On the left hand side, we have the currency or the cost a company or an organization incurred while adopting these campaigns. That is for campaign O one, it costs 6600 units. And we want to calculate real time analytics or some real time useful metrics on top of them. We want to calculate conversion ratio, CTR click through rate or some cost ratio, conversions to currency cost. So first we need to visualize this entire two data set at one place so that we can do those mathematical calculations. We have to join the cost data with aggregated conversions, clicks and impressions per campaign. And for the simplicity of this talk, we will be just restricting the data set to organizations O one. First we will execute a search stream expression. We already saw example of the same. We are restricting that result set to organizations O one. We're wrapping this search expression with a roll up. Roll up performs a straightforward group by a rolling up over a field. We specified the field as campaign ID. And along with rolling up over, we're calculating some extra variables like the summation of conversions, impressions and clicks to get the aggregated data. Once we have this aggregated data, we will roll up our, we will wrap around our roll up expression with a select, which will just rename these variables to whatever name we want to give. In this example, we have some parenthesis conversions as A, G, R, C, O, N, V, and respectively we can rename it for impressions and clicks. Then finally, we will fetch data through search expression from currency cost collection and we will join the result set of this search expression with the top level select. We will inner join it on the field campaign ID. This looks like a big query, right? Let's look at the representation to understand what's happening here. If you look at the graphical representation on the top right corner, we're fetching data from weekly data, performing a search on top of it, rolling up over a field campaign ID, renaming the variables to select. Parallely, you are fetching data from a currency cost collection, performing a search on top of it, and these results set are inner joined. In the result set, we have for campaign ID, campaign O1, the respective conversions are 41, currency cost is 6600, clicks 259, and we have these numbers for campaign O2 and campaign O3 as well. Now since we have the entire data at one particular pace, we can visualize data at one single frame. We need to just now calculate some mathematical calculations. So we will wrap around our entire query we discussed in the last slide with the select and perform divisions. So to get our conversion ratio, we will divide our aggregated conversions to clicks to get our CTR, click through rate. We'll divide our aggregated clicks to impressions. Similarly, for cost ratio, we will divide our currency cost to aggregated conversions. We'll look at the graphical representation from left to right. You will understand how we are forming this query from the beginning, how we have reached so far. Looking at the numbers, I hope it is visible, otherwise I'll just summarize. For campaign O3, CTR is 0.36, which is the best in the in case of these three, and the other two ratios, conversion ratio and the cost ratio is the best for campaign O1 as compared to the other two. So we are done with discussing streaming expression with one phase where we discuss the use cases we can implement. Let's move on to discuss how complex they are when they get into execution. So whatever we discussed in the last slide, we calculated some metrics. Now we want to store this metrics in a separate collection as part of a report. It can be monthly, biannual, annual or a weekly thing too. So you have an update expression. We will wrap over the entire expression we just created in the last slide and we will specify the collection we want to index, which is in this case the collection name is report. I have specified the batch size to be 500. We have only three rows there for campaign O1, O2, O3. This batch size 500 is just for significance that if you have a result set with millions and millions of docs, it will send the it will index those document in the batch size of 500 only. Looking at the representation from left to right, we are now up till the update expression. And if you look at the result set, we are indexing three documents as we calculated and we have something called worker here. All the use cases discussed until now are being executed by single node or single worker only. There is no parallelism being introduced until now. Right? And assuming there are n number of rows being involved for executing this expression, our performance composite set P is big O notation of n. Now there is a concept called shuffling in streaming expression. Now as I just said, we in the use cases we discussed until now only single node was used while executing this expression. So the request will be sent from a client to a stream handler, which will forward this updates to the respective workers. We can define more than one workers for a streaming expression execution. Now I hope everyone understands what the solar collection is. It represents the entire data set. This entire collection can then further we divided into logical slices called shards and each shards can have multiple copies called replicas. In this case, we have five shards one to five and two replicas each such that we have total 10 course which is representing the solar collection. Now these workers which has received this respective streaming expression query will then fetch data randomly from this course such that worker one can fetch data from short to replica to worker three can fetch data from short one replica one and worker five can fetch from short replica to respectively the other workers. And every time you execute the same query, they can fetch data from a different shot from a different core. Now we have can have a use case where we want to send the correlated subset to a single worker only. Suppose you have a data set, you have a field category. You want to send all the documents with category science to one of the workers say worker one. You want to send all the documents with category mathematics to other workers say worker two and correspondingly for other workers such that you have correlated data at one space and when you perform those mathematical calculations, the numbers are calculated right. So this particular concept been called as control shuffling and where we can specify that particular parameter that send a particular subset to a particular worker such that each worker will request each shard in a collection and only retrieve those particular documents. Now these workers were discussed are basically part of a worker collection. These worker collections are just a regular solar collection in a solar cloud cluster. They can be part of the same cluster or they can be part of an entirely mutually independent cluster. And the goal here is to separate out data processing from data fetching. You can have a primary collection which is hosting your primary data and you can have another worker collection residing somewhere else which will fetch the data from your primary collection and perform the processing so that you can have multiple servers doing multiple things. So as I said worker collections perform streaming aggregations and they will receive shuffled steam from the replicas and they can be empty. They can be created just in time or they can be a regular solar collection which is which are hosting their own data of its own. So I hope everyone is still with me on the use case. We discussed two slides before we calculated some metrics and index to a separate collection. Now we want to implement the same use case this time parallely using n number of workers. Now this is a big query. We discussed until now the update until the update we formed that query which will be wrapped around by a parallel whose first parameter is the worker collection name. In this case worker itself and we have defined the number of workers we want to leverage which is three. This worker collection is being hosted on the same zookeeper the primary collections are and I want the final result said to be sorted in the ascending order of campaign. Now I discussed something about control shuffling. So for calculating the numbers in this use case we want to send all the documents of campaign o one to one of the workers or two to the other worker or three to the any other any other available workers. So we have a parameter called partition keys which you can define in the stream sources. In this case we have two stream sources a search which is fetching data from weekly data. Another search which is fetching data from currency cost and specify the partition keys as the field name you want to partition the data on which is campaign ID. Let's look at the result. So now on the left hand side we have the representation where we are can see three parallel execution of that update expression getting accumulated or aggregated on a parallel node. And on the right hand side we have the result set worker shot to replica and for is responsible for only indexing one document. There is another shot replica which is responsible for the other one and finally shot for replica and 12 is indexing the last document. Now we calculate the complexity of this expression. Obviously some extra amount of work needs to be done to aggregate this respective results from each worker which we denote them as Z three. While we assume that there are a total n number of rows being involved for executing this expression which now will become big o notation of n divided by w where w represents the number of workers we are using here. Since the first parameter is almost negligible as comparison to the other in terms of complexity. We can safely ignore it and we can state that the execution time of the streaming expression has been improved by the number of workers we are using. So not now we discuss about a statistical programming in solar which these are some really new statistical functions being added into streaming expressions. So now solar rich search text capabilities can be combined with in depth statistical analysis. We can now do co variances correlations euclidean distance k nearest neighbor graphs. You can represent your graph in multiple formats through statistical programming which are backed by Apache Commons math library. This statistical programming syntax are used to create arrays from list of tuples so that you can transform manipulate or analyze. Let's discuss a use case of actually how we can use this. So these are some real stock market data from February 2013 to January 17 and have abstracted their names. So for stock price for company a on 1st February 2013 the closing points were 30. As a stock be on the same date was 168 and correspondingly we have this information for stock C and we have this historical data for four years. And we want to determine here the correlation among the stocks from the historical data for a given time frame if the stock prices for company is going up at the same time the stock prices for company B is going up then they are positively correlated while if the opposite is happening that is the stock prices for company B is going down they are negatively correlated. So let's determine the correlation among the stocks a to b first. So there's a late expression which allows us to set variables within an expression itself and it outputs a single tuple. Then we will implement two different search expression here. The first expression expression will fetch data from the historical stock's data collection and will restrict the result set to stock A and assign it to the variable name of its own name stock A. The second search expression will limit the data to stock B and assign it to the variable of the same name stock B. This stock A and stock B will then become input for a statistical function called COL. This will take COL stock A closing points. It will take all the closing points from stock A result set put into one single array and assign it to the variable prices A. Similarly, it will pull all the closing points from stock B and assign it to another variable called prices B. And at the last this prices A and prices B will become input variables for another statistical function for CO double R which will implement Pearson product moment correlation. A Pearson product moment plots an XY graph of two variables and analyze how parallel they are to each other. The more parallel they are, the value will be more closer to one otherwise minus one. Now let's see the result. So stock A to stock B correlation is 0.999 and so on. So suppose a new CEO is going to be appointed for company A or a new brand ambassador is going to be announced and you anticipate a rise in the stock prices for company A. You can make a very bold prediction that the stock prices for company B will also go up while the correlation between A to C is minus 0.18 and so on, which is modally negative. The trend of the stock prices of company A has nothing to do with the trend of the stock prices of company C. So the final takeaway from this talk is stripping expression in Apache solar allows us to perform complex correlations on map reduce and statistical functions which can be executed parallely using n number of workers on dynamic subsets which leverages Apache solar rich tech search capabilities with these subsets can be fetched from various data sources, solar collection databases etc. And this entire expression can be executed in near real time making the analytics application in your real time possible. So these are the references and the knowledge based article listed for this particular talk. All the use cases and example are uploaded on the GitHub handle Sarkar Amrit to stream solar. Check out the official documentation of streaming expression and statistical programming. This talk is heavily influenced by Mr. Joel Bernstein's blog who is the creator of streaming expression in Apache solar. And I've also listed out some presentation leaks from last four years of relevant to this topic. That's it from me. Thank you so much for being here and I think I have decent amount of time left to answer some questions. Thank you, Amrit. Who has questions? One. Anybody else? Two. Yeah, you can start. Yeah. Hi, Amrit. It is a fine talk here. Hello. Hello, Amrit. Oh, hi, is a fine talk. Now I want to ask you suppose we are working in a real time situation. The real time data is coming from some monitoring part. Yes. Through Kafka. Yes. And we are just getting the information and putting it on the dashboard. Yes. And that time how Apache solar works. Suppose for some part of the day we want some streaming expression on a particular interval of data. Yeah. So how Apache solar fits in this scenario? Yes. So you have an option of specifying time series data. You can restrict your time frame. If you have a field called time, which is in a data data field type, you can specify an entire time frame from this time to this time. It will be a start. There will be an end and because of solar rich search capabilities, it will fetch those particular documents from the given time frame very quickly fairly quickly and then these expressions will work on top of it. Hello. So you mentioned that yeah, you mentioned that this can fetch data in near real time. Yes. So let's say I have to join two data sets. If the data sets are pretty big, it is still going to take time. So for example, if I run a high query, it will still take time because data sets are big. So what's the difference between let's say running a query on high and running a query on solar? Yeah. So first of all, and the challenges part I state that when you had, I want to have a dynamic subset. Right. You want to fetch a dynamic subset from one data set your big data set you have another dynamic subset from a second data set. First of all, solar capabilities helps us to retrieve those documents. Second, depending on the number of workers you are using, right, you can always this performance of streaming expression execution is linear to the number of workers you are using. So I'm not going to going into very much details about like the physical course for servers and all but for 25 million documents, you can execute these joins and perform this analytics analytics in less than a second. Those are the benchmarks we've done now depending on how high this comparison to the solar obviously high has their own advantages solar has its own but here you have the liberty to expand your solar cluster as as many nodes as possible and which will linearly will improve your performance. Yeah. Amrit, I have a question. So it's kind of very obvious question because so when you see solar there, the other tools that come to mind is elastic search and let's see the others. So I have worked with elastic search quite two, three years back. So it has quite changed from that point of time. So when you and both are very comparable, like I was doing a Google search and people don't have any opinion which one to choose. So what do you think like in all these use cases that you described I think possible in elastic search also. So which one to pick like any any performance and differences or some other feature differences which are present. Right. So yes, we can do all that I think in elastic search while I will restrict that there are 70 plus statistical functions already available in the solar. I'm not that confident whether they are available in the elastic search as of now now also see both are open source search engines but the community development how big a community has been built around a particular project is also very helpful of solving certain problems. Right. As being of comparison elastic search is yes more seen as an analytics tool solar is being more seen as a search tool to build search engines while the inclusion of streaming expressions in solar now effective analysis analytics applications also possible in solar on top of the search which is already provides. So there is a B of session on solar itself and this is a very broad topic you bought because we have to now compare feature to feature what's possible what's not possible. So yeah. Okay. Just one quick question. So there is a case check as kibana right? So is there something equivalent to kibana in the solar? Yeah, it's called banana and it's a fork of kibana itself and also I work for lucid work. So we build this dashboards to there is an app studio which builds on top of solar or the fusion product we have and it's kind of like a competitor for kibana or banana as well. Yeah, so the term near real time. So for example, the earlier example that you give company idea how it is performing based on click same data. So let's say I want the numbers up to last five minutes. So first question is we need to run that of that expression every five minutes. So you have to I'm like for the last five minutes if you want to retrieve the information until last five minutes. Yes. So we will first provide the information about that you want only documents of last five minutes. You will execute the expression the sorry the execute the expression will get your numbers. Okay, you have to you have to rerun your query every five minutes. Yes. Exactly. And the second question goes on the same example. We are looking for click to conversion until last five minutes from the beginning, right? So you're writing into probably some target collection. Yes. So when that collection is being updated do I still have access to that probably old data versus new data because it's a simultaneous operation right some rows might be updated some might be having steady data. How do we handle that? Um, so this is a concept called time series collection being introduced in recent versions. I'm not sure whether you're familiar with that or not. So time series collections now can host data depending on a particular regular intervals. So you really don't override the data. Obviously you need to create your unique ID. I'm going into very details of solar now that how you can don't override your existing collection, right? So you need to while forming those queries. So let me let's quickly go there. Right. So in this case, yeah, in this roof, yeah, so in this case, we have just campaign ID only. You can introduce a unique ID so that they don't get overwrite in your respective final collection. So, uh, when I was building this talk, we were in, we introduced very complex queries, but then feedback came and we wanted to say dumb down this query so that we can understand. So those, if you, when you're building this query, obviously in the division part here, right? The aggregated flex can be zero, right? Something to be divided by zero is undefined. So you need to make sure that those checks are in place and you have a unique ID. Otherwise, check out time series collection in solar, which will absolutely suffice your use case here. Thanks. We have one question from the front. Um, so I just have one question like, uh, I just have one question like, uh, so, uh, for unstructured data like, uh, production logs, if you want to run analytics, uh, solar is fine, but, uh, you pointed out a few other use cases like, uh, stock market data and which is, uh, kind of structured. So for that, uh, I mean, Apache already has, uh, uh, another project, Druid, which is really good for real time analytics. You can roll up, you can build dashboard and time series. Yeah. So, uh, why prefer solar over Druid? So I'll be really honest. This is an alternate, right? It's an alternate of a solution. Um, talking about advantages of each, um, I'm like application of why solar is better. He mentioned about elastic search. We have to really get into the intrinsic features of each feature. Now building analytics application here is fine, but again, uh, that, uh, Apache project you mentioned whether it can retrieve effective dynamic subsets from a large data set, right? Right. If it, yeah, it can, it can fetch, but this also, I'm like, yeah, for structured data, I'm like, you can look into that, you can look into solar do performance benchmarking and then you can do the comparison. But if you have unstructured data, then you have this solar. So yeah, answering your question. This is an alternate. Not exactly. This is something which will rule over all other projects. That would be the best answer for the unstructured. Yeah. I totally agree. Solar will dominate every other product, but for structured data, you have other options too.