 Welcome to another edition of RCE again. This is Brock Palin. You can find RCE online at RCE-cast.com There's a RSS feed there as well as a link to iTunes for your favorite hot catcher now I have again Jeff Squires from Cisco Systems and open MPI Jeff. Thanks again for your time. Hey Brock. How's it going? All right So we also have off a website a link to your blog and you've actually answered some of the questions I've had recently being an MPI noob myself and thank you very much for that and I see it actually sparked a little bit of discussion with some of those questions. I think that was good. That's all good for people Yeah, yeah answer these kinds of questions MPI could be quite the mysterious beast. Yep Yeah, and so you can also find all of our Twitter handles and usual contact information on there as well as a list of all of our Backshows and things that we're looking at getting a hold of if you have contact information for any of the shows We haven't done yet. That's on that list. Please let us know on the contact form Today is kind of we've been going down a little bit of a path, right? We talked about Hadoop a while ago and Hadoop has come up more recently and Today's show is kind of related to that, right? So, yeah when I was keep finding these different pieces of software Lucene Hadoop, Notch, and Solar, they're all kind of related and they all live out of patchy.org and They're all Java based, but they're also all designed to scale very large scales when we had Cloudera on earlier talking about Hadoop You're talking about massive, you know multi-pedabyte clusters and I'm sure those have only grown since then so I think something coming on You know the definition of HPC is just kind of changing and this is kind of a contentious topic We know it's it's not just the same traditional HPC that we've we've seen No, there's a lot of big data number crunching going on that you know loosely could be considered HPC It takes a lot of compute a lot of scale a lot of processors disk memory all that stuff You know as a matter of fact in open MPI We're just going to be adding Java MPI bindings to our development trunk in the in the near future and the reason for it Is because the Hadoop guys Have become interested in using MPI as an IPC channel And you know they're in the in the map reduce world their reduces are becoming so computationally complex that they want to go parallel and they need a good IPC mechanism for it So, you know some of these things are starting to have very interesting Overlap so what you didn't traditionally consider as HPC, you know, it's starting to get in there And I think today's project kind of fits in that same category Yeah, anything bigger than what you can do in a normal traditional programming and server environment Gotta start doing these things at these massive scales. Yeah, so let's go ahead and introduce our guests today Yeah, so our guest today is Simon will now or I believe he's over in the UK And he's one of the people involved with Lucine. So Simon, why don't you take a moment to introduce yourself? Hey guys? Good to talk to you Thanks for giving this short introduction My name is Simon and I'm a patch of Lucine committer and I also happen to be the Lucine PMC chair Which is kind of an official position. I've been working with Lucine since 2005 and I'm a committer It's since 2006 and I spend a whole bunch of time and working on this project. I'm pushing it forward you know spreading the word and in my day-to-day job, I worked like 50% on patching Lucine specific stuff and The other 50% I usually spend on customer projects I Share a company with a couple of other people. It's called search workings This is also related to search workings.org. It's a community portal about Lucine solar and all these open source Search stuff and I run a conference in Berlin. It's called Berlin bus words. It happens annually and this year. It's in June So what can you give us rundown what Lucine is exactly? well Lucine's basically in a contrast to all the other things I've seen on your podcast It's it's just the library, right? So it doesn't do anything. It's not an application. You can't start it up It's just basically Java library Which solves the hard problems in information retrieval the basic information retrieval like building in index retrieving documents and not analyzing text And making everything in there are very very efficient. So it's basically a high performance search engine library entirely written in Java So does that mean you entail indexing as well? So I mean you mentioned accessing documents and searching So assuming Lee there's there's a bit more to it than that, right? right so In Lucine we have we have a notion of a document which is which basically corresponds to some kind of an entity But everything is a somewhat schema less So you don't have to really define the different entities you can just throw it in it works and each document has a set of fields The you can you can just imagine it as a spreadsheet a database with one table All right, and the Lucine takes care of building up this table and making this table accessible So the point of this is to be able to search it really fast, right? So do you build a separate index for this? I mean how I actually know very little about search embarrassingly little about search So how does you know a typical search work and how do you make it fast? Well in a contrast to a database in Lucine or in an Information people library you have something called a reverse index or an inverted index. So you basically index the terms the unique terms in a document and then map those terms to the documents where the term occurs in and Everything you do is when you type a query is you get a list of documents those terms occur in And how we make this fast. Well, there's a couple of things we do on a technical level, but Basically, I mean, where should I start? What what we do is we every data structure in Lucine is read-only right and we have some some standard Algorithms in place which writes stuff to disk makes it makes it persistent and then people can can load this indexes and fire the searches against it and You know once you have updates you write a new little index to disk and then you work it in the background together And everything is basically read-only. You don't change anything all the data structures are highly efficient You we make efficient use of of the infrastructure file system caches and all these kind of things In contrast to to a sequel database or something like that You don't need to do update everything right you don't need to update anything in place not a disk that I think makes a big difference and Lucine is a library which is very very much tailored to the purpose of full-text search Once you do something else with it you could easily get into problems, but we can elaborate in this later So Lucine would not be good for kind of replacing a traditional database if your information is changing rapidly Well, that's that's basically two questions One is the data is changing rapidly. The other one is can you replace a traditional database with it? The their answer to the second question is you can but you have to denormalize your data model, right? You only have one table. You have to put everything in one table. You can't do more than one There is some developments which implement joins based on Lucine But they're there they have a lot of limitations. This is not they don't have the power of a sequel store When you when documents update rapidly Lucine still works. We have something called a near real-time search But it works slightly different and it involves More disk I O and it's it's basically less efficient space wise because in Lucine You don't have an update procedure. You only have a delete and an at so basically if you want to update a document You delete all or mark all previous documents with the same key as delete it and then add the new one on top So it sounds like you know, it doesn't have all the features of a full database Things like MySQL and postgres SQL and these other databases where we've traditionally stored a lot of data have full-text search and Table partitioning and you know the MySQL MBD cluster Why would I want to use Lucine over one of these traditional systems? Well, the the first problem is how is this full-text search implemented? I'm I'm not sure about every every implementation But full-text search is more a bit more is more than than just putting something into a database and start indexing it Right, you could have a whole bunch of analysis in front of it Like you want to you want to remove thycritics you want to tokenize your text you want to put synonyms in remove stop words do stemming all these information retrieval related things Put some machine learning in front of it. Lucine supports all these kind of things and it is actually made for for a full text Full-text-ish queries Just go one step further and say you have some positional information like term a occurs next to term B How would you do this in a no sequel in a in a MySQL environment, right? You don't have those notions and the basic reason is Those databases were not made for information retrieval. So let me take a step back here. Why? Why is text search important? What kind of fields is this used in what kind of applications? I mean obviously the the most obvious thing I can think of is you know You go to Google and you get like the quick search and things like that, but what other areas is this useful in? Yeah, you're absolutely right. I mean if you go to Google on type of web search and you definitely at some point hit an inverted index There's much more about that in Google. There's a lot of ranking stuff going on Lucina's ranking too. We can get to this later There's a whole lot of applications Starting from your little mobile phone running Android There you could implement some kind of on-device search Searching your emails Go on your operating system. You want to index your file system and find things quickly You have a web shop and you want to find You know things things you sell But you you don't want to rely on the database which only gives you exact matches You want to have stuff like spell checking and query suggestions and all these kind of kind of neat features Which I'll do sink and give you like faceting But it can even further and say hey you have a you have an application related to Geographic information and you want to find something within a certain geolocation bounding box searches This is where we sink and help a lot Yeah, all kinds of all kinds of information for basically every website has a has a Search engine behind it. Like there's a lot of content management systems integrating Lucina solar Yeah, there's almost no application these days without a search box, right? Okay, that makes perfect sense. So How did this get started at Apache and and what's the relationship with? Lucine and Hadoop and solar and others Well, let's talk a little bit about the history I think in 1999. Well, well, that's a long time ago for computer science, right? 1999 duck cutting came up with a little pet project. I think it was his first Java project he wrote a search engine and I call it Lucine and A couple of years later. He he donated the code for Lucine to the Apache Foundation, which is version 1.4 If I call recall correctly, that was in 2001 so the scene turned 10 last year and Well, what happened is that this library is further developed and gained a lot of traction a lot of people invested in it a lot of Also famous people or well-known people in the Java community it contributed stuff to the scene but at some point The the problems you want we wanted to solve in Lucine became bigger Like, you know, you want to index the web a Little library doesn't help you really right? Do you need to write so much code on top of this? Like you need a crawler you need a database where you store the link information in et cetera, et cetera and somebody came up with the idea of nudge and nudge was built as a web scale Crawling search engine It should be capable of indexing tons and tons and tons of data are basically in the entire web Well using Lucine for this work pretty well But you had to deal with all the the distribution between machines And you also had to deal with storing the the websites and the link information and run run algorithms on it like map reduce Or page rank algorithms map reduce came later and out of this when Google published the papers about HDFS GFS and Map reduce duck cutting founded yet another project as a sub project of Lucine called Hadoop and Basically out of this it became another top-level project Nudge became a top-level project and Lucine was left alone with this task for information retrieval So those are the relations between those projects So then is does Lucine require Hadoop to actually use like is it a library that assumes map reduce functionalities available? No Lucine is actually one of the project in the Apache foundation which which has zero dependencies So the core of Lucine doesn't need anything to run except of a Java runtime environment Yeah, there's a couple of modules we have that need third-party libraries But the relationship is the other way around so if you want to do some kind of indexing things You're pro and you're using Hadoop. You have enough data where that makes sense You probably want to use Lucine to actually build your indexes Another question kind of going back a little bit here. Does Lucine Only index text or does it index other things as well? I mean you mentioned documents, but can those documents contain things other than text? Yes, and no Let me elaborate here So the latest version of Lucine Is somewhat balanced to text so it it basically accepts a string for for a value For a document consists of out of one or more fields and each field can have a value And those values are strings, but in the current trunk developments. We moved from strings to bytes And we index those bytes. So basically if you if you can you know Materialize your data to bytes and you can make some kind of sense out of this Then you can index it with Lucine So once i've done that how hard is it to make a an index environment that's aware of say My collection of genomes or my collection of images That's an interesting question. Um, it's To begin with Lucine is probably not the right tool to search your genome database I'm not sure. There's a lot of people came up with crazy ideas like You know Some somebody came up lately and we were talking about to build a chess computer Based on lucine, right? You what you basically would do is index all the the chess games Published in a in a in a in a big chess database and then make good predictions out of this But you know everything you need to do is you need to you need to transform Your genome representation into bytes and if you if your application needs to do something like you know Give me all give me all genome strings starting with this string or or a similar to this string with in a certain distance Then that would be probably straightforward, but Well with lucine 4 Um, I would say if you if you break out of the of the common model of full text search It probably involves way more work than Than a classical application, but it could do it so The the major purpose for moving to bytes is actually A different one. So I can elaborate this if you want, but To answer your question. It's it's it's way harder than just doing a normal full text search engine So then on a on a related question here does lucine do other languages? Well languages other than english Things that are multi byte representations of characters and do the same, you know phonetic and search types of contexts span languages Yes So basically internally to be seen everything is udeavate bytes Right, so we represent everything in udeavate. We moved out of the java model of What java basically uses 2 byte for character or for a code point and We moved down to udeavate representation because it can save us a lot of space on disk and in memory But yeah, you can you can basically search everything Which you can represent as as udeavate you can also use some unicode compression If you you know if you search some asian scripts You can certainly go and say use some book who encoding to to save space The question if you can search another language It doesn't really relate to how lucine is implemented internally lucine offers a ton of language analysis tools for I don't know if never counted, but when I look at the source probably like 50 to 60 different languages From from hindi to polish japanese English Yeah, brazilian portuguese It's all there Some some of them are better. Some of them are worse, but um, you can basically Write your own if you're not if your language is not supported. It's all possible to see So you keep mentioning this is all written in java. What was the rationale behind using java for such a system? I think I mentioned this before it was duck cuttings first project in java and it just it just You know, it was just meant to be a little pet project That it took off in a way it did. I think was never planned in a way, but um You know somebody could argue you could do the language like c plus plus where more freedom with memory Here and there and you probably get more More performance out of it I say probably i'm not sure if that's true. You can actually write really efficient code in java sometimes it it requires Evil tricks to get there and usually every every time you use java You're gonna you're gonna fight with garbage collection and the cheat compilers. They mostly do the right thing Sometimes they don't and you wonder why but yeah, there's there wasn't like a real decision Hey, we write a search engine and let's write it in java. It just happened by accident Yeah, I feel your pain there, you know, whenever you're writing optimized code, you know, a lot of rules can go out the window unfortunately to to extract really high performance but Java has also come quite a long way since 1999 Compilers have gotten better and the jits have gotten better And the performance has gotten quite a bit better. It's still an interpreted language But you know, there are a lot of enterprise class applications that depend on speed and Optimization that that run in java today, but by the same token with all that is kind of a prelude to my question here Do you have some parts of it that are not written in java? Like I I know very little about java to be honest But I know that some applications will actually kind of branch out into c for their optimized parts Actually, we don't Which is not entirely true. Um, nothing nothing is written in c which is actually I know of the people use it in production we have a couple of implementations of our lowest level file system representation For a couple of reasons like we've seen there's a lot of i o especially when it merges in nexus in the background and When you when you write in write to an i o stream or to a channel on java You basically always hit the file system cache right you invalidate your cache And there's a couple of implementations which try to work around this and do like direct writes and Or do write-throughs and Talk to actually the the the operating system the teleoperating system. Hey, I'm gonna read this sequentially, etc. Etc Stuff like stuff like I operations you don't have access to when you're in java but To be honest our benchmarks show that The implementations are not faster than the java stuff The problem with calling into native interface in java is that the native interface call is pretty costly So if you have a piece of code, which is pretty hard or a method is called very very often You should rather rely on the chip compiler to go compile this in the native code and keep it around Then trying to work around this and use the java native interface. That's my opinion So what is lucine's core model like what's the What's it look like when i'm passing something into lucine does it Get kind of mung into a sequel kind of thing or is it completely its own deal? No, it's it's it's completely different. It doesn't do any sequel kind of thing Um Well, the most of most applications I would say they just pass in a user typed query was like I don't know you type you type some words you want to find let's say david bowie, right? And um, lucine internally runs a query parser on it. It understands a couple of um Boolean parameters and or not um And it has a couple of things like range arrange syntax you can you can search Numeric ranges or text ranges date ranges You can do wild card queries and prefix queries phrases and fuzzy queries but You know that that that syntax is Is only bound to the lucine standard query parser and there is like five or six of them around Which have totally different syntax it all boils down to the api and the kind of queries you can instantiate And I would say the a lot of people really do this built their own query parsers come up with their own syntax um Depending on what they really need and this is in my opinion one of the biggest advantages in lucine Is gives you the freedom to do it. You're not bound to anything um Syntax wise Well, I think you just hit a first there in uh rce history. I think that's the first time we've ever mentioned david bowie on one of our podcasts Uh Awesome. I I just read something about lucine and somebody had a problem on on stack overflow with Ranking david bowie against some other david and that's why I came up with it Hey, you know great real-world examples. This is all good All right. Uh, my next question is uh, what kind of back end storage do you use? I mean you mentioned, uh, you know native file system stuff. You mentioned one big table like interface What is it? Is it an actual database a customized kind of database or what do you do? Well, this scene writes its data structures itself So there's no database behind it all the all the low level code is is contained in the core jar Um The the actual persistent storage or Volatile storage, whatever you want to use is hidden behind an abstraction what we call a directory and you can build directory which operates exclusively on um on heap space Or you want to do some memory mapping or you want to use javas n i o classes, but it basically boils down to either a file system Or some kind of other Voluntiles storage There's a couple of people try to build directories for kasandra another no sequel storage um well solutions That in my From what I can tell it doesn't work that well or it doesn't scale that well But usually it boils down to uh, you go to the file system Your basic thing is your hard disk and and then all the data structures are written on top of this Does this answer your question? Yeah, it does let me ask you actually a further one. What happens if I use uh, What's becoming popular these days cloud based storage right if I've got A whole chunk of data and I'm a small startup And I host it out on you know amazon's cloud storage or or google google's cloud storage or something like that A How does lucine react to that? Is this in i o intensive and so the latency would would kill me or Is it more tolerant of that? How does that go? Well, um, that really depends. Um for the indexing side of things it can be extremely i o bound For the for the read part for searching. It depends on what kind of um file system implementation You use what kind of directory? Um, I would probably recommend to use memory mapping. So it basically pushes everything to memory. So you don't suffer from really slow disks um But in general lucine is not doesn't contain any any code for distribution It's it's it's meant to be on a single box If you want to do this in a distributed mode and you have like massive amount of data Uh or or a super large amount of documents You probably use something on top like solar or elastic surge of kata some frameworks on top of this So can you go into that a little bit more? What exactly do these Other frameworks that i've seen were built on lucine like you mentioned solar, which is the one that keeps coming out What do they do to really make this scale to a massive level? So Solar is basically a full flash application. You download it. You put it in a surflet container. You start it up and Put some documents in we are rest interface and then just search right And it uses lucine under the hood. So it's basically built on top. It's the lucine's official search server this other project like elastic search would focus on real-time search and You know document replication large scalability and sharding solar adds these features too in this in version four, which is the upcoming version But yeah, basically it gives you everything lucine doesn't offer out of the box It puts the it integrates the api Into an executable application. That's the major difference between it So what what are the resources for actually running one of these things? Does a scale with a number of documents or how many documents you're indexing today? What's what's the hardware to make this go well? So I I personally have experience with running lucine on very very low power mobile phones, which works right You're certainly not gonna index millions of documents on this Well, unless your battery runs out but The hardware requirements are not really the issue here It always depends on you know, how much data you have how fast you want to deliver your search results And what kind of searches you're executing there is some kind of searches are very very computational extensive Some of them some of the operations like indexing is very very io bound But um, you know, um, when you have a let's take an example Let's say you have something around 20 to 25 million documents All of the size of let's say a wikipedia page, which is roughly like four kilobytes of text And you want to you want to serve those? this data in in a very nice speed Let's say, you know normal queries like turnip or is a boolean queries coming back in something between five and 20 milliseconds Scaling with the number of requests you probably use some some hardware like You know two 12 cpus two times six cores Um, you probably need something around eight to 12 gigs of ram Your heart disk is not super super important, but SSDs usually give you a better performance here, too So do you cash much in memory for subsequent queries? So because i mean you mentioned a pretty low amount of ram there actually only one gig per core or so or even less right well We absolutely rely on the file system Cash, right? We don't try to cash a lot of stuff in the java heap space For the actual reverse index There is some data structures held in memory But those data structures are a tiny fraction of the actual size of the index So we basically rely on the fact that the file system keeps hot pages in memory as long as memory is available So you don't necessarily need tons and tons of memory to make this run Really really smoothly, but if your index starts to grow, right your file system cash needs to get bigger and bigger and bigger so um Basically, yeah, you can run this on a very very restricted environment and still have reasonable reasonable response times um If you if you go much bigger I use the term bigger here because I can't just say more documents or bigger documents. It really depends What kind of stuff you do and if you store the entire document or you only use it for indexing and then retrieve the ids Things like that Yeah, you can you can do this kind of stuff with almost every requirement every environment You could run it on your notebook and probably you know for a webshop nobody would really see much of difference to a big machine So where does a hide dupen come into this whole thing? I mean hide dupes is big cluster massive scale out thing How's how's lucine play with that? Huh That's a that's a tough question. Um to answer in a couple of seconds So let me give you an example. Um a lot of people have lots and lots of data and um to get all this data and index this data Um, they usually fetch it from a database But let's say you have so much that you have to put it in a on a on a file system like hadoop Where you need to process? a tons and tons of data um You probably want to build your indexes on the fly in the process of a map reduce job and um Basically when you do this what we've seen there's nothing special about it You probably have a couple of reducers in hadoop Which receive text and out of this text you're gonna You're gonna build a lucine index on on a reducer Um, and once this is done you copy the whole index to hdfs And your search engine fetches the new index from hdfs So hadoop can be used to lose to to build lucine indexes, but it doesn't have like a real relation There's nothing where you say hey, here's the component you use then it works with lucine There's a framework called kata Or it's actually an application That uses hadoop and lucine or hadoop underneath and lucine on top dual searching index building and um Use hadoop as a persistent storage, but there's not this close relation anymore Would a setup like that let you scale Lucine from you mentioned, you know a couple million documents on a machine with 12 gigs of ram and 12 cores to say You know 100x that you know get into the billion It also depends. Um, you can i i've seen indexes with 200 million documents with small documents And i've seen indexes with a couple of million documents where each document is massive Right and something around like five megabytes of text per document It always depends on what kind of stuff you index But usually is a rule of thumb 100 million documents is probably the limit for a single box So one interesting thing is the google book scanning project U of m's involved with that and last time i talked to the guys involved with that they were using solar interesting I think i'm guessing that's lucine underneath. Yeah. Yeah, absolutely. Absolutely. Um You can you can scale the scene to to a super large amount It always depends on the distributed system we build on top lucine itself Doesn't do any Distribution is not related to you know, it doesn't have an rpc implementation. It doesn't talk to Sockets or anything like that. It's completely agnostic Um, you put something on top like solar where you have replication and um sharding and distributed search And all this kind of things or elastic search the same kind of thing Then you can scale out massively. Absolutely So speaking of large indexes, what's the largest index that you've you've heard of that somebody's doing with lucine? the number of documents or How do you want to qualify it? huh I've seen a couple of people they're hitting actually the limits of a single index. So just for instance, we have a limit of um 2.14 billion unique terms per Per segment in the index and There's a couple of people actually hitting these limits When I say unique terms, right? So imagine how many unique words you need to have to hit 2.14 billion border This is a massive index in in terms of in terms of Size it really depends On average or not on average, but a common compression ratio is 30 percent of your original text When you build the index If you don't store the source data into this in lucine you can do that too You know this I've seen lots of indexes being more than 500 gigabytes per machine It really depends on what kind of documents how many documents So you can index 200 million documents and you end up with 25 gigabytes But you can index 2.5 million documents and you're gonna end up with 250 gigabytes of the index size So it really depends, but it scales out that way. So what's coming for the future in lucine? um The current version of lucine is lucine 3.5 and we've been working on lucine 4 since basically 2009 and the scene 4 is Almost I would say almost a rewrite a lucine. We changed tons of internal apis We made great improvements on the indexing side. There's like 300 percent interest rate improvements on indexing Fuzzy surge is 20 000 faster than 3.x searches can do There's tons and tons of new features regarding different scoring models we we support language models and BM 25 scoring all those kind of things researchers would like to use There's a massive amount of performance improvements And solar comes with new cloud features like, you know, it manages its instances automatically and shards Can join and leave and if machines go down it starts up new machines. That's replication Um, so yeah, there's it's gonna be great. Um, there's a lot of improvements coming up and we hope That we can release it within the next six months Okay, simon. Well, thank you very much for your time. We'll have this show out soon Again, you can find us online at rce-cast.com. Thanks again simon. Thanks simon. Thank you