 It's pretty hard to present after Stefan because he showed real magic and I cannot promise anything comparable to that. So please be accommodating for what I can do. This presentation is first topic on this, first talk on this topic was made at EclipseCon in North America by Stefan Ksenis and me. And unfortunately Stefan could not come to this conference, so I am presenting for him his primary developer on this faster index for Java, which we are talking about. So what is this thing for? There is a problem. My job, our team is responsible for providing support to Google engineers who use Eclipse. Providing support with Eclipse, making sure that they can use Eclipse successfully and be productive with it in their daily job. And number one complaint we hear from our customers or Google engineers is that Eclipse is slow and not only it's slow, it's sometimes completely freezes the UI and you don't even have an idea what's going on. We have an independent way to verify these claims. We use UI responsiveness monitoring to record all UI freezes. They are centrally logged and we can analyze, get stats of these freezes and have dashboards to show how these freezes, how bad they are, cumulative time, work time wasted across Google due to these freezes and so forth. And the damage of course is much higher than just time UI freezes because when your tool is not responsive, it really disturbs you and you switch to reading emails and waste 10 times more time. So what we saw in the pattern of these freezes is that although many places in Eclipse contribute, JDT is the number one source. And among sources that can be attributed to JDT, the majority of the freezes are related to reading of jar files. And of course when there are libraries on the class path, they have to be read so ID can understand the code that source code depends upon. But each jar file reading is not particularly fast and what is really bad is that JDT does that reading in response to user actions. For example, I can ask for content assist and then I decide, oh, I need to go read some jars to figure out what to show in the content assist pop-up. And the problem really gets worse when the number of jar files on the class path is large and it's a typical Google environment, it is very large. It's measured in thousands and would be over 10,000 jars on the class path. Another problem that we want to solve is that existing index used in JDT often gets corrupted for various reasons like this. Maybe JDT got shut down unexpectedly in an orderly fashion like by Q9 or something else. But what is problematic really is that when the index gets corrupted, what ID starts behaving in kind of degraded functionality. Without telling the user that functionality is actually degraded. So you can have, for example, look at call hierarchy that would missing some call points. You can do refactoring that will not change every place that it's supposed to change. Without telling you that actually something is going wrong. So current way index corruption happens is that when index is corrupted, it says, oh, I just don't know about it without raising any errors. So we want to solve two problems or we happen to work on something that is capable of solving two problems, the performance problem and the corruption problem. So what is problem really unique to Google? Is it just something that would benefit Google engineers and nobody else? The answer is no. This is from automatic data collection about UI freezes that is enabled by default in Eclipse for committers. And we see that number one, it is by number of reporters, source of UI freezes. This is in content assist, relatively short freeze, but it's in this particular zip file read method. You can recognize that it's actually reading Java. So hopefully when we solve these problems, it will benefit not just Google engineers, but everybody else. So let's now see the problem for ourselves. I'll demonstrate it on an example that does not require thousands of jar files. In fact, this is the simplest of all Java projects possible. This is a Hello World project with a single class, Hello World. And what we'll do on it, we'll look at type hierarchy of object. So we open object class and we'll press F4. So there is some progress, but what actually happens is the UI is frozen. So it showed progress dialogue. The intention I guess was to show that the operation is lengthy, but that progress dialogue only showed the small fraction of the operation and the rest is just UI freezes. You'll have to wait for a few more seconds until that finishes finally. Once it does, we can look here in the arrow log. We see stack traces captured during that UI freeze. So this is not particularly interesting, something UI related. But here you see a zip file open at the top of the stack. This is the same thing of jar file, again, again, and again. So this is, of course, kind of just sampling. It's not even a profiler, but we get an idea that probably at least 80% of the time is spent in jar file reading. In reality, the number is even higher. So how do we solve this problem? The solution that we are developing was inspired by the experience we've had with CDT. Over the years, CDT borrowed a lot of code from JDT. A lot of CDT functionality was developed by taking JDT code, copying and pasting, and then modifying it so that it works with C++, and that is responsible for a bunch of significant fractures of code in CDT. So now it's time to pay our debt back. So what we've noticed in CDT is that although C++ is a more complex language and typically translates into more work that has to be done for anything like content assist, if actually it was not, most operations in CDT, and I guess the most important in terms of response time is content assist, are faster in CDT than in JDT. Similarly, in CDT, we have content assist that is longer than 200 milliseconds. So how does CDT achieve that? It uses a comprehensive index that unlike existing JDT indices, not just contains kind of information for lookups. It holds complete model of the code, not quite AST, but pretty detailed model of the code. And how is CDT capable of doing that? Code is huge, the model is even bigger, cannot fit in memory. The model is stored on disk in a database and parts of it are paged from disk to RAM lazily. And that technology that was originally developed by Doug Schieffer, who was for many years project lead on CDT, that technology served us well for 10 years, it's very mature, and we know it works. So we want to take that technology, adapt it to JDT needs, and solve once and for all this problem with jar file reading in response to user actions. So what are our requirements to this new interface? We want to completely eliminate jar file reading in a synchronous way for common operations. We want to scale well with large number of jars, of course, because that's our primary problem at Google. We want index to be updated incrementally. So when something small changes, we don't want to rebuild the whole index because it will take long time. We want to time taken for update to be proportional to the size of the change. Index has to be usable during updates because it will be updating from time to time when things change. Updates of the index should not prevent whoever needs information from the index from using it, because otherwise we will get UI freezes from different source again. Read latency, let me explain what I mean by read latency. Since index is going to use naturally single writer multiple readers model of concurrency, when you need to read the index, you have to acquire read log. What I mean by read latency is time it takes you to acquire the log. Then you can start actually reading, and of course you also need time to read information that is proportional more or less to the amount information you have to read. I don't count that in read latency. Read latency, what they call it, is just initial wait time that adds to read time. So we want that read latency to be definitely not worse than 200 milliseconds. If it's under 16 milliseconds, which is kind of time between frames with 60 hertz frequency, that would be ideal. For the new index, we want automatic corruption detection so that problems don't go unnoticed. And ideally we also want index to recover from corruption. So how do we design that? We decided to use one big global database per workspace. This is different from CDT. In CDT actually there is a separate database per project. But we decided to go with a single database because we want faster searches. If you have multiple databases, you can speed up searches, make them log in typically inside a single database. But if you have multiple databases, you would have to then try them in order. But if you have a single database, you can have just log in behavior. Another very important requirement is that we don't want to rebuild our database when class pass changes. We may have to update our database. For example, if class pass changes in a way that new classes are added or jar files are added, of course we need to index those. But we don't want to rebuild it when, let's say, one jar file added on top of 1000, we don't want to rebuild the whole thing. So we want some kind of stability. This is intended for indexing jar files and class files, not source files, really. So the database that we are using, like in CDT, is a network database. This is not a relational database. We are not using any SQL to access it. Network databases were popular, I guess, I don't know, maybe 25 years ago or so. Then technology kind of got into shadows, but it's still a useful technology. It's actually a very simple one. And what that database contains, it contains a graph of interconnected objects. And the way we build this graph, we want every reference to be bidirectional. The reason for references to be bidirectional is that in CDT we actually don't have that. And we learn that lesson when you don't have fully bidirectional references, deleting something that has a large number of incoming references becomes very expensive operation. Of course, since we are dealing with the database on disk, to have fast access, it's very important to have cache. So database at the lowest level is a virtual memory system, organized into pages on disk, lives in a single file. And we have page cache, where access it's LRU cache, where pages read from the disk stay for a while, and you can access them in that cache without reading them from disk again. And to the callers, the database exposes a bunch of these accessor objects, and accessor objects are backed by the data in the database. So basically they point to the page or some piece of the page and provide its information from there. Now, what these accessor objects do at least some of them, they are closely related to Java model objects. Probably these type names, iBinary type, binary method, binary field, don't tell you much unless you are a JDT committer. These are the interfaces used by model in JDT, and some of the accessor objects implemented those same interfaces, so that these accessor objects can be used in place of existing model objects. The best of JDT sees that kind of as an index model these objects, and how they do these objects work. They typically contain the only instance field, which is a point database offset. Long pointing to a location in the database, and that's the only field of any of these objects. And they, of course, as interface prescribes contain bunch of getters and setters and maybe some other logic. And getters and setters, getters read from the database and setters write to the database. These objects are lightweight, because they only have a single field, and they are disposable. So if you have a way to get, if you know the database offset, you can always create a new object. You don't need to keep the old one around. You create a new object, since it points to the same location in the database, it will be equivalent to any other object pointing to that location. This is to give you an idea how code describing these objects, active classes for these objects, looks like. So the major part of this code is a bunch of statics, static declarations for name and age. So this is an example, so it has nothing to do to JDT. So we created the object that has a string name and age. So these are classes of objects that in database. This defines basically the layout of these objects, and you can have relationships. So string is stored, for example, in a separate location in the database, so it creates a relationship there. This is how database is constructed. Object is constructed. You pass it to the database and offset, and this is how getters and setters look. They read, they write and read from the database. Why does this getName method return I string instead of string? So setter takes string, but getter returns I string. This is for performance reasons. I string is database string that is backed also by the database, so not all information has to be in memory, or actually no information has to be in memory when it's created originally. But when you access it, let's suppose you have a long string. Then, and you want to compare two strings. You really don't need to read everything from the database to do your comparison. You start comparison from the beginning, and you retrieve as much information from the disk as you need, and only if the strings are equal to the very end or to the very last piece that we retrieve, then we have to retrieve all the pieces from the database. But if not, we can stop earlier even without retrieving all the information. Corruption detection. The primary danger of corruption happens if we have to write pages back to disk because we are modifying them in memory cache, and then we have to flash that cache. The primary source of corruption is incomplete flash. So maybe Eclipse died in the middle of a flash. All pages could be written back to disk, and that creates inconsistent state when next time we read these pages, they will be hanging references and other bad stuff. And we don't want to learn about that kind of late and unspecified manner. We want to know for sure that we were able to write all pages successfully or not when we actually read things. So we do that by having a flag that we flipped on the first page. We flip that flag when we start flashing, and we flip it back when we end flashing. So it's a very simple primitive mechanism. We tried using checksums on the pages, but unfortunately that introduces a slight but noticeable performance overhead that we don't want to infer. So we are not, for example, going to catch corruption due to bad sectors on disk undetected by our system. We are only going to catch incomplete flashes. So we still need a mechanism to catch situations where maybe database got corrupted due to software bugs, for example. So maybe we screwed up and wrote something inconsistent. We don't want these errors to go unnoticed. We want to know about them and recover or minimize damage at least. So the policy that we decided to follow is to throw runtime exception. So when we detect some inconsistency, so we expected something and we didn't get that and got something that we know cannot happen if everything goes right. Then we throw runtime exception. That exception will propagate and will be visible to the user so current operation will fail. But it will also, so we will notify the user that way. And we will also use that exception to actually start process of rebuilding the index. Since we don't know actually what got corrupted and how it's too dangerous to make guesses here, the safest thing to do is to just trigger full index rebuild. So at least after initial kind of unpleasant thing where user tried to do something and it failed, at least after indexing finishes user will be able to continue working. So relationships between objects, edges of the object graph are represented by smart pointers. And what I mean by smart is that they enforce referential integrity. If for example you try to remove an object, it does not leave behind incoming references. Instead it modifies these edges and there are different styles of these pointers or different styles of edges with different behaviors. It can be cascade delete or it can be just removal of an edge and modifying reference count on the other side. What bidirectional references mean, so kind of our main motivation was to facilitate delete. But what they also mean that index contains a lot of knowledge. For example, and you don't need to search anything, it's just there. The fact that every variable knows its class also means that each class knows every single variable of that class. And every Java class has a list of subclasses of course. But it also, yeah, so normal relationship is that class knows what it extends and what interfaces it implements. But opposite relationship is actually you know all your subclass. We were talking that during updates the index should still be usable. In order to achieve that in single writer, multiple readers concurrency model, we have to make sure that periods of time when write lock is held are very, very short. Because these periods determine how long readers have to wait before they can start reading if they didn't hold the lock before. So we need to write to the index in very small increments. To achieve that, when we make an update, when we write new information, let's say some jar changed on disk. And it was in the index before, but it changed on disk. So we update what is known about the jar in the index. We keep all the information about that jar while writing new information. We have markers so that readers won't see new information until we flip the switch. They will still be getting old information. And when we are done with the new one, we flip the switch again. And new information about the new version of the jar file becomes visible to the users. And when we write information to the index for the new jar, we actually do it file by class by class. And writing for the one big jar would take a long time. But for a single class, it's pretty short, pretty quick. So our measurement showed that index is rarely blocked for more than two milliseconds at a time. Our goal was ideal goal was 16 milliseconds and actually we can achieve even better. So this is an illustration of how index update happens and when it's available for reading right, or not available for reading during update. So first thing we do, we flip the switch that we are going to update. We are going to update the jar and the jar is currently being indexed. So there is some preparatory work we need to do in the index. And during that work, we need to block it for reading. Then we write information about classes in the jar one by one. I get to why actually there is so much green in between and only narrow orange strips here. And then we flip the switch to actually make the new version of the jar visible to the readers. And after that, so from readers and point update has finished, but index still has some work to do. It has to delete information related to the previous version of the jar and it does it after flipping the switch. Again, we don't want to do it in a single, under the single right lock. We want to allow readers to get in and start reading in between. So we do it currently class by class, but maybe we can get slightly better performance. We need to benchmark that by actually grouping, introducing some grouping there and let's say do it 10 classes under single lock or not. Yeah, in one thing I forgot to mention that in this one single writer multiple reader concurrency model readers are given preference. Why we are giving readers preference because readers is likely the code that is responding to user actions. So that's why we are giving readers preference over writer and writer is the only one it's whoever the code the indexer the code that updates the index. So this is how update related to a single file looks like. First of all, we need to read by bytecode of class from the jar that of course is an I operation takes time. Then we need to parse that bytecode understand what's in there that takes significant most of the time that we need to write it to the index. This is where we need to acquire right lock during the right operation, but that right operation is pretty quick compared to reading from jar and interpreting. Once we have written to disk we actually when we read a right to index we actually write to page cash. So there is no writing to physical disk happening at that time. And then we release the right lock and then page cash is actually flush to disk to make sure that results are stored and would not be lost. So because index objects implement interfaces of Java model objects, we hope to almost completely replace existing cash in Java model manager. Java model manager place where kind of currently holds model objects in JDT form. But there are interesting problems there. Many colors, many a lot of code that interacts with the Java model expects the data there to be completely up to date with respect to state of file on disks. But index is updated asynchronously. So if we want to replace these model objects with model with objects backed by the index, how do we deal with that kind of synchronous and asynchronous model? How do we kind of translate a synchronous behavior of the index and its objects into synchronous expectations of colors? So let's consider the following scenario. Class file changes on disk and somebody some color immediately asks index for binary type corresponding to the change class. And we know that that color actually expects up to date or at least we suspect that that color expects up to date results. We cannot return out of date information because it would be incorrect. And actually we cannot wait for the index to waiting to finish updating before we return that information. Would you do know why? Why we cannot wait for the index? Because that wait in order to update the index we will need to acquire a right look. But readers we decided previously take precedence of a writer. So the writer can be start. So we may have to, the update may have to wait for other read operations that can be lengthy operations. So we cannot do that because otherwise it would introduce UI freezes again. How do we solve it? We have not solved it yet, but this is our plan for solving it. So when somebody asks index for a project for an object. We keep track of time stamps of information in the index what it corresponded to when it was written there. And we compare it to the actual time stamp of the file in this let's say that class file came from a jar file. We compare time stamps. We actually do IO to do that at that moment but it's a very cheap IO. And if we determine that we don't have up to date information what we would do in this case we would read data from the jar file synchronously. We would then create an object from that information that would be backed by bytes of the class that we read. And the indexer will then use that previously read information to actually update the index. But that will be done asynchronously with respect to returning the object to the caller. But the object once the indexer is done indexing will internally morph into being backed by the index as opposed to being backed by that byte information. Corresponding to the class. So that's our plan. So how is the index agonized and what it consists of? There are three sections in the index. One is singleton data that is always present. We have type name B3 actually there are two of them. One is for fully qualified names and another one for just non-qualified names. There are some things in global state like time stamp of the last update and version and what not. There is B3 containing references to all files that are indexed. It could be class files, jar files and in future actually also Java source files. And then there is hierarchical data. So jar contains classes, classes contain methods, fields and so forth. In order to be class path independent when we connect two classes, let's say one class extends another class, we don't connect these objects directly to each other because it's possible that there are multiple versions of the same class in different jars. Of course only the first one wins at type resolution time, whoever is first on the class path, but on the class path there could be multiple versions. In order to get class path independence, we delay that until the resolution time when we actually apply class path as a filter. So to make sure that index information has that class path independence, we introduce intermediate objects which are here type ID and method ID. So basically from class we point to type ID which is effectively fully qualified name of that class with the reference count there. And that is connected to type name B3. So if you for example look for a class, you start from type name B3, you do search there, then you go to type ID. Type ID contains references to every class that matches that fully qualified name, they can be in different jars. Then, so you get multiple results at that point. Then you filter those results based on the class path and get the one that is first on the class path. This is how resolution works using class path at reading time. So let's take a look at how this type hierarchy that we saw problem with, how it works with the new index. So this is the version with the new index of same dialogue. There was actually still some UI freeze, but as we can see that UI freeze is only two seconds instead of, how much it was last time, 30 something, right? Yes, 33 last time. So from 33 down to 2. And if we look inside, we can actually see that this is all just, I guess only the last sample relates to something index related. All four first samples were UI stuff. So mostly time is spent in updating the tree widget because it's a massive tree widget. And it takes around two seconds to actually create all the nodes in that widget. Index is really, of course we cannot judge based on just one single sample, but index actually here takes, I think, maybe 300 milliseconds. So what we got is for total time that displaying of type hierarchy of object, we are down from 37 seconds to 4. And the remaining four seconds is not index is only responsible for about half of that. And that is before UI freeze when actually progress was properly shown. These are some stats that we have for how long indexing takes. These numbers may change as we add more information to the index. Yeah, the way index is designed, it's very easy to change format. You can always add more fields to these objects. And there is a versioning scheme that would detect that format has changed. And it would discard previously created index and re-index everything again. So that's why we are not putting information there that we don't yet use. So we add more information as we go. As whoever consumes information from the index needs it. For the time being, we are going to run new and old index in parallel because a lot of functionality in JDT has not been ported to use the new index, so it still relies on the old one. We have a mechanism that both can work together. Eventually, all the index should go because, for example, to achieve the goal, even in situations when all the index is not critical in terms of performance, we want to solve the corruption detection problem and in order to achieve that, we need to get rid of the old index. But that is pretty far away. Code for the new index and type hierarchy based on it and we also adopted it in very few places in JDT so far. The code currently lives in GitHub repository. The reason it hasn't been merged into primary JDT repository is that we still have few regression tests that are failing. When new index is enabled, then we are working on it, so we expect that to be done pretty soon in the next 2-3 weeks. Then the real work starts. Then we need to do gigantic, massive work of replacing an engine on a plane in flight. Adapting index in every place pretty much in JDT. These are the places that used the old index and places that didn't use the old index before because the new index contains much more information than the old one and therefore can be used to improve performance in a lot more places. There are interesting problems with that adaptation. Java model manager that currently caches or holds the model objects actually gets in the way of efficiency when using the new index because performance when using the new index in large degree comes from lazily reading data from disk. But Java model manager assumes that reading is very slow and optimized for that slow reading by reading it in bulk and eagerly reading. But if you eagerly read from the new index, you lose almost all of the advantage. So we really need to change how model works in Java model manager so that it also works in the lazy manner. Delayed... Oh, we are on time. Oh, we are good. I think we are. We have plenty of time. Great. Model data events. We want to eliminate situations when something changes on disk and that creates an event and whoever listens to that event actually immediately goes and asks the index, oh, what changed? Give me the new object. Because that would mean synchronous reading that would be just as bad as we had before. So we don't really want our changes on disk to be announced to model delta listeners. We don't want them to be announced right away. We want, in most cases, we want to index first, update the index, and then produce that event because we are not noticing the changes on disk instantly anyway. So adding a delay there should not be a problem at least it's not a correctness problem. But kind of whenever you change event ordering it's tricky and this is no exception. We made one attempt, showed some problems and we decided that we need first to merge into primary git repository so that we can get to incremental model development before actually sorting out these problems that are not that easy. Of course we will need Java 9 support but I expect to be almost no brainer and of course adopting a huge bulk of JDT code to take advantage of the new index. Another thing that now I'm switching to really far future there will be hopefully such a day then we can start thinking about adding new functionality to JDT that was not even possible before for performance reasons. Content assist with spelling corrections. We can easily add more information to the index to make kind of this fuzzy content assist fast and kind of opportunities are numerous and they're just limited by our imagination of what can be done. These are some references that you can use. Development is done in the open. There is a master bug that is referenced here. Kind of individual problems and pieces of the puzzle are all described in separate bugs and this trio of bugs is interconnected. There is a design doc that is referenced from the master bug so the only link that if you are interested is really the master bug number. Everything is referenced from the questions. Dany, you have something? So how that flag that used to indicate that we are changing the index where it's stored? It's just a transient state in memory. There is no need to store it persistently. The only thing that has to be stored persistently is the flag that indicates that flashing of pages to disk is in progress because we want to detect that it hasn't finished actually successfully if the program dies. Other transient states like we are in the process of updating the JAR file don't have to be stored persistently because if application dies then in worst case what we can get we can lose some space in the index so it's not that critical and actually it's not that hard to write code to recover that space. Okay, great.