 Daryl, we are going to discuss about work on an efficient approach to store and access Wikipedia's revision history. Another problem with the Wikipedia data set is the accumulation of a lot of redundant data. I'll show this with an example of evolution of India article. Suppose this is the India article. It started in 2002 March. Now, whenever a new article starts, many people come there and it starts contributing. And whatever they contribute in this article, it gets recorded in the backend. Now, the way it is recorded in the backend is not a simple process. Actually, whatever edits are being performed by the editors or contributors on these articles gets accumulated in a form of redundant data. And the problem is that once you try to extract a particular edit, a particular edit from an author or editor, it's very difficult to extract the data set because as you can see, even the current article contains less amount of data. The revision history, the full revision history at the backend will contain a huge amount of data because it contains the accumulation of all the revisions. And hence it gets very difficult to analyze the data set of such portals. More precisely, suppose on the left side, we have India article, the current snapshot that we can see. On the right side, we have the backend data, which is whatever that has revision history, which is being stored in the India article. So suppose a user comes and writes three lines in the India article and saves it. Now whatever, what will happen in the backend that these three lines will get recorded as revision number one. Now suppose there's another user comes, a different user, and he writes a single line in the India article and saves it. Now what will happen in the backend? Actually, all the current snapshot, whatever is there in the current article, current version of the India article will get recorded as revision number two. And that's what I was talking about, the redundancy. Now suppose, let's say user number three comes and he sees that the line, the new line which was added is not required and he deletes that line and saves it. Again, whatever edits has been performed, whatever is the current snapshot of this India article will get recorded as a new revision. And this will be called as revision number three. Now as you can see, the current version of India article contains only three lines. But in the backend, the data contains lots of lines with three revisions. And this is why even though the current version contains very very less number of lines in the backend, the dataset could be huge. And with this, it's very difficult to extract the dataset and process and analyze the dataset. So what's the solution? One way to solve this problem is to just store the difference rather than storing the whole revision. For example, rather than storing the whole revision number two, I'll just store the difference between revision number two and revision number one. Similarly, rather than just storing the revision number three, I'll just store the difference between revision number three and revision number two and so on. But there is a problem with this solution. Although it reduces the size of whole revision history by a huge margin, there is one major problem. And the problem is suppose if I want to extract the revision number three or nth revision, I'll have to first extract the revision number two. And to get the revision number two, I'll first have to extract the revision number one. And using the decompression mechanism, I'll get the revision number two. After I got the revision number two, using the decompression mechanism, I'll get the revision number three and so on. So you can see if I have to extract the nth revision, I'll have to go through a recursive process, which will take a lot of time. Hence, this process is time consuming and hence not an efficient algorithm. What we proposed was rather than just storing the difference between every two consecutive revisions, at a block size of k, we store the difference between every two consecutive revisions but at kth block, we store the full revision and we repeated this process. So suppose if you want to, in this example, if you want to extract any revision, let's say revision number three, you just have to go till the kth block, here the k size of k size is two. So you just have to extract two revisions recursively and using a constant amount of time, you can get any revision. But again, here the problem is what should be the size of this block, that is what should be the size of this k. And the second problem, can this k be fixed? Can we can we set this k as a variable? And we solve both of these problems. So first, using the fixed value of k, we showed that that we can actually come up with an optimal value of k, using which if we if we compress the data set of any revision history, let's say Wikipedia article or Wikia, we can get we can extract any revision in very less amount of time, achieving the maximum compression ratio. Moreover, we saw that if we use the variable length also, we can actually come up with a algorithm and we came up with an algorithm through which we rather than fixing this size of k, that is the size of each block, we made it a variable. And we also achieve optimal compression and extraction time. We also compared our result with the benchmark that the baseline result which was proposed by Francesca. And we showed that our method performs better in terms of extraction time and compression ratio.