 Okay, we are live sharp at six. I think let's wait for a couple of minutes for more folks to join. We can jump into the paper. Okay, so we are at nine people. As soon as we hit, I think 12 or 15. Yeah, I think we have about 17 people. We can get started. Yeah, let's start. There are folks waiting on YouTube. Right. Right. Hello, everyone. Thanks so much for joining me for the June session of papers. We love Bangalore. Today we will be discussing the paper by Facebook. It was published in 2013. In the user next ATC conference, this is very famous for distributed systems research. And it is a very large scale draft for database that Facebook developed to support the billions and billions of users. They have on their platform. And we have today with us. He's a principal architect with that great technologies. Yeah. So yeah, that's about me. So I'm Rohit. I work at capital technologies as a principal architect. So these are my contact details you can hit me up. So we will be about 45 minutes to one hour for Rohit to present the paper and we will try to keep it engaging and interactive. So the paper is fairly detailed in the sense that the first few sections talk about the original problem statement, the construct of the problem that they were trying to solve, the assumptions that they made and the API characteristics. And the subsequent sections talk about the architecture and the implementation. So the way I'm planning to do it is that we will have Rohit present the first few sections in the API details. And we will run through them. And you can see posting your questions. On the Q&A session on Zoom or on the Q&A session on YouTube. We will take a five minute logical break once the then for each the API section so that you're able to absorb whatever was covered and we can have a one round of Q&A then after that we will jump into the architecture and the implementation section so that you can build up upon whatever was discussed so far. Right. And we'll run through till the end of the paper by then and then we'll have open it up for further discussion. Yeah. So please feel free to share your comments in the chat section either on YouTube or Zoom. Keep posting your questions in the Q&A section on Zoom and YouTube. And I'll keep making the make a note of them and I'll bring them up during the breaks and during the logical break and towards the end of the paper. Yeah. All right. Over to you Rohit. Let's jump in. Yeah. So yeah. So the paper at hand is Tau. Facebook distributed data store for the social graph. So Tau is a word play on I mean so abbreviation of the association and objects API. So it's safe to assume that a lot of you guys are Facebook users and whenever you open up your timeline, you definitely hit the Tau system several hundreds of times and for more popular users, you can open up your timeline. You definitely hit the Tau system several hundreds of times and for more popular users much more than that. So Tau is a core subsystem for serving the serving the timeline and pages in Facebook. So what is the social graph is something that we'll try to address first and then we'll see the evolution of the solution at Facebook to render these graphs to the user. Yeah. So first get to the definition of the social graph itself. The social graph concept is has has been at Facebook even before Tau itself. So it's been around since say 2007 or something. It's a flexible representation that directly models real life objects. So in you could say some of them are not real life objects or like Facebook posts or at least in the social media life. It is a real life. Now, this is essentially a directed graph. The nodes of the graph are objects and the edges are called associations. So this is the nomenclature throughout the paper. The nodes are typed so they have their own types and it will be abbreviated as type further in the paper and associations are also types. So there'll be a types. The objects model people places posts and repeatable actions. This is fairly obvious. So the nodes. So I think I think the only point which probably needs a little bit stressing is that even repeatable actions are modeled as objects and not relationships. Sorry, associations. So associations model relationships, non-repeatable actions and state transitions. Associations are like so the graph itself is directed. So associations are directed. But oftentimes most of the large number of the associations have a title tight coupling with an inverse edge. So with this definition in mind, we'll try to traverse through to the example that is shown in the paper. So Alice has made a chicken at the Golden Gate Bridge with Bob. And Kathy has commented on it and David has like the comment. So obviously from the right away from the definition, people places are going to be objects. So the users are objects and the location itself is an object. These are all tight. As you can see to the O type is user for all the users and location has its own type and it has its own attributes as well like name for the user and the location coordinates for the location of it. Now check in and comment are two of the repeatable actions which are modeled as objects. So why are repeatable actions? So there can be the same user can comment multiple times to the same post and same location can be checked in a user can check in to the same location again and again. So these are repeatable actions. So this according to definition has to be an object. Now like is an example of a non repeatable action which is modeled as an association. So for example, so the user David has like the comment and he can like it only once he can either like it or not like it so as either the association exists or not does not exist. Now again to most of the associations associations here are tightly coupled with the inverse edge. So if someone is a friend of someone the vice versa is true as well. So the friends inverse edge is always a friend association itself. Meanwhile, the others have distinct types of authored and authored by and tagged and tagged it. The actions will also be like a I mean they would be similar to a like comment right? It's a non repeatable thing essentially. Yeah, so likes are non repeatable. So yeah, they are. The actions will also be non repeatable. If I as a member you can react on we can give only one reaction to a post right? Correct. Yeah. So yeah, so now let's just note that the only relation or the association here which does not have a reverse edge would be the comment one. So why not is a mental exercise that all of you guys can just perform just defend the use defend that particular choice or counter it and also what would be a general thumb rule for deciding whether you need an inverse edge in the market. Yeah. So I think this is more of a pop question. I think that Rohit has played on us. You should think like why Facebook has made this design choice on on what kind of an association they want an inverse edge and then they don't want an inverse edge, right? Maybe if you come across some answers, maybe you can post it in the comments or maybe we will and we will also be discussing I think in a subsequent slide. Yeah. So I hope the modeling itself is clear how the objects are modeled in the social graph. So now let's look at some of the characteristics of the graph itself. So as we saw the user type has a name associated with it. The location type has some location coordinates along the dude and that dude associated with it. So these data are encoded as key value pairs into the object and these keys are predefined by a schema. The graph itself is extremely read heavy. So we'll come to the performance and workload production workload that Facebook experiences. And probably you you would it would be surprising to you to see how lopsided the read part of it that compared to the rights. So probably you can even guess because a page view involved several hundreds of reads from the graph. Recent items are often read the most. So this is another characteristic which is fairly obvious because the recent post and comments are more often read and the older ones are very, very unlikely to be read very often. Now read regarding the consistency levels. So read after right consistency is critical. So for in the example that we made so if Alice makes this check in into the Golden Gate Bridge and it does not show up in her own timeline or her own profile that makes for a poor customer experience or user experience. But for Bob who is probably around the globe from Alice if the post shows up in his feed let's say with a few seconds of delay that's fairly acceptable as long as the page view experiences itself is good which means that latencies are very low. So this is a consistency choice which is fairly obvious with the social graph read after right consistency is critical for the user. So basically read after right consistency itself implies that the process that made the right should be able to read the consistent data after his right. So all the data that he fetches should be consistent after the point of the right. So I think this is a fairly important characteristic which their use case entails and I think this is something which the paper goes on to leverage very very beautifully to detect the design choices. For instance how do they show that read after right consistency available for the immediate user who is performing an action and how eventual consistency is leveraged for the recipients in the association list of that actions essentially. So folks I think do bear these characteristics in mind because you want to spend a few more seconds simply this is something because a very strong bearing on the design choices and the recent items being the most often read items essentially I think the listing and the way they expose the API is I think this has a bearing on that right and read after right consistency and eventual consistency folks I think do keep these three points in mind as we proceed on to the sequence sections. So going to the pre-tower solution so as I said so from 2007 the social graph has been the data structure that Facebook was having internally the social graph data itself was stored in MySQL so given the limitations of MySQL it's fairly obvious that you cannot serve the whole production workload out of MySQL so we need a caching master and slave instances as spread across the globe so this was already the case even before tau the memcache so memcache memcache acted as the simple key value store which they used as a look-aside cache so what is a look-aside cache look-aside cache is where the client bears the responsibility of setting values into the cache and validating it because memcache can only operate in that way it does not have any understanding underlying storage layer it just functions as a general purpose key value store although there is a cache in the name memcache b it does not work as a cache that understands the underlying storage layer it's up to the clients to use it then a psp abstraction was provided to the developers so the web application developers who could query the graph so this abstracted out any direct access to the MySQL layer so they essentially the programmers did not have to worry about the storage layer but instead worked through a graph like abstraction which was made at the psp sdk basically the client library the client library took up several functions one is the data mapping that is reading from the MySQL and mapping it into the graph graph style objects and putting it back into the cache and then validating the cache and some amount of logic yeah so why is there some control logic so we'll get to that in the next slide so there are some inherent complexities that you'll come across while using a look aside cache at scale and spread across geographies so this is a critical part here so it's an inherent deficiency that would come with look aside caches while using it spread across the globe across while having the storage layer and and clients are having to implement this logic which accounts for some complexity so let's look at these use cases so so the look aside cache how to counter thundering hurts so what is thundering hurts so this is described in detail in another paper which is scaling from facebook we'll just hover around with the slide just to understand the challenge and how it was the pre-tow architecture so this is done using the concept of leases presented in the paper so this addresses the issue of thundering hurts and stale sets a thundering hurt occurs when a specific key undergoes heavy read and write activity so because of the repeated reads the sorry writes the keys get repeatedly invalidated this would most definitely lead to a thundering hurt which eventually would manifest itself as a heavy load on the storage layer which is the MySQL database itself I mean if Satil Tandilkar makes a post on his page people would rush to comment on it because it should be like a thundering hurt on the underlying data store yeah and stale sets is another problem that leases can address actually stale sets is the fundamental problem that leases address which addresses thundering hurts as well so stale sets occur when a client sets an older value into the cache due to concurrent writes so there are concurrent writes that are happening via two different clients and obviously there is sort of a race condition it could be possible that the value that is set into the cache is not the current value that is present in the source of truth which is the MySQL database which is a huge no no memcache itself so facebook modified memcache for this purpose and it issues a lease to a client that initiates a set request so yeah so here are one thing fundamental thing to understand is that memcache is a very simple key value store with a very simple API so we just set get and delete it's fairly understandable what each of those functions does so whenever a client initiates is identified from the token and vice versa and the cache server invalidates the lease when a key is deleted so delete or update basically but because it is only a set get and delete API update is essentially a delete and set so on the lease is invalidated so this helps to ensure that only one client is actually performing the right at a time performs the only person with the lease can set a value into the cache a set request is validated against a lease issue to prevent stale sets so this prevents stale sets well a small enhancement to this logic counters thundering hurts as well so for a short duration after the issue of a lease for any reads the memcache sends a custom message what this does is that it gives the right the client making the right the small window required for it to set the value back into cache and the further reads can read it from cache itself without having to default into the heavier read path yeah so now another of the complexity is that any of the edge list so so any edge list stored as a value so this would mean that every time so let's say somebody has added a new friend the whole friend list would have to be invalidated and read again from the cache there is another issue that is addressed by the same paper that we discussed earlier which is the read after write consistency in this model is quite hard to work with we'll see why so the solution that is used here is remote markers so the problem that it addresses is that so a write can originate from any region so the master database might not be in that particular region so the client would have to contact the master in the master region which would be a cross region APA call it would set the data but it cannot be sure that the application can still take some time so there is some window during which you cannot trust your local data store local storage layer so when a client writes a key it additionally sets a marker rk in the local storage so this is the solution so when a client that is a client library when it writes into a key it sets a marker a deletion deletion query for the rk is embedded along with the right query itself so when this deletion query gets replicated into the slave region your rk would get deleted so the presence of rk indicates that this is the window in which so right now this is not a good time to read the key from the slave region so in such a period the requests are routed to the slave region so this again so this should be clear that this interregion calls are expensive and this is why read after write consistency is expensive in this old solution right maybe we can run through this slow one in the subsequent section when we go over the architecture might be able to walk through so yeah now again so another challenge which is because all of this logic which is not trivial so the logic is not very trivial logic trivial amount of logic was all written in the psp client library and non psp services cannot access it in a straightforward manner yeah so these are the challenges with the pre-tower architecture so now now comes tau and what it envisions itself to be so it is essentially a geographically distributed read efficiency so it needs to read the queries from a local region which means that wherever whichever is the region where the web server which is serving the request is present and what is the region a region is a collection of data centers that are co-located in such a fashion that the ping latencies are extremely low read and read so it will have a read and write through a pattern so this avoids the need for the tower system itself would be aware of the storage where and it can directly access the storage layer so all the requests from the web applications can go directly to tau it will definitely get you the data whether it is from the cache or from the from the storage layer now another use cases to support logic specific to the social graph for example the availability over strong efficiency and availability over strong consistency so this is a key point as we discussed earlier social graph can be flexible with consistency in some fashion as long as read after write consistency censured eventually consistency works so that should be exploited to provide efficiency and availability to the clients yeah allows them to have stale data as long as they have high availability and they can have good performance to consistency is not something which is primary objective for them as long as they have read after write consistency for the same user they are okay to make some sacrifices essentially yeah so the read after write consistency that tau aims for is that in the same region where you should be able to have read after write consistency especially to the clients that have made the write request right so yeah that is the guarantee that tau aims at yeah nothing it is highly possible that if I go and like a comment then I will see that I like your comment but if you made that post you might actually see it much later that I like your comment right so yeah yeah so again coming to the APA characteristics as we discussed there are specific graph use cases which probably is not the same as the complete set of graph queries let's say supported by a system like neo4j it does not intend to solve all of those purposes it's very specific to the social graph use cases so one of them would be that the creation time locality of this social graph so this is basically how social sorry how to make an assumption that association list can always be returned ordered descending by the time field so this is a pre-decision that they have made which which lets them so usually stuff like blog cash in my SQL works over a space locality of the data while they they have opted for a creation time as we see later on and it also enforces a maximum number of associations per association type this helps with some of the caching strategies like you can cache certain ranges and not worry about different access patterns which are not consistent so typically they have a maximum limit of around 6000 associations to be fetched so yeah so this can first of all always efficient and prevent any bad queries getting fired into the storage layer yeah and I think this limit of enforcing maximum page sizes of 6000 records is fairly bad back by their data I think in some later sections they do present by data in terms of how many read queries return what the payloads of what sizes and I think that's what they have used to arrive at this limit yeah so the ABA itself the ABA itself is fairly straightforward so it supports create read update and delete operations on objects and then a restricted set of association related operations so you can add an association delete an association so Asoc add takes the from object ID to object ID the association type and the time and the attributes of it Asoc delete can delete an association type it can update an association type association get is probably going to be heavily used so it for given a source object and destination objects list all the type of association sorry given an association type A type list all of the give the association list basically and the high and low are basically time bounds for it so that it can query based on the time Asoc count just returns how many associations of a particular type are there for a particular from ID and Asoc range is basically a pagination operator so that you can use it along with the Asoc count and paginate over the values Asoc time range is a specific API where you can query association query association list in a specific window for a particular from ID we will see the distribution of these APIs then we will initiate what the use case itself is and the data itself defends the API choice yeah so do you want to quickly walk through and tell like when I say Asoc add what does that initially mean on the graphical presentation that we have on the slide maybe that will make it clear yeah so it essentially creates these edges so Asoc add has an additional responsibility that if the A type has an associated source into the storage layer and gets updated into the cache for the association association list of a particular ID so that's essentially what the write operations does and crowd operations for objects just creates creates delete and updates these objects yeah I think here we can take a logical break and look at the problems and any questions if people have yeah so I think we have from Sreeper which is called Sreeper Jewel which I believe you already answered but yeah we can run through it again he asked that as far as I know Neo4j is eventually consistent and can be read after and be after write consistent what was it missing that Facebook decided to build something from scratch clearly are Neo4j versus now yeah yeah so Neo4j Neo4j so it does support distributed and it is so it has its own asset compliance of sorts which is not required for the for the Tau system and Tau can exploit those concessions to be more efficient so that is where it comes in and also in general talks about the same topic they have defended their use of their Facebook have defended how they wanted to retain MySQL because it is a fairly standard database and the operational knowledge around it for retention of backups failure recovery is much more than widely accepted than Neo4j for that matter and probably it was also a nascent project around the time Tau was being built Tau or even the graph store was being built and my story is very I mean very vigorously defended by the developers at least in some of the talks which is linked in the references so the same question has come up and they have defended MySQL their desire to retain MySQL I think Facebook is one of the companies who even at that scale are operating MySQL and they've I don't think they have moved away from it for the persistence for some of the key entities I'm sure they are using other data stores but at least for the Tau and the social graph MySQL probably for Neo4j the considerations of the data being spread across different regions that needs to be specifically addressed in the same region it's fairly straightforward but across regions that is across data centers across the globe Neo4j probably still needs to be worked on correct and also I think if you look at the API so they don't have a use case of complex graph traversal methods essentially like they're not doing any you know they're not doing any directed search or they're not doing any breakfast or they're not running any shortest path algorithm as well their API is fairly straightforward so I think Neo4j use cases that I would say a lot more advanced than what they needed from an API standpoint perspective so yeah great hope Shriva will answer your question any other questions leave a comment and I can read that for you okay okay let's move forward I think in case something comes in we'll probably take it up towards the later yeah sure so now we'll come to the implementation itself so as we already touched upon so they've retained MySQL as the persistent storage database so the MySQL databases are logically shorted so they reside so objects so objects are split into shards basically the wherever the object id id1 which is the source object resides the associations originating from them are also stored in the same shard so this is again a compromise for not compromise this is a decision made for the re-deficiency so that as we saw from the API they can be served from the same shard so you shard the objects and keep all the associations in the same shard as the id1 each MySQL server is responsible for one or more shards so shard directly does not imply that it is a physical machine so a particular shard can house a large number of logical shards and these logical shards manifest itself as a logical database within the MySQL server what this does is that these two points which is that one or more they are logical databases what it helps with is that the physical servers are decoupled from the sharding technique itself what this implies is that the shard mappings are modifiable and there is an operation layer on top of it which lets you move shards between physical servers so this is often done at Facebook to rebalance the loads across the databases object id is contained for another lookup service so given an object id you can straight away know which shard id you want to query into the object attributes are all so the type attributes and the O type attributes are serialized into a single data column and so that's how it is stored so why is this because they do not have a use case of filtering by the attributes filtering or searching by the attributes and it helps with the caching obviously so you just need to serialize one column and serialize and store it the database servers in one region together have a full copy of the complete data either in a master or slave role so let's assume that there is an India Facebook region which involves multiple data centers with a very low latency latency between them it is assured that the complete data is stored in the in that region it does not mean that all the rights can go into that region because all of the charts master databases might not be in the same region it would be distributed across the globe so this is a concept that you need to be aware of while discussing the paper or going through the paper if possible yeah so this is a key thing that we need to understand consistency model needs to address this particular problem so this problem of a right originating in a region where the short slave is only present not the master so do you want to give an example yeah so let's let's say I registered as a Facebook user my user object could get created in a data center and the but the same data is replicated to a local database in India as well and my further rights could hit an India web server or it could hit the North America region anyways but these are two different use cases when the right request goes so my let's say I updated my profile and so the right has to propagate into the North Virginia region and for me to get the read after right consistency it needs to be reflecting in reflecting to the the API layer in the local region which is the India region right now so people keep traveling right I registered in India so my primary data short could be in the my the point from where I'm accessing will change essentially so so the paper this is that the sharding is completely random so that no two objects are usually in the same shard but it obviously makes sense to have some sort of geolocation affinity so that let's say I am an Indian user so probably the data stays in an India region might make sense in that in that yeah so this so this so now let's come to the caching layer so which which actually implements the tau API so this is a general structure so every database is is sort of covered by blanketed by a cash layer a cash layer is formed of tiers so each of these are a set of cash servers which form a tier so there are three this is just the architecture diagram and essentially the same sort of structure follows in in the slave region as well so a particular this just visualizes a particular shard let's say so this is a master region for a particular shard and this is a slave region for that particular shard there is a replication between these two might equal servers now how this interact will be clear after we define this so we will go over the brief overview so as I said so the caching layer is what implements the tau API essentially the functions that we mentioned earlier each caching layer is consisted of multiple cached tiers and each tier consists of multiple servers in turn the storage layer shards are mapped on to the cached servers like let's say Amazon's DynamoDB each of the so there is a single leader tier in so amongst all the cached tiers that are present one is designated leader tier this is responsible for all the interactions with the storage layer under normal circumstances but on multiple follower tiers these are the cached servers that a client would make a request to so these essentially answer the client on cached misses the followers fall back into master sorry leader and leader in turn falls back into the storage layer the leader issues leader is responsible for issuing asynchronous invalidation and refill messages to its followers so one of the advantages here is that being the leader ensures that the cached server so hence you can avoid things like stale set and stuff the only thing to keep in mind is that this needs to fall back into the followers in the cluster so this is done through asynchronous invalidation and refill messages so what do we mean by invalidation or refill messages so invalidation happens in the case of objects so in case of ID object is stale so evicted from your cache and followers evicted but in the case of the association list it is an inefficient way to do that because it could lead to heavier read path for the client request essentially because you would have to invalidate your followers to refill the data from the master sorry from the leader so refill would essentially mean that a specific range of the association list that is stale would be read again from the leader and cached in the follower servers do you want to explain the invalidation messages and refill messages because I think they play and gets refill do you want to give an example what would be invalidation message what would be a refill message let's say I have updated my profile I have changed my user name display name let's say that would involve an invalidation of the object of my ID from the cache and when would get repopulated into the cache is when my page gets access taken when the next read comes through while let's say I have a new friend addition that has happened this would mean that the association list of the association type friend has changed for my ID so the association list of a type to invalidate the association list as well which would mean that the whole range of the association list would be deleted and whenever a read comes through we would have to perform a heavier fetch association list query from the leader or probably even the database the storage type basically what we are saying is refilled the list rather than evicting the whole cache the ranges of association list are stored together you can refill a certain range so this would be the advantage of refill and this is essentially exploiting the nature that it is a read through write through cache anyways correct now starting this again for me when I was reading this paper this is the beauty of how the characteristics of your models data models can be leveraged when making such kind of design choices for instance the nature of friend list and association list and the nature of objects allows you to make these kind of choices yeah go on yeah so again the eviction policy for the cache is least recently used this is again a straightforward choice that arises from the nature of the social graphics recently created or frequently access objects would be the recent objects anyways and the older objects would not be accessed as often so your cache is always going to be brimming up with recent data the writes creates an inverse association as we discussed already so the cache itself is type aware so caches know what association types without any atomicity guarantees so why is the atomicity guarantee not present this is because if you take any two object IDs it is very highly unlikely that they are present in the same chart it is more probable that they are present in two different charts and it is very probable that these two different charts are present in two different master regions so across these two this is where some dangling what they call dangling associations can happen so where let's say I am a friend of Pugh's but Pugh's friend list is not yet updated this is possible in this architecture but what happens is there is an asynchronous process that goes on correcting any inverse relationship any dangling associations yeah and you know it is very instant for it I mean that I am sitting in India and I add someone in the US in my friend's list so it is very clear the master chart for my user is in an Indian data center and the master chart for the friend I added in the US data center so having cross data center atomic rides will have a strong bearing on the performance and availability so obviously it's a fairly it's an obvious design choice that they have made so we look at the responsibilities of the leader here so leaders perform all the leads so that is to handle the rise to the storage layer so this let me just emphasize that leaders in the master region so leaders in the slave regions of particular chart cannot deal with the rights so they have to be forwarded to the master region it issues the invalidation and refill message because it sets responsibility now leaders in master regions handle cross region right request and additionally return chain sets to the leader so this is to ensure lead after right consistency for cross region request for that to be achieved what happens is so let's say a follower in a remote in a slave region received the request it would forward the request to its leader and the leader would forward it to the leader in the master region now once the chain set that is the diff is returned back to the back in the same same line which is to the slave from the leader in the master region to the leader in the slave region and then further to its followers so that the caches can update and this ensures read after right consistency how is that this is because all of the the further request from the same user would hit the same web servers and that affinity is present and you are ensure to get the consistency that you the consistency desired from the point after the right so we would be walking this walking the right path like just to walk it through with the diagram just I think there's a lot of whole right paths all the paths that cross region follow along step by step on how it gets resolved in the tower system so now again so the leaders in the master region also needs to somehow what we handle right now is read after right in the cluster from where the request originated what about this embedded and invalidation I as a black full table essentially a concept where the query is fired and the data is not really stored it's just a technique embedded into the bin logs when these other and publish the invalidation and refill messages. So leader have the responsibility of embedding these into the queries so that when an update is happening, update creation or delete is happening, the corresponding invalidation and refill messages are also embedded into the bin logs of MySQL itself, which would eventually propagate asynchronously. Leaders in the slave region of a storage layer delegates the right request to the leader in the master region as we already discussed. Leaders in the slave region subscribes the embedded invalidation refill messages from the MySQL bin logs and in turn deliver it to its followers. In case there's a failure, they are queued into a disk and re-delivered to the follower as and when they come back up. The leader is acting as a central cache coordinator. So yeah, so one of the things advantages here is that there is a single entity that is accessing the storage layer. So preventing a thundering herd is essentially just serializing the concurrent rights in that layer. So this was more difficult to achieve earlier because earlier as in the pre-Tao era because the clients were all distributed across and there is no central coordinator. So all the requests are not going through one central coordinator, which is why they had to come up with the lease solution. Leases essentially construct to serialize the request anyways. So the leader optimizes the cross-region communications called by bundling multiple calls into a single RPC call. So this is a technique that is discussed in the scaling memcache paper anyways. So even in that, so there is a even more frequent use case of inter-region communication, wherein they have suggested a solution of bundling RPC calls in a certain time window and save some time in the RPC call itself. This is again, I think a fairly obvious choice. I mean, if you're doing cross-region network transfer of data, right? Do you want to batch as much of a thing you can do in a single call essentially? Yeah. So now follower tier. So follower tier has a fairly straightforward role to play. It serves the clients with the data. So if there are cache misses, delegates with the leader in the corresponding region. In case of rights, the followers apply. So once, so if the region does not have a master for a chart and a right has happened, a change set is returned to it and that has to be applied in the cache of the follower. So this is also a responsibility of the follower tier. Followers process the invalidation of the message as we already discussed. And they perform chart cloning. So this is a common technique again. So it is, so when a popular personality again comes online or he's doing a Facebook live or something, it is very often possible that the chart where his object ID is present has a heavy amount of traffic. So chart cloning is a technique by which the same chart is again placed into the consistent hash ring. And this allows for multiple follower servers to serve the same chart basically. I think the chart cloning and consistent hashing these are again very, very frequently used techniques in a lot of distributed file systems, databases. And I think edge based also use it up to some extent. I think Dynamo, the Dynamo paper also has touched upon this very, very frequently. So I think chart cloning and then using consistent hashing to place the charts is a fairly common technique. So yeah. So the read path, so the read path is the easiest path to understand. So it is, so the HPSP servers which are basically the web servers at Facebook, they contact the follower tier, anyone follower tier. So they are configured with a primary follower tier and a backup follower tier. They can contact anyone of those. And if the follower has the data, it returns it directly. If it does not, it delegates to the leader. And if the leader also does not have the data, it reads it up from the MySQL and updates its own cache and the follower's cache. So this is how the read path would work. It should be fairly straightforward. Now write path in the master region. This is again, so when write happens from an HPSP server in the master region, all that has to be done is the follower sends it to the leader and then to the MySQL. So now why the, so one of the questions that could arise here is that why follower to leader and then leader to MySQL? Again, this addresses the case that there is only a single cache coordinator who can write and invalidate the caches in this particular cluster. So once the write is complete, there are two things happening here. The leader has embedded the invalidation messages into as queries, which would get propagated as bin logs to the slave regions. And the leader would also issue cache invalidation messages to all the follower tiers because there can be multiple follower tiers, not just the one where the write originated and the caches would get invalidated in the region and also get propagated to the leaders and sorry, the MySQL databases in the other regions. Now the write path originating from the slave region. So a remote region makes a write request. So the SPSP server in the slave region has made a request to the follower, follower forwards it to the leader. And the leader knows that he does not have access to a short database in the master shard in the same region. So he needs to forward it to the tau leader. So this is where the buffering happens. So here there is an RPC optimization and the tau leader here would commit the data into the MySQL database and then return a chain set over the same line. So the chain set that is returned is used to update the tau leader as well as the follower that was hit with the request. So this ensures a read after write consistency. And for all other regions, what happens is there is a MySQL application that is happening and as and when the slave processes, it's bin log would generate that this particular query has been executed and there is a service known as wormhole that's not discussed in the paper, but it's generally known that wormhole is a PubSub mechanism which is used to the bin logs at Facebook. So the wormhole would read the bin logs and the leader has subscribed to any invalidation and refill events that would originate from wormhole. So based on that, it would invalidate all the followers caches as and when it receives this. So let's again reiterate. So let's just imagine that this is just a slave region where rights did not hit. So any data that is replicated asynchronously over MySQL would be subscribed to by the leader and it would invalidate all the follower caches. So this is the eventual consistency model. And in case the right has originated in the slave region, the chain set that is returned over the wire up to the follower ensures that the read after write consistency is ensured. There are some caveats here. So for example, let's say a follower tier that issued the request has gone down for whatever reason and further requests are hitting the backup follower. It could still have an older data. So a back-end time effect could be observed there. But even that is a trade-off because it's still as acceptable as long as it's a rare occurrence and it usually is a rare occurrence. So in back-end time, how would it manifest? How do you think it will manifest on the user experience? Alice made the check-in and her own timeline does not even reflect the... Or she saw a comment that was made into that she made, but all of a sudden it has disappeared. So this is technically possible, but I think consistent charting and usually the performance of the asynchronous replication of the MySQL databases ensure that it is not noticeable. Right, I think they have published some SLA as well. What will be the maximum, the typical SLA or the delays they see in the replication? I think we'll come to that in a couple of minutes. Sure, sure. Yeah, so the consistency model, so as we discussed, Tau is eventually consistent over into the asynchronous nature and read-after-write consistency is ensured. So the read-after-write consistency has another caveat here. So as the chain set is being relayed over the wire, so this is a cross-region and this is a significant data transfer time. So as this data chain set is being propagated over wire to the follower tier, it is very much possible that another write has happened via another region. It is still unlikely that that would happen because usually there is some sort of region affinity for the web servers and the users as well, but it is technically possible that there has been another write. So which is when a follower needs to make a call that whether to apply the chain set or not. So this is where it uses a concept of versioning. So every object has a version associated with it and when a chain set is applied, so you have the from value and the to value. And if the from value is already marked as obsolete in the database by asynchronous replication, then the changes are not applied, but it still poses a risk of the replication itself being slower and in that case, definitely there is a chance that there would be a back-in-time action wherein the remote write, which is not originated from the region where the write has happened, does not reflect in his or her timeline. Yeah, so this is highly unlikely, but it is discussed in the paper as one of the areas where read after write consistency could be compromised. Inconsistency might arise from partial leader failure and follower failures. So as we discussed, a follower failure would mean that the request might go to a follower tier which does not have the chains that applied initially. And the other is that if there is a partial leader failure, that could also lead to inconsistent results because the data is transferred over the wire traversing through the leader tier as well. Yeah, in the case of cache eviction due to, so as we discussed earlier, there is one of the cases where read after write consistency would fail and clients might observe a go back-in-time experiences is when the chainset version does not match or indicates that it is older than the current data and still the replication has not happened. Now, tau failure handling. So storage layer master failure handling is handled by auto-promoting an existing slave. Fairly common technique while using MySQL servers replicated across regions. A storage layer slave failure is handling by redirecting all the reads in the region to a master region. On slave promotion, there is a risk of data inconsistencies which is why the invalidation and refill messages up to let's say some 10 minutes before the failure are replayed across regions so that they can be ensured that there are no bad values in the cache. That eventually is just kind of replaying the bill of like you could be doing in any data source. Just to catch up on the time gap between the master going down and the slave catching up. Yeah, failure handling. So on a leader cache failure, the followers handle the cache misses by reading directly off the storage layer. This is the only case when a follower would directly hit the storage layer. On a leader cache server failure, writes are sent to a replacement leader chosen at random which queues up the embedded invalidation and refill messages for the original leader until it is back in service. The refill and invalidation failures are handled by queuing the messages into a disk and delivering them later. And followers of our failures are handled by bypassing the request to a backup follower tier. This is a per client configuration that there is a primary and a fallback follower configured for each client. We'll discuss the workload on performance. So as we said, so it is a read heavy system, but it is so lopsided that 99.8% of the read requests are requests are reads and only 0.2% of them are writes. Again, coming into the split. So as we can see the ASOC time range is an issue's case because ASOC range suffices most of the use cases because the association lists that are returned are already ordered by time descending. And then it is object get an ASOC get which are the more popular APIs. Association lists are, associated lists surprisingly have a characteristic that they are often empty or small. So this also defends that use case that it enforces a pagination. It enforces some fixed maximum limit because most of the use cases are well within the 6,000 count. The very large association lists are very rare. So essentially what we are saying is that the number of people on Facebook who will have up to 6,000 friends or posts on Facebook who will have up to 6,000 comments and all is gonna be very fairly there. Which seems obvious. I mean, like, not all of us are as popular celebrities as we would like to be. But I think one thing which struck me at least on this is that, you know, I mean, I used to work at Yahoo and we were trying to build a social network there and the relationship graphs that we had implemented, we had planned for about a split of 97% read and see percent rights. But what this shows is like 99.8% read and 0.2% rights. I think this kind of distribution also gives them a lot of leeway in terms of how much caching and what kind of performance they can extract out of their single boxes or their two years of boxes essentially. So having less rights always needs life a little easier. Yeah, so again, to discuss the throughput, we definitely need to see it as a function of the hit rate itself. So when the cash is not born, it is not fair to judge the throughput of the system. So at the maximum hit rate of 90 plus percentage, a half a million RPS is served by a single server, single car server. Availability, the fail query percentage is extremely low in the sample data that is presented in the paper. But this has to be taken with a pinch of salt because there are dependencies between queries. So the client itself would have got a large number of queries once a query fails. So for example, if the friend listing the friends of me has failed, then the subsequent request that would arise out of it is obsolete. One other point to highlight in empty or small association list is the caching server itself has specific entries made for the SOC counts. So the SOC count is not returned by actually fetching the associations and counting them. Instead that is an integer value stored in the cache itself. And very often the value that is returned is zero. And in those cases, when the SOC request comes in, it can straight away respond with an empty result. So this is a small performance improvement but very specific to the social graph use case. That is also kind of makes sense, right? For instance, if I'm looking at a relational store, then I wouldn't want to fire consta queries. If the consta query would end up being the reinscan and let's assume that I have to render a post where I have to see how many friends have liked a post or how many friends are there in my list. I wouldn't want to run consta queries all the time. But it was better to kind of, and given that I'm doing only 0.2% writes, I would rather pre-compute the counts and store them as is and then fetch them at read time. So makes sense, yeah. Yeah, and in the sample set that was presented in the paper, they had an overall hit rate of 96.4%. Now coming to the read latencies, they are often of the order of a few milliseconds. Now the write latency is an interesting thing. So it needs to be measured in two different contexts. One is when it is in the master region and one it is when it is in a remote region. So in the master region, the commits take around, mostly around a few or near around 10 milliseconds. The dotted line indicates the latency between the master region and the remote region that has been profiled. So as we can see, it is around 60 milliseconds and it is consistent that the average commit latencies or the median commit latencies are around the 60, 70 millisecond range when it is cross region. This storage server lag, which is another issue that needs to be kept in mind because eventual consistency, what it basically defines the SLA for eventual consistency. So in 85 percentage of the cases, it is less than one second. And the highlight being that it is usually, the replication is usually done by 10 seconds in 99.8% of time, which is very acceptable for a network like Facebook. Yeah. Again, we kind of that, you know? Yeah. For only for 0.2% of the cases, a friend of mine would see a live from me after 10 seconds in some form. I can translate to that, I could. Yeah. Yeah. And this also, so I think many of us would have experience operating it with MySQL. So this is not fairly, I mean, this is at least in my experience, this has not been the case, right? But most probably the reason for this is that the simplistic modeling of the graph itself in the MySQL data store, which is just a simple object stable and association stable and you're done with. So there are no complex queries being fired. This is essentially again, because there are no complex APS being exposed from the layer anyways. Even for, so paper discusses how intersections are handled, right? So intersections are not handled in Tau at all. And intersections are a common use case for stuff like a Facebook profile, right? So we would need to know who are our common friends or something like that. But they are essentially offloaded to the client library, which is implemented with just queries, the raw Tau queries and makes the intersections themselves. Yeah. So we'll quickly go over the related work that has been discussed in the paper. So regarding eventual consistency and read after write consistency formal definitions and trying to formalize how it takes to build upon the cap theorem. So Terry et al and the Vogels paper discusses this. For geographically distributed data stores, they discuss Google spanner and how they employ Paxos algorithm to ensure that there is only a leader at a time. Distributed hash tables and key value systems like Amazon DB, Dynamo DB. So this is for the consistent hashing part. Hierarchical connectivity systems like Akamai content cache system. So this is a very relatable model. So this particular model you could envision, with this particular model, you could envision how a CDN should be ideally structured. And Akamai already does this apparently. And personally, most of the CDNs. Structured storage systems like simple DB. So this discusses how instead of a column or relational DB in data structure, you could go with something like storing the complete data and see it lies from in one column. And modern graph databases like Neo4j, which has its own consistency levels and also very flexible query language. Yeah, so to summarize, so basically the social graph data model that pre-existed Tao is widely discussed in this paper. Tao introduces this hierarchical data layers to ensure that your requests are served as much as possible from a local region. And it also introduces the asynchronous model to ensure consistency across the globe, right? So across data centers that are spread across the globe. Yeah, so that's it for the presentation. So I think we can take questions or discuss anything. Thanks, Roy, I think I really enjoyed the presentation. So let's go to the Q&A. So I think Shreep Prajwal's question we had answered in the first section, in the first break. The next question is from Avila. She said, since all queries are on A-types, which are the association types and object types, do they internally maintain separate type-specific graphs for a faster query? Sorry, type-specific? Do they internally maintain separate type-specific graphs for a faster query? Essentially maintain association lists and your, so if there is a small optimization in the cache server implementation itself which may not touch upon. So they partition the memory into arenas and arenas are grouped by types. So this is to encourage that better behaved types are not impacted by poor cache performance of the poorly behaving types or abusive access patterns and other types. And association lists are generally, association lists are basically originating from ID1 and of association type A1. And these are stored in contagious memory locations in ranges in their cache servers. Yeah, yeah. Great. Avilash, I hope that answers your question. Please feel free to raise your hand or leave a comment in case you want to discuss this further. In the meantime, let's move on to the next question. This is from Ashwin Alaparthi. So he says, how do you generally decide how many shards are on the same physical machine? What ensures there's not too much swapping as all the databases need memory? Yeah, so generally, at least what the paper presents is that the number of shards has to be a large number. I mean, significantly large number. There is no order of magnitude presented in the paper. But I would assume that this particular situation of high load on a particular database is addressed by the re-sharding mechanism that they have. So you can move around the logical shards between physical servers and Facebook actively does it. And I'm pretty sure it's a manual intervention involved there or at least semi-manual intervention involved based on alerts. Yeah, so I feel, you know, I think if you can go into details of the paper, they also give very clear estimates on what is the estimated bytes for each of the different data types, right? So by doing some back of the envelope calculation, it's easy to predict that if you have X number of keys in a shard, what can be the potential size in terms of gigabytes that shard might occupy it? And depending on, you can do some back of the envelope calculation and figure out how much RAM and disk you need on a physical server and accordingly map the number of shards as well. Yeah, so basically for Obama and Trump ends up in the same shard then. So which is why they would essentially need a resharding technique in this. I'm sure the sharding algorithms are gonna be fairly smart and they would be factoring in a lot of runtime characteristics of the overall system in terms of like which shards are getting hot, which shards are getting heavier in terms of fees and all that. I think that dynamic feedback loop would must have been constructed in the service with the shard allocation, I'm sure. Which I don't know if they specifically mentioned it, but that's what that would be my, sir, my... And it has to be sufficiently high to confidently say that it is very unlikely that any two object IDs that you pick from the Facebook system are in the same. So that itself gives you an idea of how many shards are mapped to physical shards. And I think once, I read this paper many, many years ago and when I came across this point that the shard ID discovered from the post ID is when my first, you know, my immediate reaction was to go and look at what IDs are. And if you folks have noticed, any object IDs are at least a 16 character springs. So that means 128 bits at the minimum, right? So they have a fairly big bit space to consume, to allocate a region to a user and then within that region to allocate shard to a user. And since the shard discovery can happen by simple bit manipulation operations, they don't have to build a shard lookup service which potentially would need maybe, I don't know, at Facebook tail probably billions or trillions of requests a second, that's the end point. Great. I think there is a question from Sri Perjwal again. Are there any note for the open source tau implementations we can check out? There are no open source tau implementations but in the deck that I said, so there are two stops that are by the developers themselves. So they do go into a little bit of nitty gritties which are not discussed in the paper, say things like wormhole, how it helps and such stuff. But there is no direct open source implementation now. Also, what I feel is Sri Perjwal that this is, if you look at the way they have explained the concept, their API patterns, their access patterns and also their product use cases are fairly, fairly niche in some form, right? So, you know, not every database, not every graph kind of a problem would entail these kinds of constraints, right? So having an open source implementation may or may not be beneficial to a lot of people essentially, right? That's why I think generic implementations like Neo4j or AWS has the Necune DB, then you have the Titan DB available from Apache Foundation. So those are slightly more open source because they address a lot of horizontal of broad use cases than the use cases presented by Facebook in this way. Yeah, and some of these things probably can only be managed by, at least some of the techniques can only be managed by a company of Facebook's. Yeah, yeah. Let's just look at the failure handling part of it. It just becomes too much tooling and so much operations availability just to pull this off. Yeah, so. Also, I mean, 99.8% we're seeing a 10% second replication lag delay. I mean, for no models like us, I think that kind of infra itself is gonna be, you know, it's very painful. I mean, I think I've been running my SQL servers for many, many years and this is just, these are the mind boggling numbers, right, especially also when you're looking at cross region API calls, cross region latency, you can imagine the kind of network Facebook managers, right? Even transatlantic lines, cross continental lines they are maintaining. I don't think many companies would be able to operate this kind of infrastructure unless they have the power to build network operations that can support this presentable. Yeah. Great. Any other questions, folks? Please feel free to raise your hands, leave a comment. Also, I can unmute you if you want to speak out I think we are at towards the end of the paper we have about another 10 minutes we can just hang around and what we have casual conversations. By the way, I must thank Hasgeek for building the beautiful platform that allows community of developers in India to come together. So do subscribe to their Twitter handles and do watch out their posts for all the events they are conducting. Thanks and our thanks has been a lot. Folks, also, papers we love is, I mean, we don't have a formal group, but yeah, this is something, it's an initiative we would like to, you know, kind of promote a lot more within Bangalore, India. I think we are a tech innovation hub right in the world now. So as a community of software engineers, programmers, I think the more, you know, more conversations we have around these architectures, the seminal papers and theoretical concepts, I think more of the whole community kind of benefits in the form of the other. If you want to present a paper, feel free to reach out to any of us, like me, Swannand, anyone in the Hasgeek team, the staggers on Twitter, anywhere, I'll probably just share my Twitter handle later on. If you want, in fact, if you want to present a paper, do let us know, more than happy to host you folks. We have a bunch of people who have kind of shown interest in the presenting a paper in the July edition, in case we get more real, more than happy to have these sessions at a higher frequency. So yeah, I think it's an open session folks. If anybody wants to add comments, any questions, do let us know. I can unmute you folks and you can have a casual chat. I think since you mentioned the Google Spanner paper, I remember reading it, right? And in that, they are actually implementing atomic transitions across geographical regions. And what they mentioned is that can only happen when they put an upper bound on the network latency across regions. Again, this kind of makes me wonder what kind of network in fact Google might be running, right? They are able to ensure an upper bound on the latency between like America and Europe, right? This is mind boggling. Yeah, so the scaling memcashity paper from Facebook is also very good to read. Yeah. The optimizations, the client side optimizations who we did not touch upon a lot. Even in the paper, it just delegates it off to the older paper. So I think the paper is also a good read. I think also, I think while they were building the scaling of memcashity at Facebook, right? They also wrote a library called Macrauter. I think I had explored that at a time, when he was back, Macrauter acts like a proxy client for the cluster of memcashity servers. You can implement sharding routing logic. And then I think Twitter also tried to replicate it with a library called memcashity. I think both of them came within a matter of few months, I believe. So yeah, scaling at memcashity is against a very beautiful paper. And I would strongly encourage people to read it. I mean, if you want more. Yeah. And one more thing to comment on is most of their solutions are more pragmatic rather than very comprehensive, I would say. So even the things that are presented in the scaling memcashity paper, like the remote markers, it's crude, I mean, it's a crude solution. It's a working solution and it works for them well. Great. All right. Folks, I think we are right. We are just about at time. I've shared the Twitter handles where you can tag any one of us if you want to present a paper or you have some suggestions on a paper you want to be discussed, right? It could be systems-related, AI, ML-related. If you want anything around theoretical concepts, we are happy to arrange this speaker, but we would want people to come forward as volunteers and we'll be happy to arrange sessions in future as well. So, all right. Thanks everyone for joining. Lovely having this session today. Have a nice evening everyone. Thank you.