Alfred Fuller, Matt Wilder
For the first three years of App Engine, the health of the datastore was tied to the health of a single data center. Users had low latency and strong consistency, but also transient data unavailability and planned read-only periods. The High Replication Datastore trades small amounts of latency and consistency for significantly higher availability. In this talk we discuss user-facing and operational issues of the original Master/Slave Datastore, and how the High Replication Datastore addresses these issues.
Google has been the best internet company :) who agrees with me :)
GreenyLiveshow 4 months ago
Not that it matters much to app developers, but Which parts are keeping App Engine from offering a stronger SLA? Looks like Datastore is not the weakest link ;)
allyourcode 9 months ago
@allyourcode Yep. The coordinators each contain a cache of entity groups that are up to date which must be invalidated in any datacenter that fails to accept a write. If a read does not find an entity group in a coordinator it must read the state from a majority of replicas to figure out if it has the current version (and much more interesting things happend when a coordinator is unavailable :-)).
alfnoodlez 9 months ago
@alfnoodlez Thanks. I was actually hoping this video would help me understand that paper :P. After reading it again, it makes more sense. To answer my own question, there is a coordinator at each data center that knows whether the replica has the most current version of an entity group, and the coordinator knows, because it gets updated when a write happens at another replica.
For anyone else who comes across this, it would be helpful to have read the BigTable paper beforehand.
allyourcode 9 months ago
@allyourcode These error rates just apply to the datastore, which in all cases will result in some user facing exception. Total unavailability of the datastore will result in a time out. Of course there are other parts of the app engine stack that need to be functional before you can even get to the datastore. The availability of these other components are not included in the numbers. This is why we are planning to offer a 99.95% SLA (which covers the entire stack) instead of 99.999%.
alfnoodlez 9 months ago
@alfnoodlez Thanks. Those are interesting details, but I think I need to rephrase my question. Usually, when I think about a service being "unavailable", I'm thinking about the amount of time that it's not even operating, meaning that it can't even produce an error code.[1] Do those error rates include the amount of time that Datastore is just not operational?
[1] I guess at that point, there is some set of front-end servers that produces error responses on behalf of the back-end service.
allyourcode 9 months ago
@allyourcode This number in includes any error sent back by the datastore that indicates unavailability (such as timeout or internal errors but not bad request or concurrency errors). The numbers shown are global estimates and may not reflect a single app's actual experience. (It also doesn't include planned maintenance periods)
alfnoodlez 9 months ago
@allyourcode The High Replication Datastore is able to known when a replica is behind and replicate the write 'on demand' using the metadata associated with an entity group. For the exact details on how this works I recommend you check out the paper "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" (which can be found using your favorite search engine).
alfnoodlez 9 months ago
@13:01 re "Average Error Rate": What's considered an error?
allyourcode 9 months ago
If not all replicas get written to synchronously, then what happens when you try to read from a replica that hasn't gotten the most recent writes? How does it even know that it's missing some writes?
allyourcode 9 months ago