@allyourcode This number in includes any error sent back by the datastore that indicates unavailability (such as timeout or internal errors but not bad request or concurrency errors). The numbers shown are global estimates and may not reflect a single app's actual experience. (It also doesn't include planned maintenance periods)
@alfnoodlez Thanks. Those are interesting details, but I think I need to rephrase my question. Usually, when I think about a service being "unavailable", I'm thinking about the amount of time that it's not even operating, meaning that it can't even produce an error code.[1] Do those error rates include the amount of time that Datastore is just not operational?
[1] I guess at that point, there is some set of front-end servers that produces error responses on behalf of the back-end service.
@allyourcode These error rates just apply to the datastore, which in all cases will result in some user facing exception. Total unavailability of the datastore will result in a time out. Of course there are other parts of the app engine stack that need to be functional before you can even get to the datastore. The availability of these other components are not included in the numbers. This is why we are planning to offer a 99.95% SLA (which covers the entire stack) instead of 99.999%.
Not that it matters much to app developers, but Which parts are keeping App Engine from offering a stronger SLA? Looks like Datastore is not the weakest link ;)
If not all replicas get written to synchronously, then what happens when you try to read from a replica that hasn't gotten the most recent writes? How does it even know that it's missing some writes?
@allyourcode The High Replication Datastore is able to known when a replica is behind and replicate the write 'on demand' using the metadata associated with an entity group. For the exact details on how this works I recommend you check out the paper "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" (which can be found using your favorite search engine).
@alfnoodlez Thanks. I was actually hoping this video would help me understand that paper :P. After reading it again, it makes more sense. To answer my own question, there is a coordinator at each data center that knows whether the replica has the most current version of an entity group, and the coordinator knows, because it gets updated when a write happens at another replica.
For anyone else who comes across this, it would be helpful to have read the BigTable paper beforehand.
@allyourcode Yep. The coordinators each contain a cache of entity groups that are up to date which must be invalidated in any datacenter that fails to accept a write. If a read does not find an entity group in a coordinator it must read the state from a majority of replicas to figure out if it has the current version (and much more interesting things happend when a coordinator is unavailable :-)).
Google has been the best internet company :) who agrees with me :)
GreenyLiveshow 4 months ago 2
@13:01 re "Average Error Rate": What's considered an error?
allyourcode 9 months ago
@allyourcode This number in includes any error sent back by the datastore that indicates unavailability (such as timeout or internal errors but not bad request or concurrency errors). The numbers shown are global estimates and may not reflect a single app's actual experience. (It also doesn't include planned maintenance periods)
alfnoodlez 9 months ago
@alfnoodlez Thanks. Those are interesting details, but I think I need to rephrase my question. Usually, when I think about a service being "unavailable", I'm thinking about the amount of time that it's not even operating, meaning that it can't even produce an error code.[1] Do those error rates include the amount of time that Datastore is just not operational?
[1] I guess at that point, there is some set of front-end servers that produces error responses on behalf of the back-end service.
allyourcode 9 months ago
@allyourcode These error rates just apply to the datastore, which in all cases will result in some user facing exception. Total unavailability of the datastore will result in a time out. Of course there are other parts of the app engine stack that need to be functional before you can even get to the datastore. The availability of these other components are not included in the numbers. This is why we are planning to offer a 99.95% SLA (which covers the entire stack) instead of 99.999%.
alfnoodlez 9 months ago
Not that it matters much to app developers, but Which parts are keeping App Engine from offering a stronger SLA? Looks like Datastore is not the weakest link ;)
allyourcode 9 months ago
If not all replicas get written to synchronously, then what happens when you try to read from a replica that hasn't gotten the most recent writes? How does it even know that it's missing some writes?
allyourcode 9 months ago
@allyourcode The High Replication Datastore is able to known when a replica is behind and replicate the write 'on demand' using the metadata associated with an entity group. For the exact details on how this works I recommend you check out the paper "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" (which can be found using your favorite search engine).
alfnoodlez 9 months ago
Comment removed
allyourcode 9 months ago
@alfnoodlez Thanks. I was actually hoping this video would help me understand that paper :P. After reading it again, it makes more sense. To answer my own question, there is a coordinator at each data center that knows whether the replica has the most current version of an entity group, and the coordinator knows, because it gets updated when a write happens at another replica.
For anyone else who comes across this, it would be helpful to have read the BigTable paper beforehand.
allyourcode 9 months ago
@allyourcode Yep. The coordinators each contain a cache of entity groups that are up to date which must be invalidated in any datacenter that fails to accept a write. If a read does not find an entity group in a coordinator it must read the state from a majority of replicas to figure out if it has the current version (and much more interesting things happend when a coordinator is unavailable :-)).
alfnoodlez 9 months ago
Comment removed
davemw 9 months ago
cool video :D
fmpjs 9 months ago
These nerds get 25,000$ per month to do stuff that only make the net more complicated
F*ck.
Listen2Colors 9 months ago
@Listen2Colors complication to you is speed, consistency, and reliability to these "nerds".
TrendCycles 9 months ago
The only thing we are interested in... Is how to get more traffic to websites...
Films4You 9 months ago
this guy talking like hes in a bar bating a girl
b3nrules 9 months ago
@b3nrules don't talk to girls about these subjects ;)
PrettyAwesomeBaby 9 months ago