Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 13, 2016
Production-quality Mesos frameworks must be able to continue managing tasks despite unreliable networks and faulty computers. Mesos provides tools to help developers do fault-tolerant task management, but putting these tools together effectively remains something of a black art. This talk will offer practical guidance to current and prospective framework developers to help them understand how Mesos deals with failures and the tools it provides to enable fault tolerant frameworks. Mesos operators will also benefit from a discussion of exactly how Mesos behaves during network partitions and other failure scenarios.
This talk will cover the following specific topics: * fault tolerance in Mesos itself: how Mesos masters and agents behave in the face of process crashes and network partitions * the tools that Mesos provides to help framework authors write reliable systems (e.g., task state reconciliation, the state abstraction, and the MasterDetector interface) * the lifecycle of a Mesos task * a collection of recommendations for how framework developers should build highly available framework schedulers and executors
About Neil Conway
Neil Conway is a Distributed Systems Engineer at Mesosphere, where he works on the Apache Mesos project. Before Mesosphere, Neil built automated trading systems at a quantitative hedge fund, completed a PhD in distributed systems at UC Berkeley, was a principal engineer at a stream processing startup, and was a committer and major developer of the PostgreSQL relational database system