 Well, how do I say it, Ross? How do I say it? Okay. Okay. Oscar works. Oscar? Okay, so Oscar is, I feel like that. My name is like three letters. It's super easy to say. I don't know if we understand the hardships. Okay, so he's going to be talking about how we can use some open source databases rather than using the ones that Google or Amazon or Azure provide in our cloud platforms. So it's going to be a super interesting talk because he's trying to avoid vendor lock-in, which is kind of how we operate sometimes. So super excited to see this. All right, thanks. So yeah, Oscar and we're going to be talking about the different open source database services in the cloud and what alternatives you have there. So I'll start with a little bit of an introduction on what we do and who I am and we're going to talk about managed cloud services and the different kinds of databases that are available in the cloud. So for relational databases to distributed key volume stores and message streaming and such. And also finally also talking about typical data pipeline scenario that we see being used in a cloud using open source software. So I'm one of the founders of Ivan. He's been in cloud databases as a service company. Previously I was working as a database consultant mostly focusing on Postgres and other open source databases as well as a software architect designing large scalable systems for various enterprises. Open source software is something I've worked with for a long time since the nineties and created and mentioned a bunch of different open source tools mostly around databases and data management and things like that. A lot of them are related to Postgres but I've recently also been working with other technologies. When I started working on Linux I just looked out the first Linux distribution actually years which was this Deeper Seals plug and play Linux and that was in I think 94 or something and it really wasn't this plug and play as you would hope and unfortunately in a way I think open source tools are stealing that phase that there's really good technology there but it's also more of a set of tools to build something cool out of instead of like fully polished complete solution for you. So that's one of the reasons why we found a diamond where we tried to make it as easy as possible for everybody to benefit from the open source databases and all the work and innovation going into different open source technologies. So we run open source databases on five different public clubs. So AWS, Google, Cloud Engineer, Digital Ocean and App Cloud and make sure that our database are available and as easy as possible to use. Our team is based in Helsinki, Finland and Boston, US but we're going to be also launching a presence here in APEC later this year. Right now we run and operate seven different open source databases in seven different regions around the world. So it includes 23 in APEC and quite a few in Singapore. So when we look at managed cloud services this term means sometimes different things for different people but it's what we're trying to talk about right here is kind of the space between the infrastructure so the hardware and the virtual machines and the actual applications that job. So we're focusing on a layer that's sometimes called platform as a service or databases service where databases are made available for you over an API request or similar thing that just provides you with an access point to the database without having to install any database software or necessarily tune it at all. There are quite a few vendors in this space. There are the big cloud service providers AWS Google and Microsoft who have complete solutions that kind of cover all parts of this thing and we also have these other somewhat smaller smaller cloud infrastructure providers such as VisualOcean and UpCloud like this Brackspace would fall in some somewhere there who give you easy API access to genuvm and infrastructure resources but don't necessarily provide other higher level services. Then there's a bunch of players such as us so Ivan and there's Compose which is IBM's company nowadays and Heroku providing more development platforms and database management as a service. And this is of course I'm working on it but I think it's an interesting space to work on and we're also able to really provide open source services for a large number of end users. I didn't really touch on the SaaS layer and applications because I'm working on the layers below that but it's a lot of these SaaS applications aren't built on top of the PaaS and IAAS systems. I think what we've seen in of course our customer and the people we talk to but more generally in the industry is that pretty much nobody is spinning up new data centers for running databases or other services on premises anymore. There's of course a lot of them still in existence but the de facto deployment target for new systems is some kind of a cloud so I think AWS is easy to what is the first one that started moving here and that's where a lot of things are being deployed but in addition to just running VMs in the cloud we see more and more of users now starting to use services like AWS RDS or Google's Cloud SQL or Cloud Datas or similar data management systems in the cloud. A lot of these services are based on open source so there's both of us for example is available from quite a few different vendors but the cloud services are also introducing more and more new types of distributed databases that are proprietary and only available from one of these vendors so we look into the options available in different paradigms. When we look at open source databases and also proprietary databases we have a couple of options available for you when operating in the cloud so you can just get the VM resources and compute resources from the cloud and install everything there yourself which gives you full control on the system and it's probably something you need to do if you just want to take an existing database from on-premises and moving to the cloud but it's also going to still have you worry about maintenance and management of those systems so it's pretty resource-intensive and requires personnel to operate those systems. Sometimes that's not what you want to do you want to have your developers work on actual applications not on the infrastructure so that's why you would consider using a managed database provider which typically makes it a lot faster to spin up new instances. One of the benefits in using somebody else's database is managed by somebody else is that you also make it access to these cool proprietary databases that are developed by companies like AWS and Google and Microsoft where you get scalable huge systems but since we're talking about open source services not a given plus in this case it also has the possibility of locking you into a single provider. On the relational database side we have a number of open source databases like Polo2S which I'm most familiar with and of course my CEO MariaDB and all of these services are available as managed services from a large number of vendors so the big cloud providers as well as smaller players such as Asin and Kompos, Heroku, ClearDB and others other vendors. So if you're working with relational databases and as we just so in first and first talk 20 minutes ago these traditional relational databases such as Polo2S can get you pretty far so you probably don't really need to consider going for a huge like going for proprietary databases or even for things like Hadoop and Cassandra unless you have really tons and tons of data so if you're working with like single digit terabytes of data Polo2S is probably going to be just fine for most use cases and it's going to be a lot simpler than many of the other alternatives but if you want to or if you need to go to distributed key value types of databases and scale to hundreds of nodes you probably don't want to do that any more with Closedgress. I guess it's possible I don't know what's the largest Closedgress cluster up there probably something that Skype ran with ages ago but nowadays it looks like things like Cassandra are most commonly used in the open source space. Cassandra is a distributed key value database that's based on the dynamo paper that was operated by Amazon almost 10 years ago I believe which is also the basis for AWS DynamoDB service which is a really good database system but it's also totally proprietary to Amazon and if you put your data there you'll have complete vendor looking with Amazon no way out so that's where our recommendation would be to look into Cassandra or Scilla which is a re-implementation of Cassandra by a different team there are also a number of companies providing Cassandra and Scilla as fully managed services so if you're just worried about the complexity of setting up all of these you don't really have to go to proprietary services to make it easier for you you can also get it as a managed service from companies like Compose, Instacluster and another interesting type of database is these distributed relational databases which for a long time were only available internally for Google with their Spanner system which is I think was a pretty unique system as it allowed you to have a globally consistent relational database with pretty high availability guarantees that's required a lot of specialized hardware and software and a huge cluster that typically didn't really make sense to run in-house even if you had access to that kind of system nowadays there are also open source projects and companies working on implementing similar systems such as CoCoachDB which is I think partly based on closed-wits their servers are comfortable with both of these so you can use some of the familiar tooling with their software it's an interesting project that's been around for a couple of years now so if you're considering Spanner or other global relational systems like I suggest giving a look at the open source options here many of these are still in the works and require quite a bit of effort to run and manage but things are moving ahead there and it is also worth just having a look at these to make sure that you're not just very proprietary services as a defect solution for time series uses we have a bunch of cool open source projects nowadays I think the most interesting ones for myself are InfluxDB and TimeScaleDB InfluxDB is standalone NoSQL database system that allows you to efficiently store time series data into pretty compact form and also efficient queries on it it's been used a lot in things like system metrics monitoring and looking at your for example CPU graphs and all kinds of traditional operating system monitoring but nowadays it's finding more use cases also in other kinds of data TimeScale is an extension to Postgres again going to what Christopher talked about the extensibility of Postgres and how much how you can add new data types and new data models to Postgres so TimeScale is an extension to Postgres that allows you to benefit from the replication and HA and tooling around Postgres while running a compact time series data storage inside the database these are also available from as managed to have services from a couple of vendors and I think they're finding more and more use in different environments. On the proprietary cloud services side I'm not quite sure if there's anything that would exactly match TimeScale or InfluxDB there's of course Amazon's Redshift and DynamoDB that can be used in similar settings as well as a big table on Google but it's sort of taking a look at the time series databases and opens or side before you're selecting the solutions there Now, again as was mentioned in the previous talk Kafka is a robust and popular system for messaging streaming so no longer talking about just databases but how do you get your data to the database in the first place so Kafka originated from LinkedIn where they used it to ingest a lot of events and data from different sources and processing I think billions of events daily with this system Nowadays Kafka has an open source project and part of it were governed by a foundation so it's a real community project Kafka is used in a lot of different analytics scenarios where you get telemetry or event data from multiple sources and you have to collect it into a single place before processing it with different types of tools Kafka differs from message queue types of systems in that it doesn't really be quite delivery but it's a shared blog so producers are pending to the log and you have consumers reading from the log and you can have multiple instances and multiple types of consumers consuming from the same log so that allows you to really create efficient systems where you don't have to have to know the consumers of your data at the same point where you're starting to produce new data there Kafka is a pretty good alternative to some of the proprietary services such as Kinesis or PubSub that are commonly used in some of today's systems Kafka has also a bunch of advantages over Kinesis and PubSub if you have requirements around ordering of your data making sure that events come out in the same order as you put them into the stream in the first place also Kafka is a pretty fast moving system right now with more and more tooling being introduced to you every week or month so now we have things like KCQL which allow you to do SQL access directly to your Kafka stream and there's also a bunch of tools for connecting Kafka directly to Postgres or Elasticsearch or Isandra and other clusters pretty similar to what was also mentioned in the previous talk so as mentioned I want to give you just an overview of how different kinds of data pipelines are being developed nowadays with these open source tools so this is a graph that maps pretty directly to some of the use cases we've seen with our companies we work with so there's a number of different devices out there that are producing events and telemetry and different kinds of events for processing all of them are sent directly to a Kafka cluster which can be thousands of brokers so it's a pretty scalable system that allows you to handle billions of events a daily I think the largest ones we work with are handling something like 7 billion events every day and with Kafka you can create your consumers that are written in any language you like and just considering from Kafka for example reading metrics and pushing them to an influx DB instance for utilization of the data you can also use systems like Apache BE or Flink or similar systems to do some transformation on the data before you push it to for example a Cassandra cluster for long term storage and for analytics use cases Cassandra could be just as easily replaced by Postgres in some of the use cases but this is a pretty typical scenario we see when you have billions of events coming in you can also as mentioned use Kafka Connect which is interfacing inside Kafka that allows you to directly hook up Kafka to push events to external systems such as Elastic Search so you don't even have to run an application server or anything like that that would read from Kafka to make sure that the latest record for a key is automatically updated in Elastic Search or other systems this is something that allows you to produce billions of events just using open source systems I think I have a couple of minutes available for questions before the next talk any questions thank you