 Welcome to Big Things Conference 2021. Isn't it great to be back? And this year with a fantastic hybrid version. So really looking forward to today. Welcome to Track 2. We've got some really exciting content lined up for you later on today. Kai Verna will be here making some predictions, which should be fun. We're going to be talking quantum computing with Ismael Faro. Álvaro Barbero will be trying to make sense of what lawyers say, which is always something of a mystery. And then towards the end of the day, Juan Jormilla from Jeff will be here to tell us how they are democratizing entrepreneurship using AI. But I know you are keen to meet our first guest. Before we do that, I just would like to remind you of two things. First of all, don't forget our hashtag on Twitter, which is BigTH21, get tweeting. And second of all, this is really important. Please make the most of this opportunity to get your questions to our speakers. But don't leave it to the last minute. Get your questions in early, otherwise possibly we won't have time. Now, most of us, I think even the non-techie amongst us, are getting used to having lots and lots of data at our fingertips. I enter the fitness app, for example, on my smartphone and I can check all manner of information, time series data about my activities during the day, my heart rate, my specific moments during exercise during the day, even details about what type of stroke I was using when I was swimming in the pool this morning. It's just incredible how much data we have at the individual level. But then we start increasing this data over a large user base. That's when performance issues can occur. And that's why AWS have invented TimeStream. Our first guest here today is here to talk about these issues and tell us a little bit more about how TimeStream resolves them. Our guest is Javier Ramirez. Javier, great pleasure to have you. Thank you. Thank you very much for being here, Nicholas. Javier, welcome. It's a great pleasure to have you here. I hope you had your coffee this morning. Yeah, I had actually two, so this will be quick. Okay, but not too fast then. I'll try my best. I cannot promise. You're all ready with your presentation? Yeah, sure. Great, so let's line it up for you. Hello and thank you very much for choosing Track 2, which as we all know is much better than a Track 1. So my name is Javier Ramirez. I'm a developer advocate at AWS and I specialize on everything data and analytics. As Nicholas has said, we are going to have some time, not much for questions, but if you have any question, any suggestion at all, feel free to contact me on LinkedIn or Twitter. You have the details on the screen, Ramirez for LinkedIn, Superoconign for Twitter. But I'm here to speak about Time Series and why do I want to be speaking today about Time Series and Time Series databases? And even before that, what is Time Series? So Time Series data is basically any kind of data sequence that it has any metric that you want to monitor and track and analyze and possibly delete over time. So pretty much anything with a time stamp is a Time Series data. Of course, you've heard the term before, but very likely if you are thinking of business use cases, you have always associated Time Series with things like meteorological or maybe environmental scientific measurements or about financial information in which you have things like, I don't know, stock trading or cryptocurrency or recent analytics, things like that. But truth is, in the past few years, we've seen more and more use cases for Time Series. And Nicholas was saying at the beginning, if you are into health and you have a risk band or into a sport, you can know exactly what you are doing. I bet if you are a runner, you have one of these applications that tells you where you were at any specific time, if you are faster than yourself from the past, if you are faster than your friends, all kinds of things. Of course, industrial IoT, where you have a lot of sensors. You have a factory, you want to make sure your machines are working fine. So you want to control humidity, vibration, noise, speed, whatnot. And that's been a very important use case historically for a Time Series in power and utility industries. But other than that, there are many use cases today, for example, small cities. If you want to have a city in which you coordinate all means of transport and traffic lights and garbage collection and lighting on the streets, you need to have a huge amount of data flowing and you need to be able to forecast with that data and plan accordingly. Or if we are thinking about DevOps, you have a large fleet of servers and applications to monitor, you want to make sure the health of the applications and servers is okay, but also you want to do capacity planning. For that, you probably have heard also the Time Series term before. And if we are thinking about new ways of working like, I don't know, like the sero economy and all those things, if you want to have a variable amount of workforce or if you want to plan any kind of demand, you also need to be storing and analyzing Time Series data. Telemetry for any smart vehicle, from Formula One to your family car to any plane. More and more, we are tracking what's happening over time to make sure everything is working fine, to do our recommendations, to plan when you have to do the next revision at the workshop, whatever you need. But it's not only for things around hardware, because right now I've been speaking about use cases that are tied to sensors. But if you have any kind of shop and e-commerce, you want to know how much and when and what you are selling. You want to know also how you can plan on your stock and inventory, so you have the right amount of things, exactly when you need them, not before and not after. And it's not only about e-commerce, any kind of application, you want to analyze the clickstream to know what your users are doing, if they are happy, where they are spending more time, what could be better. And if we are thinking about this kind of analytics, of course, an important use case is in-game analytics. You want to know exactly where players are in the game, when they are stopping, if you have any in-game purchases, you want to know exactly how some actions are affecting these purchases, and for that, having data with a timestamp, with the sequence of events that led to that conversion is super important. Of course, a special case also is media streaming, you want to see how people are engaged with your contents, and so on and so forth. So as you can see, time series data is not only for banking, it's not only for science, it's not only for sensors and manufacturing, it covers a huge amount of use cases, and it's present pretty much in any company. And all the use cases have some things in common. For example, in most cases, you need very, very, very fast ingestion. You're going to have data coming from many different places all at the same time. That's going to mean, in many cases, you're going to be getting duplicates. You're going to be getting out-of-order data. If you are solving this with some kind of append-only storage, that might be a problem. And the other thing is this data is not going to be constant. The flow is going to be all over the place. At some point, you're going to have a burst of data. Then maybe at night or of office hours, things are going to be much quieter. And then there was hour, and everything is up again. How do you plan for that? It's really not easy to plan at a scale, a system that can adapt constantly to the flow of data you are getting. Since many of this data is going to be from physical devices, you're going to experience every kind of connectivity or hardware issues. It might be the case for a while, a device goes silent, and you're going to have a gap in the data. How are you going to represent that? Are you going to interpolate data? What are you doing with the missing data? And what if one of the sensors goes back to connectivity, is sending you later data, sorry, which is late, that already happened. How are you going to go with that? Again, it's not easy to work about those things. And when you are working about data about time series, the recent data, the hot data, is very important. But you also want to have the historical data, at least for a while. And it would be nice if you could work together with both the call and the hot data in the same query. In summary, you need to have some system that allows you to work with time series semantics. You cannot just use the same databases you've been using in the past to work with this kind of tolerance in which you need to do sophisticated period over period comparison, trends, patterns, and so on, and so forth. As the Amazon CTO says, the modern applications need to collect, store, and work with a huge amount of data at a scale that cannot really be supported efficiently by a traditional relational databases. And don't get it wrong, I really love relational databases. They are great. They are just not designed for the specific use case of time series data. So we are lucky, because at AWS we have millions of active customers, and many of them have been working with time series for a long time. So I can tell you how they have historically worked with time series data. And the first approach is very nice. Just use and abuse your traditional relational database. I know they are not designed for time series, but that's what. You can make a relational database do anything you want. So it's not ideal. They are going to be missing some things, but you can use a relational database for time series. And actually, modern databases like Postgres, like Aurora, are pretty good for some time series work. We even have some articles in our blog about how you can use Postgres or Aurora to deal with time series. Many customers historically, when they were dealing with the turn of ingesting data, which was very fast, moved away from relational databases and went to SQL databases. Things like MongoDB, elastic search, they can ingest huge amounts of data very, very, very quickly. But they are not great for doing powerful queries, for doing powerful analytics. Other customers started using the Hadoop ecosystem, specifically H base as the database for storing time series. And as you know, Hadoop is not really easy to work with. Many moving pieces, many different things. So it's not ideal. So in summary, customers got creative how to use the tools they knew for this new type of processing for this time series. It was the case that even at AWS, long time ago, you can see there, we are still using the old logo. We haven't been using the old logo for the past five years. So a long time ago, we even defined reference architecture on AWS to working with time series data. And in this solution, we were suggesting customers. For ingesting data, use DynamoDB, fully managed SQL database. Then for actually processing the data, use EMR, a managed Hadoop cluster. And then for doing the analytics, use Redshift, a powerful cloud data warehouse. And this work, it was great because customers could actually do time series analytics without having to manage infrastructure, but it was not really ideal. It was not really fast. It was not ideal for the kind of analytics we want to do today. So of course, it was not only AWS that noticed this trend. If you go to dbanji.com, it's a wonderful site with a lot of information about different trends and patterns and usage for databases. You will notice how time series databases are actually getting quite popular. Out of the more than 300 databases they are tracking, they are 38. You can see there the blue slides on the chart. So they are 38, which are classified as time series databases. Actually, if you take a look at the trend for the past eight years, you see time series databases are getting the more attention, the most attention, only behind graph databases. But if you take a look at the past two years, you can see time series databases are actually trending more than any other kind of database. So basically, it's kind of a new thing until a few years ago, we didn't really have specific databases for time series, but it's something that is growing quite fastly because, as we saw earlier, it's very interesting for many, many different use cases. So with this, of course, there were a lot of customers when they saw new databases, both commercial and open source databases like InfluxDV, or like QtvPlus, or like Men's SQL, or like OpenTSDV, or like even Prometheus. So when they saw there was like these new databases that were specifically designed for time series, they started to run it on top of AWS because in the end, since we offer virtual machines and containers, you can run pretty much anything on top of AWS. And they were quite happy. The first customers we saw adopting these were mostly coming from the financial sector, and they were quite happy. They could work with time series data without having to manage the infrastructure. And they were reporting some interesting savings, and that was okay. And we even started defining some reference architectures to help them with these, you know, how to build with time series on AWS. So it was something like, okay, maybe you need to have a message queue in front, something like manage Apache Kafka to deal with the variable amount of data, then your time series database, commercial or open source, I don't care about that. And then maybe you want to integrate with the data lake to do more complex analytics. And it is already looking more sophisticated. This is looking more modern than the previous picture. But if you notice right there at point three on the chart, you can see time series DB, which is a black box. It looks blue here, but it's actually a black box. It's something that is not managed by AWS. In something you need to maintain. So if your time series are small, if you are not getting a lot of data or if it's not too fast, maybe you can get away with a single server. But the moment you want to scale, the moment you want to have reliability, speed, the moment you want to distribute your system, it's not going to be easy to maintain those databases. And that's why we started thinking about doing something about that. We started thinking how we could help customers that want to deal with time series with databases that were actually designed for cloud that could take advantage of all the things that you take for granted when you are talking about cloud security, scalability, reliability, flexibility, paying only for what you use, those kind of things. And we did this because, as we saw, building with time series is not easy. You can use a relational database. It's not going to adapt well. You can use time series database. It's not really designed for cloud. So it's not really going to be flexible enough. It's not going to be easy to work with them. There is a specific kind of time series database. There is this open source project called Prometheus. And Prometheus is a very interesting database. Apart from giving you time series storage analytics, it gives you monitoring. So Prometheus is both a monitoring tool and a time series database. We saw a lot of customers, specifically customers running containers, that were using Prometheus to monitor their workloads. So something we did was starting offering managed Prometheus. So you get exactly the same Prometheus that you have open source just managed by AWS. So you don't have to worry much about scaling. You don't have to worry much about integrating without the real services. So if your use case is about DevOps time series, if you are already using Prometheus, just for you to know that we offer Prometheus. But I'm not here to speak today about Prometheus in itself. I want to speak about a more generic type of time series analytics, not only for DevOps. And of course, if we are talking about time series, it's not only about analyzing the data, it's very important to visualize the data, to have some way of seeing those trends, configure alerts, see what's the status of the system. And there is an open source tool. It's called Grafana. You're probably familiar with that. And with Grafana, you can build dashboards like this. So what they have on screen, it's a dashboard which is getting time series data from two different tables. One is just raw data from sensors, getting temperature. And the other table is doing some aggregations to the test some complex events and putting their some interesting points on a screen. So of course, this kind of dashboard, when you have it live, you can see it's working, it's getting a lot of events per second. You can change the range, the scope of what you are searching. But I want to show you how you configure one of these individual charts, one of these visualization. And if you see here, to the right, you have the useful Grafana configuration, colors, legend and so on. But at the bottom is the important thing. Let me zoom in. Here you have a query, a SQL query, in which we are configuring what is going to appear on that chart. And I'm doing a very simple select. I'm doing something like give me the sensor ID, and then I'm aggregating time in intervals of 10 seconds. And for those 10 seconds, and for that particular sensor, give me the maximum temperature. That's about it. You can see a couple of fancy things. We have an ago function and we have a been function for working with time. But other than that, this looks like regular SQL. If you are familiar with any time series database, you probably know they don't use SQL. They use their own query language that you need to learn if you want to go with them. So what is this Grafana dashboard? Which database is speaking to that can allow SQL and still works as a time series database? So that's what I want to cover in the second part of this talk. Now that I already told you what is time series, why they are interesting, what's the trolling of dealing with time series today, which solutions you could use? I'm going to tell you about the new kit on the block, about Amazon time stream, a purpose built time series database, which is designed for the cloud. That means it's fully serverless. You don't have to worry about any servers. So you pay for three things, basically. How much data you are ingesting, the more data you are sending in, the more you pay. How much data you are storing? If you are storing data for a year, you pay more than if you are storing data only for two weeks. And how much data you are querying? If you are running a thousand queries per minute, you pay more than if you are running couple of queries every hour. But other than that, you don't have to pay for any servers. If you are not sending or storing or querying data, you pay zero. There is no minimum, there is no maximum, a scale as much as you want. Of course, this runs on the cloud, so security, you can not really take it for granted because security is always a responsibility, but it's integrated with the same security mechanisms you have on AWS for the other services. You can encrypt the data. You can use your own customer managed security keys if you want for encrypting your data. And it's specifically designed for time series. So apart from the standard SQL that you can use, and I'll talk more about SQL later, but apart from the standard SQL, we have extensions for working specifically with time series and for getting time series analytics. Sounds good? I hope so. So what is looked like behind the scenes? How we are building this database? So we have, we announced time stream about two years ago. For the first year, we were under private preview. We were testing with a lot of reference customers to make sure we had that right. To make sure the use case, it was what most customers were demanding. And then last year, we actually opened to the public. And guess what? In all this time, customers that were using time stream didn't have to change anything. We've been adding new capacities. We've been improving the service, but you don't really have to worry about that. Since it's a serverless product, you don't have to worry about updates, about upgrades, about patching, about anything like that. So the way you work with data at time stream, it's first you have to design, define some tables as you would in any other database. And those tables can leave both in memory or on magnetic storage. Actually, we're working now on a third layer, which is SSD. So you can choose to have the data in memory, on SSD, or in magnetic storage. So every time you ingest data, data getting ingested in memory. And you choose for how long it's going to be there. You can set a minimum of one hour, a maximum of one year. So after that period, data is going to be only on the magnetic storage or in the SSD layer. And again, on magnetic storage, you can choose to store for only one day or for store for up to 200 years. And after that maximum time you have defined for your table goes on, then the data gets deleted forever. That's basically it. So you have full power to choose how much your data is going to be in memory, which is more expensive to store, but faster to query, how much of your data is on magnetic storage, which is way cheaper, but also slower. And you can always query both of them seamlessly. You don't have to do anything special. You can include hot data and cold data in the same query. It works. It's just the response time might be different, but that's about it. You don't have to worry about anything else. So this is what we built. We have a multi-layer architecture. At the top, we have the ingestion layer. The ingestion layer scales up and down with the data you are sending. You don't have to worry about that. You don't have to pre-warm. You don't have to do anything. You can go from zero to millions of events per second without any warning. So it will ingest all the data. And when we ingest the data, we replicate the data across multiple availability zones. That means different data centers in different locations and in different hard drives. So the data is always within the region, but it's automatically replicated in multiple data centers in that region. So you can make sure you never lose any data due to a hardware malfunction. So once the data is there, it's storing memory. It also goes to the magnetic store. After a while, it will be deleted from memory, and you define that time. And then we have the query layer, which is at the bottom. And the query layer, it's independent from the rest. So it grows as much as you run queries, and you don't have to do anything about that. So you have, you can't forget about managing servers, adding more capacity, removing capacity when you don't use it. This is elastic. It's totally independent one layer from the other, and you don't have to worry about any of this. If you actually look behind the scenes, it gets a bit more complex, because this is not a big installation. We have a big cluster. Actually what we have is what we call a cellular architecture. So basically what we have here is we have multiple copies of this architecture. And when your application is sending data, we are automatically redirecting to one or to the other to minimize the impact of any failure that could be in the system. But again, that's not your problem. That's just how we manage behind the scenes, because for you it's a CC, ascending data and retrieving data. If you were talking about a conventional database, or even about a wide-column database like Apache Cassandra, you probably would be thinking that when you are storing time series, you store already different metrics on a single row. Imagine you have a machine, and that machine has different metrics. In this example, I have a server, and the server has three metrics, CPU utilization, the memory utilization, and the number of bytes is sending over the network. On a conventional database, on a conventional architecture, you will store that as a single row. On time stream, we do this a bit differently to have more flexibility. A different metric is stored as a different record. So in one record, we have the metric itself. Here you can see one for the CPU utilization, one for memory, and one for the networks. And then we have the dimensions. The dimensions are the things that are common to the whole row. In this case, since all the metrics are for the same server, the things that describe the server are the dimensions. In this case, it's the region, and the ASET, and the VPC, and the hostname. So those things are dimensions. The other things are measures. So you can have over 100 dimensions. You can have up to 8,000 different measures for each record that you are sending. So it's not too bad for each, you know, 8 different measures in each table, more than per record, actually. And when we have these individual records, we can combine them to create what we call a time series. And historically, you can see that better in this example. Here I have nine different rows. So even if there are nine different rows, you can see only three different time stamps. Second zero, second one, and second two. And each time stamp appears three times. One for each metric, one for the CPU, one for the memory, one for the network. So basically, when I have this individual record and I want to convert to a time series, what I do is I tell the stream. I want to aggregate by one or two or three, or whichever dimensions I want. For example, if I aggregate these metrics by Rihyon and ACID and VPC and measure name, sorry about that. I didn't tell you about that. So yeah, I was just trying my superpower with you. So if I aggregate Rihyon and ACID and VPC and host name, what I get, basically, is the time series. So that's how it works when you are working with Amazon Timestream. So when you ingest in data, there are many different ways in which you can ingest data. Of course, you can use the AWS probably SDK for your favorite language. But we also have adapters for working with IoT, for working with third-party tools like telegraph or Apache Flink, and for working with many different solutions. This is just simple Python example in which I'm composing a JSON record and writing individually. If I were to run this example, imagine I have 100 different metrics. So I'm writing 100 individual times, and imagine I have a metric every five seconds. So eventually, I'm going to have about 50, almost 52 million writes a month. I told you before, you pay for how much data you are ingesting. In particular, you pay 0.50 for each write of up to one kilobyte. So that means you will be paying $25 for ingesting 52 million writes a month, which is not too bad, but you can do better. If instead of doing individual writes, I batch the writes and I put multiple metrics for the same dimensions in one go, I can actually write much less. Rather than writing 100 times, I can go out to only 11 times. So my cost is way cheaper, it's not even $3. But I can do even better. If I do this in groups of different writes, so I can be within the maximum kilobyte, I can do better writes, I can actually put all these metrics into only three writes. Which means for the same workload, I'm going to be paying less than $1. So I guess for you to know, it's not really important the numbers here, it's just for you to know that in cloud, it's always interesting to understand how you are going to be billed. Because for the same workload, if you just go with the naive approach of writing individually, you are paying $26. If you are batching, you are paying $2.85. And if you are batching and grouping by the common attributes of the records, you can go with just below $1. So always interesting to see how you can save money. But before I finish, I know I have about five minutes. So that will be plenty of time. So before I finish, I want to speak about a couple of things. First, I already told you, you can choose between memory or magnetic storage depending how much recent data you want to have to query faster. But I want to speak about how you actually query the data. So you just SQL with extensions. But if you know SQL, you only have to learn a few functions to start using Timestream. That's it. And actually, we have a GDBC driver if you want to play with that. And you pay for how much data you are scanning on your queries. If you go to the web interface, that's, you can actually run interactive queries directly from the web interface. But of course, you can also use your Toramil language of choice to run those queries automatically. So when it comes to SQL, the SQL we accept is a standard, no more and no less. Select from where, group by, union, blah, blah, blah, whatever you want. But we are adding specific things for time series. The SQL you see here is basically the thing you could run anywhere else. But we have things like the percentiles, which are interesting for this kind of metrics. You have things like the being in the time in intervals of 15 seconds. So you have metrics at different time points. You can normalize at 15 seconds interval. You have functions like a go, for example. So small things that are interesting. But the most important bit is you can actually convert individual records into time series. So this is what it looks like. If I have this table here, which is the same I was showing you before, in which we have multiple dimensions and time stamps and some metrics, I can do this. What I'm doing this query is saying, for the dimensions of the region, ASET, BPC, and instance, I want to have a time series of the CPU utilization metric. And this is going to unfold this table. So it's going to look like this. It's going to give me, for those particular dimensions, I have the time series data. And now that I have in a single structure all that data over time, I can apply time series functions like interpolate, fill the missing gaps with the full data or with average data or with whatever I want. Derivative, if I want to do which is the change rate between one metric and another in time, integrals, correlations, all kind of interesting things you want to do with time series. And of course, time stream doesn't live in a vacuum. We have integrations with Apache Flink. If you want to do some preprocessing before exporting the time stream with telegraph, if you want to collect operational metrics from servers with SageMaker from machine learning on AWS, GDBC driver integrates with different services. But there are two I want to show you very quickly before I finish. One is QuickSight, a business dashboard. I showed you before a monitoring dashboard with Grafana. We also have a business dashboard on AWS. It's called Amazon QuickSight. So if you want to present business metrics on a pretty dashboard, you can integrate directly time stream with QuickSight. And okay, this is what it looks like. So this chart feel a bit less geeky and a bit more business friendly. Okay. And other than that, if you want to automate at any point the data workflow, you truly want to do some kind of orchestration. And what's better for orchestrating than airflow. So you can integrate Apache airflow both with time stream, both with the open source version or with the version managed by AWS. So I told you at the beginning, there are many different use cases for time series. I want to highlight just a couple customers. One is Amazon.com itself, the Amazon group. They are using time stream to help improve the production of renewable energies for supporting the whole business. The other is a German company, Sieggeek. They sell tickets for events and they wanted to analyze which was the average time the users are waiting in the queue before actually getting to buy the tickets, all those things. And for that, they are also using time stream. So as you can see, different use cases in which you can be using time stream. The last thing that I have is just to tell you that other than doing analytics, there are many use cases for time series when it comes to doing machine learning. And if you are using the AWS with or without time stream, you have multiple solutions to forecast what machine learning ready size from your data. That's all I have. Thank you very much for having me here. If you want to learn a bit more about time stream, you have some resources on the screen. Javier, thank you so much. That was a fascinating talk. I think the two Espresso's work wonders. We don't have much time, unfortunately, but we do have a couple of questions. One technical, one not so technical. Let's start with the technical, shall we? Yeah. So Miguel Angel Monjas asks, isn't there an ingestion plugin for MSK Kafka? So, hi, Miguel Angel. Thank you very much. So, I'm not aware we have a data plugin for Kafka, but what you can do, if you have Kafka and you have already an application written in Java with Kafka, you can just use the SDK to send the data. Another thing you can do is just use KafkaConnet. So we have MSK. We have now a managed connect service for Kafka on AWS. You could write the connector to just using the SDK right into Kafka. So there is not the direct connector, but there is the SDK that you can use for writing your data. So thank you for the question. Okay. Javier, finally, not such a technical question. Okay. But coming back to what we were talking about at the beginning. Yeah. Becoming more and more used to time series data, is this a trend that you think is going to continue, or are we perhaps going to get tired of this obsession with time series data? I mean, I'm talking as a typical user. And perhaps you could answer, how is this changing us as human beings? Are we becoming our own little data scientists? It's not. I mean, it's a very philosophical question. Of course, I don't have the answer here, but I can tell you, with technology, we are able to track things now that we couldn't in the past. So we can predict some things. We can prepare for things. For example, in the past, you will go to the workshop only when the car was already broken. And that was too bad, because then it's like, oh, I don't know how long I'm going to be without the car. But with time series, you can get there before our tooling gets bigger. So maybe it's a quick fix. So I hope that trend of using time series for good, it is there. We have not the technology to apply that to more and more scenarios. So I see the trend is actually getting bigger. I don't see it as something that is conditioning as a human, but it's like, no, it's like a kind of superpower. Now you can use those things to be more efficient. Where we put the limit, that's not for me. That's for a philosopher. Maybe we'll leave that for next year then. Maybe, yes. Javier, thank you so much for your time.