 And here we go. Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We would like to thank you for joining the latest in the Monthly Webinar Series, Advanced Analytics with William McKnight. Today, William will be discussing embedded data science trends and databases at the edge. Just a couple of points to get us started. Due to the large number of people that attend these sessions, he will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions by Twitter using hashtag ADV Analytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And if you'd like to continue the conversation after the webinar, you can follow William and each other at community.dataversity.net. And as always, we will send a follow-up email with a two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our series speaker for today, William McKnight. William is the president of McKnight Consulting Group. He takes corporate information and turns it into a bottom line producing asset. He's worked at major companies worldwide, 15 of the global 2000, and many others. McKnight Consulting Group focuses on delivering business value and solving business problems, utilizing COVID and streamlined approaches in information management. His teams have won several best practice competitions for their implementations, and he has been helping companies adopt big data solutions. And with that, I will give the floor to William to get today's webinar started. Hello and welcome. Hello. Hello, Shannon and everyone. Thank you, Shannon, for that introduction. And welcome, everybody, to Embedded Data Science. That is the topic for this month's Advanced Analytics Series. Before we begin, I want to share my excitement that Advanced Analytics will be continuing in 2020. And it will be at the same time every month as right now. So keep on coming back for those of you that just tuned into a webinar and wondering what I'm talking about. This is actually a monthly series. And so far, it's been me giving presentations, and I've certainly enjoyed that. We might shake the format up a bit, though, and occasionally bring in some outside experts to provide some more information to you on the subjects of advanced analytics. Speaking of that, in the next couple of weeks, I owe Shannon my list of topics for 2020. So if you have ideas, things in my wheelhouse, of course, feel free to send them my way. And maybe you are the expert that wants to be on the show here with me. Let me know that as well. Also, I want to let you know that Dataversity is sponsoring enterprise analytics. Yes, that's right. Online, October 23rd. So come on out to that from your desk, from your home, wherever you may be. I will be speaking on data platforms. You can find more at Dataversity's website on that. So with that, let's launch into the topic of embedded data science. I've been talking about databases, big databases, small databases, databases outside the organization, the many different forms of databases inside the organization all in this series. And I got to round that out by talking about a very important emerging trend. And that is databases at the edge, databases on small devices, so tiny databases. And you might think, well, William, that's a software industry topic. Yes, it is. But many enterprises today are building applications. Many are building mobile applications and many are embracing this idea of embedded data science, data at the edge. And many enterprises today are claiming to be software companies no matter what they do, you know, insurance or finance or healthcare, what have you and seeking those valuations to along the way. So definitely I've been engaged on this topic with many of my enterprise clients. So if you're just curious about how some software products are architected or if you're actively working IOT or other forms of projects that do embedded data science, hopefully I've got some information for you here today. Here are some example applications of data science at the edge. How about a mobile airline application? That's enterprise, right? Today you can get full features online. I know I enjoy this quite a bit, checking in, retrieving boarding passes, checking flight status and so on. We all use these, right? Okay, that's an example. Socially connected mobile game. So all the latest leaderboard stuff, positions in the game for all the gamers out there, something back to more enterprise oriented is aggregated IOT sensor data. So embedded databases now are supporting these types of applications. They're building the software, transparent to the applications in user and require little or no ongoing maintenance. Embedded databases are growing with the rise of mobile applications and IOT giving innumerable devices robust capabilities via their own local database management system. So developers can create sophisticated applications right on the remote device and they don't have to use a file system or some other means of trying to manage data on the device. And this architecture is commonly preferred these days over client server approaches which rely on database servers accessed by client applications by interfaces. So a theme throughout this presentation is going to be what do you do on the edge versus what do you do in some other database that maybe this rolls up into. No easy answer on that but hope to give you some of the boundaries of what is possible. Data is the weak link in embedded development. It's the weak link in a lot of things right data science in general but it's also true when it comes to embedded development. Now here are some facts about the embedded world. There's over a million iOS and Android apps. Just think about that for a minute over a million. Wow they must be pretty easy to do yes and they are and they come in all sorts of levels of quality of course and depth but there's over a million of them. There's eight million developers around the world and it's growing. Three widely adopted platforms Android, iOS and Linux. When I sit back and think about that I think it's pretty amazing that there's only three. But there's limited cross-platform support. Basically there's a lot of development that has to go on for each one of these with a single app. There are security mandates and requirements in the backdrop of all the development today. All of these apps require data though. They all require data and they can't all access it at some centralized location. So there are a couple types of embedded databases traditionally, maybe going back five years plus. The database was installed with the software. There was a silent install. Users really never knew about it. Fine tuned for small footprint and a range of devices. And you had multiple databases that could talk to that database should the need arise. And we all have experienced this type of embedded database of course in our walks of life. But what I'm going to be focusing on here is the deeply embedded embedded database. Application runs the database inside it by a library. This has become more popular with the move to mobile. And the client driver can talk to a traditional embedded database. By the way, I'm not talking about the sort of light data centers, the mobile data centers that may be on a ship or in a truck outside, something like that. Those are also kind of called embedded databases. But that is not at all what I'm talking about here today. Although they may have a play in the architecture, why embedded database? Okay, well, it's low cost to do so. It is simple. There is no DBA, administration is free. All the reasons really why we go with a database for everything else these days. It's embedded in the installer. It speeds up development. It has disaster recovery and various other things that we need like encryption and security built in. And databases tend to work across multiple platforms. So this is all great. The idea is that in the past embedded systems was something of a black art that required in-depth knowledge of electronics. Now, maybe this is where your mindset is about embedded databases. But you don't have to know about interrupt processing, internal hardware architecture elements, assemblers, protocols, hardware assisted debugging, et cetera. Those things will certainly rule me out of ever being able to do this type of development. But fortunately, the database is there these days, and we can get moving at a much more rapid pace. Modern IoT development tools have elevated the development process and made it more accessible. All right, next slide. Don't use flat files for embedded data. It's tempting, I suppose, but the extra step of engaging a database vendor in this process is well worth it. With flat files, you have a lack of data portability. There's no single API portability across programming languages. All the usual problems that you have with flat files, data integrity, much of the support has to be built. Security has to be added. You need to add a lot of things to flat files. So the management, really, and the performance, et cetera, all these things you're going to get mostly better in a database. It's certainly well worth it for everything that I have seen. So what are some of the benefits? After it's something of an embedded database, there's a sophisticated SDK. I'll talk about that in a little while. Customized packages, simplified operations, native integration, data protection build-in, a flexible business model usually from the vendors. They understand that these could go out into thousands of devices in your IoT architecture. And finally, performance. So what kind of performance? Everybody focuses on read performance, but we benchmarked a number of these embedded databases. Got our hands dirty. Got our hands on these databases. Put them through the paces that is really required in IoT today. And one of the things that you'll want to focus on is write speed. Write speed, the speed at which data can be written to the database. That's an essential performance metric for IoT data. So here are some of the things that you want to look at. You want to look at some of these measures with and without synchronization to another database in the architecture. In your architecture, there very well may be levels of databases. Usually a couple, but sometimes more. A couple levels of databases above, if you will, in the architecture, in the conceptual architecture, above the edge database for the edge data to be synchronized back into. And that synchronization, that's a big deal by the way because that often happens after the fact because of connectivity issues with edge clients. It's obviously much more of a problem with an edge device than it is a central database in a data center. But some of the things you want to look for, let's say without synchronization, you want to look at inserts, queries. You want to query, say, 10,000 documents without an index. Delete, you want to delete, let's say, 10,000 documents on an index key and delete a bunch on a non-index key. These are some of the things you want to test. You want to do updates as well on the index key and on the non-index key as well. And you want to do this again without synchronization just at the edge and then with synchronization of data back to a server. However, the solution that you pick needs to be able to automatically synchronize in real time and do this without ETL. And again, you could have hundreds to thousands of devices in your architecture around the embedded database and all that information may need to funnel into a core database on a server, a core database or two. Mobile databases. Now, you can use a lot of different databases in your embedded architecture, but you're going to probably want to use one that's really focused on being a mobile database, okay? And usually these go by the name of light or something like that or like Mongo Mobile, okay? And so you may be wondering, well, is that Mongo? Am I getting everything from the original? And it's by design, but here are some of the things, for example, in Mongo Mobile that you do not get in Mongo Mobile that you would get in your ordinary Mongo. Sharding and replication. The MongoDB Mobile operates as a standalone instance with a single node present on the mobile device. MongoDB Mobile uses SQLite as a simple key value stored behind the scenes due to stability and prevalence on devices. As such, MongoDB Mobile does not provide the ability to configure the underlying SQLite deployment or use other storage engines. Database authentication. Yeah, that's going to be more limited. MongoDB Mobile and an external component, you have to use a combination of authentication and security rules on the server side. And finally, I'll mention, MongoDB Mobile does not provide encryption at rest, does not support the creation of change streams, does not support server side JavaScript execution, and does not support transactions. Now, this was all not to ding on MongoDB Mobile. The point of this is to say that, no, you're not getting the same database in the mobile as you are getting in the regular server. And I'm just using Mongo as an example. They're all that way. And so if you're used to enterprise development, you'll want to know what these limitations are when you step into this world. Now, selecting the embedded database, and I'll tell you about some of the choices here as we go along, but you're getting a separate database that has a focus on embedded. This is for software developers. I should say it's for the software development community, which again, I'll say could very well reside inside a corporate enterprise, a finance organization, a healthcare organization, a retail organization, and so on. Everybody's building mobile apps and many are building IoT apps and so on. And I'll touch on that a little bit more as we go along here, but the embedded database is not really geared to end users. I just shared with you some of the limitations. It's used by almost every industry, not just software, and you'll want to do performance testing on your selection. And I just rattled off a list of some of the performance tests that you're going to want to do. Now, some other requirements you're going to want to have, do they support CRUD operations? Do they support Asset? Are they capable of running on multiple platforms and with multiple languages? Because in this world, a lot of the data is going to be no SQL base. It's going to be tagged up data as opposed to relational data. So you're going to want those capabilities. Can they move data to more centralized databases seamlessly? Yeah, that's a big part of the architecture. What do you do at the Edge? What do you do in centralized databases? And by the way, another constant theme of this type of development is, what data do we put at the Edge? What data do we put at the Edge? My snarky comment about it is, the Edge is not the place to store your history data. Okay, well, let's start from there. You just want to put data out there that you can use. And in many cases, this is data that is highly summarized in some form of centralized database and pushed out as an actionable piece of information. Because you don't want to do a lot of processing at the Edge. You don't want to be doing summaries to try to understand, say for example, customer lifetime value or propensity to buy across a category, something like that. That should be pre-calculated. And that should be based on a whole lot more data than what can be contained in an Edge database. So there's this constant back and forth between centralized and Edge. A key to any successful application, whether it's processing bank transactions or gaming or health monitoring, it's processing the data within a specified timeframe. Latency is especially a problem with read and write intensive applications. And we have a lot of that with sensors and mobile devices using this type of streaming data. So we can lose some of the advantages that centralization provides, such as scale and maintenance. Additionally, you can end up with hundreds or thousands of devices. And there is a greater need, and all those have databases by the way. And there's a greater need to engineer maintenance and fault tolerance into the solution. So while these embedded databases need to act independently at the device level, it is crucial to recognize they are not just silos of data. They are part of a network of devices. At least some of the data is transmitted and integrated into higher-level databases for further action. And I don't believe I read all of these to you, but you can see some of the other embedded database requirements. And one of them is support for multiple platforms. Now, this is a list of platforms I got from the Actian Core database website. These are the platforms they support. It's pretty good. It's pretty current. And Actian Core is a no-sequel, embeddable, zero-DBA, self-tuning database for smart phones and other IoT devices requiring a small footprint. And they actually give you some guidance. And this goes along with what I was saying about you don't put your history data out on the edge. Their guidance is 2 megabytes minimum. It supports all the platforms that you see here. And a common set of data that you're going to put at the edge is called time series data. Some of us are familiar with time series data. Others are not as familiar with it. Here are some examples. Oil and gas in a remote location. And that indicates connectivity may be an issue, as it often is. Agriculture, black box equipment like airplanes which have a sliding window of telemetry where you are constantly moving that window forward in terms of analyzing what's going on to determine next best action predictive maintenance and the like. Self-driving cars. It's a big obvious one. Trading algorithms, smart homes, transportation networks, and law enforcement. So I think of time series data as a sequence of data points that measures the same thing. It just keeps measuring it over and over again. Time is on the axis and the data workloads are usually append only. You're not getting rid of data because you want to see the trend. And so you're constantly appending data. I shouldn't say you're not getting rid of data. Yes, you do get rid of some of that detail, older data. But it actually has relevance for quite a while, especially if you're into the analytics of your app. So simply put, time series data sets track changes to the overall system as inserts and not updates. Time series data is data that in the aggregate represents how a system, a process, a thing, whatever that thing may be, how it changes over the course of time. So self-driving cars, for example, continuously collect data about how their environment is changing. Autonomous trading algorithms are continuously collecting data on how the markets are changing. We know about smart homes and what they're monitoring. Retail is monitoring how their assets are moving. And think about that and how we can get stuff to our home no matter what it is practically in a day. And sometimes in a more localized environment like when it comes to food within an hour. It's amazing these assets move with such precision and efficiency. And a lot of it's due to the effective manipulation of time series data. All of these self-driving Tesla's, autonomous Wall Street trading algorithms and so on that I've talked about have some things in common. They are continuously collecting data. And these time series databases that I speak of have steadily remained one of the fastest growing categories of databases in no small part to the fact that they are popular as embedded in mobile databases. So time series databases. Why do people use a separate time series database for this? Why do they specialize? Well there's a couple reasons, scale and usability. So a single connected car for example will collect 4,000 gigabytes of data per day. And that's quite a lot to actually be able to do anything with. And furthermore time series databases, they function and operate common to time series data analysis. They have functions that are specific for time series data like data retention policies. I've mentioned maybe getting rid of some older data. It can be actually pretty sophisticated about how it does that. Continuous queries, flexible time aggregations, and so on. So what are some of these time series databases? Well there's quite a few, and there's no dominant player. In FlexDB time series is pretty popular. KDB plus time series, Prometheus time series, Graphite time series, and RRD tool are some of the ones I'll mention in case your data is dominated by time series data at the edge. Now I mentioned the need for SQL and beyond a little bit ago. A lot of the data is JSON or it's some other form of optimized binary formats. And we need to actually allow non-SQL access to the data at the edge. So that's a feature you're going to want to look at. The data is often quite unstructured and may not fit well into a simple relational database format. I shouldn't say simple, it's actually more structured to that than a no-SQL database, if you think about it. Now security. Yes, now I'm going to be the first to say I'm a data guy, not a security guy. But these are some things that I do know about it. You'll want to check it out deeper, of course. But backend database servers are normally run behind sophisticated firewalls and avail themselves of a full network security apparatus. Server security and authentication overseen by security professionals, regulated by strict rules, and so on. Okay, that's not the database at the edge. The database at the edge is out there in the open, often running unattended, and more often than not, have limited resources without well-established firewall-like security measures. So you're going to want to be careful in the data that's out there and adhere to some of the things you see here. Existing Internet, SSL, TLS technologies do a good job. In addition to secure communications and channel authentication, the integrity of data must be ensured by the database management system. You want to be sure that your DBMS is tied around this. And the best database vendors enable some of the things you see there. SPS for 140-2, which is a U.S. government standard that defines cryptographic module security requirements. These databases come in packages, which is a combination of the database and a development SDK, allows for a lot of customization. It's a little bit of a different world than enterprise database deployment. I shouldn't say a little bit. It can be a lot. It's really more developer-centric than it is DBA-centric. As a matter of fact, there's no concept really of the DBA so much in this world. Build a user-defined function that runs server-side, build database install into the application install. Yes, the packages do all that and more. The operation of the embedded database. No DBA required. I've mentioned this. It's hidden from the user. Backup restore, automatic recovery, things like that via CLI. There are listeners on TPC-IP ports that do remote management like vendor management services. Architecture integration. There is the client server mode it can work in, but that's suboptimal. And usually we see embedded server mode where the database is a DLL and is dynamically loaded by the application at execution. The user is none the wiser about it or it might be linked as a library and embedded in the application. The business model. The software vendors. I've mentioned some of them here are licensed to market by various forms. Perpetual software as a service. They have to be more creative. You have to be more creative frankly when you're licensing these databases because it's different. There can be 1,000 footprints of the database. The embedded database model is designed for OEM and I will say it's flexible, but be on your guard. Make sure you get into the right type of model. Here's an example architecture with the Actian Zen product working along with, you know, there's the core database. It has the core database. It has the Zen edge database. And here you see that for example on one mobile device, there may be a couple of apps, right? There may be many more than a couple of apps on a device. But anyway, here we see that there is the Zen core data table. Really it's a database. And the applications are there on the device as well. They are connected to a mobile IoT gateway, one of the levels of the architecture where they have a specialized database called Zen Edge. Very good architecture. What is the edge? I keep talking about it. What is the edge? The edge is where your things are. Whatever those things may be. Remember connected cars. Remember smart homes. Remember the algorithms and so on. It's wherever your things are. It's where highly available processors enable real-time analytics for applications that can't wait long for decisions. Got the customer standing there. You want to make the right decision. Can't wait for data to go back. Get analyzed on the back end. Experience all the communication. And finally come back with a result. The embedded database has to have the information on it that it can use to make a decision. It's always a trade-off. Let me repeat that. It's always a trade-off. And you'll want to be very smart about the data and the processing that you do with the edge versus elsewhere. Now storing data at the edge, the purpose of collecting data on the edge, remember the collection is pretty important here as well. I encourage you to test the inserts on the edge. The overall purpose of collecting data on the edge has shifted from purely device control and monitoring to improving various service capabilities through real-time analysis. So the capabilities have come up while not becoming overly ill-performant. Row data, time stamp, coming from devices to be stored in a central database, yeah? This is important when connectivity is limited, that you have data at the edge you can process with. And many times, data is intentionally not real-time connected to the central database. And there's a couple of different mechanisms of storage that you'll want to take advantage of. Store and forward is a technique in which information is sent to an intermediate station where it is kept and sent at a later time to the final destination. That is that other centralized database. A digital twin is just real-time data. So it's constantly in more or less real-time, I should say, copying that data, replicating that data from the edge to a centralized database. So I think this is a passing fancy. Think again. Here we are with some 30 billion-odd IoT-connected devices right now, projected to go to 75 billion by 2025. And we've seen various estimates, right? But they're more or less in this range of 50 to 100 billion or more connected devices in the next few years. So with all those connected devices, many of them are going to have an embedded database and want to be smart about what they do. So this is very important to pay attention to, again, whether you're in the software business or you're in the enterprise business. IoT device differences from traditional data requirements, high volume, new data sources, the speed at which the data is collected, modern transponders frequencies are higher than before, sensor-sibrator precision. Edge devices may or may not be able to do everything that you want to fulfill all the capabilities that you have for your application. Edge nodes' physical connectivity is often unpredictable due to various reasons, bandwidth availability, the wide range of protocol stacks, like Wi-Fi, Ethernet, cellular, Bluetooth, what have you, very unpredictable. And often, sometimes, you have edge devices that are battery-controlled. And when they are battery-controlled, the edge might be saving on the battery by going to sleep for some time period and coming back on. And therefore, you have the delayed connectivity to the centralized database. That has to be okay. Again, you have to be very efficient that you store at the edge, not for history data. Be very smart about it, summary data that you can actually use for the intended functions of the edge. IoT, and again, I'm calling that out as a very important, very typical use of embedded databases. And this IoT device management, kind of in summary here, the database must be seamlessly integrated into the application facilities. It doesn't stand alone. It stands alone in terms of everything is encompassed there that it needs to do certain things, but not the entire application. There must be a well-balanced choice of database management features available at the edge. Resource consumption must remain low, yet still allow for sufficient analytics to reduce the data flow to and from the cloud or the server. There is a need for advanced, data management, and don't forget that you're going to have a number of layers in the architecture and you're constantly balancing data collection with aggregation and advanced processing. This has been, Shannon, my part of the presentation. Are there any questions? Well, Ian, thank you so much for another great presentation. If you have questions for William, feel free to submit them in the bottom to hit those in. And just to answer the most commonly asked questions, just a reminder, I will send a follow-up for this email to all registrants for this webinar by end of day Monday with links to the slides and links to the recording. So far, everyone's quiet, William. It's a quiet August day. It's eerie. You're really quiet. It happens when we already brought up everyone. Yeah, everyone scrambled myself included. But, hey, if I got to give people back 20 minutes of their day, I'll do that. Let's get everyone to second. Yeah, I don't see any questions coming in. We did have a comment you know that way that as data people we need to be very well aware of security and it's true. So many people say it's so important but everyone wants to point the finger at somebody else for the responsibility of security. All right. Everyone's so quiet. Then we will give people time back in their day. William again, thank you so much for this fantastic presentation as always. And we'll get that follow-up email out on Monday. Thank you. Thanks everybody.