 Well, folks, thank you so much for joining me at this advanced stage of the conference. I think we're really kicking it into gear by now. And from this point on, it's just going to get better. I do appreciate the fact this is, of course, the last day on a beautiful weekend. For everybody who's coming in from the East Coast, you're welcome. It's a beautiful day. It's the last session on the last day. And I'm really happy to see that you guys came to the talk. For me, I want to be sure that this being the probably the last one that you attend, hopefully unless you have a time machine, or I've gotten the reading of the schedule completely wrong, I want to end this on a very high note and be something that you guys really like and appreciate and have fun with. My name is Brian Renaro. I am an engineer at MongoDB. I am a developer advocate, which is kind of an interesting space to be in. Essentially, I am a consulting engineer, an integrations engineer, a kind of a jack of all trades that has been let off the leash. So what I do is any time that anybody is using MongoDB from debugging it, debugging their implementation of it, preparing to scale out, creating their data model all set, whatever they need, anything that touches MongoDB I can help them with and often do. But now I am here to talk about the Internet of Things and using it with MongoDB. And thank you very much for joining me. So the first question that I have for you, I did tweet out a couple of days ago. I wonder if anybody saw this tweet. There was a little bit of a challenge. I was trying to get people to come in because I thought it was going to be the last one on Sunday. Does anybody know what this thing is? Can anybody answer this question? What is this mysterious Egyptian device? Is it Egyptian? Is it otherworldly? Is it an extra-terrestrial piece of technology? Any guesses? What does it look like? Could be a water pump, but it's not. Good guess. I like that one. Points for trying. What does it look like? It's grain mill windmill thing. Grain mill windmill. What you are looking at is a first, what I would typify as the first institutionalized, codified, organized, permanent wireless network. This is a semaphore station. This is from 1796. It was invented by a gentleman named Claude Chap. And they could send, there was a number of these stations, each across France. They could sparse them out from, they stretch them out in intervals of about 20 miles. So pretty far distant. And the way it would work is it's exactly a semaphore that you've seen on old naval vessels and stuff like that, guys with the flag. Same principle is that the arms would be moved into different positions to indicate a different letter in the alphabet. 1796, there was a network of these stations around France. They still exist. And they could send messages from one side of France to another in a matter of minutes as opposed to weeks or days. In fact, there was famously a battle that was fought in Alsace as the French in the Alsatians want to do. And there was a battle fought, and they got a message from Lille to Paris within an hour using the semaphore stations. And that was a distance of 140 miles. In 1796, in fact, in San Francisco, there's a very famous hill called Telegraph Hill. And it's not a Telegraph Hill because of an electric telegraph. They actually had one of these on top of Telegraph Hill where Coyt Tower is now. And the idea was that there would be a guy up there watching the ships come in through the Golden Gate and then signaling that the ship was imminent, its arrival was going to be imminent within a couple of hours, right, depending on the tide and stuff like that. And if they could identify the ship, they could identify the cargo it was carrying. So if you're down on the other side of Telegraph Hill in the old city of San Francisco, Yoruba, Buena, you would look up and you would see that the telegraph was telling you the semaphore line was telling you what cargo was about to come in port, potatoes, eggs, what have you. Eggs were rarity in San Francisco at that time. The price dropped out of that commodity because you knew that there was a resupply of that coming in so you didn't overpay. Anyways, so the gentleman who created this was a person named Claude Chap, right, 1796, invented the semaphore line. Revolutionized communications. Interestingly enough, actually the semaphore line was in place and operative. It was only replaced until 1846 by electric telegraphs that his brother would say was why would you want to have an electric telegraph that's totally unreliable. Anybody could come along and snip the wire, right, as opposed to a cloud bank coming and obscuring your network and causing connectivity problems. But this was a very successful system, right, and it's an example of Internet of Things out there deployed in the environment that are exchanging data over these wireless networks that have to deal with environmental factors, that have to deal with loss of connectivity and all sorts of different kinds of Byzantine faults, if you will. Interestingly enough, kind of sadly enough, Claude Chap didn't live to see 1846. He instead threw himself down the well of a Parisian hotel to kill himself in 1805, just a few years after he invented the semaphore line. Why? I don't know, but I would just say the engineers never make money on this kind of stuff. Okay, so how are we going to make you successful? All right. So we're talking obviously about Internet of Things and the idea here is what is the Internet of Things? Earlier there was a discussion here that was having, we were talking about what is the Internet of Things? Why has it suddenly gotten so super hot as a topic? Because these are concepts obviously that go back to the 18th century, if not earlier. But why is it now that we have so much interest in the IoT space? I think people are starting to realize what is potential here or what is coming. It's like dawning on us that all of these systems are starting to come together. An example of IoT, I'm just going to go through a couple of examples here because it's not just personal devices that we're dealing with. We're dealing with any kind of system, we're going to define IoT. We're dealing with a system that has transducers of some sort, it has interaction with the physical environment in some way to collect data that's going to be transmitted back to a central system. And that's where the trouble comes in, right, is getting the data back on the central system and dealing with it. So here, of course, is one example of IoT that we've worked with. This is an embedded induction loop. When a car goes over this sensor, this induction loop, it induces a voltage in the wire and trips the sensor, the transducer goes off. And we know that there's the presence of a car at this stop sign. Not the sexiest implementation of IoT, but you will love it in its absence. You notice its absence because you're always waiting for at a stoplight, you're waiting for the timer to run out before you're allowed to get the green light. This is one example. You can, with these sensors obviously deployed across networks, we did a little bit of work collecting data for presentation that I did a couple of years ago that was really interesting about the state of New York publishes its traffic metering system. Actually, a lot of states do that by now. But you can actually go through their API and pull out vehicle speeds based on different segments of the road. And they include weather conditions too as well. So you can actually do some pretty cool analysis off of this published data set and maintained data set to see how does weather affect traffic conditions on certain kinds of roads and characterize those roads on their average times and how long it takes to get from interval to interval on the car road. Another one that we've done a lot of work with is with power companies. Internet of Things also includes smart meters. There's a big field here. These of course are now fully wireless systems. They have their own NIC cards in them. There's no more meter reader that comes to physically read the meter anymore. They just drive a truck through your neighborhood and pick up readings off of these meters if they're not already transmitted through the transformer, the electric transformer that's at the top of your poles. So constantly drawing information off of these meters. It's kind of a fascinating space because what kind of data are we getting off of the system? We're getting voltage and temperature. We're getting amperage off of these things. So the deal is we're trying to figure out not just basic metering of how much power a household is using, but maybe how they're using the power. For instance, I recently got an electric car and I'm completely pretentious about it. So I called up PG&E and I said, can I have a special rate? Well, growing up down here, it was SDG&E when I lived in San Diego. I don't know what to call it now, but PG&E is the northern California power. I said, well, I plug in my car and off peak hours, can you give me a discount? Yeah, yeah, I can see that you do. I already got the car and they could tell from my meter reading that I was plugging it in at certain times. So part of the work that we've done is to analyze some of the data that comes off of smart meters. And then we say, hey, guess what? The goal was for this particular company was sending real-time notifications to someone's account that say like if you were in a peak usage time, if you turn off your washing machine, which seems to be going right now, you will save X amount of money, please do. Now there's all sorts of fraud detection with these things as well because it turns out that people like grow houses are out there. I mean, there are some mala factors out there, not necessarily that anybody is trying to analyze this data to find out if there's a grow house, but people are trying to evade having to pay for electricity all the time so they'll rip these things out, turn them upside down to get them to run in reverse, literally. They'll wrap it in tinfoil to block the transmission of its data. They'll actually take a meter off of a derelict house and put it on their house so that the billing goes to someone who doesn't, you know, totally different house. So there's actually a component of location of where these things are in the data. Wild wacky times. Okay, and then I think that there's a gentleman talking about home automation is another big field. Home automation is also very important, of course, for, I'm speaking about it now because I didn't think to put that in here, but home automation, of course, is a big field as well. H-back costs, trying to keep that down. Now there are green mandates, again, to lower the energy consumption of large managed buildings and, let's say, office spaces as well. That's a big component of what we're dealing with. I think that there's also another component here, while Fitbits are not necessarily all that necessarily advanced, but they're trying to come up with this idea that there's connectivity between devices themselves that play a part in the way that we want to deal with these systems. That is to say that you have a network of devices that might be communicating with each other before they come back and communicate to the centralized system. And also this idea that these kinds of data that we're getting off of these new instruments, these new devices are ones about ourselves, which plays an extremely important part about the kind of processing that we need to do on this data, the way that we architect our system. What we're dealing with today is almost assuredly to be something different in the next six to 18 months, right? We're talking, including about embedded medical devices, which are evolving as we speak and what kind of data will we get from them. The, I think probably the most meaningful thing that I've heard people say about IoT in general, a lot of people talk about data volume, and that's indeed a big part of what's going to be happening on these systems is just a huge amount of data coming into a system. The volume, the need to retain it, the bandwidth required for the ingress of all of this data, the ingestion of all this data. But another aspect that I think is really overlooked at this point, but won't be for long, the value of the data that you're collecting on all these devices really doesn't mean anything unless you can process it some way. You have to gain insight. In this case, we're creating a social network, a social graph off of data that we can derive off of the way that people are interacting, their proximity to other people, what they're doing at what time and what context. In marketing, there's a term that I actually kind of like called a system of engagement. System of engagement takes all of this data, accrued from all of these different places, could be devices, could be APIs, personal accounts, coordinates it, and is trying to create an experience for the user that gets you to be engaged. It doesn't fight you. Like if you have a flight booked to go to a city, you want to know that there's a traffic alert along your route. These systems are context aware. So to borrow a little bit from that idea of systems of engagement, I think an important part of IOT is remembering what kind of meaning you're trying to derive from these systems, what kind of insights you're gaining, what are you doing with the data that's actually important. Okay, so when we talk about high-level architecture, and this is just kind of like setting up bare bones, most people think of system architecture, services architecture in this basic way. We've got clients, the data services layer, and a back-end database. Of course, there's a lot missing in this kind of a diagram, but I just wanted to start off with that one so we can build out there and discuss a little bit of its absence. We don't have monitoring systems in there. There could be several layers of data services. There could be the front-end clients that are microservices that talk to the actual application clients that we see out there before you get towards the database. It wouldn't be just a single layer. You have a data services layer in front of the database, and then your application clients ahead of that and that are actually talking to the back-end systems. But there's also a need for analytics, archiving, and monitoring of these systems as well, which we're going to go into in a little bit. Keep that visual because we're going to start building out from there. Okay, so if we're going to start off, we need a data strategy. If we're going to be dealing with these architecting systems, we need to figure out what we're going to do, how we're going to handle our data, and not just physically, but the actual data model together. Now, some of the things that I want to talk about today, they're buzz terms, and one of them is how many people have heard of event sourcing or CQRS? Yeah, we'll go into detail about what that is. In a very interesting pattern, some would call it like a pseudo architecture of CQRS. That's why I find this term a little bit more applicable as data strategy because it involves those ideas involve not just what database systems you're using, how they're arranged, and the data model itself, but a little bit of the processing and how the nodes of the database system are arranged themselves as well. So when we talk about data strategy, let's talk about first off the way that we're going to model our data and the tools that we're going to be using. This is a little bit of paying the rent because I'm going to obviously be using MongoDB with these systems, but there are some several advantages with this idea. The first thing, how many people are familiar with MongoDB? Great. How many people have used it before? Okay, so really fast, I just want to go through here some ideas about a flexible schema. I hope that doesn't appear too small on the screen. If it's not, you'll have to move forward, so I don't... Let's do it. Everybody stop what you're doing and move forward because my slides are too small. Anyways, the idea here is we talk often about flexible schemas which is an important aspect for this domain. The reason that's an important aspect for this domain is that we are dealing with a lot of different types of things in the Internet. We're going to be dealing with a lot of different devices that have different characteristics. They could be very simple attributes that they're metering on. They can be very complex, but there's going to be a myriad of these devices. Some are older versions of a device that don't have the same features as newer versions do, but they're still out there in the field, active, and collecting data. So we need a system that allows us to easily ingest that data into our database. Maybe we'll become more strict as we move on in our classification of the data and the schema, but for the time being, we want to maintain a flexibility to ingest new kinds of data. Now the reason that I make note of this flexibility and how it applies to MongoDB is that in MongoDB, there's not a schema enforcement. You can use this flexible schema that is to say that records that are put into the database as long as they're well-formed JSON, they will be accepted and there's not a need to check the presence or type of a field. It gives you a lot of flexibility. For instance, these are two records in the same collection or table, if you will. And the idea here is that they're of two different types. These are two records, but the thing that's different about them is that they're two different types of vehicle. What I like about this particular kind of demonstration or this example is, they are subclasses of a parent class. A sports bike is a subclass of vehicle just as helicopter is, but they're two fundamentally different vehicles, which we may want to put into the same collection, even though they're fundamentally different things, they share common ancestor or common understanding, we want to keep them in the same collection because we want to access them by the same kind of query patterns. So you see in this particular example, I'm using a discriminator field to determine the difference between these two objects, but they share common attributes. And one of them is that this company, Augusta, makes helicopters as do they make sport bikes, right? So maybe I want to search on all of the vehicles made by Augusta for some reason, or maybe all of the transducers or devices made by Siemens or General Electric or what have you, Arduino, if I so choose. But the interesting thing is where the flexibility comes in is that we have what I'm referring to as a polymorphic attribute. Racontrail is an attribute of the motorcycle that has no relevance to a helicopter and therefore does not need to be in that individual record. Just in the same token, a motorcycle with blades, a four-bladed motorcycle, would be interesting, crowd-pleasing, but irresponsibly dangerous for the rider. So they're not built with these blades, and they're not, therefore, necessary in this document. So we have flexibility about the way we can store our data. Okay, so that's just a little bit of level-setting about this document model. Let's talk about this data strategy, how we're going to actually deal with our data. These sensors, these Internet of Things, they're sending in, in many cases, in most cases, they're sending in a state, what their state is at a certain point, some condition, some reading, some measurement, what is happening to them at that moment and that's being sent back into the system. Those snapshots or those instantaneous snapshots in time really don't have a whole lot of meaning for us until we start thinking about them in aggregate. Now, one way that we can deal with this is that we say we have an object that represents the state of a device or a system right now in the database, and then every incoming message that comes into the database updates the state. It seems reasonable, like I have a smart meter, I want to know what its current voltage is, so I have one record that corresponds to that one smart meter. The new reading comes in from the meter and I update the state of that meter. It's voltage last time I checked. The last time a message came in was 220 volts. It was at a temperature of 70 degrees Celsius and there was an amperage of 210 amps or something like that. It comes into the system, that's the current state. The problem is that's an extremely limiting way to deal with our data and it's not, we're losing insights, we're losing ability and robustness against our data because if we're persisting state only, we've lost a record of how we got into that state. So the fundamental idea behind event sourcing, a buzzy term, is that we don't want to think in terms of state on the system. That is to say that we do have an idea of state that we're going to derive state, but we don't want to immediately think of a record persisting a current state of our objects. What we're going to do instead is we're going to go into the system and we're logging these events, an event log or domain events. If our domain is smart meters, it's going to be something involving current power, temperature, all that kind of stuff. If we're dealing with a system, a mobile system, a component of that data is going to be where the object is physically on the planet at that time. But what we're logging here is we're not updating a state, we're recording or persisting to the database, deltas or changes or logs. This would be akin to saying like I want to find out how much, I mean I have to do expenses, right, if I do a trip. So how do I handle my expenses when I go home and submit an expense report? I collect my receipts and then I derive how much I'm owed for my expenses. This is like recording the receipts as opposed to constantly updating what my account is every time I buy a coffee or something like that. So we replay this log to derive a state. So at time 12, tick 12, the state of this system, the application state is 9. Now there's a lot of advantages to doing this. If I lose the application state, if the system crashes, what do I do? I replay the log and I regain my state. No big deal, very robust that way. But there's also another advantage is I can see what the time, what the state used to be in the past. I can see how I got there or maybe what the conditions were at time 5 or 10. I was up, down, all that kind of stuff. I can go to any time in the past, provided that I've not deleted out the logs and regenerate state in the past. So maybe there's something I've learned about processing this data. Like I might find that there's certain users that I can reach or let's say there were users that I found that they have associations to one another. I want to know if they were ever physically in a proximity to one another, I can go back and search for that in the past. I can change my data processing model, my stochastic models, and apply them to the way the world would have looked like in the past. So if I update bidding models or anything like that, I can go back and see how well that would have played in the past. So event sourcing is pretty powerful. And in the domain of what we're doing here, it's especially powerful because we want to keep track of how we got into the state. We might want to find out from our sensors, like if a sensor is about to fail, was there variations in its readings? Does that indicate that there's some kind of failure on the system? In the case of the smart meters that we were dealing with, they wanted to determine if a group of smart meters together in a location were starting to act wobbly. Maybe it's not an individual meter that's the problem, it's the transformer that's about to fail and they need to get out there really fast. Otherwise the whole neighborhood is about to lose its power or something like that. So another way to think about this is the A application state is derived as the first derivative of the log. I just like saying first derivative of the log. I say that because derivatives are, we're thinking that it's a continuous function off of the system or that you could do an interval off of the log. Typically with event sourcing, to find your true state, you need to replay it from the beginning of time and move forward to get into your application state. You can't just start in the middle because that would obviously corrupt the system, so that's a little bit of difference. But we'll handle that in a second. The idea here is that particular problem, if I have to go back into the path to derive my current state, if I lose my state or it gets corrupted, somebody attacked me and they changed my application state in some way, maybe by inserting erroneous records into the system, or maybe I just got it wrong, maybe I released a version of my software that was making logical errors or interpretations of the log, that's okay, I can go back and regenerate my state. But if every time I need my application state, I have to go back and replay my entire event log that could be prohibitively expensive. In computation, in disk, in network, I have to go back, I could have terabytes of this data. So the obvious solution is to create snapshots at regular intervals. So when I need to go back in time to a certain known state, I just go back to the last snapshot. Let's say I wanted to go time interval 5, then I go back to the snapshot I stored at time interval 4, load that, and then play the event log up to the actual point in time that I want to recover to. So lots of different ways. This is important actually, the snapshots. This is particularly important because they are a view of the data, if you will, very much, a view of the data at those times. They themselves are generated by replaying the event log. And as I said, I can replay the event log with new information or changes to the way that I process that log. So not only do I have an opportunity with event sourcing to create snapshots of a type of state, I can make different types of these views. I can have multiple views, all interpreting the event log in a different way for some reason. Very important. Okay, so let's talk about some examples. They just came down from San Francisco. And the idea here is that there's a published data set on CrowdAd that I'll have linked to in the last slide that you guys can check out and play with if you like. But it's the last 30 days of taxi cab movements around the city. I thought that was really cool. So they're tracking taxi cabs and if they have somebody inside them, and it's a lot of data. For the last month, it's over 11 million records, 11 million position locations of these cabs. So every minute they're updating, maybe not a little bit frequently than that, they're updating where they currently are, where that cab is. So there's a lot of data to play with. Possibly before Uber and Lyft, there might have been a lot more data to deal with, but that's a different issue. Okay, so here's what one of the records comes in as. Actually the data that they provided is just a CSV, actually a space delimited list to be ingested flat file. Here's how I format it and ingest it into MongoDB. Same document model, pretty simple. The idea here is that number one, each of these records is going to have an individual key, a primary key that identifies one, the cab, the taxi that we're recording what this reading is for, this position, this GPS location is, and I've concatenated it with a pipe to include the timestamp. So each record is a snapshot in time of where this cab actually is. In addition to that, I have additional fields of the cab itself that I might want to play with without having to parse out the individual record, but the real sauce on this one is this format, this GeoJSON format, which is basically coordinates of latitude and longitude coordinates of where this car actually is at this point. And you can see it's longitude first, latitude second. In MongoDB we have a geospatial index that allows us to index these elements and search on them quite fast, which is a nice feature that's going to make everything much easier to do that we're going to be doing in this example. Second is what is the occupancy of this car? Zero is nobody's in it, one is there's a fare going. In playing around this data set, in New York they actually show how many people are in the cab, included with the fare price and the tip amount, which I thought was really cool. And I was going to play with that a little bit too as well to see if people tip more for longer distances and that kind of processing. San Francisco, the data set is much, much simpler. And then the last thing is that I want to know there's an actual smart date value in MongoDB that I take the timestamp and I interpret it as an ISO date type, an intelligent date type that I can index and search in order through the system as well. So I know where the cab is, when it was there, and if there was a fare currently going on or not. Now what I did for the processing on this system is as the stream of these records come in, I'm keeping track of, like in electronics systems you have an edge trigger, like if the voltage goes up you change state on the increase of the voltage of the system. If I'm at a state of fare zero, right, and I'm not currently on a ride, when that fare flag comes up, the first record that comes up as one means the trip is starting. That's one way that I can interpret these records. And so for every concurrent record that comes in while the rate, while the fare flag is up as one, I know we're on a trip, we're going from somewhere to someplace with a rider. And when it goes back down to zero, the trip is over. On the next record that comes in, I consider that to be a zero. I can actually, I ran the code through off of these records and just did them in the MongoDB and then ran the code off of these records. I can send it just out on the code so you guys can play with it as well with the dataset. And then of course started spitting out not how much the car was, where it was driving around when it was empty, but when it had a fare. So here's one that's overlaid on a map. Really kind of cool. We see that I'm going from somewhere around the opera house over here down to my old neighborhood. And that's one cab ride. Pretty cool. Can anybody see anything interesting about this particular route? Yeah. A little known fact, there's a tunnel between Oak and Fell. Now actually this is one of the issues that we'd be dealing with. Why do you think this happened? Why is it that they teleported this car? You can't, yeah. There's a loss of connectivity in the system. And we can expect that of course, right? A couple of things. Maybe the record's gone. It just didn't make it, depending on the system that we're dealing with, maybe the record's gone. If it doesn't make it through the network, we'll never recover it. Other systems like smart mirrors will say, hey, I'm not able to transmit my data, so I'm going to hang on to it for up to 40 days until somebody takes it off of me because I don't want to lose this data. This system doesn't have that. So the way that we handle that kind of failure mode is going to be different on a device-device basis. Another company that I worked with, they didn't want to deal with the logic of whether the transmission would get through or not. So they had these sensors transmit everything they ever recorded every time. And so the BOD rate that was required of the network would just increase over time because these systems would continue to... And they were relying on the unique key constraint in MongoDB that says, like, if you've inserted a record with this identifier before, now you're going to return an error that you can't reinsert something with this unique key. That was their data management platform. A little unfortunate. Easy for them to code, though. Anyway, so that draws in a fact about what we can deal with this kind of a system. Here's what that record looks like. This is the aggregate view of... My raw data stream were those records, right, of where the cab is at a point in time. From that, using event sourcing, I go through the log and I derive this view, which is that trip that you see on the map. Those are the geolocation coordinates of where it was. You see that I have it defined as a route. I see the start and end time about where the cab was, how long it took me to go there. If I want to get really sophisticated, I can use a haversine function to determine between these geolocation points what is the distance and divided by the time, and then I get how fast the driver is running, too, as well. I'm not about throwing that in here because it's kind of cool, but I can always regenerate that. I can add that functionality in because I've stored the records in my database and I can go back against these domain events and reproduce and enrich this data. That's the advantage of these views. Actually, I started really playing with this data. It was a lot of fun. I said, let's see everywhere this cab has taken an affair in the last month. That's what that looks like. I'm not the only one that's played with this data set. It's actually pretty cool. You'll see people more sophisticated than myself that actually put this on an animation and it's really cool. You see the cabs going back and forth with the city map on it, and the city becomes apparent to you as the cabs are going through there. Now, when I zoomed in on this, I was playing with this a little bit. You see that this particular cab is on every street in downtown. He nails them all. He or she nails them all. The thing I liked about this one is that he takes somebody up to across the Golden Gate, of course, and he's got someone he took to Oakland, to the airport, and he has a lot of trips on the map here, but I thought that was pretty cool. Now, there's 11 million records that wanted to condense all of these trips onto the map and really see how they're going. That'll be for a subsequent follow-up. Yeah, there's a question. How did they do it? I don't know. I'm not exactly sure. If I knew that, I would have done that myself because it's a crowd pleaser. That's for this presentation.2.0. I'll come up with some interesting stuff about that. What I wanted to do was a heat map about the actual, like I said, use the haversine function and find out where the streets are the fastest streets and the slowest streets, and then really get into the propeller head and that's really going down the rabbit hole at what times. Like what times are the streets fast? Is there a cab that's a speed demon with a lead foot and there's another one that drives really slowly? And which one do you want to hail? How where you need to get? Yeah. I'm not using MapReduce in this particular case. I'm using, like, how many people took electronics or basic electronics in school? Maybe we're programmers and maybe not so much, but you know like an edge trigger? You know what an edge trigger? Yeah, yeah. That's what I'm doing. It's edge triggering. If I'm in a state that there's not currently, yeah, I've got a flag or like a latch and I say like, okay, the fair flag went up, start when it goes back down. I will realize that that last record was the one. Cool. There was one more question. Oh, thank you so much. Yeah. So the question was how do I know when this goes, when a ride is starting, when it has not, it's using an edge triggering function and your question, there was another question. Oh, what animations do they use? I don't know. I wish I knew because I would have been, I would have been totally impressed. More than you are currently. Okay, so let's move on. Geolocation. With this data, there's lots of things we can do. One of them is defining an area of downtown Manhattan, for instance, which I've done here as well. This is a type polygon, which is supported by this kind of industry format of GeoJSON. It's also supported by MongoDB, so I can actually index on polygons, like I can define polygons in the database. Why would I want to do such a thing? Maybe this is a service area. Maybe this is a zone, a toll zone, that as a government entity, when I see your cab go into that zone, I can bill you automatically, just for your presence there. Maybe I can derive from that zone that my rates go up on the consumer side, not the government side. And then the neat thing is, of course, is this could be a floating zone that I'm moving around. I can find if a person is calling if it's a service area, for instance. Someone requests a service. Are they in the zone that I service? Right? There's different variants about this that are really kind of cool. Is this device being activated or turned on in a place that I wouldn't expect it to be? And can I attribute that to fraud? Right? This would be geo-fencing, as they say. All sorts of interesting things that I can do with this kind of data. But the point is, of course, that this in itself, this zone, or the fact that the person appeared at that time, these are different views off of this data stream that we're getting off of the system. So it's a pretty important part that we're recording these domain events. We're recording these logs, event logs, so that we can replay them, or either retroactively or on the fly to create an idea of what we're looking at in the world. Now, that's event sourcing. Event sourcing is closely associated to CQRS. Some say, the guy that coined event sourcing, Greg Young, Greg Young says that event sourcing relies on something called CQRS, Command Query Responsibility Segregation. I pretty much go along with that myself, too. The idea here is that the fact that we're generating these different views of the data based on the event log, we can think about this as that when we access the database, we have this basic pattern of ingesting the log, ingesting the data as a log into the system, but then we have these different kinds of views stacked up against them, different ways that we look at the data. It's hard to see, but this is kind of a couple of shades of gray. Right here, each shade is supposed to mean this is a different type of view on the data. Okay, this has architectural impact that you'll see in just a moment. The idea here is that we have variant read models and views. These views are fundamentally different than the data that's coming in. They're built off of the data that's coming into the system, but they are aggregates, or they're interpretations, or they're averages. They're something that we've done with the data to interpret them. That has an implication that because the data going into the system is different than the data that we're extracting, maybe we shouldn't use CRUD. How many people, or a CRUD object, create read update delete. From the application side, when we think of interfacing with the database, we have an object that represents the main object. We perform these CRUD operations back and forth from the database by this one object. We use getters to retrieve the data from the database and retrieve it as this object, or we can use setters to set the object and push it back into the database in one way or another. But because the data stream is different, we have these domain events coming in on one channel, but the way that we're interpreting it, or these variant models, these objects, these pieces of our code should be separate because they perform separate functions. They have performed separate responsibilities. Therefore, there is this idea of command query responsibility segregation. And the idea here is that your reads are handled by different portions, components of the code, than your writes do. In this case, command means writing. Writing to the database. We're going to mutate the state of the database. Even if it's only an insertion of the mutable log records, we're changing the state of the database. And that's interpreted as command. We're changing something on the database. But we don't expect anything back. We're inserting into the database. We expect data back. Except for metadata that says yes, the insertion was successful. On the other hand, the query doesn't change the state of the database system. It doesn't change the state of the data store. But we do expect something back. We expect something back from our query. So fundamentally different pattern, fundamentally different objects. The reason that this is interesting, this is a cool idea, is that on these systems, lots of heavy analytics, off of the incoming event stream, maybe I have a disproportionately or asymmetrically high read load in comparison to my write load. Possibly. And if I have an asymmetrical read load to my write load, I can asymmetrically scale these objects. I can have lots of readers as opposed to very few, or one-to-one read-writer objects. Now you can start to see if these components of the code are different, right? And they're presenting the responsible for different views of the data, at least the queries there are. This lends itself quite aptly to another buzz term that you probably have been hearing a lot about. Can anybody guess where this is going? If this is a service area and instead of one big reader-writer, I use lots of small reader-writers in my service, small service, micro service, circle gets the square. So micro services. So micro services. This is a pattern that lends itself to micro services. Now an interesting side note about micro services, a lot is said about them, right? And when you ask somebody, well, how do I define a micro service? What are the edge of a micro service? What I take right now is a lot of people talk about the idea of the separations of concerns, separations of code for maintainability. But when you say, well, how big should my micro service be, and it says it depends? Here's a way to define a micro service is what is the view of the data that you want to see out there. And one of these little read objects or query objects constitutes a micro service as does the writer. And it has a single responsibility in gestion of the log stream, in gestion of the domain event. And then the interpretation of that domain event, log stream, whatever you want, is right over here. So now we're getting into micro services. So I think that we have, at count right now, currently four buzz terms going in this talk. Did really good. Number one, IoT is itself buzzy, but I've also talked about event sourcing and CQRS or command query responsibility segregation, which you have to say ten times before you invoke it in your own code. And then micro services. But indeed, that's fitting because a lot of this stuff is boiled into the complexity to be, right? So that's where these terms... This is the fascinating part to me. This is where these terms start to gel and have meaning to one another and relevancy to one another and context to one another. The impact that CQRS has on your system architecturally, because ostensibly that's what this talk is about, the architecture of a system, is that if we have an asymmetrical load of read or write that asymmetrical load can be serviced asymmetrically by our data store. So in MongoDB we have the secondaries. We have all the writes go into the primary and then we have a system of replication where you replicate out the writes or the data out to the secondary nodes which you can send reads to. That's called eventual consistency. Now it's not... eventual consistency is a tool, it's a pattern, it's not perfect for everything. There's some caveats that you need to be careful about when you're using this kind of a system. But in this kind of a pattern if we're regenerating these views on the data eventual consistency might be okay for me to use, because I can... I may not need to know exactly what the current state in the event stream is. I'm regenerating this to take analytics for instance or something like that that doesn't require the most up-to-date data. One needs to be careful about that. Okay, so let's talk about how these systems get tricky. Let me just do a quick time check. Okay, cool. It's great. These talks about buzz terms they're always about how it's going to be transformative, it's going to change everything in the industry. That's great. I think that anybody who actually tries to implement these things they're obviously good solutions. These ideas, there's real hay behind them. But I find like, well I have to think pessimistically. I've maintained servers, I've been on pager duty, and the one thing I've learned is to be cautiously optimistic and strenuously pessimistic. So what's the tricky part of IoT in these systems? Well, mishapery. Like for instance these are systems that this guy almost got clobbered by the camera drone, right? These systems they're out in the environment and as we saw with the cab example they don't always work. They could themselves be reliable but we can expect service interruptions of some type. So we need mechanisms to identify this and take action. And again as we go into the future there's all sorts of problems that we could we can encounter with these kinds of systems because they're dealing with environments that we've not put sensors before. I mean like if you think about embedded medical devices there could be a number of reasons that you lose connectivity to the system. Okay, so how do we deal with these systems? This is another emerging idea is a service management system. This is actually kind of an interesting thing to me and I think this ties into a lot of the discussion that we see about containers that we see about infrastructure as code. This idea that I can code by infrastructure or infrastructure as service. How does that work? I think as these systems become more complex we're going to need systems that handle that complexity automatically for us. Especially since these systems they're expected to be up and reliable. So a service management system is a component of the architecture that's responsible for detecting failures and taking action to resolve those failures or resolve those systems. Now we've seen examples of this in components of our overall architecture. Indeed in MongoDB we've got a the replica set is there as a redundant system. If one server goes down you've got this redundant system that they elect a new primary and they carry on. We've seen this in sub-components. In this system a service management system that says well not only is your architecture the back end, it's also the front end, these deployed systems, these objects. What does it mean if the transformer or the central node that I'm collecting all this data from these service models, the smart meters they stop collecting data or something like that. Can I deal with that problem to maintain service? So the idea here is part of this is that the service management system has these agents they get their configuration or the understanding of what the world would be from the federated configuration management database. The idea is that the database stores information about what types of sensors are out there where they are what the recovery protocol is who to notify if they come down and the agents use that data to update automatically like maybe a ticketing system that says hey we've detected a failure we are using monitoring systems like Nagios and whatnot we enter a trouble ticket for that and then once we have a resolution of that problem we update the trouble ticket automatically and it's resolved the ticket is logged and all that kind of fun stuff could be quite sophisticated for these systems but if you consider the amount of these devices that are distributed out there this could be an interesting part of architecture in the future maybe it's possible to completely automate this based on the idea inside of a configured management database I think with all of the maturity that's coming with containerized systems it's possible and orchestration systems so like I said the idea that you're containing you have here is where these devices are this is a data set I imported for you and you can see that there are three wifi hotspots in New York so if you need connectivity just tweet at me and I'll send you this map but yeah if one of these fails where are they where do I send my service technician to the other idea is that part of the service management system is it's necessary to have these deployed systems if you don't know what the state of these deployed systems are it's going to start munging up if we're performing analytics off of the data that we're getting off the sensors the standard deviation of our processing or results or averages is going to go wild because maybe a lot of these devices are failing and it's affecting adversely the accuracy of our metrics accordingly that we don't get into this negative feedback loop that we're losing sensors and it's affecting our analytics that's affecting the way that we deploy and use our sensors so monitoring just like you monitor your backend infrastructure it's going to be important for monitoring the state of these deployed servers and then of course it's very important that if you're monitoring these systems that you have an accurate system for alerting right that's the juror ticket that's letting people know that there's a problem going on in this system I love human like non-technological well what's not technological without batman but I love using analogies the most famous alerting system of all in the case of smart meters again where we've seen a lot of work one of the records that we'd see here would be something along the lines of this component has a threshold like we have alerts that we want to configure that we're setting for this specific type of meter what are these threshold readings does the temperature go out of whack that was what they were really they would see as the leading indicator of an eminent failure of the system the temperature in these sensors goes really high and it's about to die I thought it would be something different like the voltage would go up or down and the amperage would go up and down so using this configuration data I can go back and set my monitoring thresholds so as I'm replaying the sensor log that comes in off of this system I'm comparing it to that threshold and when the threshold is exceeded I trigger off, I go into this alert state and trigger off a trouble ticket in bugzilla through the API matter of the example now how do we actually do this this is another part of trickiness with event sourcing when we replay the log we exceed this threshold so it's pretty simple we send a notification to the subscriber service that takes an action the SMS takes an action to go start the trouble ticket raise the alarms everybody wake up all this kind of good stuff one thing that you need to be careful with using event sourcing on these external notifications is that if you are replaying if you've lost application state and you replay the log does your system know to differentiate between the first time you pass that threshold or is replaying the log going to re-trigger the subscriber service because the subscriber service doesn't understand the data model the same way that your event sourcing system does so it's possible that you can re-trigger these states by replaying the log in fact to use another human analogy there's a very famous example of this how many people are of a certain age that know who these people are what show this is and what he's saying has anybody seen this this is from 1977 Dan Aykroyd, John Belushi and a very young in the back Bill Murray and the sketch is there's this cafe where all they serve is cheeseburgers and this guy Robert Klein is saying I don't want a cheeseburger so we're early in the morning for cheeseburgers, I want eggs and John Belushi says look he's having a cheeseburger he's having a cheeseburger every time he says cheeseburger his brother interprets that as an order for another cheeseburger and he starts putting it on the flip he's replaying the event log and his subscriber service can't make the differentiation that's the classic case cheeseburger cheeseburger cheeseburger and then he finally says cheeseburger cheeseburger cheeseburger Robert Klein says okay fine I'll have a cheeseburger and so John Belushi turns around and says cheeseburger and his brother says no more cheeseburger it was funny in 1977 anyways so what do they need, what do they need to solve this system what is the problem, what would you do any ideas who didn't put themselves through school by working in a restaurant I guess I'm you know quite frankly they need a circular buffer you've seen this this is quite literally a circular buffer and the way that you put the ticket on the buffer it's a queuing system and who's responsible for taking the ticket off the chef, the cook it's a consumer driven queuing system it's not a push system it's a pull driven system meaning that the subscriber in the system maintains their own state about whether they were seeing a message or not it's harder, you could implement such a thing on a push system but it's much harder coming in on the wind I think we're almost out of time so in this case the way that we do these external updates is that when the notification comes through we're going to be pushing on to a queuing system like Kafka or something like that and then the subscriber consumes it off of the system that's how it affects so one last thing about gaining insight in these systems and I'll try and wrap this ooh I'm running out of money this is the good part so give me a few seconds I'm going to run over a little bit this is the Gulfstream first charted in 1770 how did they do that how did they know that there was the Gulfstream there how did they do this very ingeniously a guy came up with this idea that he would put a card inside of a bottle and throw it into the ocean inside of that card it says please write down where you found it your latitude and longitude and the date that you found it and if you'd like to tell us what the ocean temperature was and weather conditions were even better and then mail this card back to me so I can aggregate that data this is a deployed sensor system and they figured out where the Gulfstream was who was this guy that thought of this well yuck yuck yuck it's Ben Franklin he actually cut his scientific teeth doing this everybody loved him famous commerce clipped two weeks off of an Atlantic voyage if you figured out where the these merchant vessels kind of knew that there was something out there and they were outrunning the US battleships to get to England and stuff like that so he had to figure out where this Gulfstream was so this is the point of this particular foray into the past is you're collecting data now you have to do insights with it right so in those systems we need to integrate our data systems with processing how many people have heard of Hadoop? okay how many people have heard of Spark? okay yeah Spark is pretty buzzy, Hadoop is very buzzy or had been, still kind of is these are related technologies and the idea here is that they're going to be doing processing of this data beyond perhaps what you're going to be processing in just in MongoDB alone MongoDB is really good for grouping and averaging and aggregate functions with the aggregation framework but when it comes to machine learning libraries you don't really want to have to roll your own that's where these systems come in they're there to handle huge data sets in processing but also they come with a lot of machine learning libraries that help you to do some sophisticated stuff I'm going to zip through this but the way that these the way that these technologies work is there's this kind of concentric rings of abstraction at the base is the distributed file system that these systems will perform the processing against the distributed set of nodes and then the Spark and Hadoop layer in the middle of there actually performs the processing of this data in a distributed way across all these nodes this gets particularly sophisticated and complicated to people because this is a whole architecture in itself and a lot of people get frustrated about these systems because they're so complex right in fact you can see that typically the way that people would process this kind of integrate their systems with this back end analysis is that they would have their data services up here where the data is coming from into the application database and they have to ETL it over into these HDFS nodes then to be processed by Spark or Hadoop right this looks like crazy right I mean it's a lot of people deploy that way but there's a lot of overhead associated with maintenance of these servers as well as the seemingly trivial but deceptively difficult ETL system from one server from one system to another system if that ETL system breaks your analytics break and if you're trying to do a really fast closed loop analytics system for real time or nil real time analytics if your ETL breaks you start losing money because you're not able to process this data so where MongoDB comes into such a system is that we replace the HDFS layer or you can replace the HDFS layer with MongoDB so that you're performing these analytics against the database against the data in the database and you significantly reduce the complexity of your architecture down into something more manageable when this idea here that our architecture is now kind of morphed is that we have these data services with them itself composed of microservices that are what the views are on that data or what those microservices are to do with the inbound log stream is influenced by the analytics that's occurring on these back end offline systems that you're integrated with this is how you get value off of the data in this analytics and systems of that so overtime is just one last thing I wanted to say the connector is available on GitHub I'll put a link in there later on well here's the link actually it's in my slides there you go and you guys can play with that as well if you like I have some examples up there I want to cook up some more that you guys can run off of the taxi cab stuff and play with one last thing I wanted to do is who can tell me what this person is this is Emmy Hennings she is a inscrutable data artist from World War I she bears no relevance to this talk I just threw in another stumper there thank you very much for your time enjoy the rest of your weekend if you guys want to have questions I'm perfectly happy to answer questions I'm five minutes over I don't know if that means that I've run into higher rates for this room if anybody says who is the speaker I'll point to somebody else and get out the back door