 Today I'm going to be talking about Apache HTrace, which is a new big data project that's been proposed to the Apache Incubator. Just let me know if you can't hear me or if you can't see something. So let me say a few words about myself. I worked for Cloud Era on HDFS, which is the Hadoop distributed file system. Previously I worked on the SAF distributed file system. I've also worked on a few other projects as well. So today I'm going to talk about some of the motivations for HTrace, you know, why we felt HTrace was important to create, and then I'm going to give sort of a brief overview of the architecture and some of the motivations behind it. After that I'll talk about using HTrace from two points of view. One is sort of a more developer-focused point of view, how do you add HTrace instrumentation to your project, and the other is a more user-focused point of view. So hopefully we can cover both of those. And we'll talk about the HTrace community, which is a pretty vibrant community. And I have a demo today, and hopefully we'll have some time at the end for a question and answer. But if you have questions along the way, then you can also feel free to ask them then. So, you know, big data in 2016. The last few years have been really huge for big data. The volume of data continues to grow. We used to talk about petabytes. Now we talk about exabytes. We used to talk mostly about MapReduce. Now we're talking about projects like Spark, Impala, other SQL engines. We have stuff like Record Service, Kudu, lots of new projects on the horizon that really expand the capabilities of what you can do with the big data stack. But at the same time, we've seen a lot of challenges. So as clusters get larger, managing them gets more difficult. We've seen people with thousands of nodes, multiple thousands. It's not just Yahoo anymore, it's others as well. We've seen lots of density. So what happens when you have 20 hard disks per node rather than 10? And these hard disks are much larger than they ever were before. We will soon see stuff like 10 terabytes per drive on the market. We've seen latency targets get lower. So an overnight MapReduce job used to be pretty acceptable. But now people would like to see latencies in minutes, sometimes even seconds. So that becomes even more of a challenge. Manageability and monitoring continue to be things that we need to work on. And as the number of projects grows and the number of moving parts grows, we have to work on those more. And we're seeing more heterogeneous clusters. We're seeing stuff like Flash. We're seeing shingled magnetic hard drives. Lots of different stuff. So you can no longer necessarily assume that every node looks the same or that every node just has a pool of disks which all look the same. And of course the stack is getting more complex. So just to give you kind of a flavor of what a big data stack in 2016 might look like, here's two example stacks. And these are just examples. There's many, many more possible. But you might have something like Impala, which is a SQL engine running on top of HBase, which is a key value store. HBase of course uses HDFS as its underlying data store. And then all that's on top of Linux. Or, alternately, you might see something like the Hive SQL engine on top of Spark, which is also an execution engine. On top of something like Record Service, which is basically a security enforcement service, that in turn would be on top of HDFS and Linux. So this is all something that is Hadoop. And I guess I'm showing this just to give you a flavor of the fact that Hadoop is not just MapReduce and HDFS anymore. It hasn't really been that for many years. And the number of projects in the ecosystem continues to really grow. So diagnosing these distributed systems is pretty tough. And there's a lot of different reasons for that. One of them is because we have these timeouts and fallbacks in the system, we like to hide errors as much as we can. So if you can bury the error, that's normally a good thing. Failure is bad. But because we take so much effort to bury the errors, it's actually sometimes hard to diagnose performance problems because, let's say, one node is misbehaving, the cluster will try to recover from that. Everything will just run slower. So unless you're really paying attention and unless you're really monitoring it closely, you may not know that your performance is suffering as a result. And of course, performance problems are often not repeatable. They can result from hardware failures that are intermittent. They can result from configuration problems that are intermittent. They can result from load. So all these things combine to make performance problems some of the most challenging problems to debug when you're dealing with these systems. And of course, when we have many different projects and we have many different nodes, we have to be able to follow what's going on across these multiple projects. Just because you're seeing slowness in Hive or you're seeing it in Spark or Impala doesn't mean that these are the projects that are at fault. It just means that that's where it's manifesting itself. So here's sort of a really small example of all the different components that might be involved here. Let's say that you have something like an HBase client, which is writing to HDFS. That's all this is. So you already have the HBase client involved. You have the HDFS client code involved. You have the name node involved and three data nodes. So we're already looking at basically five different daemons involved here and potentially three or four different nodes. So if we have a problem with one of these data nodes, we're not really going to necessarily see it here. We're going to see it here. And then we're going to have to find a way to trace it back to where it came from. So there's a lot of different approaches that we have today and that we've historically had. Metrics are obviously one. I guess everyone's familiar with metrics like Top and VM stat, IOS stat. They give you an idea of what's going on on a particular node. Is the CPU pegged on this node? Is the disk pegged on this node? What's the distribution of disk requests? And metrics can be pretty sophisticated. Metrics can include a max, min, variance, median, all that stuff. We've also developed sophisticated systems to store and aggregate these metrics. Stuff like Cloud Air Manager will store them for a long time and it will down sample the older ones so that you use your bounded storage space for metrics in an intelligent way. And there's other systems too that will manage Hadoop. So JMX, of course, is very important for Java processes, which most Hadoop daemons are, not all. So metrics are really good for getting an overall view of throughput. And metrics can be even application-defined stuff like how many transactions per second am I doing. How many files per second am I creating, stuff like that. It doesn't have to be just disk or CPU. Metrics are usually not very good at identifying latency problems, especially for particular requests. So the averages can often hide significant outliers. And of course, min max is one way of coping with this and so is variance. But even if you have those numbers, it's hard to figure out why you're seeing a particular metric. For example, you might see low disk IO, but you don't really necessarily know why. Low disk IO could be many different reasons. It could be because you're experiencing errors that are bottlenecking the writes or reads. It could just be because there's not much to do at that moment. It could be because of some higher level thing like we are not really balancing where we're writing across the cluster very intelligently. So metrics tell you what and they give a good idea of aggregates, but they really don't tell you why. You have to figure that out in other ways. And of course, that's where stuff like log files often comes in. And daemons pretty much all generate log files. And increasingly, they generate multiple log files. So you have stuff like the HDFS audit log, which is sort of a record of what operations have been done. You have stuff like log for J files. Log for J files, of course, can be tuned. So you can turn up the logging for particular services or faculties within the daemon or you can turn it down. Clients, of course, have log files too. So the HDFS client is also logging. So the important thing about log files is that they're usually stored on the nodes that generated them, pretty much always. And this is another thing where you only have a bounded amount of time to store them, a bounded amount of space to store them, which translates into you only see logs back to this time. With log files, it's a little bit harder to down sample them than it is with something like metrics. So that's often an issue here. We often don't have the logs that we would like. Log files are really good for getting detailed information about what's going on in a particular point in time. But they're not necessarily good for getting a holistic view of a single request because due to performance reasons, you can't necessarily afford to log everything. And even if you do, you have to have some way of correlating what's happening in these different log files across the whole cluster. And as I showed you earlier, you might have, let's say, even for just a single write, you might already have five servers involved. So that's kind of, you know, you're really doing a lot of work to try to correlate these logs. And it's a lot of manual work. So, and you know, increasingly we're seeing log split into more and more files because single log files can be a big performance bottleneck. So we're not only, you know, we're not only sharding by hosts, by service, but we're also sharding multiple ways by what the log is. Is it an audit log? Is it a block log? Is it a log for J log? And so forth. So the HTrace approach is a little bit different than either one of those. And I think it's complementary to those, sort of filling in the gaps. The HTrace approach is to follow a specific request or set of requests across the entire cluster. And the idea is you can follow the request across network boundaries, but also project boundaries. So it doesn't just stop at the border of HBase or stop at the border of HTFS, but it can go all the way down. And we would like tracing to be end to end. So what does end to end really mean in this context? It means multiple cluster nodes, multiple projects, and if necessary, multiple languages. So we're seeing C and C++ get more and more popular for big data implementation languages. So we need an HTrace client for those languages as well. And we need integration for those as well. HTrace also tries to use the available compute and storage stack as much as possible, because we can't make assumptions about what is and is not available, or what rather we can, but that would really limit our user base. So our goals are, you know, supporting these multiple backends and not necessarily being tied to any one RPC or language or framework. Just within Hadoop itself, we already have Hadoop RPC, we have people using Thrift, we have people using KRPC, we have people using Acca, we have people using tons of RPC frameworks. So if you start from just saying, tracing is tightly tied, it is our one RPC framework, it's already a non-starter for Hadoop and for big data. We wanna have a stable and well-supported client API that can really be easily integrated with different projects, and we wanna have zero impact when not in use, so that we can actually use this on production clusters. And it's also very important, I think, to get integration with upstream projects so that people don't have to do a one-off to trace their workloads. And you'd be surprised by how many people do write a one-off to do tracing in Hadoop, and that's something we'd like to kind of avoid when it's possible to avoid it. So spans are basically an interval of time, and one way of looking at them is that you annotate your program with information about when trace spans begin and end. So there's a parent-child relationship between trace spans. Sorry, the PDF export kind of messed us up, but so the idea is this is a copy from local operation, and in the course of doing this copy from local operation, we did a create file system operation. That ended, then we did a globber operation, and the globber operation interned at its own thing. So it's a little bit like a flame graph if you're familiar with that, or it's a little bit like viewing the stack over time. So a trace span, as I said, represents a length of time, and trace spans have a lot of different attributes. For example, they have a description. That tells you what it is, what's going on. They have a start time, and the start time is stored in basically milliseconds into the epoch, the end time as well. Trace spans have parents. At least trace spans can have parents if they're not the top level one. In order to keep trace spans unique across the whole network, we have a UUID, which is 128 bits, and randomly generated. We also store the process ID and IP address of each trace span, and this is configurable. If you want to store host name instead, for example, you can. Trace spans also allow you to add arbitrary annotations if you want to add extra information that's specific to your system. So sampling is an important concept in HTrace because tracing every single request would generate too much data in most cases. So tracing every single request could be useful for testing, for example, but it's not really going to be useful in production because it's just going to be too much. So normally we do sampling of less than 1%, and that rate will depend on a few different things. It'll depend on how verbose the tracing is in the system you're tracing. It'll depend on how much bandwidth you have to store and process traces. But in general, less than 1% is sort of the ballpark here. So we also like to trace things at the whole request level. So I guess the intuition here is if I want to see a request, I want to see the whole thing. I don't want to see gaps in the requests that are holes that I can't look into, which is sort of what would happen if I did sampling at the level of each individual span. So we want to trace an entire request or not trace an entire request. We don't want to go halfway. So we have a pluggable architecture for HTrace and basically the idea is the HTrace core API is the one that you integrate with. And span receivers are things that plug into the HTrace core API and process the spans. So we also have a web interface that can be used to query the span receiver. So the idea behind having this separation is that if you're a developer, all you have to care about is this API here, you don't have to care about the span receiver or the GUI. You just have to integrate with this. It also tends to minimize the set of dependencies that we're pulling in, which is pretty important, especially in big project like Hadoop. So we have a few different span receivers. The two most important ones that I'll talk about today are the local file span receiver and the HTrace D span receiver. Those are the ones that I think are, we've gotten the most mileage out of them. So local file span receiver as the name implies, stores spans and files on the local file system. And the idea here is that you can post process the files later with whatever tools you like. And it's also useful, of course, just to test things out and see what kind of spans you're getting. We have a JSON format for our spans, which is pretty well-defined. And I think this makes it a lot easier to write tools that go on top of HTrace. Having a well-defined JSON format makes that a lot easier. And you can see that our span IDs are 128 bits, they're stored as hex. Our begin and end are stored as numbers. And yeah, so we also have the process ID in there. And there can be more stuff than this even. So this is what you would see in the files. Now the files would not be pretty printed like this, but this is the basic idea. So we also have the HTrace D span receiver, which is a little more easy to use because it stores the spans in the central daemon rather than putting them in files on each host. So the idea here is to have indexing, have a web UI, have aggregation all in one place on the HTrace D daemon. So the HTrace D daemon is written in Go. As far as RPC goes, it serializes its spans via message pack for greater efficiency. We also find that message pack and JSON are really easy to convert to one another. They're sort of isomorphic in a way. So HTrace D exposes a REST API so that you can have command line tools to query it. The web apps can query it via the REST API as well. And yet it also handles overload pretty gracefully. So as far as storage goes, HTrace D stores its spans in level DB, which is sort of analogous to HFile in Hadoop. It's an LSM tree. So the idea is it's optimized for a very high right throughput. We use multiple level DB instances so that we can take advantage of multiple disks, which nearly every big data cluster will have. We also index certain fields, such as begin time, end time, duration, and span ID. And that makes queries a lot faster to have the index on them. And of course, level DB persisted at a disk so that we don't have to worry about being limited by the size of memory or anything like that. So the third component is the HTrace D graphical interface. And I'm gonna give you a demo of this later today. But the basic idea here is that we have the ability to do queries on what's going on on all the spans that are stored on HTrace D. And these queries, you can add multiple predicates so that you can actually have, you can select, for example, a range of time by setting both a begin and an end time. You can select only spans that have a certain description or that come from a certain node. And once you have those spans, you can follow them so you can see where the parents and children were. So let's talk a little bit about using HTrace. I'm gonna start by talking a little bit more about how you add HTrace support to an application. After that, I'll talk about configuring HTrace, which is more of a user level concern. And finally, I'll talk about using the HTrace web interface. So adding HTrace support to code is really easy. Basically, you just have to link against the HTrace core jar. Or if you're using CRC++, you can link against the libhtrace.so library. And you have to have some way of hooking up your configuration to HTrace. So this is sort of a system-specific thing. So however your system is configured, HTrace needs to be able to access some of that configuration information. In the example of Hadoop, we usually configure things via these XML files. So you would need to give HTrace a way of looking at that. And it's actually really simple to do. After that, you would need to add HTrace spans to measure the important events. And basically for requests in your system, you'd like to create a span for each request. And you'd like to have spans underneath that. And I think the most difficult thing is probably just being sure that you're not tracing too big of a chunk at a time. So you want to have a request that's small enough to be reasonable to look at, but also big enough to be interesting. And that's the more challenging part. And there's annotations you can add too for system-specific information, which can be very helpful. So a lot of applications will need to pass parent IDs over the network as well. So let's say your RPC system starts, let's say you start an operation on the FS client and you want to be able to trace that operation on the data node. You'll have to pass the trace ID that you're currently using, the span ID that you're currently using over the network. That way the data node can create a new trace which has that trace ID as its parent. So I'll talk a little bit about the core API here. And the core API has a few different concepts. Probably the most important one is the tracer or probably the one that you would encounter first is the tracer. So the tracer creates trace scopes. Tracers also have their own sampling configurations. So you can do sampling on a per tracer basis if you like. Tracers are thread safe. So you can create, you can have multiple threads using the same tracer at the same time. It doesn't matter. It's completely thread safe. Trace scopes manage the trace span for the particular thread. And they're created by tracers. So for example, you would call tracer.newscope to create a calculate pi scope right before you call it calculate pi. And then you would have something like a finally block that would close out that scope. And that way you know exactly how much time that calculate pi operation took. And because there is thread local data involved here, if you create a new scope inside the calculate pi function, it will automatically point back to that trace scope as it's parent. So of course we have the concept of spans that we've been talking about. And there's an object that represents those as well. You typically don't have to deal with spans directly, but you can sometimes. So for example, you can add a key value to an annotation to a span. You can find the trace ID of a span and stuff like that. HTraceCore also has various functions that are basically wrappers. So for example, the trace runnable object wraps a runnable so that before calling the, when you call the run method, it starts the trace scope and when it exits the run method, it will close the trace scope. So it's really just a convenience thing. If you'd like to write a little bit less code. And there's also a trace callable, which is very similar and trace executor service, which wraps all the stuff that it executes in its span. So there's a few other internal classes that you usually don't have to deal with, but just for completeness, I'll describe them here. We have a sampler object, which is determining which spans to sample. Tracer ID is the 128 bit ID. Which is unique for each span across the network. Tracer pool manages a group of tracers. And it's basically only really useful if you wanna do certain things in resource management. So these classes are all related. The tracer pool owns a tracer. The tracer owns a sampler. Actually potentially more than one sampler. Tracers create these runnables and scopes. And these spans own span IDs. So this is pretty much the complete API. And it's actually a pretty small API. So I'm gonna shift gears a little bit and talk about how you would configure HTrace in an application that supports it. And I'm gonna talk about Hadoop specifically in a few cases just because that's kind of the first application that we're considering. But a lot of these things also apply to any application. So I mean, the first thing that you wanna do is determine which span receiver you wanna use. Basically, do you wanna use the local file span receiver? Do you wanna use HTraceD? There are even other ones. There's an accumulo one or a, there's a HBase one. So you wanna run whichever daemons are necessary to run that span receiver. And finally set up the configuration. So the configuration basically is a bit system dependent because we would like HTraceD to be integrated very tightly with the systems that it's tracing. So if we're integrated with Hadoop, we wanna use the Hadoop configuration. If we're integrated with Kudu, we wanna use the Kudu integration, et cetera, et cetera. So in the case of Hadoop, basically what you need to do is you need to set this Hadoop.HTrace.SpanReceiverClasses. And it's important to note that any configuration key that starts with Hadoop.HTrace, we will basically cut off the Hadoop.HTrace and we'll pass that on to HTrace. So it's sort of like saying everything under Hadoop.HTrace is an HTrace configuration key we just passed through. Hadoop also has the ability to configure things on a per tracer basis. So if I want to trace just the data node, then I can do that as well by setting a specific configure, by setting a specific configuration key prefix. These prefix is sort of determined which service the configuration is going to. Hadoop.HTrace is sort of the global configuration prefix that says everything gets it in Hadoop. There's a link, which is sort of cut off on this slide, but I'll show it later that it provides more information about this. Another thing that you need to do is you need to add a few span receiver jars to your class path. So for example, if you're using HTrace HTrace D, you would need to add that to your class path. And there's a few different ways to add things to the class path in Hadoop, which I'm not gonna go into here, but the basic idea is that the core API is always on the class path, but the span receiver needs to be put on the class path depending on which one you choose. So HTrace has a really great community and it's not just Cloudera, it's also stuff, companies like NTT Data, Hortonworks, Facebook, and a bunch of other people from the Apache community. Of course, it's a true open source project, so if anyone is welcome to contribute, and we'd love to get more contributors. It's also licensed under the Apache license, obviously. So it's very compatible with the rest of the stuff that's going on in the ecosystem. In the last few months, we've done a few different releases The first Apache release was 3.1, and then we had 3.2 after that. In the last few months, we've done a 4.0 release and a 4.0.1 release. And the 4.0 release had a bunch of API cleanups that really made things a lot better, especially when you're tracing a library. 4.0 is much better at doing that. It makes many fewer assumptions about globals and stuff like that. We basically got rid of all the globals that we possibly could to make things easy for the library guys. So we're really big on sharing ideas with other big data projects like Hadoop and HBase. There's other projects in this ecosystem too, like Open Tracing, Xtrace, which is an academic project. Twitter has a project called Zipkin that they do, which is sort of similar to Htrace in some ways. We have the ability to send spans to Zipkin, so we have that kind of integration. And we'd really like to get more ideas, more use cases from everywhere. And I think that's one of the best things about open source is that you really easily can. So some of the recent work we've done, we added more effective error checking in the Htrace client. This was a really big sticking point for some people because they were forgetting to close spans and then there needed to be a better way of determining that they had done that, but they had forgotten that. We have an optimized RPC format now for Htrace D, so that's nice to reduce the volume of data over the network. We have better integration with HDFS. Our HDFS integration is probably the best integration we have in any project. So we're very proud about that. We have a new GUI for visualizing spans. So the user interface is actually one of the newer components, probably the newest component. I think it's one of the most important components because it really adds a lot of the value to be able to see what's going on. We added the ability to tag trace spans with IP address or hostname. As I said before, this is configurable, but that's another thing which I think is super important is knowing exactly what IP address stuff came from. And we extended the span ID to 128 bits. The original span ID in the 3.x series was 64 bits, but the problem is due to the birthday paradox, collisions occur a lot more frequently than you intuitively think they should. And of course they only really tended to happen once you had an enormous volume of spans, but with a 128-bit ID space, collisions are basically impossible now. So we've really future-proofed ourselves. In that regard. So Clutter has also made HTrace available as a Clutter Labs project. So basically Labs is our way of having stuff that's in beta that people can try out. It's kind of analogous to some other programs that companies have. Right now we support HDFS tracing. We're hoping to add HBase very soon. We've made RPMs and devs available for HTrace D to make it easier for people to actually run this stuff. Because believe it or not, most people don't actually like to compile stuff from source and just upload it to their cluster. Maybe a few people do. I don't know, maybe some people here do. But it's certainly nice to have the packages. We also have some minimal integration with Clutter Manager. We'd like to have better integration, but we have the basics done, which is nice. So we have a bunch of stuff planned. We'd like to improve the HTrace integration in HBase and bring it up to the HTrace 4. Right now we only really have some old stuff from the previous HTrace version. And we'd like to add a few more annotations to the Hadoop span data. So as we use HTrace more, we find more stuff that we'd like to know and we don't. Supporting more span receivers would be nice. Like supporting HBase span receiver would be pretty cool. We'd like to have some better integration with cluster management systems. So certainly CM and potentially even other ones too. We'd like to improve the C and C++ support more. The basics are all there, but testing is something we need to do there. And it would be really nice to have an aggregate view for spans. And this is something that's sort of an open design question, what kind of aggregates do we wanna do? It's certainly more easy to do a streaming aggregate, like counting the number of spans in the last hour, doing a moving window of some kind. Doing a completely free form aggregate that is configurable is more difficult to do without tighter integration with one of these SQL or execution engines. So we're probably gonna take a look at how we might do that and where we find the most value there. So I have a demo here that I'm gonna show. Okay, so this is a demo of tracing HDFS writes. And basically it's a little bit similar to what I showed you earlier in the first few slides. Basically, we have a write to HDFS. How do we trace what's going on? Before I start, I'll give you some really basic background. HDFS is a distributed final system and HDFS by its nature will distribute data across the network. The goal is that we don't want data to be all on one node as in a traditional final system because if that node goes down, we would lose that data. We also don't gain the parallelism of having the ability to process the data on multiple nodes that we would like to get. So the goal behind HDFS is certainly to distribute out the data. In this case, we're gonna be using the FS shell process to write to HDFS. When we do that, there's gonna be a few different steps. So one of them is FS shell when it creates a file, it will ask the name node, which is a single node system that stores HDFS's metadata. And then after it's asked the name node, it's gonna talk to the data nodes. Basically HDFS sets up a pipeline of data so that HDFS passes a message to the data node. It writes it in parallel with passing that message to the next data node. So the analogy that we like to use is a bucket brigade. People are passing water down to the end of the pipeline. Then they're passing back the empty buckets. And it's not a perfect analogy because of course, we're not just passing the data on the data node. We're also writing the data to that local data node. But hopefully it gives a little bit of the flavor of what's going on. The specific case we're gonna test here is where one of the data nodes is slow and how we can determine that that's the case. So here's me running a Hadoop FSLS. It might be a little hard to see, but this is my shell command up here. And here's a bunch of files that are in the directory. So here's HtraceD. And you can see that HtraceD is running, you can see stuff like the git hash here, the server release version, all that stuff. So we're gonna do a query on the web UI to see if we can find the trace spans that we just created. And we'll query by time. And we'll also add a predicate that the description is LS. So here we found the trace span and we can take a look at sort of what's going on here. So we can see that the top level span here is an LS. So the amount of time taken, so the whole LS is here. We can see that a large portion of that time is taken by create file system, which is setting up the Hadoop file system stuff. And that after that, running the globber is taking up some time. The globber ultimately ends up talking to the name node here, as you can see. In this particular case, the name node is on the same PC as the client. That normally wouldn't be the case, but it is the case in this test cluster. They're all on 10.20.212.10. So you can see that as I move the cursor, it actually changes the begin and end that I see here, and not the begin and end, but the current time. So you can double click on this to get more span details. So you can get the precise begin and end time and the precise duration. And you can also get the arguments here. This is a system specific thing we've added. So what arguments was the LS running? Not very interesting here, but it will be more interesting in most cases where we're not just doing the LS of the slash. So now we're gonna do something a little more interesting, which is basically running a copy from local operation. And a copy from local operation is going to be copying some local files into HDFS. Now, this little snippet of shell here is just generating a unique ID so that I don't have to worry about overwriting a file that already exists. So you can see that that took about 4.4 seconds. Oh, sorry, 5.5 seconds maybe. Whereas this one is definitely taking longer. It took at least an extra second. So we'd like to figure out why one of these requests took longer. Because after all, we did the exact same thing. In this cluster, we have four different data nodes as well as a name node. So we'd like to see whose performance is slow, why is it slow, and figure out stuff like that. So what the query that we're doing here is actually looking for tracer IDs that contain data node. And so this should give us all the trace spans from the data nodes. And we're again also filtering by time so that we don't see stuff that happened a long time ago. So in this case, we're looking at upright block proto. And you can see that this one had a very high max right to disk MS. And basically what that means is that one of our rights to disk took a very long time. And 227 milliseconds is a very long time for a right to a disk to take. So 10.20.212.32 seems to be having some disk problems here. Whereas if we look at some of the other trace spans, we see that they're actually only seeing like one millisecond of disk write time. Probably 1.5 or something, but it got rounded down. But certainly nowhere near like 200 or 250. So the idea here is that we wanna find out why this particular request was slow. And here's an example. So here's all the data nodes we have in the cluster. It's a little hard to see, the font's kind of small, but we have basically 14, 12, 16, and 32. So if we log into 10.20.212.32, we see that things are actually not going very well. We have a really high disk load on this one. So here's an example of using metrics, right? This is a very primitive example. Normally you do something like CM or something like that. But this is a really basic way of getting metrics. Just looking at IOSat. You can see that there's almost nothing going on on A2404. But there's a ton going on on A2424. So that's really an explanation for why you're seeing such a high maximum write latency. Is that 24.24 or A2424 is just really swamped. And I can tell you that I did that deliberately. I had a tool that's generating tons of IO load on that one data node. So that's what you're seeing. And that's what's reflected here in the fact that this is taking eight seconds and this was taking like not quite half that, but much less time. I'd also like to say that these times are longer than the, every time I run a shell command in Hadoop, it always gives a misleading amount of, it gives a misleading idea of Hadoop's performance because you're spinning up the VM and that's a substantial amount of time. But certainly because we're paying that cost in both of these cases, we can compare these two cases to each other without having any bias there. So yeah, and here's another example of me looking at the data node stuff. And we're gonna try to find the top level requests that it came from. So here's a copy from local that took just about six seconds and you can see that 10.20.212.32 was involved in this one. It was the last data node in the pipeline. And again, it has that really high rate latency. So this corresponds to the second request that we were looking at. Now, if we look at a request that took less time, we see that this request only took basically 2.5 seconds. If we take a closer look at this request, we see that all the rate latencies are really low for one thing. And we also see that there's no, the 10.20.212.32 data node is not involved here. So having that data node involved clearly blows up the time. And basically, so this is exactly what we're looking at here, right? The last node in the pipeline was really slow and it slowed down the whole request with its long tail of latency. And we can see this clearly in the web UI whereas if we're just looking at metrics, it's a little bit harder to see that. And certainly if we're looking at logs, it can be harder to see that because we would need to somehow correlate the logs for that copy from local across multiple nodes, which we could do, and I've done it in the past, but it's easier to see everything on one screen than to have to look up each log file individually. So these things are complementary, right? Like once you determine that a request is slow, you can start looking at metrics, you can start looking at logs, and you can start examining the rest of the system to see, to dig further into why the request was slow. So, cool. Any questions? This is our question and answer time. Yeah, yeah, console. I guess I'm not completely familiar with console. Are you saying like, is console a particular configuration management system or what? Okay, oh, console. Oh, okay, I thought you said the console. I'm like, well, yeah, I guess you can use it from the console. I think that the smartest thing for us to do is to piggyback on the configuration of whatever system we're tracing. So if that system uses console, then absolutely. If it doesn't, then I don't think we would want to require console just to do that. I mean, I wouldn't be, it's not something that I would veto or anything, but it just seems like if I'm an admin, I just want to do the config that I'm doing. For example, we're working on integrating this into YCSB, which is the Yahoo, well, not just Yahoo, but it started life as the Yahoo cluster service benchmark, something like that. But it's a very popular big data benchmark. And its configuration management is different than Hadoop's because it's not just a Hadoop project. So we want to integrate it into that project's config, which seems to be mostly based on Java properties. So again, if your project likes using Java properties for config, we'll use that. If you like using XML files, we'll use that. It's just a layer of glue you have to put in to use whatever you want. Yeah, no, not currently. We have some ones that are planned, but currently we don't have any projects using it in C++ world. We'd like to, and we wrote the client. We spent a lot of time thinking about the API stuff, by the way. I'm an ex-C++ programmer, so I sort of know my way around library APIs in C++. And so hopefully we've done everything that we need to do to make that all work. We have some plans for next projects, but I can't say anything about them yet. Correct, yeah. We would love to have more people taking the tires on that. That's a good question. So I mean accounting is an interesting way of putting it. So I mean, HTrace is really about requests at the moment. And the idea is we annotate requests, and so we follow requests across the network. So as we follow requests, requests are slow for a variety of reasons. Like maybe they are slow because of network. Maybe they are slow because of disk. I think maybe this gets to the heart of your question. If you experience a really long latency, you can annotate the span with that latency. Or even if you, like one of the approaches we took in HDFS was we annotated each trace span with the maximum disk latency we saw during that span. Now we could also do the same approach with networking too, right? We can annotate it with the maximum network latency that we saw. Typically, one way that we see network latencies is if one span, how do I put this? Like if a span on one node starts a lot later than a span on another node, you start to ask questions about why that happened, right? If there's normally almost no delay between starting them, but then there is a large delay, then you have to wonder why. I think your question also touches on the aggregate question too, like what should we be aggregating? I have a friend who's working on trying to run Maprity's jobs on some htrace output so he can try to basically find patterns and stuff, which I think is a really interesting idea. If you have this uniform sampling going on, then you can sort of even use it to find out the answers to questions like, this job on average makes writes that are what size exactly? So I wouldn't say that htrace is just about latency, even though latency is probably the most important use case, but there's other stuff that we can find out once we have these annotations. That's a really good question. So xtrace, I think, is more of an academic project. I haven't heard of xtrace being deployed, although maybe I just missed it and it was deployed somewhere. It's really hard for me to say never because everything's deployed somewhere, right? But I haven't seen a lot of activity there. I get the impression xtrace was a more academic endeavor, which hadn't been any good contributions and was older than many of these things. Well, so Zipkin is an interesting project which is still going on, still somewhat alive. And I actually talked to the guy behind the Zipkin project a lot. Adrian Cole is the main guy behind it at the moment. So Zipkin started from a little bit of a different point of view than htrace. Zipkin started very tightly integrated with the FNAGLE RPC system at Twitter. I think they've been trying to move away from that, but that was where they started. Zipkin also started from the perspective of a very fixed tech stack, like you will use this stack, period. Again, that's something they're trying to migrate away from. So one way of looking at it is that we're trying to gain their features like a web UI and they're trying to gain our flexibility and maybe we'll meet somewhere in the middle. Some of the more important distinctions between htrace and Zipkin at the moment include, htrace supports the concept of multiple parents for each trace span. There's some debate about whether we can use different ways of representing these relationships, but currently that's our way of representing the relationships. Whereas Twitter, sorry, Zipkin does not. There's a few other distinctions, such as the data model is a little bit different. So Zipkin tends to, because they're enforcing a tree model, they have a trace span for the whole tree, whereas we don't really have that. Trying to remember what the other distinctions are. Zipkin doesn't obviously have Hadoop integration of any kind, so a lot of the people who are using it are just hacking stuff in and they're keeping that patch on top, they don't contribute that patch back to the community necessarily. So integration is something that, and that's one reason why we added the ability for htrace to talk to Zipkin so that people could, using Hadoop, also use Zipkin if they want to. I think people should have the freedom to choose you know what they wanna do. But yeah, so we're, they're similar as projects and they both have similar ish focuses. We're trying to do a little bit of standardization, well not really standardization, but there's another effort which is called open tracing, where people are trying to create a tracing API that can be on top of other systems. And so that's a work in progress I would say. I think that it's a little bit like POSIX in a way, like, so personally I'm really happy with a lot of the decisions we made in htrace, but on the other hand I also think people should have the ability to use what they want, so. Good question, well I would love to see it in Apollo but we have to spend some time doing that. One of the things that we're thinking about is how we can effectively correlate stuff that's going on in a higher level system to stuff that's going on in HDFS and HBase. For example, if we have a MapReduce job or an Impala query we don't want that to be a whole trace span because that would just be enormous. But on the other hand we don't want it to be untraced, so we have to think about the best way to do that. And probably we'll end up having some way of correlating things by an ID that we can match up. That's one, probably the best way to go about it. The annotations in HDFS are upstream, they're in the 2.8 release of Apache Hadoop. So it's not just a vendor thing, although I would argue the support is best in our distribution but that's just the way it is. But it's upstream, so we strongly believe in doing things upstream at Cloudera, so similar to how Red Hat does things we actually get the patch into the upstream release before we put it into our release. On the other hand we also do take things from let's say an unstable branch at times because we feel that, like just to give one example, which isn't related to Htrace, the native Mapretties task stuff is stuff that's in trunk right now but it's also in CDH5. Just because we feel that it's stable but we didn't manage to convince the upstream community that it is. But we do have an upstream first policy that's very robust. That's a good question. No Htrace is not instrumented with Htrace. Circularity is kind of a problem in a lot of tracing systems. It's especially a problem when your tracing system is implemented with your normal storage back end. It's not an insurmountable problem. I've heard people argue that the best systems are ones that use your existing storage and your existing compute stuff which is, there's a lot to be said for that but it also does create circularity problems. And I would argue that the tracing Htrace would certainly create circularity problems like that because then you don't want to create an infinite amount of traces based on one input, right? All right. Other questions? Oh my gosh, we're gonna finish. Oh, wow, almost on time. All right, thanks a lot guys. Thanks for coming by today and if you have any more questions, just send me an email. See McCabe at Apache. So, we'd love to see more people in the community and more people, you know, asking questions, trying it out. It'd be great. Being CQL and seeing. Test one, two, three. Hello, testing. One, two, three. Does that come terribly loud for everyone? Okay, all right, okay, good. How was your coffee during? Yes, awesome. You guys should turn this off for now. One, two, three. Is that on? You guys hear that? You guys, we could probably, yeah. Thank you very much. Hi everyone. How was everybody's lunch? Great. See after, I see we have that little dip there right after our meal here. I wanted to take a little survey here. So, how many people here have used Critib before? Okay, about three or four. And how many folks here are current users of GIMP? Okay, good, good. About half of the room. And commercial applications like, say, Photoshop. Okay, about three or four, three or four. Well, I hope I can share with you some of the things I've learned through the last few months about Critib. So, I'm very glad to see you here. And so, let me tell you a little bit about myself. First off, I'm currently a STEM teacher as science, technology, engineering, math teacher over at Sylvan Learning, providing training to elementary school students from kindergartners all the way to sixth grade students. A lot of times, what we do is we show students at these ages how to use painting apps. And because I work with elementary schools because of funding issues and things, Critib is a way to introduce kids on how to paint. So, we'll talk about what is Critib or Critib. Maybe we'll talk about the pronunciation as well, the history behind it, some installation tips, basic concepts, and possibly if we have some additional time, a demo. So, let's get going here. So, this workshop is an introductory level for Critib. So, but we're trying to cover various topics that might be of interest to both beginning users as well as advanced users. And I just wanted to point out that I am an engineer with basic art skills. So, you might see a little bit of that. I'm not exactly a Van Gogh, but I'll definitely show you if you are top-notch artist, what types of techniques that people at that level are currently using. So, the pronunciation, a lot of it has been anecdotal. I've been trying to Google to see what I could find about Krita. I've seen Krita. I've seen Krita. So, you can kind of pick your own. It looks like the very top one is the one that the person who, the people who have set it up, that's the way that they pronounce it, but that looked anecdotal a bit. I didn't see an exact source of that. So, so Krita is a digital painting program. And so, you can use it for pixel art. You can edit photos. You can create web banners. You can do some retouching of photos. Let's say someone is frowning. You can change their frown into a smile. You can do some color corrections, things like that. A brief history of Krita. Back in 1998, someone created a fork on GIMP, which proved a little controversial. And so, what they started to do is that spun off as K offices, K image. So, in case you're wondering why the K is on my slides here, Krita was originally part of the KDE section. But don't worry, however, because Krita does run on GNOME systems. And I've tested that a little. There are a few wrinkles though. So, I do forewarn you that if you find some things not quite working right, maybe try a parallel installation on the GNOME, on the KDE workspace. From that, around 2002, it was renamed to Krita to avoid potential trademark infringements. Development ramped up. And originally, there was talk of Krita being sort of Photoshop-like, sort of GIMP-like. But then over time, it developed into an app for digital painting. However, you will find that there is a lot of overlap between this, among this app and the other apps that we've discussed. So, some of the comparisons that you'll see on the board here are Krita's open source. Photoshop is commercial, so you have to pay for something on the cloud. You can do digital painting in Krita and image editing. Photoshop, you can do image editing, painting there as well. There are many types of brush engines. You have the color to alpha filter, which allows you to recolor. Various blend modes, I was pretty surprised, will only be covering a subset of those because there's so much experimentation that you can do on a day-to-day basis. Drawing assistance, multi brushes, those are features I haven't found yet in Photoshop and Layer Styles. They're called Layer Effects in Photoshop if you're used to that. For GIMP, both programs are open source. You have Krita covering drawing, painting, photo editing, GIMP, you can do some photo editing. You can do a little bit of painting here and there. Text tool is a little quirky. GIMP's text tool is a little quirky too. I found the operation could be improved over there. Brushes are identifiable or previewable if you mouse over them in Krita. Whereas for GIMP, a lot of times you're you're moving your mouse to the paint brush and you're trying to figure out, okay, so what is this? And there's a thing called the pop-up palette in Krita and GIMP, it's either not implemented or I haven't seen something quite like it yet. So a frequently asked question for me is which of these applications should you use? I say try them all. There is quite a bit of difference among these apps such that you may find one app more usable than the other app and vice versa. As you'll see in some upcoming slides, there are things like multi brushes which are very useful for folks. In terms of installing, what I generally use is the Fedora design suite spin which is at this link and the slides are online. They'll also be posted after this presentation. You can also perform a DNF install calligric Krita to add just Krita. There's a distribution called Chakra which completely implements Krita but the distribution is a little tricky to set up. If you have a different distribution like Ubuntu or various others, you can install using either YUM or DNF or whatever flavor installer you have. And the latest and greatest, if you like hacking away, you can try building it from scratch. For Mac, what's recommended is to turn OpenGL support off because it's a little bit tenuous at this time. People are currently working on it and there is the hope that if there are other Mac developers out there that they can assist with that issue. For PCs, you install as you normally do using that operating systems installation and for Linux, try testing with different workspaces including KDE. So for a walkthrough on Fedora, I've outlined it here. You basically perform a DNF update. You install calligric Krita or calligric if you want the full entire suite. And then as a root privilege user, if you would like to include KDE, you perform the additional steps and then perform Krita as a regular user. So here's the user interface. And at the very top, the link, excuse me, the link below is details the various things that we're about to discuss but I wanted to include it there so that way when you try this out at home, you have a chance to survey the Krita interface. What you can do is you can right click and you then have access to the pop-up pallet. And if you notice, these types of swatches here give you a gamut of colors and then these types of swatches here are swatches that you are presently accessing and in your picture. So you can go back to colors that you previously used. And these tools over here are part of your defaults but you can configure these. And we'll be showing in an upcoming slide how to set that up. So if you want to view the canvas only, you hit the tab key. A lot of times when people are painting or drawing, they don't want to look at any of the menus, they just want to look at the picture. And so I wanted to include this slide because you just want to look at the canvas and focus on painting and painting only. And what people tend to do when they're working through this is they use the right click of their mouse in order to access the tools so that way they don't have to worry about those additional menus and just focus and drawing. To zoom in and out, it's similar to other applications that you've used. You use the mouse scrolling on your mouse. And also the menus which are at the top of the screen are the image menu which affects the entire image. You have the layer menu which affects the individual layers. You also have the select menu which affects selections and we'll be discussing that in a few minutes. You have the filter menu which applies various filters at your disposal. And this is very similar to other apps that you may have used, Gimp or Photoshop. The tools menu is for macros if you want to begin recording and operations and also the settings menu which allows you to configure keyboard shortcuts. The credit tools on the left side, you have the vector tools, the drawing tools, the manipulation tools, the color tools and the selection tools. So the vector tools include the pointer and text tool. You have various shapes. You have the paint brush which is very, very useful. The manipulation tool which allows you to crop your photo and move things around. And you also have the coloring tools so you can fill. This is pretty similar to Gimp. And you also have this eyedropper tool in case you need to sample from things off of the web in case you would like to do some color matching. And then various selection tools. As you notice, there are different shapes. So one introductory term for Krita is something called a brush preset. This is by default, it is on the right side, the right side docker in the second tab. What you can do is when you click this button here called brush presets, it'll give you an assortment of brushes such as markers, sketching pens, airbrushes and things like that. There's such an assortment of these that we can't cover it in this entire hour. And so I leave that for you as a way of experimenting further. To adjust brush size, you would like to press the left bracket key as well as the right bracket key. Or you can use the shift and drag the left hand, left button of the mouse in order to drag it up and down in order to change the size. And there are these two buttons here for symmetry mode and that allows you to press the buttons to create axes for reflection. So you can choose a brush size and paint and make snowflakes and jewelry items. So for example, what I did was I took a blue and a purple brush and we have axes of reflection over here as well as over here. And just the simple act of drawing this would create this on the other four sides. There are these things called seams and you may have seen this in other applications. So if you like to work in the dark, there is critter dark and there's also critter bright if you like bright work spaces as well. We also have on the settings menu, you can configure these toolbars to your exact specifications. Do we have any questions so far before I continue? So the dockers, you can click these in and out by simply dragging these boxes on the left hand and right hand side of your screen. And there are buttons on the individual dockers to lock these in place. When you're happy with the way you've set it up, you can click on this button here to create the workspace icon and then you type in a new name and this will save your settings. That way if things get moved around and strange and you'd like to reset back to your workspace you can go ahead and save that. Wrap around mode is if you like working with a tiled background in a web page, you hit the W key and it allows you to take a look at what you're drawing as it looks like in a tiled environment. It's very useful for folks that use that. So for example, this tile that I created here is then shown as this, so that way this would be part of some sort of background let's say. To pan the canvas, you hold the wheel, the wheel button of your mouse down and so that way you can move up and down. You can also hit the space bar and mouse around and move the canvas as well. To rotate the canvas you press control left bracket or control right bracket or you can use the four, five and six keys. The four will rotate counterclockwise and six will rotate clockwise and five will reset back to regular orientation. And to lighten and darken colors you can press the L or K keys. For exporting, you hit the file menu in export and export to either PDF or regular JPEGs, TIPS, things like that. So let's talk a little about layers. So how many folks here have used layers before in either GIMP or Photoshop? Okay, so it looks like about half of you, let's talk a little bit more in terms of thinking of clear sheets of plastic atop of a background. The bottom layer is the background which is either transparent for those who work on web banners. You definitely don't want a wider background showing up, you just want that picture or you want an opaque background if you want a nice canvas behind that. You can group these layers such as create a face layer group which could have eyes, nose, hair, and so on. So the various buttons in the Slayers dialog you have a visibility which shows whether the layer is displayed or not. You have a lock icon here. A lot of times people will lock their layers down once things look perfect because as you're moving around here, if you start accidentally tweaking a layer, it could turn disastrous. A lot of people will lock that down. Then you also have the alpha button which is for transparencies. So if you work with web type banners, this is definitely a button to use as well as alpha inheritance as well. You have these buttons here which are layer types of operations. Very similar to Photoshop and GIMP if you've used it before. So just a survey since we have about 50% here, you have adding of a layer. This little button here doesn't just add a layer but you have various choices of layers. So you have transparency layers, filter layers, filter masks, so on and so forth. So I wanted to point that out. You also have this button here which allows you to duplicate a layer, move the layer down, move the layer up, pull it out of a group and put it into a group. Layer properties and finally the trash button here which is to remove the layer. You can also add the layer using the insert key. That's a shortcut that people tend to use a lot from an Adobe type product. And also to remove a layer you can right click and remove the layer when you're in the layer docker. And you can press the control J key. That's a very useful button in order to duplicate a layer. So the case study for the sunny day here is that we have our background here which is a gray canvas. We have a ground layer which I had created, a vegetation layer which I created using the various brushes there, a sky layer which is just blue paint, and finally the sun layer. So that's kind of an example of layers if you haven't seen that before. They do add extra complexity so be sure to save your work first and then flatten the layers if you no longer need the sublayers. Right click that layer and choose flatten image or flatten layer. I caution it is safe for us because sometimes you do want to go back to a previous version and see what the unflattened image layers look like. Do you have a question? Mm-hmm, yes it does because what's happening is each layer takes up quite a bit of memory as well as space on your hard drive so as you flatten it, less complexity there. Each individual layer you're applying, you could potentially have a lot of filters attached and it's extra stuff for a credit that you keep track of. Thank you for asking that. So how would you create the following three layers? You have a bottom canvas, on the middle you have a sketch layer and then on the top you have an ink layer. And a lot of times what artists will typically do is they'll create a basic sketch and then on top of that start inking that sketch. So I wanted to ask who here, what would you do first? Does anyone know? Who's created layers before? Or correct. So you would start lower, so when you create a new project, you can have it fill in the canvas color and then you can create an additional layer called a sketch layer using this button here, the plus button or the insert key and then last create another layer called an ink layer. To undo and redo, you press control Z or control shift Z to redo what you have just done done. And to go into a race mode, if you're not quite happy with how it's turned out, you hit the E key. And any brush or drawing tool can be turned into an eraser and that's pretty, it sounds trivial, but it's actually not, so that's really handy to have. To mirror, a lot of times if you need a mirror image of something, you press the M button and that allows you to get a fluffed image. So if you need to get something reoriented correctly, you would do that. There's a mode called stabilizer mode that I wanted to point out to folks using a mouse because a lot of times when you're drawing on the screen, it looks very jittery and stuff. And so if you haven't purchased the tablet yet and tablet, wake on tablets definitely work with this, but if you're just starting out with a mouse and wanted a nicer feel to it, you have this option of changing the dropdown smoothing from basic smoothing to stabilizer. So definitely look out for that. So let's talk a little about foreground and background colors. For folks who've used an Adobe product, you should be very familiar with this. This little box here, this little box here is the foreground color and this box is the background color. And at any time you can press either this box or this box and change the color of the brush that's currently painting. You have this button here, which allows you to swap the foreground color with the background color. And then this button here resets things to default black foreground and default white background. And so this is useful to keep track of because as you are painting, a lot of times you'll start drawing and then all of a sudden, why is everything showing up white or why can't I see what I'm painting? Well, you may find out that you've accidentally swapped these and you're starting to paint white over your canvas, which you kind of don't want to do. So definitely when you're starting out, try that out. To fill a layer with color, as we were populating those three layers earlier, you hit the backspace key to populate with a background color or shift backspace to fill with a foreground color. And this is great for masked photos, which we may see later. This feature called color history. Basically you have a docker here called the advanced color selector. This should be shown by default, but if it isn't, you can go into the settings menu and display it. And so what it does is as you are painting, it'll keep track of the various colors in your photo that you've used. So if you liked this green color that you've been using, you can go back here and it's recorded that so that way you don't have to find that again. Now let's talk a little about brush settings. So the brush settings or the brush presets are the ways for you to change what your photo looks like. As we've discussed previously, you have pencils, brushes, sketch brushes, air brushes, clone tools, things like that at your disposal. So many, so many. I've counted it probably about 80 to 100 for you to experiment with. So definitely try those out. And you click on the tool options, the tool options docker to play around with the various settings. So from the top, you have gradients and fill patterns which you've seen in other graphical applications perhaps. You have the brush engines and how to erase. Preserving the alpha, you have the brush sizes and opacity and flow. Opacity determines whether or not you can see through that down into the next layers and also the mirror buttons that we discussed earlier. So for scaling, you would like to be sure to scale to a new size under the image menu and you change to the larger or smaller sizes accordingly. So if you have a photo that seems just a little bit too large, that's the way you would do it. You have blending modes which are algorithms that describe how colors are mixed together, mixed with the other layers. And here is a page that I'll leave for you to experiment later about working with these blending modes. The things you have to, the things I'd like to leave with you in terms of blending modes are the popular ones, which are the normal, you have the multiply, which allows you to darken layers. So a lot of times, if I'm handed a photo which is pretty light, like say a photo from a relative where it's been fading over time, this is a very useful mode for you to use called the multiply mode. That allows you to darken your photos. You can also screen too. So the opposite operation, if you've received a photo which is way too dark and you need to lighten it up, you also have overlay and soft light, which we won't go over now, but I just wanted to leave you with that, those are two of the more popular ones. You also have a color blending mode, and this is useful for like, say, wedding photos or grid-side grayscale underpaintings. And so just to, rather than discuss this slide further, let's just go into this picture as an example. So say you have, let's say you have a happy bride here and she wanted to show her flower vase here. What I've done is I've taken a new layer, created a new layer, I've duplicated it, and then I just changed the color blending mode to color. And then I just painted, carefully painted purple over this can. So you've seen that technique used in a lot in wedding photos where the bride is in color and everything else is black and white. This is the way you would do it. You take a photo, make it black and white and then just color this, color this right in using the color blending mode. So for shapes, you have various shapes, squares, circles, polygons, and polylines, which are similar to polygons, but they're not connected at the end. And you can also transform too, which allows you to distort or adjust your pictures. So a lot of times if you want to squash things or if you wanna make them bigger or smaller, but preserve the aspect ratio, what you would use is the shift key as you're dragging your selections there. And then you have the perspective key, which allows you to control and drag a handle. You have a thing called a perspective transform, which you hit by pressing Ctrl T. And when you go to the tools options docker, which is on the right hand side, you choose the perspective or second tool here. And then what you can do is you can drag the objects of the handle, of the handles of the object. So say you have a photo that you would like to apply along the side of a wall, but it's in three-dimensional space, the wall is perpendicular or off at an angle. This is how you would apply this photo onto that. And so you can see the applications for that because a lot of times if you have like say jewelry or something where you would like to apply a photo at an odd angle, this is the tool to work with. The filter menu works as the filter menu in GIMP or in Adobe products. And there's so many different filters available. The ones I wanted to point out that you might be using a lot, starting out are the Gaussian blur, which allows you to create a smooth blurring on an image. You can also use the Gaussian blur if you need to, say you work with a photo and you want to kind of redact certain things that shouldn't be there. That's one way you can blur the photo out. You also have something called GIMP, which is Gracie's magic for image computing. There seem to be like a hundred or more of these various filters. And you might not be familiar with these. So the ones I want to at least point out for you to experiment are the frames polaroid and frames fuzzy one. And so let's take a look at that. Here's your, remember our regular photo with a bride here. What we've done is we've just created a polaroid and one of the options there is for adding an angle. So a lot of times polaroids might not be set at a regular angle. So this one here is called frames fuzzy. As you can see, the photo on the edges is a little bit fuzzy here. It wasn't like that before, as you can see over here. But there are so many more of those, countless. So I encourage you to experiment more and take a look at GIMP. Finally, you have the text tool here, which allows you to add your signature to either your paintings or your photos. And you can apply layer styles to these. And we'll be covering layer styles in about a minute or two. Note that you can interleave these vector layers with the raster layers. What are raster layers are the paint layers that we've been discussing? So, and you can have multiple layers of this. To add to your favorite presets or that pop-up palette that I was discussing earlier, go to the brush presets docker. And then right-click on a desired brush preset and choose assign to tag. And then you choose favorite preset or any other group. And then when you're ready to use that brush preset, right-click on the canvas and see that your preset has been added. So returning back to our example of jewelry or snowflake, you have symmetry mode. So there's two triangular buttons on your, at the top of your screen. You choose a brush, you choose a color, you choose a size, and then you draw. You can also use the multi-brush tool, which will be on your left-hand side. And repeat. One nice thing about the multi-brush is that you can add additional axes on the multi-brush so you can have, say, 13 leaves. So over here, for example, I've changed this to 13 by dragging and moving this up or down, accordingly. Our next case study is the frowning to a smile. So our friend here, when he had his picture taken, he wasn't smiling very much. And so what we would like to do is maybe this is the type of photo where you wouldn't exactly like to share it with folks because, you know, like to cheer him up a little. And so what we've done in terms of this procedure is we've used the transform tool, hit Ctrl T, but you want to then press this button here, which looks like a teardrop. And this is called the liquefied tool. So if you've used that tool in other applications, like Adobe ones, for example, this allows you to essentially nip and tuck on a photo. And the mode you want to choose is build up. It should be selected by default, but I wanted to point that out to you. And for a size, you would choose a good size. In terms of a good size, I would probably make it around the size of his cheek over here. So like maybe about here. Because what the tuck is doing is you're basically want to grab this section and then just tuck a little bit, but not too much because what happens is if you tuck too much, then it starts to have this thing that we like to call the Joker effect. And so he's like very smiling, smiling very crazily and stuff. You don't want that to happen unless that's your intended effect. And then finally, another way to do this is to go into tools, options, or warp. You can create anchor points and draw, create them, lock them, and then start dragging those points. It's not as clean as say liquefied, but I've included this here for you in case you find an application where this works better for you. And so you do have a warp tool similar to other graphics products. So our next case is the penguin is looking up. And so this particular picture, we have this penguin here, a great ox. And we'd like to basically make him look up a little bit more and maybe, and also maybe adjust how his I guess arm or wing is. And so what we've done is we use the transform tool, control T once again, and you pick this tool here, which is known as the cage tool, which you may find in other apps you might not. So you draw a cage around the object. So what I've done is I've started here and then I just started clicking, clicking, so on and so forth. And then I returned back to my first point in order to create this cage here. And then when I was done, I then took that this point over here, let's say, and I just started tweaking it up just a little bit and maybe this point here and tweaking it just a little bit. And so what ends up happening, it's ever so slight, but as you can see, a different look from before and after. So folks who like to work on vision boards or collages, you like to take photos and arrange them or vision boards, which I'd highly recommend if you don't have one, definitely start a vision board because it's a way to definitely take a look at what matters to you and is important in your life. Basically, what you do is you create, start a new document and choose a custom size. I typically do like a letter size, but what some folks like to do is they like to change the size to a size for, let's say, a background for the desktop. So just find out what the numbers of the pixels are and then do that. And then under the Layer menu, choose Import, Export, Import and choose each picture one by one. And then in the Layer tab, you highlight each layer, you hit Control-T and then Shift-Click and drag the pictures corner handles and move the layers to an ideal location. And then you can optionally rotate these pictures. So as an example, we have these four photos here and after turning it into a collage, I, with some tweaking, ended up making this collage. I can optionally move this picture to the left here so that way I cut this out. But you get the idea from there. One important thing is blemish or debris removal. A lot of times if someone has a blemish on their face that you'd like to remove, perhaps, or perhaps there's some type of dust on a photo, like a dark speck, you'd like to remove it. What you do is you go to your edit brush settings, which is at the top toolbar that we've discussed previously. You click on the Clone Brush Engine, which is on the left-hand side and hold down the Control key and click on an ideal source, clean surface with no defects visible. Release the Control key and click on the blemish to remove. So for example, for this case study, we've taken this wrinkle here and what I would do is I would sample from her forehead here using the Control key and then start painting over on her face here to remove that wrinkle, in case that's an example you'd like to do. Or if the person has a blemish, you can do that too. For a transparency mask, you can use this to create vignettes. And so I've left a procedure here for you to try out an experiment with as well. The instructions are all there. So for layer styles, a lot of times what people would like to do is create a banner for their web page or apply it and maybe apply drop shadows to a headline. So what you do is you create your text box using the text tool and then on the Layer menu, apply a Layer style and you use the FX button to view your applied style. So what's missing currently from Krita that I've been looking for are snapping to grids or snapping to the rulers. I like some feature that allows me to lock my selections to the rulers and some guidelines and also the nudging of layers. But this is scheduled for version three, so look out for it. And if you like working with apps, one thing you can do to help is develop for Krita at Krita.org, get involved. You can learn it and use Krita and teach it and tell others about it and fund Krita development. So just to wrap up, use Krita to paint, edit or retouch photos, make web banners and much more. Are there any questions? Yeah, I mean, you could select the entire photo and move it, but a lot of times, like say for a collage, if you just want to move the individual pictures, you just want to nudge that one photo. You don't want to move, just like if you have photos on a light board or on a canvas, you want to move the individual photos around. Are there any other questions? Well, okay, so your question then to repeat for the audience is, is there a way to take that raster and then export it out as a vector or something? I'm not sure about that, but I can research that for you and let you know. So maybe we'll talk about it a little later. Good question though. So by the way, the online evaluation for this presentation is located here, but this is also part of the slide deck, so if you don't have to write this down now, you could take a look at it later. And please feel free to stop by the Fedora table and say hello. I'd like to chat with you. So thank you for attending and here's some fun stuff. We have keyboard shortcuts. So if you go to this link here, you can see a graphical representation of all the shortcuts that I've shown you and much more. David Revoy has a bunch of brushes here that he's created artistic brushes. And so you can, I'd highly recommend this bundle. Just go to this website and import it into your workflow here. You may need to restart Krita and what you can then do is you can click on these new brushes that show up and you'll be amazed at what you can do with this. So for an aging document, let's say, we'll go to Krita. Or actually, let's go here. Okay, so for the cage example here, so we have this penguin here and we would like to tweak his beak here. So what you would do is you hit Control-T and then what you do is you go into the tool options here, which is on the right-hand side. And here's that liquid fly tool that I showed you earlier, but here's also that cage tool. So this is the tool that we would like to work with. If you notice, the background is a little bit kind of off-white, so we may need some cleanup operation a little later, but just to give you the idea of how the cage tool works is I've, I'll just click around here and then to start tweaking, you can start moving these up a little more like that. Maybe make him a little over here. As you can see, there is a little bit of cleanup that we might need to do later, but we can use the fill tool color match here and clean that up. So there's one example. Another one is jewelry or snowflake. That's the case maybe. So let's go ahead and get rid of that. So you have these two symmetry tools here. What that does is it creates these axes for you to draw around in. And then I wanna choose a paintbrush. So you click on the paintbrush tool and then I'm using the bracket keys on my keyboard in order to choose a brush size. On the right-hand side over here, if you follow my mouse, I'm just choosing the color blue. A lot of times when you're first starting out, you'll see that the pointer might be over here on white and no matter what you do, if you start clicking around, it's like, I can't see anything. Well, you're drawing white on top of a white canvas, so just change your color and you're all set. And if you notice, when I clicked here, it then, and I start drawing here, notice that there is a symmetry applied in this photo here. And as I pick different colors, remember that color history on the upper right-hand corner, it's keeping track, so I can always go back to that previous color and just continue painting. Similarly, you can use something called a multi-brush which is over here on the left-hand side. If I click here and let's get rid of our symmetry brushes now, because we don't need that. And if I choose an axes angle of say, I don't know, 10, you can do something like this. So it's like a kaleidoscope of 10 different, 10 different angles, I guess, if the case may be. Then you can just pick different colors. Any questions on that before we go on? So that's a feature that I haven't seen yet in some of the other apps. So if I go to File, Open, and go to, how about blemish removal? Because I'm sure folks here would like to see. So what I'm doing is I'm using the mouse wheel to zoom in on the affected area. And basically what you would do is you would, see me move the mouse over here to edit brush settings. And what I want is the clone tool here. And then when I'm done here, hit Escape. And if you notice, the brush is pretty huge. It's like, I don't know, really, really large. And so I wanna change the brush size down so I'm using the left bracket to reduce the size of the brush. Notice also that you can use, if I press the Shift key and use the left mouse button, I can change the size. But I wanna make it about the size of the blemish here or in this case, the wrinkle. And what I would like to do is I'd like to sample from this section here. So what I do is I press the Control key. And due to an OpenGL fun thing right now that weird pink box is showing up on my screen, but it'll hopefully not do that on your screen. So if you press the Control key and then just click once in this area. Let's see, maybe make it a little smaller here. Hit Control. And so this section here will be the sample area. And then what I'll do is if I press here, it'll then clone a portion of this but mix it in with a layer. So that's an example of the clone tool. Similarly, you can go here maybe and maybe clean that up a little, things like that. Like clone here, over here. Let's see, do we have some more questions or shall I keep going? Yes, so if you Google on YouTube, there is actually a pixel art demonstration. There's an initial process in setting it up. I think it's like about 12 to 15 minutes. So more, not for the scope for an intro class, but if you work with pixel art a lot, it's a very good video. So what I can do is I can help you find that link if you see me at the Fedora table. All right, so any other questions? Yes, so the slides will be available either on Scales website or if you go to, if you write this quickly down, you can get them today or stop by the Fedora table later and I can get you this link. So the last set of bonus slides in case you're interested, let's see. So Krita, you can also, I've left the instructions for you on how to change, how to make this aging document perhaps look a little better. See, right now it's kind of yellowing and so what I've done is I've darkened the document using the multiply tool and I've also replaced the color of this with blue. Obviously, I picked blue arbitrarily but you can pick whatever color you'd like. I did the blue just for effect. I've left you a few resources and brushes, studies on why people may be switching from different applications and some DVDs that you can purchase from the Muses group and a copy of the Krita manual. If you like living on the edge, there is an alpha that just came out recently like about a week ago. So there's a link for that as well. And in terms of the photo citations, there you go. So if you have any other questions, I will be at the Fedora table. I'll be here for probably about five or 10 minutes while I close up shop here. If you'd like me to maybe quickly demo something for you. Thank you very much again for attending our presentation here at Scale 14X and I look forward to seeing you again. Take care. In terms of what? It's different. I mean, again, there are different tools at your disposal. Well, as I mentioned earlier in the presentation, try the tool out, see if it works for you if it doesn't work for you. It's like, not a question of what's better or not. I know some people who will use both GIMP and both Krita. But it's an alternative. Thank you for asking. Sure, I will go back to slide number one. Yeah, so go ahead and if you wanna take a picture of that now. Geez. Thank you. Thank you. Thank you for coming. Geez. Yeah, what it found with the text tools, it's a little, it's a little quirky. Like what it found is when I first set up text, it'll show up like really, really large and then I have to dial down the font size. So it could use a little work. Oh, like how to change the layer style or something like that. No, just set the font size. Oh. Oh, okay. You can't do anything other than change the layer. Okay, so what I do in that particular case, say if I click on the text tool here, so then I click inside here and I start changing the size over here in the tools, the tool options here, which is on the right hand side here. Yeah. I would get that. Yeah, so you have to highlight the text and then you start. How do you highlight the text? Oh, so what I did was I clicked between the R and the T in artist and then just start dialing it down. Oh, I hit Control A. Oh, okay. Yeah, for select all. You don't have to select all because the thing with Quitta is it allows you to select just individual characters. Well, I'm gonna figure out how to do it with a mouse. Oh, okay. Okay. Control A. Okay, that'll work. Yeah, so you have to click in to get your mouse in there and then you hit Control A to highlight everything and then it's starting to jump down. Just select a couple letters then. Yeah. Okay, so to get to the clone tool, click on Brush to activate the brushes and then you click here and edit brush settings and then here's the clone tool. However, D-Vod here, if you go into the brush presets, see and notice the stock brushes are everything except for D-Vods. If you click on D-Vod here, his brush is one of the, oh, we're interesting. I'm wondering where that worked. Okay. One of his tools is a clone tool that you can also use. So let's see. Oh, here we go. And if you click there, notice how it has a clone tool if you hit Control, the Control button, it'll sample and then you can start sampling from that. Of course, I'm sampling from nothing right now so that's kind of pointless, but yeah. Yeah. Yeah, so. Have you ever played with the paint tool? No, I haven't. Okay. Now that one. Okay. You ever used that? Paint tool. Not, SAI. SAI. Okay. I'll look for it. Ah, okay. Great, thank you. Do you have a question before I, hey. Yeah, see ya. Or do you have time for lunch or are you? Hey guys, it is three. You guys are in the right place. So we had a slight delay, but we'll get this going for you as quick as we can. So just hang out. It's in 1024. Feel free to do that. You can go. Yeah. The application track is true here. It's a big group. And you're up here to hear about home automation. I think this is like a 10 year in the making. So I will let Bruce get right to it. It is. Thank you very much, everybody. Apologize for the three minute delay. I am super excited to be giving this talk because while I do Postgres for a living, this is probably one of the more interesting things that I do when I'm not working. My name is Bruce Momgen. I'm one of the Postgres core team members. I gave a couple talks on Thursday and Friday at the sort of Postgres mini conference we had before scale started. And I really, I really enjoy it. I like the new venue. I think it's really nice. Pasadena is a wonderful, wonderful town and walking around the restaurants and everything. It's really nice. Really nice. So again, normally I do stuff with Postgres. I work for Enterprise TV. But this is actually about home automation. And it's something that's near to dear to my heart. I hope that comes across to you. We do have 73 slides. So I'm going to chug along a little bit. But the slides are actually available to you right now. If you go to that website right there and you pull down the slides, it's under presentations. I think general subcategory. You can actually read the slides whenever you like. In fact, there's probably 30 or 40 presentations all on that website, all as PDF. So feel free to take a look. A lot of them have videos as well. You can actually watch videos of me doing the talks. But again, we're here to talk about home automation. And I think it's really interesting. What I'm hoping to do today is to give you a background not only of some of the technical aspects of home automation, but the technical aspects are actually not the hardest part. Personally, I think the hardest part is figuring out how to integrate technology with your home life, with the people who are in your household, and doing that in a successful way. I know there's a lot of talk about the internet of things and all these new light bulbs that are Wi-Fi enabled and stuff. And it's interesting, but one of the concepts I'm going to go along here is that technology in home automation space has been around for a long time. It hasn't been as good as it is now, but it's been around for a long time. And we've still found it challenging to really get across and make home automation sort of a general purpose thing that families and households use. I'm going to be giving some data to that. I'll show you some articles that have been recently published on this issue to give you sort of a framework. But if you're approaching the home automation problem thinking it's a technology problem, you're probably going to want to think about that again when you're done this talk. Because I'm going to highlight what I think are the real barriers to home automation. In many cases, they are not technological. So what are we going to talk about? We're going to talk about what is computerized automation? What actually is that? We're going to evaluate some technologies, specifically the technology that I've used in my home for over 10 years. But again, it's just a sample technology. We'll look at a sample deployment, particularly my household, which has grown steadily over the years. We'll talk a little bit about device programming and what's basically involved there. And then we're going to talk about what is success? That's actually where we start to talk about, well, what are we trying to do here? What are our goals? And then finally we'll talk about, sorry, I'm just not used to the microphone being way up there. So everybody hear me in the back, are you good? Great, okay, just checking. And then finally we're going to talk about 12 home automation applications, particularly cases where I wanted to do home automation and how I solved them. Do I think you're going to use any of the applications that I actually specified? Probably not, but the goal is that you're going to have gone through a case of somebody who's gone through it and then you can sort of apply those as you go. So be prepared once you've done this talk to kind of go home, spend a day, a couple of days and you're sitting in your backyard and thinking, hmm, I had that home automation talk a couple of days ago. I wonder if I could X, and that's usually the way it happens. I've gone to home automation talks before and I sort of get ideas that kind of percolate in my head and then I kind of like grasp onto them when I see a use case right away in front of me in my household, okay? I'm going to take questions as we go because I think it's hard for you to remember them and I think it's more interesting to have you answering the questions sort of as we kind of chug along so feel free to raise your hand, we'll be doing that, okay? So let's get started. Let me do that and that, there we go. So what is computerized home automation? First off, we have to talk about what it isn't. What isn't computerized home automation? Timers, the timer that has like the dial, you plug in and you stick the plug in, right? It's automation, but it's not really computerized. I think we can all kind of agree on that. The clapper, maybe I'm showing my age here, but maybe I'm not showing my age because everyone's seen the YouTube hilarious video about it, so anyway, it's just kind of a, it's not really computerized. Dust on sensors, similar, again a lot of people think it's automation, it's not computerized. Emotion sensors, again these are fairly widely deployed particularly in outdoor applications and again, they're not computerized. What is computerized? Cases where device behavior can be combined. Think of programmatic inputs, okay? The idea of not having a distance implementation. A lot of cases you're gonna see here where the sensors in one place and the action you wanna have happen is somewhere else, okay? That's not true of a dust on sensor, that's not true of a motion sensor. In those cases, the sensor and the action are right next to each other. When they're separated, then you have a challenge. You've gotta get something controlling all that, you've gotta get one sensor to be sensed and then send information to some other remote sensor. That's where it gets kind of interesting. Activity detection, another thing, again, usually distance. It should be fully programmable and scriptable. At least I would like that, not necessarily a requirement. And it should have access to external data. I'm gonna show you some examples of that where you can pull data off the internet and have activities behave based on data you've gotten from the internet, okay? Any questions? Okay, so let's talk about home technologies. Again, we've got seven sections or six sections, right? So let's talk, I sort of alluded to this just a minute ago. The idea of having the sensor in one place and the action somewhere else, okay? That's a challenge. Again, motion sensor, you put the motion into the bulb, somebody moves, the bulb goes on. Not really programmable, not really controllable, it works, assuming the sensor and the activity are right next to each other, but a lot of times they aren't, and if they aren't, you have to have some type of way of these things communicated. So I think in good home automation systems, you have some type of network that can actually control things and kind of harmoniously kind of set things up. Now, this is a little awkward. You're not really, you're used to thinking networks as like ethernet and maybe wireless, 802.11, Wi-Fi, but there are some other networks that are available in your home and you should be aware of them. Don't sort of get stuck on the fact that I only wanna put in 802.11 Wi-Fi devices in my home. There are a lot of limitations to those in terms of cost, in terms of what they can do, in terms of sophistication and how hard they are to configure. So don't get stuck on, I only wanna have 802.11 devices because it'll limit what you can do, okay? So what networks do we have in our home? Well, a lot of people aren't willing to rewire their entire home to get started with home automation. I'll explain why that's unrealistic in a minute. But, effectively, you already have some networks. First, you've got the telephone. Is the telephone network in your house? Yes, it is. If I pick up one phone, my wife picks up another one, we can talk, right? Yeah, it's kind of awkward. I'll hear the dial tone while I'm talking to her, but it is a network, okay? In fact, you might wanna think of it more as a network that reaches out into the rest of the world and this is not necessarily one phone to another within your house, but it is a connection to a data network. I'm gonna show you some examples of that. Cordless telephone frequencies, 900 megahertz, 2.4 gigahertz, 5.8, 1.9. These are all networks that are available in your phone and some of the Wi-Fi protocols, in fact, do use these frequencies. They may not be the same frequency data to 11 or they may not use the same protocol data to 11, but they are networks. So if you realize that you have wireless frequency all over your house, that also can be used for communication. You may have wired internet in your house, good for you, I have a little bit of it. I run Ethernet actually over some of my co-acts that used to be used for cable television. So there's that option there, co-acts or some type of Ethernet in your home. And of course, 802.11, Wi-Fi is available, that's a network. Your electrical system is actually a network. I'll show you an example of that. You might not think that's true. And of course we have new wireless networks available, okay? So let's take a look at sort of some of the standard ones out here. And these are a little, this is not an exhaustive list. So if you go to some of these URLs here at the bottom, it would give you a better list of all of the networks that are available in your house. One of them that I happen to use is power line control. So it makes your electrical system a network. There are some advantages to that. There are a large number of disadvantages. So you may not want to go that direction. There's Z-Wave, which runs on the 900 megahertz band. That's very similar to the 900 megahertz I've listed right here for cordless phones. In fact, it uses the same network. ZigBee, same thing, runs on 900, runs on 2.4. It is an IEEE standard. So the nice thing about ZigBee is that, you know, they both start with Z, Z-Wave, ZigBee. But the nice thing about ZigBee is that it is more of an open standard in terms of allowing devices to communicate with each other, okay? And then there's some hybrid ones. There's Insteon. That's actually too loud, isn't it? So I'll put that back there. Sorry about that. I was just worried. So again, just think about the networks that you have in your house and how you're going to use it. You might want to standardize on one network type. There are some very creative hubs that are now being produced, which actually communicate with several different networks. So you'll get a hub from somebody. I know Staples sells one and there's a bunch of others where it's a home, it's basically a home automation center and it does 802.11, it does Z-Wave, it does ZigBee, it may do X10, it may do Insteon, okay? So the idea is that you might decide that you want a hybrid network in your house. You can't buy all the devices you want to use the same protocol, so therefore you buy some kind of hub that sort of synchronizes all that together. So just keep aware of, because there's a lot of devices that only are available in one, like for example, door locks. I think they're only like ZigBee or Z-Wave, right? And there's other things that are only X10 or there's some things that are only Insteon. So if you're in that type of environment, either you have to have some kind of central server that understands all those at once or you're gonna have to buy a hub that understands all those protocols and then program everything within that hub and let that hub communicate. Any questions so far? Okay, choosing a home network technology. So again, I've listed a bunch of them, there's a bunch more, they're coming out all the time. Some of them are proprietary, they don't even have really household names. But you have to think of a couple things. You have to think of, do you have open source computer control of this protocol? Okay, if it's a closed source protocol, you may not have a way of communicating from Linux or from BSD or from an open source operating system into these protocols. And if you don't, that means that effectively you're locked into using the software that comes from that vendor and you're also locked into being able to interoperate that particular protocol with another protocol that's available. So be aware of that kind of trap. You wanna try and find something that is open. Pretty much everything, Mr. House supports which is written in pearl, is probably gonna be good for you. So if you go to Mr. House and you can actually, it's been around for probably 15 years, they have a whole bunch of protocols that they understand. And that's actually an example of a tool that actually can integrate a whole bunch of different technologies together and give you a single platform upon which you can program. You have to think also of what type of devices do you want? I already mentioned door locks is something you might want. And you might find that those only support on certain protocols. You may find that you wanna control like lamps. That's a pretty basic one. Turn the lamp off, turn the lamp off, damn it. You might find that you want chimes as part of your case. You might find you need wireless remotes that allow you to turn lights on and off without actually going to the light. You might want sensors. You might want HVAC thermostat type of controls. You might want 220 voltage controls. So you wanna try and get an idea of what devices you think you're gonna need and then look at the protocol you choose and try and get one that hopefully supports most of the type of devices that you're gonna use. And if not, you have to realize that you're gonna have to bridge networks. You're gonna have to get some kind of central hub to kind of bring all those networks together. There'll be one if we had one home automation system, one protocol, but there's a whole bunch of reasons that we don't. X10 years ago, probably 30 years ago, was hoping to be that. It's been around since the mid-80s. Again, as I was saying, home automation is not like brand new. We thought it's like home automation is always five years away. So in five years, it's gonna be five years away again. That's effectively the way a lot of us who have been on home automation because the technological problems have really not, have been surmounted long ago. It's really more of the sociological problems and psychological problems you have to deal with and that's what I'm gonna go into in a minute. So to simplicity, device replacement, I know a lot of new homes sometimes get a built-in home automation system that's built into the home. Does anyone have one of those? Does anyone wanna admit to having one of those? And the reason I say that is because I have talked to a number of people who had pre-built home automation systems in their home and the problem has been one of reliability, the ability to expand as technology expands and the complexity of programming these devices and the cost. So unfortunately, a lot of pre-built systems which have basically been in the home automatically when they purchased it, often within a couple of years have fallen into disuse because they kind of had a very rigid structure, they weren't able to change technologies, they were very complicated, they were often very proprietary, very off the shelf, I mean, very customized. But again, the more customized, the less testing that's going on and if you had to change the way anything behaves, you had to bring in somebody to reprogram it. And obviously that was a big, you know, a big kind of detriment to people sort of doing things with home automation. Just to sort of step back a minute, home automation seems to be fine in like automobiles. Like, you know, you drive an automobile, it says a whole bunch of technology in it, right? Why can't I have the same technology in my house? Well, the reason is because the car comes as a closed system. You buy that car, you drive it off a lot, every single part in that car is made to work with every other single part in that car. You probably have the car for three years, five years, maybe 10 years, okay? And you don't live in the car, right? So you're not worried of it, like locking the doors, you can't get out or something. Like you're just, you're not assuming your car has the same requirements of safety and sort of predictability I think maybe then a house does. For a house, you figure it's gonna live 70 years, 100 years, you know technology's gonna change all over that time. It's not, a house is not necessarily a technological vehicle. So you don't think of technology sort of being part of it. So it's kind of weird when you think of automobiles that they've been able to move in terms of automation very quickly because they are a closed system that comes pretested. Home automation by definition is always a one-off. And that's made it very, very hard for people to effectively get any kind of hold or momentum in this industry. Cost I talked about, that's a big factor. That's a big factor. If you have to go to your spouse, every time you wanna add a light going on and off and it's gonna cost you $80, you're gonna have a conversation. Okay? I don't know about you or why, with my wife, I probably have, you know, I don't know, 40, 30, 70, 50 devices in the house at different places, okay? If I had to, if that was all $80, you know, it'd be like, eh, you know, my wife kind of jokes, I got a little like, you know, X10 warehouse in the basement. And the reason for that is because when you buy things, I tend to say you buy a couple, I'll leave them in the box. And then when somebody needs something, I can go down in the box, pull it out and install it right away. And I don't have to wait for, you know, shipping and pay shipping and so forth. So I have a tendency to have a little extra bit of stuff around and that's how it gets to be joked. But the items are not that expensive, so it's really kind of easy to have a couple extra around. Again, as you get into the hundreds of dollars per device cost, then you're like, oh, you know, then you're gonna really be thinking. And again, there's a lot of buzz about this. This is, the home automation is really a subset of the Internet of Things. It's the Internet of Things sort of brought into the home. And, you know, I think the jury's still out on what it's gonna look like five or 10 years from now. I would have thought that home automation would have been here five years ago. So, yeah, my prediction is completely useless because I assumed once the technology was available that it would be used very widely. And in fact, when I give presentations, I'm not gonna ask for a show of hands. There's usually a very small number of people in the audience who are actually actively doing home automation. And there's a number of stumbling blocks and I'm gonna cover those, okay? Let's just take a look at a sample deployment. I'm gonna go through these slides really quickly. I started my home automation probably 2003, 2004, so maybe 11, 12 years ago. At that time, most of the devices that are available today for home automation were not available. The only real sort of basic home automation thing was something called X10. It was very inexpensive. It was also fairly unreliable. But I figured I had to use it effectively. What it does is it makes your electrical system into a network. Sounds crazy. But effectively, the way they do it is in a standard AC house, they actually have a little spike of frequency that they send at the point where the phase crosses the center line and they can basically send a one or a zero that way. Sounds really crazy. You can only do 120 bits a second because it cycles like 60 hertz and it crosses twice. So really slow in terms of capability. But if you only got like 30, 40 devices and all you have to do is say on or off, it's not that bad. Again, you're not pushing a lot of data through there to basically say that device I want to turn on, that device I want to do. This is a little bigger example and again some of the code. X10 will support under 256 devices. I'm not suggesting X10, I'm just giving you an example. There's a sine wave on the oscilloscope. That's what it looks like when it sends a one basically. So X10 standard, this is very sobering for me. Designed in 1975. I can tell you when they designed this, they thought this was gonna take over the world. And you know how many people really use it today? Probably like 0.001% if that. Again, there's a packet and a protocol and so forth. There's all sorts of problems with phases and amplifiers that allow you to bring the phases together. You don't have to be an electrical engineer to do this, but you have to have some kind of engineering smarts to kind of set it up in a reliable way, particularly in a large home. But the nice thing is the devices are fairly cheap. They're roughly 10 bucks, maybe $15 a piece. They look pretty normal here. For example, on the right is a normal switch. On the left is an X10 switch. And you'll notice it's a button because there's no definition of on or off. It can be turned on and off electrically or electronically. So instead of flipping it up and down, you just press it. This happens to be for an out-by-light. This is a flat switch, a decor switch they call it. This is a three-way switches. This is a wireless switch, it's kind of weird. This is actually how it got started with home automation. I had an electrician in the house and I said, my wife wants a switch right here to control the light over there. He's like, whoa, well, okay. The problem is the switch, you want the switch on the second floor, but the light is in the ceiling of the first floor. If I have to run the wire down and over to that thing in the middle of the ceiling, I'm gonna make a whole bunch of holes. It's not gonna go, I don't think you wanna do that. I said, well, I said, you can try X10. I'm like, what's that? I've heard of it, but I don't know. Like I remembered the obnoxious X10 marking. That's all I knew. Yeah, so I hope you remember that. I think it was the first pop-up ad, believe it or not. But in fact, that's what it was. That is a switch that will send a 900 hertz signal to a receiver, which then will send a signal across the electrical network to a switch that actually controls the light. So it allows you to basically stick a switch anywhere. Kind of interesting. That's what the receiver looks like. So this sends a non-remeager signal, picks up by the receiver. The receiver sends it through the electrical plug, through the electrical network in the house. The switch, which actually controls that light, also is an X10 switch, like that one, like that one, and it will turn the light on. Kind of cool. You can have a remote, that's kind of cool. I'll show you an example of why I use that. You can have multi-key remotes. That's a big button on the bottom there. I won't go to that. This is an interesting case. When I bought my house, we had one of those intermatic dials. Any engineers who probably know what it is, like a big yellow dial with like little screws on it to do anybody getting this? Yeah, okay. Little screws, you move around, right? So that's the pull, right? So I, if you need to turn the pull pump on, you have to open this little box and you have to flick a switch to turn the pump on. So you basically, you get out of the pool. Somebody wants the pull pump turn on for some reason. So you've got wet hands, okay? Yeah, wet hands. And there's this metal intermatic dial, right? With some yellow paint on it. And then a little plastic flap down here, and then you're supposed to touch this. And I'm like, this is just an accident waiting to happen. 220 volts, okay? I'm just, I'm like, this is not good. So one of the things I did here was to actually put an X10 switch here, which for safety reasons alone is cool. But I'm gonna show you some example later of how that works. I actually use an open source program called hey you to control X10. Again, you have some protocol. You'd have to find a different application. But it has a whole bunch of things it can do. Here's an example of a cron tab. If those of you know what that is, that effectively turns different lights on and off depending on whether it's getting dark, whether the pull should be on, turns all the lights off at night. To me, this is really where home automation takes off because it's one thing to have a motion sensor. It's one thing to have like a QC little thing. I know there's some home automation things like every time you get a Twitter feed, like the light changes color, you know, right? You know, that's cute. I think it's great to show off to people, but practically not really, you know, it doesn't really turn me on. Not that this turns me on either, but it actually is useful. And the reason it's useful is effectively I now have program out of control of my entire house. So from a command line, from cron tab, from a shell script, I can do virtually anything. And that's really where home automation to me takes off. Now I'm an engineer and I'm expecting every homeowner to start, have a Unix server and connect it to X10 and start running cron jobs, but I do expect you people to do it. I don't think I'm asking too much here, okay? But you do it for work, but you come home and there's not a whole other nation there. Now, I'm just being facetious. To me, I found it interesting, but it is kind of cool, I think. And my family actually does like it. I'll explain why that is in a minute. But you can get an idea of kind of what's going on here. We have the ability, for example, to dim things right here, okay? We have, this is kind of interesting. It actually goes out, it goes out and it finds out how the visibility at the airport, how much cloud cover is there at the airport and then what time is sunset and it computes when the light should go on in the house. Because if you have a normal thing, go on at 5 p.m. or go on at 6 p.m. every day, hey, it doesn't account for dust. So maybe you can use a sensor, but the cool thing is it kind of actually looks at actually how dark it is and computes it that way. So there's some other cool stuff in here we'll talk about in a minute, but that's just sort of a sample of sort of the, I would call a master console that allows you to now programmatically do stuff, okay? How does the computer communicate to X10? That might be your really interesting question, like how does it get onto the wired power network? Remember, I'm using a powered network. Again, if you were using ZigBee, you would have a ZigBee controller connected to your computer. But effectively what this is here is you've got, there's the server, there's effectively a telephone wire, and this is a serial port. So to the server, it looks like a serial port. It basically plugs in. I used a USB to serial adapter, and this is just a little serial cable that comes out. It goes into here, and this plugs into the wall, plugs into the power. And that's how it can not only send controls out, but it can detect activity coming in. That's the other issue. Not only do you wanna control what's going out onto your network and turn lights on and off, but if you have any sensors in your network, you wanna be able to detect those and take appropriate action in your server. So it's both in and out that you wanna be able to do. This is actually what in looks like to me. Okay, this is just a log of hey you or X10 activity. And you can see various things happening. Here's a motion sensor. Here is somebody turning a light on with a remote. So you can actually record this, and then you can create a shell script or some type of pearl script or something that is always reading that activity and taking actions based on that activity. Again, you can kinda see where I'm talking about computerized home automation when you actually can computerize it, okay? And you can actually control it from a shell script from Unix, from Linux, and actually sort of create your own world of control. You can do all sorts of things. So for example, it's in a loop and all it's doing is reading the output of the log. And every time, for example, it sees somebody hit the remote bed, it's a button, it's a remote control, it turns some lights off where it does a dusk thing or it does a kitchen, here's an example. We hit a button, a wireless button, and it actually sends a signal out to all of the terminals in the house, including the slide show that shows up and says, hey, we want it, we're ready to eat. And it also sends a chime to all of the, in different parts of the house to tell everyone who you wanna eat. The house is fairly large. We tried one big bell in the kitchen. You know what happened? Everyone in the kitchen became deaf and nobody else in the other house could hear it, okay? I actually had, yeah, I used to have my server that would actually make the sound of a bell. Everyone in the office to where the server is would have like a heart attack every time it went off. And then nobody else in the house could hear it. I'm like, this is not working well. So we actually got a little X10 plugs that make a little chime sound and you put them in different parts of the house and you don't hear this huge sound in one place. You hear sort of ding ding and it sounds in different parts of the house. Everyone kind of hears it. They still may not come to dinner, but again, you can say you've tried. You did hear it. I didn't hear it daddy, yeah. You do hear that occasionally. So my point is that you sort of moving into step four we wanna talk about what are we actually doing? What things do we need to look for? So you should be thinking and when you're terms of programming is what are your inputs? What type of things are you looking to happen and then what is going to be your output? And again, I'm gonna kind of be going through this input output thing pretty regular. Let me take some questions before we move to the next section. Yes sir. I'm sorry? Do UPSs interfere with this signal? They absolutely interfere with the signal, yes. There is actually a, and I didn't show it but there's actually a specific X10 isolator designed to be plugged between something like UPS and the power system to prevent the dampening of signals which is what the UPS is designed to do from interfering with the rest of the network. Yes, so in fact I do have like a filter that sits there and you have to have it, yes sir. So that's a great question. So if I was starting out today, what would I actually do? And I racked my brain on that one. I'm not sure actually what I would do and the reason is because a lot of my requirements and I'm gonna go back to that earlier slide where I said what's your cost profile? What type of devices do you need to interface to and what type of actions do you need to take? So for me, I needed a 220 control for the pool. I needed some type of chimes, okay? I needed some type of remote that would send a signal into the X10 network from a remote, okay? I didn't really need light bulbs to change colors, okay? I didn't need a lock that unlocked for me. I just didn't need that. So I looked at some of the technologies. I probably would have a hybrid system at this point. I probably would still be using some X10 because there's a lot of the devices, like an X10 chime is like $8 or something like that, right? And I don't know anyone who makes anything like that even at that price point. I think what you have to do is you have to take the device then you have to add somehow a chime on top of the device. Whereas right now it's just a little block it just sticks in the wall, right? The other cool thing is if the device breaks or you wanna debug something, just unplug it. It's like not that hard, right? You don't have any, I had a friend who had a home automation system and like every month he would be having to replace a switch in his house, like an outlet because he had it in the outlet and obviously they were very sophisticated. They broke and he would have to rewire these things. You know, whereas I've only had like maybe two devices fail in 10 years, but I don't really feel like rewiring my house to do it. If I decide I don't want X10 I just pull it out of the outlet. Couple of the wall outlets I gotta do but the plug switches. But in general, I would probably be doing some ZigBee stuff although again it has a tendency to be a lot more sophisticated than I even care about. In most cases what I wanna do is pretty simple. Turn the light on, turn the light off, get in the light, tell me somebody's hit a remote. I see a lot of the home automation today is very targeted at specific use case, like a lock. But there's very few home animations I've seen that have sort of a tool kit approach where you've got a whole four of things that all talk to each other, all use the same protocol. So I don't know, I've looked at them, I've talked to some people. There's a PLC, power line control thing that's supposed to be very sophisticated. I don't know, I really know, no protocol has really, X10's on its way out but I haven't seen any single protocol really kind of take over as being by the fact that with this point. And I would like to try and avoid a hub where I've got one hub thing and I've got to interface to that hub all the time because then I don't have computer control of it anymore in a certain way. Other questions, yes. Benefits using parallel number Wi-Fi. So the benefits using parallel number Wi-Fi is simplicity of device and low cost of device. Usually once you do Wi-Fi then you've got to have a very sophisticated, I don't know how you would make a little tiny remote control that did up Wi-Fi because it would just drain too much battery just being on the network all the time. So a lot of it is simplicity I think is what drives it. I'm gonna hold on questions at this point. I wanna get through the applications and we're gonna come back to it. Again, I'll be up here for half an hour after. I knew that was gonna be a big question. What, you know, where do you go? And I'm basically telling you, I don't know where I go. I know where I've been. I know how Hori-X10 is to get working reliably but I don't know absolutely where the new technology is going because I feel a lot of the new technology is very siloed, very siloed. Your application's not, when we get to the application section you'll see what I mean in terms of the breadth of what I wanted to do. I will close with a funny story. I was reading a Wall Street Journal article about home automation. They've had a number of home automation articles if you wanna take a look. And somebody was talking about one of these light bulbs or something, you know, and he was explaining that to turn the light bulb on you take your smartphone and you open the app and you hit a button. The light goes on and off. And he said, he said, if we had started out with smartphone control of our light bulbs and somebody had come along with the idea of putting a switch on the wall, we all thought it was a genius. Okay, and that's very sobering to me and it really tells me, it gets back to what I'm gonna talk about in a minute. What kind of problem are you trying to solve? If it's cool that you're using your smartphone to turn on the light, then you can solve that. But if it's easy to use, smartphone might not be the best way to turn your light on, right? And I think that's unfortunately emblematic of a industry that still has to figure out where it's going. So inputs, what type of inputs do I need? Again, we're talking about things that I want. So I wanna be able to touch commands at a command line, okay? I wanna be able to do sort of a clock time. I wanna, you know, you might want Dawn Dust sensors, wireless remotes, caller ID, telephone dialing, websites, pulling information from the internet in. And then you've got outputs, things like turning lights on, turning motors on, appliances, I'm gonna talk about a coffee maker in a minute, sounds, make sounds like the buzzing, everyone is trying to eat, broadcasting network information to your family, slideshow, all again, useful things you could do. This is an example of the first floor of my house. What you see here are the circles are actual lamps or lights, okay? And again, you might need to look at this slide later and sort of pick it apart. But effectively what we have here are different things that are happening. So for example, we're walking through a case where right here we indicate that it's sunset and it sends a wireless signal to a device. It then goes to, through the circuit breaker to a different phase. It arrives at the computer X10 interface. It goes into the computer, it comes out of the computer and it turns on this light right there. So that's, and this light right here, okay? So it actually is, I'm walking one to eight in how a Dawn Duff sensor effectively goes through the computer and then schedules an action to try to happen later, okay? That's an example of the Duff sensor. So let's talk about number five. What is success? And you know, when I had this slide, I was almost gonna like give you an empty slide. I was like, I'm just gonna put up the question. I'm not even gonna answer it to any, because there's what you think you wanna success and there's actually what success is. So you really have to define this. Again, back to that, can I turn my light on for my smartphone? Yeah, I guess, you want to, I don't know, right? So this is what success is to me. So for me, success was adding, was improving my family's home environment. That was the goal, okay? The other issue, a suggestion is to start slow, make incremental changes. I'll give you a story here. I bought some X10 stuff and I think I'd done something with X10. I don't remember what it was. The first thing I did was some particular case I needed X10 for, very simple. And I probably went three months and my wife comes to me, she said, you know, it would be nice if this X10 thing could turn on the light in the family room when it gets dark. So I'm like, yeah, I could do that. Three weeks later, I decided I'm gonna do it. Why did I wait three weeks? Because effectively, they had had X10 in a very limited siloed situation, okay, that they understood and they got used to it. And then the family would say, you know, I could use this other thing. Could you make it do this other thing? And I'm like, yeah, I could do that. And three weeks go by because if I did it right away, I'm basically like, I'm overeager, okay? I'm kind of like showing my hand, like, well, I wanna, you know, I'm expecting we're gonna have under devices at the end. No, no, no, no, no, of course not. So I waited a couple weeks and I put it in and it worked and I waited another three months, okay? And then my wife comes to me, she said, oh, you know, I really like the outside lights to come on when it gets dark, right? Because again, if you've ever seen the non X10 controls, they're always very awkward and you, every time you lose power, you gotta, anyway, they're kind of yucky. And we have a whole bunch of lights outside, so it would take a lot of work. And I said, oh boy, I said, I'm gonna have to like bring an electrician because it's a three-way switch, so I have to know which side is hot. There's a whole bunch of design involved in putting in X10 in a three-way switch situation. Probably took me two years to do the outside lights. And finally we got a man, we're like, okay, good. And then the pool pump, we did the pool pump and then we did a whole bunch of other things. So I guess my point is that when I did it, I did it in a very sort of slow way, okay? I also accepted that some home automation tests are just impossible. You know, there's like, what could you make it do this? I'm like, I don't have any inputs I can really use. There's really no output that like, I can't, there's no logic that would tell me when to do that. And you have to say, oh, I guess, you know, I guess we can't do it. You know, you just have to accept that. You don't feel bad. There's a whole bunch of things you can't do. But you know you're succeeded when somebody asks for more. And in fact, this Wall Street Journal article there at the bottom, smart home gadgets still hard to sell. That's not from two years ago, five years ago. That's from like two weeks ago, okay? And if you read that article, you will find some very interesting statistics indicating the difficulty of really getting home automation into the home. Probably only 9% of households are interested in home automation. And that number has not changed in the past two years. Okay, so they do the test every year and it's pretty much way, way low. There's a lot of people who love to sell it to you. There's not a lot of people who wanna buy it. There's a whole bunch of reasons for that. Why? What are your challenges? Change, people don't like change, you know? I'll give you a classic example. Philadelphia, we had a smoke stack next to the training station. It was used for the power station for Pennsylvania Railroad. It had been there since 1930. And it was not used. I mean, the electric, the railroad has not been generating their own electricity for a long time, okay? That's been a long time. But smoke stacks are still there. And they got rid of smoke stack about 18 months ago and I'm coming down the train and like the smoke stack's gone. I'm like, where's the smoke stack? Right? A lot of people don't like change, even though maybe the smoke stack looks hideous when it started. After like seven, eight decades of smoke stack, everyone's expected that smoke stack to be there. When it's not there, everyone is like, something happened. Okay? So don't expect your family to be just rushing to home automation because you think it's important. Reliable operation. You know, people just don't, that's their home, that's their castle. They do not want unreliability. That switch on the wall works every time, folks. If you can't match that, most people are interested. Or let's face it, they don't want complexity. Device longevity, again, if you put something in your home, particularly it's hardwired, is it gonna last for the life of your home? If it isn't, are you gonna be able to change it easily? You gotta think of that. Maintenance, how much are you gonna be able to change this thing? What's the cost? Do you have to have a server running all the time? Like I have a server running all the time, my home, that does a bazillion things. So having automation added to it's not a big deal. But if you don't have a server running all the time, okay, how are you gonna do this? You're gonna make a little, you know, one of those, not Arduino, but the other one, though. Raspberry Pi, yeah, you could. You gotta make sure you get the serial port on there and a whole bunch of other things. You could probably do it. There's not a whole lot of power involved. But again, you have to think of that. And also security privacy. This item here at the bottom, again, from a week ago, actually I made a mistake. I'll have to fix that. The URL here and the URL here are the same, so something's wrong. Anyway, the reason that, you can just search for these titles. You don't even need my thing. Nest thermostat bug leaves users cold. Anybody ever see this article? Yeah, okay. It stopped where they did a firmware upgrade and all of a sudden, some users experienced the problem. I think some means all, I have a feeling, okay. Because if they're not given numbers, then it usually means all. Everybody's got that upgrade and they're worried about houses who maybe are not occupied or vacation residences now. They have no heat anymore. They got to drive 450 miles to figure out what's going on. Nine-step process to reboot it and get it back online. That's scary. People don't like that. So you have to offer something that makes it worth these changes and these disadvantages. So let's get to the beat. Let's talk about actual applications, okay. And again, I'm gonna run through 12 of them. But again, years are gonna be completely different. First one, telephone interface. I told you about the idea that the telephone effectively is a network, okay. So what I do, this is unrelated to X10 at all. I just take a modem. This is a U.S. robotics modem. Believe it or not, you can still buy these things, okay. And as you can see, it's got a serial port. Here's where the telephone call comes in. Here's where it goes out to the computer. And we have a little loop that's always looking to see activity coming. And this is the activity. This is what the X10, this is what, not X10. This is what caller ID looks like in the raw to a modem. You basically turn on caller ID to the modem. You tell the modem, I want you to listen to caller ID. And all of a sudden, you start to get these information. The date, the time, the telephone number, and the name. You can then take the telephone number. You can look it up in your address book. And bingo, you get Christine Momjin from Bruce's cell phone. And you have my telephone number right there, okay. Which happens to be on my website, so it's not secret at all. Okay. If you call it, it'll buzz in my pocket right now. So the point is that, yeah, I've got 20 people calling me. The point is that I've done a couple things here. First, I've gotten information from a network. Second, I've added smarts to it by looking up the telephone number to see if it's in my address book. And if it's in my address book, this description is gonna be better. My description in my address book is gonna be better than what I'm getting from caller ID. And it actually comes up with this beautiful message. It shows up on everyone's terminal, and it shows up on the slideshow, and it shows up on a laptop in the kitchen with X message. Just the whole screen becomes telephone call from whatever. And everyone loves it. This is my daughter's idea, the X10. The idea of coming up on the screen, because a lot of times when we eat dinner at lunch, somebody will call and we'll be like, who is it? And some of them have like, it'll talk to you, but that doesn't help because it doesn't check my address book. So everyone will be eating the phone call ring and will look over and say, oh, it's X person and right away we know exactly who it is. We know if they're calling from your cell phone or from home and we can call them back or whatever we want. In fact, I even have the system set up so I can dial out from my address book. So I never dialed the phone to call anybody. I just say rollo in the person's name, and I say dial, call the house, call their cell phone, call their work, and you just pick up the phone and it's dialing for you because the modem go happy to dial for you, right? You just need to tell it to do it. And also I have a system where if I call the house, it actually makes a little ding dong sound. And it makes a ding dong sound so everyone will know this is not an ordinary call, this is your husband calling, or actually any family member it'll ding dong. To distinguish it from an ordinary call from somebody who isn't in the family, which you're gonna answer but not maybe as quickly. So it kind of is an introduction to life. So it's a combination of the caller ID on the modem, the address book, and X10 to do the chime, and X message to send the message on the screen. I mean it's kind of hilarious if I call my house and I'm not gonna do it now, but if I call my house, effectively what happens is like when I'm in the hotel and I call my house and I have a terminal open, it effectively I can see myself calling myself. Like you see the terminal saying Bruce is calling, I'm like, yeah, I know I'm calling, I'm here. So it's kind of funny. So again, you can also dial out, and again it's happy to do that as well, automated, okay. This is how it works, the caller ID comes in, it checks the Rolodex and then sends a chime and sends a broadcast message and logs it as well. I have all the calls that come into the house for the last like 12 years I can pull them up. Outgoing calls again, I can force the system to dial directly from the address book. Second home automation would be the first floor, so this is basically a list of all the lights we have, all the wireless remotes, the coffee maker, I'll talk about in a minute, the computer. As you can see, this is pretty target heavy, target rich environment. This is the second floor again, has lights, cool pump is on the outside, obviously, different remotes in different places. So again, this didn't start at once, it kind of grew over time, you've seen pictures of these. But again, as these need became somebody, family member will come to you and say, I would like to have a remote that does X, I would like to have a switch that does Y, I would like this light to go on in this circumstance and so forth, and then you work with that individual to add it. And this is sort of a list of some of the devices that you have, so X10, you have a config file, you name the devices, and you can say X10 off on the family room and all the lights go on. I have a button that turns off all the lights if we're gonna watch a movie, these type of things. This is what, yeah, X10 on cache, turn the couch light on, turn these on, this is the one where if you're gonna watch a video, it turns off all the lights in the area of the video. Again, the idea of using con to do stuff, we talked about that already. This is, again, that same screen where we talked about different things happening at different times. Dawn dusk activity, what I do here is I actually go out on the internet. I find out when sunset is and I figure out how overcast it is and I determine when the lights are gonna go on. As I said before, X10 also already supports a remote, so there's a bunch of remote applications. This is one of the cooler applications that I've done. We actually seven up from the house so my aunt lives with us and everything so when aunt gets up, she kinda sets the coffee maker and then when my wife comes out of the shower, she presses that button and the coffee maker comes on and then by the time she comes downstairs, the coffee's ready. I'm not a big coffee guy, drink tea, but she loves it, my aunt loves it. In fact, it's so funny cause my aunt'll get up and she'll kinda set the coffee, put the water in and everything and then she'll plug it in, but there's no power to it, right? And then she'll kinda go back to her room or whatever and when my wife presses this button, it'll turn the coffee maker on but the problem was that my aunt didn't always know when the coffee maker went on. So there's a separate X10 chime in my aunt's room. So when she presses this button, it turns the coffee maker on and it makes that special chime in my aunt's room so she knows that the coffee has now been turned on and she goes and pours it and does whatever she does and that's her thing that she does and she loves it, okay? If you thought that wasn't crazy enough and this is kinda how it works, you hit the item, it turns on for, I think 10, 15 minutes, I think only, it says 30 minutes, it turns for 15 minutes and then it turns off again because you don't wanna burn the coffee, right? So it turns it off for 15 minutes and then it turns off and that's usually enough time. If that's not crazy enough, my wife has an icon on her phone, on her mobile phone or some smartphone, okay? It's connected to XvConnect which is an Android application that does SSH with secure keys and so forth and effectively what it does when you press that button, it makes an SSH connection to the house, it runs the coffee command from the command line and then it logs out, okay? So my aunt gets her buzz, the coffee maker goes on, she arrives and the coffee is ready, yes, I know my wife's a princess, whatever. It's the kind of thing that wasn't that hard to do, it makes things easier for them, sure, why not, right? This is something you do year 10, okay? This is not something you do year one but as your family gets more used to it, you start to get into these type of applications and again, the house starts to do cool things for you, okay, that you may not kind of get used to doing. So it is pretty cool. So this is how it works, the control of a smartphone uses XvConnectBot, runs SSH, doesn't need a password because you're using an SSH key and it runs the coffee command from the shell prompt and then logs out, not that hard to do. Once again, all these pieces together, all of a sudden they work really well. Pool pump, I'm gonna cover this a little bit. The problem with the pool pump is that it uses a lot of electricity and you don't have to run it as much when it's cold out, when it's hot out, you gotta run it a lot. So hey, we'll just go out on the internet, we'll find the temperature outside and based on that temperature, we'll run the pool pump for two hours or four hours or six hours, okay, automatically. Just go out on the internet, what's the temperature and then just control it that way. And you can control it yourself, you have shell scripts you can control and so forth if you're doing something special. But again, I'm not sticking my hand in that border thing with the dial and those pliers. And also the, I don't have to move those little like knobs and get it to the right time and just all control. Found the activity screen, we actually have a screen that shows like everything that's going on, so it's like a slideshow and then the calendar and then all sorts of activity. Again, when something comes out of the weather, if somebody calls on the phone, if there's an event, you have all sorts of stuff sort of all together in one place, okay. We have an eat button, which is basically a button that you hit eat and then a broadcast, I talked about this, tells everyone it's time to eat. I'm gonna give you the last one, which was actually my most hardest one. This is one I stewed about for a long time. My wife was like, when somebody comes home, can you tell me when they open the garage so I know they're coming home? Instead of them just walking in the door, like why can't I know when a car is arriving in the garage? This is really hard. And it took me a couple of years to get to the point and I think just walking through this will give you some understanding of how bizarrely complex this is. Again, I'm running out of time so just give me two minutes to wrap this up. So you have a bunch of things. You could do it with lights, motion, distance. You can do it with activation of the garage door, garage position. But you also have to take into account that somebody can be taking the trash out. That's not a car arriving. And you also have to realize somebody could open the garage door to leave or they could do it to rise. It's the same action, okay. So one thing I thought about was like maybe using like a detector. So here's what I ended up doing. I took the garage door, I put a switch up there near the top of the garage door and X10 controller. And this allowed me to identify when the garage door was open. So this is the garage. Yes, it really is that clean, believe it or not. My son, my Peter put up his shelves which are fantastic. So basically you have a garage. You have three garage bays here, okay. And right up here, you can see there's like a little switch. I'm gonna blow it up. Actually this is probably a better shot. Okay, so this is the garage and here's the switch. And right here is a little, you can see a little bar. It's hard to see but there's a little bar there. And then that connects to a switch which connects to an X10 control which connects to the power. So when the garage door goes up, it actually, and it's hard to see, it actually trips the switch and the X10 identifies the switch has been closed and it actually sends a signal to the X10 monitor that identifies somebody came in. The problem is you don't know if they open the door to come or go. How do you fix that? Well, you could use some kind of sensor, maybe Arduino distance sensor. And I was like, this is gonna be really complicated, really expensive. So I ended up going like super simple. See that up there? That's a door sensor. See that up there? That's a magnet. When they open the door, there's a wire that goes from here along the wall into an X10 sensor. So I know when the door has been opened. If they open the door and then they open the garage door, they're leaving. They're leaving, they're gone. The garage door opens and nobody has opened this door for five minutes prior to that. They're coming, okay? So it's the kind of case where it took me two years to think of it. I didn't want to get into a whole complicated thing, but it actually works. To put icing on the cake, somebody said, it would be neat if we knew who was in the car when they came home. Well, I'm like, everyone's got a cell phone and every cell phone has a MAC address. And the way I set up my network, every cell phone, every MAC address for every cell phone member in the house registered to a particular host name where I put a little DNS comment in there of whose cell phone it is. So when somebody leaves, when somebody leaves five minutes later, I'm like, you know, I had 10 MAC addresses 10, five minutes ago. Have any MAC addresses disappeared in the past five minutes or 10 minutes, I think? And when I identify that somebody's left and a MAC address has disappeared, they were in the car. And when they come back, I'm like, I don't use that in that car because I know who was in that car when they left. And I basically say, Van has returned with Christine. And that's like the newest thing that I had kind of worked in. So conclusion, this is, yeah, I know I'm four minutes over, but anyway, yeah, this is sort of the scary of why you don't want to animation. But as you can see, if it's done right and it's done in a balanced way, I think it can be really powerful. And I think you as engineers, certainly have the potential to do that if you want to, and if you feel that your family would appreciate it. I know mine certainly has. I've enjoyed implementing it and I know they enjoy using it. So thank you very much. I'll be up here for questions. That's it. Oh, perfect. Okay. Okay, so. Yeah, in a few seconds, let me check something here. Yeah, let's go. Welcome. Thanks. Well, first of all, I'm very excited. This is my first scale and I got my first batch, so I expect to be coming every year. And it's been an exciting event. So people are very nice. A lot of DevOps, CISAD means, and developers, great. So, and this topic is very important. I think that for everybody, not just DevOps, not just CISAD means, but also for developers. Because we as developers generate many problems to DevOps. Right? You know that. So it's like sometimes I'm creating my own application. This application is logging some information and the next day I change the format and at some point the DevOps has to create a new script and we have 10 applications, 20, all of them different with different formats. It's a mess. So on this talk, we're going to talk about our experience developing a solution to fix this problem, to make things easier for developers, for DevOps, and of course for the real business. Fruity born long time ago, like four years ago, it's not a long time, but before to continue, I would like to take some time. If you want to send some comments or this presentation sucks, it's good or not, please use both hashtags, okay? We'll have some teachers, people who's trying to give us some feedback during the session. So we'll be really appreciate. Well, let's get started. My name is Eduardo Silva, I work for Treasuredata. We are the company who's behind Fruity, but Fruity is fully open source as many other projects that we have. We do cloud analytics. So when you do cloud analytics, you say, okay, we are going to manage data for a bunch of people, but how do we collect the data? So that's where Fruity borns, right? So and Fruity as our projects, like a Fruity in both are fully open source under Apache license. So this slide will be available later so you can grab my Twitter blog or projects that I'm involved to. We're going to start talking about logging. Logging, it's pretty important, it has many advantages. For example, if you can get what's application status, you can perform debugging. When you get some issues, the first things that you get is please review the logs if you have it. So and find some anomalies. We come through a chute and of course a logging can be done locally or remotely. When things start scaling, at the end it's mostly remotely. But for a business point of view, logging is important because it helps you to take better decisions, right? Sometimes maybe your application is not working well for you as a developer, for you as a DevOps, but for a business perspective, that could be affecting customers. So things get serious. Logging is not something optional. Nowadays, logging is a must. And there's a few assumptions that first of all, we need to start working on it. When we start doing logging, we usually do it on the file system. And everybody says, ah, let's do some logging here. In this part, I have enough space, right? Your hard disk will never be full. Ah, you got it, it happened to you. And sometimes you say, no, I'm writing this to the disk, but the message that I'm writing is too short as it will never block. What means block that an application is running? At this point, I'm writing to the disk and I supposedly be wait for that function to return. But sometimes it's saying what second, two seconds, three seconds, okay? That means blocking. At our point you say, the local messages that I'm writing and human readable, what that means that everybody can understand it, false. It's very hard to understand local messages until you are in a iteration, in a daily basis, you understand what is going on. And without conduct, next day maybe you have an intern, maybe next week a new developer and everybody's writing new log messages. And you want to filter that information at some point. That's become a problem. And of course you say that your logging system will scale. It doesn't matter. We as a company build this great app. If we have 100 users, we'll be fine. Nobody cares. Sometimes your application is scaled, get viral and you get 1,000, 2,000, 3,000 users. And guess what? What's happening with your login mechanism? It's trying to get stuck. It starts to get blocked. Well, it should work. But sometimes it doesn't. So let's talk about the concerns that we have. So if the login formation increase means that our data is increasing. If we have different message formats, it's more complex to parse the data. Even if you write, you invoke some system code and a practice system label, say please write this message to this file and that returns, that does not means that your data was fully stored in the hard disk or the SSD or whatever, right? Because the kernel need to flush that data for the kernel buffers, right? Now, if your application is running multiple threads at the same time, right? Everyone of those threads is trying to write to the file system. Of course, you need to log, put some mutual exclusion between them so you can write the messages. But you are looking and looking is not good. It solves the problem, right? But it does not help you to scale. Now, if you have multiple applications, what happens when you have multiple logs? If you have multiple applications, it means multiple logs. If we put that application on multiple hosts, it's a mess. So there's a point where you cannot manage the information. It's getting very complex. So login matters. It's really beneficial, but it needs to be done right. There are many solutions. I'm not saying that we have the better solution of all, right? We're talking here about our experience by the community plus the company customers. So when you think about logging, you can think that you have many input sources, right? If you have some web applications, you can think that maybe your Apache web server is generating logs. But in a common environment nowadays, you don't just have just one application. You maybe have Apache. You have an NGINX as a front end doing caching. Behind that, you have a PHP using fast CGI. So we already have three things. And each one is generating logs, logs, logs. Plus, you have your custom application which is made in any language or scripting who's generating their own logs. So from an administration point of view, how do you manage to look at this information and get some result of this? Some statistics. What's happening if you get an error, for example, the customer or the user who's with this mobile app is using the app and get some error, where that error happens? On the front end with NGINX, on the back end with Apache, on the PHP side or my custom application. I think that you are getting the point. So not too far from here. This is what we do, right? We have a bunch of scripts running on Python, Ruby, Perl and all of them are running through Chrome tab or something similar. Because we want to get that data and push the data back to any kind of service. Like for example, we may want to make some archiving. We use Amazon S3. Maybe we want to do some big data stuff. We want to use Hadoop. So we want to push the data to the Hadoop file system. Or some relational database like MySQL or NotSQL like MongoDB, Redis or whatever. So this is a current scenario that most of companies get a problem for companies. I was in the flu indeed booth and we come up with many people. Some of them say, oh yeah, I hear about this topic. And others say, yeah, this is what we do. And this is a problem because I waste hours. And that will not happen on these times. So other thing, okay, you can get the data but you have to parse the data. That is quite expensive. One of the most expensive things when working with them, working services and data is part strings. Because everything is a string here, right? Just a few people is writing logs in binary format. Trust me. So how do we solve this with high performance? So you got the problems. The first problem is, okay, we had different inputs or sources of data, each one on a different format. Okay, I get them. Now how to bypass the data? That's when flu indeed born. Flu indeed born on 2011 to solve all of these kinds of problems. With performance, with flexibility. Not trying to compete with everybody but just to integrate with everybody. Well, as you say, as you know, flu indeed is an open source data collector. That means that it collect data. It allows you to unify this data at some point and push that data back to any database cloud service. This is what we have before. So with flu indeed, we aim to have this. Something more clean and something that you can use in a real way. This is a real way we have more than 1,000 users. Well, what about the future of flu indeed? Well, it's high performance. I could be lying. Try yourself. Build inter-reliability. We don't want that you lose data. There are some solutions in the market which collect the data, but if they cannot switch the data back to some database, if the node is down or you have some outage on your network, the data is lost. That could not happen. That's why flu indeed have some work around for that. We manage structure blocks. That means, for example, if you have a web server log, for example, where you get the IP address, you get the date, you get the meter, the URL, and all those files from HTTP request, what it does is take that information and decompose a kind of message in JSON format for you. That means structure the information at some way. It's like making a lot of key value store from each row of logs that you have. It has a pluggable architecture. The community had built more than 300 plugins, plugins to read from syslog, from syslog engine, for Apache, Nginx, Mexico, everything. And what's internal architecture in a global overview? Starting from here, from the left to right, of course, we have the inputs. The inputs plugins get the data in. Then we have the parsers because we need to parse the information. At some point, may you want to filter that information? You don't want all of them. Maybe you just want to get the whole request, HTTP request that you're getting from the states. Maybe you want to discard everything that's coming from China or India, for statistics reasons. Then, that the data is filtered, it's buffered. That means that, indeed, when it's collecting information, it starts buffering, like for seconds, you specify the time. So once it's buffered the information, opens time out, it's flushed the data back. Backward to an output type. So the basic thing here is when it works with inputs and outputs. And of course, it supports some formators. Formators means a way to refilter the information that's going out, the format. So the internal simplifier, how it works. When input plugin gets data in, okay, and it's filtered, then it goes to the buffer. But the data, it's pretty good because it just adds a timestamp here. Times that means when I receive this message. This is no longer just a log file. It's an event. An event has properties, right? It has time, it has attack. What means attack? Because I want to identify internally of FluentD where this data is coming from. And then you have your records in a structured way. And then it's filtered out, of course. So as an input, maybe you can start FluentD, not just come parse log files. It also can behave as a web server. So I can say, I'm gonna start a FluentD with an input as HTTP server. And I'm making all my apps pushing logs to my FluentD. And then FluentD insert the records back to a database. Or just, you can listen for syslog or syslog ng. We have plugins for everything. And the output, of course, you can send the data back to a file to Amazon S3. If you want to make some archiving, it's quite extensive, but it works. Maybe your local instance of MongoDB, because at the end, what you want at the end is to query that information. And what we do, when we do buffering, we can do it on the file system or in memory. Nobody use memory, trust me. Everybody who's in production use the file system because if something happens, nobody wants to lose data. So you could try it, but it's up to you. And when the data's going out to the output plugins, if you get, for example, a mech of data, one mech, what it does is split small chunks because at the end everything is a record with a time, with attack, right? Small chunks that are slushed back in parallel if you want. So it tries to optimize, increase the throughput through the network. Well, this is a more complex part. Here we have the record, right? We have the time, the record information. It goes to the internal rotor. Do you remember that we have attack here? Attack, I can say, for example, if I have one fluendee running, I can say fluendee. Fluendee, please listen for log files from Apache on this path and also listen for a nginx logs on this other path, right? But what's the difference? I say please, for everything that is coming from Apache, add the attack Apache. For everything that's coming from nginx, add the attack nginx. And then when buffering and splitting them back before it goes to the output, the rotor will say, okay, everything that is coming from Apache, maybe I want to push it to elastic search. And everything that is coming from nginx, I wanted to send it to Amazon S3. So you can split and copy the data inside when fluendee instance as you want. And of course, we try to solve this. This is a common pattern. You have many inputs, many outputs. It always goes for buffer, filter, and routing. It also supports simple forwarding. Simple forwarding means receive the data and insert the data back. That's the basic stuff of fluendee, right? And now we're going to explain a similar configuration file here. Source means from where the data come from. The type means the plugin. The tail plugin looks for files and start reading for new data that is coming into that file. It's like tailing on bash, but through the script. I'm saying a path from where this data come from. I'm specifying the format of the data. So you don't need to tell fluendee how to part the data. Just use the right format that already exists for you. And at attack, back in that Apache. We're not going to focus on this one. But then we have a match. A match means that for every data that match the tag, bacon, with everything, right? We are going to insert this on a MongoDB database, which is called fluend, and the collection name is test. The collection is a specific name from MongoDB database. And also here, well, what we have here means forward. That means that when we can listen for events coming from another fluendee, because maybe your architecture is quite big. And maybe what you can do is make them talk each fluendee to each other. So here we have a more difficult example, where we have many sources and many sources running their own fluendee. Here is working forwarding, because we are forwarding the records to a central or aggregator fluendee, who later can insert the data to any place like Trisha data, Sirocco, Amazon, Google, everything. And also, we have the common case of a Lambda architecture. Are you familiar with Lambda architecture? Okay, a few people, okay. Sometimes when you manage data, and you want to query that data, you have two ways, right? One is using real-time queries, right? Because you have a bunch of records and you perform real-time queries under a set of data that is maybe three weeks from four weeks ago. But when you have data from a year, two years, the queries over that is a bit complex. So you need a different mechanism to parse that data. So this is Lambda architecture, where for some things, you distribute one data to some real-time engine and other to some big data engine. Here is an example, we have Elasticsearch, which is pretty good for real-time, and we have Hadoop, which is really good for MapReduce and that kind of queries. And how that can be implemented in the configuration. If we look at the source, it's pretty much the same that the previous one. Now, look at the matches that we have here. The type of output that we have is copy. What it means copy, that we are taking for each record, we are making a copy for two kind of stores. The first one is for Elasticsearch. We are using the lockstach format, and then for each copy, also send it to all the Hadoop file system, which is located on this hold, this port, and this path. So when I'm trying to explain here, is that does not matter what kind of data you get, you can store it as you want, and that is quite good. Now, who is fluent in production? Well, line, green, a slide share. This was not intentional. I don't know what happens here. Trust me. Really, you know what, because this is out of the record, because we have been talking with the Microsoft people, and they are using fluent in a project. So I just added the image before to came here, and I don't know what happens. Really, I apologize for Microsoft people here was not intentional. I will fix it before top of the slides. This is life, you know, things happen. Well, so Google Kubernetes is using FluentD too. So FluentD is not just to collect data, push data. It aims to implement and unify it log in the year. So hold my whole architecture in my company, where I have different kinds of data, how I can unify everything. That's where FluentD comes in. And of course, we, at the company, we use it a lot. We do one thing for analytics that pay the bills, and we make everything open source. And FluentD is one of them. And with FluentD, we collect around 1 million events per second. That is like 100 of tweets around the world per second. Of course, we have multiple FluentD instances. I cannot share more details because I'm restricted on that. But it's quite good, and it works. And what happened, well, we're talking about servers. We're talking about mobile applications. But they are different scenarios. What about the internet of things, right? Devices connected to each other, or most will not as embedded. IoT is like the new market name. The IoT stuff is growing a lot. It's growing to the order of the billions of devices. We care about different things, about connectivity. But at the end, all of these kind of devices are doing the same, sharing information, sharing data. And they're generating logs. The difference is that logs are not being stored in most of cases in the file system because it's very restrictive, right? But they are being dispatched to somewhere. So they need logging. When we talk about internet of things nowadays, there are two consumptions, very big, where different companies are behind them. Because companies say, okay, I have my IoT devices, but I need to partner with other companies. So everybody's trying to merge to have a complete ecosystem for them. So on one side, we have the old scene alliances and the other open interconnect, each one with their own implementation. So each company, it's creating their own framework. On one side, they're creating IoTivity and in the other, all joined. That is how the devices communicate each other. But it needs logging. So how to collect the data properly from these devices? Of course, FluentD is not suitable for that because FluentD, you need at least 40 megs in memory to run it properly. And you cannot waste that kind of memory in a really small device. That doesn't matter if it's too cheap, you can't. That's where the new project born, which is called Fluent Bit. We just make a big release this week. I don't know if you read linux.com website. We have a full post about Fluent Bit. And Fluent Bit was a solution based on the experience of Fluent D reading from scratch, but in C language. Fluent D is made in C plus Ruby. This is made fully in C. And of course, it's fully open source. There's the Twitter, if you want to follow us, great. So Fluent Bit allows you to collect data in many ways, collect, but also if you're a developer for embedded applications, it's allowed you to dispatch the data to different outputs, same as Fluent D. And it's made for any kind of services to collect data from sensors, signals, radios, we support XB, operating system information. When you have your system running, your embedded system, you want to measure sometimes how much CPU is being consumed because think about it, embedded power consumption is an issue. So you want to see how much I'm consuming, maybe I'm consuming the half of CPU time. And that is really bad. You want to get that decrease. And of course, it can run on telematics or automotive stuff. So when we think about Fluent Bit at the beginning, we say, okay, how it should be developed. So we say in C, it must support plugins, right? And must have an integration with Fluent D. Because maybe you have a full architecture, maybe a lot of IoT devices generating data, but you want to merge that information at some central point. That's why it supports Fluent D. So this is a generic solution, right? Well, Fluent Bit works. On this case, Fluent Bit resides on the embedded device. This is just an example. Where the data source can be an IoT framework as the one described previously, an XB device, an XB is a small radio device that's in the market, or some other Linux peers. Once it gets the data in, it flushes the data to Fluent D. And of course, after that, you can flush to everywhere. This was really good. People was very happy, but we get a lot of feedback from embedded people. Hey, I don't want to use Fluent D. People from embedded is very strict about some things. So we started supported direct output to different kind of services. What you see here, there are actual output plugins supported by Fluent Bit. The thing is, of course, we don't do battering on the file system yet, because of its reason, but some customers are asking for that at the moment. They say, it doesn't matter. I am using a really small board, but each one has a one gigabyte for the file system. So please use 100 megabytes for that. Recently, we had support for the Elastic Search. So when you collect metrics for embedded or any kind of system, you would like to have some visualization of that data. Elastic Search is very good for real time queries, but also provides a very good tool, which is called Kibana. Do you know Kibana, right? Cool. And Kibana allows you to make some graphics about your data to make it some sense. So we have the support for that, so you can make some graphics of the CPU usage of your system. Now, we are going to jump to another context, container. Who's from here is using containers? Who's using Docker? Ah, yeah, well, Docker is the same. Docker works on top of the Linux containers. It's pretty much the same. Not the same, but it provides the right interface. So when you run containers, means that you're deploying applications, not just once, maybe multiple times. So we have here about use cases where people have a hundred of containers or more. Cloud services, they have maybe a thousand. So how do you collect the logs for those applications? So we make a deal because Docker on its version one that six implemented the login layer. They knew it, that login was very, very important. So we just come up with them and said, okay, we can build the driver for Fluent D natively for Docker. So one of our colleagues just did the driver and after a lot of iteration and work, this resolution was merged. What that means that now starting from Docker one that eight, the Fluent D support, it's there. So we wrote a go driver for Docker. If you get the new version of Docker, it will be there. So if you deploy your application, you can use the login driver Fluent D. You specify the tag, and then it will flush the data to a Fluent D service automatically. And who's using this right now heavily? Open Shift. People from Red Hat, Open Shift, the new versions, they are deploying everything with Docker and Fluent D because Docker solves the problem of containers, how to manage them. And they are using Fluent D to manage the problem of logging, how to manage them. And the Docker output is pretty good because when the container is working, it's split out some information. Like the tag, you know that each container has its own ID. So you get that information, you get the time and the source from this data comes in. Oh, container ID, container name and the log message. And each container often generates many of them. So if you're writing an application that's gonna be deployed on Docker, just stream out some messages if you want for facility to the standard output and your set. You don't need longer to take care about logging. And at the end, you get this many Docker instances flushing the data to Fluent D. And for people who loves Node.js, I added this because people often ask about this. They said, okay, I'm deploying my really cool Node.js application. Oh, I managed not logging. So we implemented to a package which is called Fluent Logger. It's already available on the NPM package server. So you can get it and start using it. You just create the instance, configure where you Fluent D, the Fluent D tag, where the Fluent D is, the timeout, so every one time it's gonna push this data out and then just one line to send the message. So if you have an environment where people are creating Node.js applications, just try to use this because you are going to unify the whole login of the whole applications. Well, that was my show presentation. I hope you enjoyed. I would like to know if you have some question about it. To modify. Okay, are you asking about when I get the data, how to modify it? If I can modify it as, oh, okay. Well, there are two ways. When you get the data in, okay, you can specify one format. If you change the format, you can change the format inside Fluent D. You can say I modified these fields. And also for each record of event that you get on Fluent D, you can modify it. On the filter process, you can say, please append this key value to the message, which is a pretty common use case. So, and if you have a custom kind of message where the plugin does not exist, you can write your own regular expression for that. So you always can deal with data, but not binary data. But we can listen for TCP, UDP, or LOX, LOX files. Oh, so sorry, please, continue. Yeah, exactly. People will use, they create their regular expression, and you can assign some key names to each value that match the pattern. And people use it to, for example, they have sometimes an NGINX as a front end, and behind that you have many address for different applications. That they are back in microservices, whatever. So they want to filter what is coming to which site and not to aggregate everything in one place, instead on several ones. Yeah, right now in the version 0.14 that is coming out in a few weeks. We are testing that a lot because change it like the protocol at the internals affect thousands of users. So yeah, the solution is there, we are testing it. It's not failing yet, but wait for the 0.14. What? Clustering. Yeah, you can make a copy, a match, and a copy to different FluentD outputs because they're going to talk with the forward protocol, which is called. So that's the big difference. Our solution in the market allows you to get the data and insert the data somewhere. But if it fails, it cannot do anything. But FluentD allows you to balance the connection and have some failover mechanism. And you can send to many FluentD and this FluentD is gonna get the record and I'll switch it back to 20 place. Yeah, sequence number on table. Sorry, I don't have your question there. Here. Okay. They are not reassembled. So, okay, let me explain. When you get each record, it becomes like a unit, an event of data. Right? No, no, no. What we split is a number of rows, like to say a number of events. If I have 100, maybe I split between 10, 10, 10, 10, 10. And why is that? Because we need to start chunks. When something happens, we send a chunk. If the chunk failed it, we need to retry. Or maybe we have some parallelization with different threads and pushing the chunks between them, through them. Thanks. Okay. You have a question, somebody from them. Anybody else? Okay. Your question is about the performance on throughput. Yeah. The performance, this is a generic answer. The performance always depends of your data. It's not the same, you receive data in a fixed format from TCP that getting data over the file system. Because each step is adding a small overhead. We do our best to collect data as fast as possible and make a reliable system to deliver that data as fast as possible. But if something happens, it will take some delays. It's near real time, but it's not real time. So I cannot say that this will work really, really, really, really fast because it depends of your data and what's the configuration. Maybe you're inserting the records into your Hadoop file system. But if there's a problem in your network, you need to retry some TCP packets or something. It's complex to answer, but our experience says that we get good results. That's why it's been used in production. And it's been replaced, so it is replacing all our solutions in the market. And that's where, and the very good thing is that it's fully open source zone. If you get some performance problem, we can fix it. Each one get the address once it starts. So it can balance and the setup allows you to define which kind of balancing method are you going to use. So it's pretty flexible on that. So we're not trying to just send to this one and get married with that, but it will balance as you setup the configuration. You can set the run drawing and there are other methods. Yeah, the password happens on the input, okay? For example, I didn't mention this, but internally we use some data format which is called message impact. Have you heard about it? Well, everybody knows JSON, right? Okay, JSON is a string format. When you want to parse a string format, it's quite expensive. And JSON is expensive because you don't know where each key start and where each key ends. The same for the values and what kind of values. Internally, and for forwarding, we use message impact that was created by the Fluentd creator which is a binary version of Fluentd, sorry, of JSON. So this binary version said, okay, this field start on this way, it's a map, it has 10 keys, that's it. And you can jump into it. So when the messages go between Fluentd and Fluentd, they use the forwarded protocol and the forward protocol use message impact. So it's not going to use the same parsing that at the beginning. It's optimized for performance. And Fluentbit does the same. Internal Fluentbit is just message impact. Yeah, it's different. Bison from MongoDB is pretty much similar to message impact. The difference I have seen personally, this is my own claim, that Bison is just used by MongoDB and a few others. The difference is message impact, it was widely adopted. You can see we have bindings for Python, Ruby, Perl, C, C++, not binding for C, but it's made in C. So it's widely used. But it's pretty much the same idea, that it's a binary JSON. Escribe. Escribe is a commercial solution, right? Escribe provides you everything to solve this. Most of people switch to Fluentd because of cost and because of flexibility. Because there's some point where you're using Fluentd, as we're doing right now at Microsoft, that they said, okay, we need some extension. So improve that is a bit hard. In Fluentd, as it opens an open source project with full documentation and a lot of contribution, it's easy to extend. If you know a bit of Ruby, you can write easily a plugin. So I think that it depends of the person. I don't want to say that this is times better because it depends of the use case, but I'm talking about what we have seen with customers and with people around the community. Okay, he asked him what needs to be on the input to use Fluentd. The input can be anything that generates some kind of log messages. For example, Apache web server, generate logs, ngnext, generate logs. MySQL, generate logs. Maybe you can hook up your own system, syslog, and let syslog to put data to Fluentd, or syslog ng. On that server, you need at least 40 megs. And all the dependencies are, comes from Ruby gems written in C plus Ruby. So if you go to the Fluentd website, you're going to get the source code of Fluentd, and also you get some instructions. For example, if you're using Ubuntu, long-term support, or Debian, you don't want to build Fluentd and get the whole dependencies. In set, you get what is called TD agents. TD agent is like a package version of Fluentd for enterprise servers. It's pretty much the same, but TD agent is a long-term support version. That is for free too. If you are serious, people use TD agent. Okay, thank you so much. And... Thank you. If you have some questions, there's a booth, booth number five tonight. I have some stickers here if you want to get out some business cards. And... For the people who made some questions, please come here. I have some t-shirts. Test one, two, three. Great is cool. Okay. Hey, everybody. There's only a few of you. Cool, thank you very much for coming at this late hour. Yeah, I just have something to show, which I hopefully you will enjoy and think it's cool. I think it's actually cool. Sorry. Let's do it like this. I just can't put them on my ears. My glasses are in their way. Okay, so what is Crate? Crate actually is a database which is developed in Austria. And as you can hear from an accent, I'm not... I'm one of the few people in the team who are not Austrian. I'm German. The other guys are really Austrians, which basically means if they talk Austrian dialect, I can't understand anything. So we are supposed to speak the same language, but it's sometimes not the case. Crate is not very known in the US. I guess that's as well for the late hour, and this is not a lot of people like here. But I think it's actually a really, really cool project, open source project, database project. It actually won a TechCrunch, disrupt this start-up battlefield contest. And yeah, I'll just let me give you a little bit of an overview. And yeah, so the title of the slide is Big Data from the big Austrian mountains. So yeah, let's go into it. So what is Crate? Crate is a database which was built to scale massively. It is a database which is, so it's not in memory database, which are right now very popular. It's a database which has for sure cache and memory, but where the main idea behind it is to have a database where you have persistent storage. It has very powerful search capabilities and it actually can do very powerful data analysis in a large cluster. And currently, for example, the cluster sizes we have deployed is, so we have a couple. So this is still a very young product. So there are a couple of projects we now offer. A lot of projects we don't really now offer because they simply use the open source version of our product. And we don't really know what kind of scale they have. From the production systems we now, one of the bigger ones is about 120 nodes running on hardware and with about 3 billion inserts per day. So that's kind of something like the scale where you can get to. So it's a really scalable database. The reason, by the way, so I always kind of mix it up or have difficulties with it. So it's kind of, when we're talking so about Crate, there's kind of a little strange thing. So I basically say Crate is built on NoSQL technology on the NoSQL architecture, but the database understands SQL. So this is kind of a little special case here. So we have kind of a lot of features you will recognize from NoSQL databases, but it understands SQL and that's why it's a new SQL database. One of the targets of Crate is to be extremely simple to operate. So it's a database which is specifically built for easy operation and especially to attract DevOps people and developers because you just can simply install the database by yourself and to a certain kind of degree, most probably many dozens of nodes, you don't really need a dedicated database person for it. So you just can more or less run it by yourself. Crate scales almost in linear fashion when you add new nodes and it has the search capabilities. So basically the queries, they scale with it. You can run Crate on commodity hardware if you want to do that. It will for sure scale better on basically server hardware, SSDs, a lot of CPUs and so on, but it's actually doing very well in scaling and you can very well scale out horizontally and in commodity hardware there is no centralized storage or whatever needed. Basically it's a shared nothing architecture, storage is all locally. And Crate is extremely elastic so basically adding new nodes or taking nodes away. This is something which doesn't really require a system administrator to add a new node or to take it away. So you just can simply create, we'll figure it out by itself. Or you just start a new node and Crate will even figure it out by itself. So it will automatically, the nodes in the cluster will detect the new node and they will start to stream data to the new node and just give it something to work. It's very resilient. So as long as we have so we have in Crate we have all the good stuff like sharding, partitioning, replication. So as long as you choose a replication factor which is basically big enough the cluster will be extremely resilient and even if, for example, let's say you have a replication factor of two and you take two nodes down. So then, for example because accidentally you just actually rebooted the node when basically the data will be for a while, some of the data will be not available and then when the nodes come up again the data will be available. So even that is just not killing the cluster. It has read after write consistency. So what that means is it's a typical node SQL architecture. We don't really have transactions but you just have atomic consistency and when you actually write in Crate a row to the database and you just after that immediately read you can be 100% sure that you just get the change row back. So let's read after write consistency. So when you look in the database solution space then, I don't know if you notice but there are a couple of reports every year coming out about basically new databases. They're right now most probably a couple of hundred different databases existing right now. There's a wide mix of databases and different kind of storage technologies, different kind of data models which are behind it. There's a huge variety of databases you can actually choose from and a lot of them are as well open source like Crate. So to give you an idea where Crate is actually located so even if Crate is based on elastic search, so it's basically I will show you later some slides so we didn't reinvent the wheel we just started basically and reused some other open source projects. So even if elastic search is here at the document store we see Crate specifically for Crate and cloud usage. So I'll later show you why that's the case. So our problem right now is we are really drowning in data so there has been never so much data generated as in our time today and at the same time what we are doing what do we really do with the data? So we have the problem that there's a lot of data is generated but we are just not really able to analyze it in a proper fashion. So there's an estimation, this is by the way something I have to give credit to Cassandra or to DataStacks that's numbers I had actually from one of their slides. So the estimation is that there is about a set of data in our world today which most probably is about 135 gigabytes of data for every person on this planet. So what are basically the challenges in this kind of environment? So if you look at the basically traditional data and kind of compare the properties we have here then so with the traditional databases I have most probably my database sizes are ranging from gigabytes to terabytes while when we talk about big data we are talking about petabytes or exabytes then with the traditional databases I'm talking about the centralized database while very clearly with petabytes of data this is not possible anymore so you need actually a database which is a distributed database that's what CREATE is then another challenge is that in the big data space while we have in the traditional data space we have a lot of data which is structured we have basically our table structures which not very often change in the big data space we very often have unstructured or semi-structured data change from one day to the other from one record basically to the other that's as well something CREATE can manage and in the old traditional data space we have stable data models which can be very deep in the big data space that's a problem so we are actually trying to have flat schemas because otherwise this is not really manageable very often the relations we have in the traditional sense they actually resolved as embedded relations in the data itself so that's how this is actually working in the traditional data space I have a lot of complex relationships in the big data space I have fewer interrelationships so a lot of interrelationships are basically embedded then in the big data space I have real-time a lot of real-time data I have analytics I want to do and I have search in general one of kind of stuff like the patterns we have noticed in the last couple of years is and this is something we have seen over and over again is that a lot of people in internet scale projects they actually use technologies so they basically use a mix of technologies for data storage so very often they use a relational database in combination with the document database in combination with some search functionality they need and then they have additionally they have a blob storage or a storage basically where they store their web assets for example so people build different stacks here with React, Solar Rados or MongoDB, Elasticsearch BritFS, CouchDB Elasticsearch and HDFS with Sadoop so this is something which makes projects really complicated because you just have different data stores and that's basically what Crate is actually trying to do so the target for Crate is that you can combine this kind of technologies and that you can have a database with NoSQL capabilities with actually a flexible data model so kind of like in a document of oriented database to store your data with blob storage and with search in one single open source product so that's what we actually trying to do so this is very quick, I want to show you how easy it is I hope you can actually read this so I wanted to show you how easy it is to set up Crate so this is one basically another target we have as well so this should be the easiest database we ever actually set up these instructions are for starting an instance on Amazon EC2 so this is here from the EC2 command line this is starting basically I think this is an Ubuntu image on Amazon you log in to your server then this command here is loading a shell script from the Crate server which is analyzing what kind of Linux distribution it's just going to install you can do that as well manually if you don't want that someone is executing that you're executing a script which you haven't reviewed so there are instructions on the website and local repositories to your Linux server and then it's just going to install Crate at this point I'm sorry for the color here at this point Crate is already installed and you can access it from the dashboard and when you want to start to create a couple of tables then you need to install a basically a command line tool which has the nice name Crate Crash so this is actually the Crate shell called Crash this requires Python when you have this actually requires Python PIP so the Python package manager then you install Crash then you can call Crash and that's actually as I notice this command is missing so it's simply the command of Crash and this is the Crash command line so you have installed and at this point you're connected to your database and you can write you can carry the table you can insert data in your table and you're good to go so this is basically all what's required to install Crate and get it working when you want to install a second server then you basically repeat exactly the same steps here okay if you are in a subnet or if you are not in the cloud so if you're locally in a subnet which can which accepts multicast traffic then the second node will automatically find the first node and the database you have or the table you have created at this point will sync up with the second server automatically so they find each other and they will build together a cluster on Amazon there's one step more required if you want to set a cluster with multiple nodes so Amazon doesn't allow multicast messages so we have a plugin which either allows you to use singlecast or even more elegant you can at the time when the EC2 instance has been created you can give the EC2 instance a tag and then you can define in the great configuration that all the servers which are running in your account which have the same tag are part of your cluster so what's happening is that every couple of seconds your server is scanning using the AWS interface to scan for other new servers with this tag and then it automatically connects them together so the impression about so the syntax for the SQL syntax for create there are a couple of additions so one of the things I was mentioning before is that create is allowing for basically semi-structured data in semi-structured data you can actually define so either you can define this here explicitly and for example doing something here like an object which then basically has nested attributes in it or you can simply start for example create a very basic database schema and then you just simply start to insert data into it and at the time where you start to insert data, create will automatically will first notice the columns which it doesn't now it will modify its internal structure of the table and it will learn the table will basically learn the data you're just inserting so it's adapting the metadata of the table structure this is default behavior if you don't want that because for example you don't want if you have false data or faulty data then this data is creating new columns and this behavior is off but the default behavior is if you insert data then this will create is going to adapt the table metadata also you see here a couple of keywords which are going into sharding and partitioning so I'm defining here how many shards I want to use and I'm defining on which column I just want to do the partitioning also at this point I'm defining how many replicas I want to have in my create cluster so replicas here in this case mean number of replicas two means that in three that in the complete cluster I have in total I have three copies of my data so for example you can simply insert data so you can do a mass insert one so for example one data source you can use is json so if you have this kind of json file then you just can simply do a copy from and then it will insert it into the database so create actually tries to figure that out by itself and it kind of stuff like rebalance on the size of the partition but as well for example you can define here you can define ranges so you can explicitly define what is part of a partition and how many how the petitions I actually use so the the petitions are so all the petitions you define here are as well in all the shards so the petition basically defines the range of the values so it basically defines that the bucket where the data is so it defines the petition that's correct let me actually think about it and I'll come back to you at the at the end it actually maps so because it's actually sitting on top of it this is actually mapping to a Leslie search it's an elastic search term correct so create is actually written in Java so as elastic search is written in Java you will need to have so Java is going to be installed as part of the the process one little information here so that's something which I noticed yesterday with installing the latest version of create so the latest version actually requires now Java 8 so when you install that you need to make sure that your Linux distribution has basically Java 8 otherwise you will need first to install it from somewhere else manually because then the whole installation process will be not able to get the right Java version there are different drivers available for programming with create for different programming languages I'll talk a little bit later about that these these drivers talk with create either with through arrest interface or there's a binary interface and I think binary interface is mainly for Java clients there is a UI with create to see basically the class status it will tell you how many it will give you an overview about all the tables in the system it will give you an overview of where the system is regarding load and it will also give you an overview if there's any kind of problem in the system for example if there's a note down if they are just under replicated then in this case it will show and it will show as well which kind of action it's taking there is a blob storage which you can use so there's an interface for uploading and managing blobs these blobs are as well so managed with for example the replication settings in the database they are as well used for the blobs so they're just going to be stored according to the to the replication settings you have by the way they are not stored in elastic search so they actually stored in the file system of the nodes and it will just distribute so it will actually look for the size and the available space and then it will distribute it across faster there is a plugin infrastructure which is used that you can extend the SQL queries so it's kind of, it's not really like stored procedures but you can for example define, you have something like user define functions and that's something you can implement with plugins and there are many different clients that are available for all the main main languages like Java, Python, Ruby PHP, Scala Node.js, Erlang and as well a couple of other ones also Crate runs really really well in the cloud and as well in containerized environments the main reason why it runs so well there is that kind of like the whole discovery automatically also when nodes are just going up or down so Crate basically takes care of the replication and that the data is distributed into the entity nodes so this are a couple of features of Crate it actually supports I would say about 95 percent of what SQL I think it's SQL92 is actually supporting additionally it allows that you do have arrays and nested objects so that's something I showed you with the Crate table statement so this is by the way directly mapped into nested objects into Elasticsearch then internally it has an information a meta schema which you can query and where you can find out more information about your basically like other SQL based databases where you can find out meta data about your database schemas cluster and node states this is as well something you can find in this meta information so you can build your own little client which is querying the cluster and finding out in what kind of state the cluster is in case this is important for your application and one really really big feature is that in many cases you can reuse your OR mappers you're already using because Crate is accepting most of the standard SQL you can use OR mappers like for example hibernate and SQL alchemy and active record and PHP PDO and so on you can directly use that and if you for example have a project which is running already let's say with Postgres or with MySQL then if you don't really use stored procedures then in most cases it's relatively easy to move it over to Crate so this is a brief overview of the different components on a Crate node so basically all the nodes in a Crate cluster are exactly identical so there is a shared nothing architecture and all the nodes are basically are equal there is though there's one it's not really an exception it's basically there's one node in the cluster which normally has a special use case that's the so called master node and this master node is basically the node which is having all the clusters metadata in it and this master node is something which is not a single point of failure so if this master node goes away the rest of the cluster does an election and elect a new master node so this information they made the information from the master node as well something which is replicated in the cluster and which then can be replaced by a new node which is taking over this responsibility the clients in general have the choice either to use the rest interface or to use the binary interface then we're just using a single parser I will show it on the next slide which is actually so that's Presto which is coming from Facebook then this piece with the analyzer and the distributed analyzer and planner that this is again a piece from Craig this is doing the distributed execution and then it's just merging on collecting the data and just giving that back to the client and for the case of the blob storage that's working a little bit differently so that's kind of bypassing most of this so this is then directly going so this is retrieving the blob directly from the nodes so this is an overview of the different layers and as well what kind of technologies are used here so the base storage layer is Lucene and Elasticsearch so everything is basically stored there and also on the side we just have the blob storage which is the blob storage which is outside of Lucene and Elasticsearch then this is the network layer where we are just using NETI so for basically transferring the data here then we have the aggregation level this is something so you can see from the color coding here where the things are basically coming from so this is what we have developed completely by ourselves this is the distributed SQL, the distributed reduce and the data transformation here for the layer with for the querying we are using Facebook Presto and then this is actually using calling the query planner which is then doing the execution also there is on this layer is the module for importing and exporting the data and as clients we just have so the create dashboard then the shell the create shell which is called crash and also we just have a lot of different kind of client libraries and also Java which is kind of a first class citizen here because it actually uses here the binary protocol which actually uses directly the scene yes so the current version from the the latest Elasticsearch version which right now in the last print was published is using Elasticsearch I think 2.0 that's basically brand new then the second one was what was the actually I know that there is some brand new functions regarding geo location or geo data type basically in there but I have not actually played around with it so I'm not really sure to what kind of query that is supported but this is a feature which came actually in with the latest version so I would need to have a look and see what exactly is supported there so this up the technical highlights for create so it's very easy to scale and to manage database it is based on a shared nothing architecture you can use real time SQL it supports environments with high availability requirements and you just have a data store where developers have a very easy time to develop with because it doesn't really give them a lot of restrictions so you can very freely basically insert data into the database and manage it so regarding scalability so they have been in our bigger setups we had actually up to 300,000 records inserted per second into a create cluster and if I remember this was even not on hardware so I think this was actually on an easy to cloud deployment shards can be actually moved manually or automatically so the standard settings which actually with create come is that create will actually try to optimize that by itself so the so in general data can be semi structured and you can use either strong SQL schemas or schema less SQL schemas so that's basically something which is up to your decision if you choose to so if you basically choose to define a database schema a table schema by yourself or if you actually want that create actually auto detect the schema you can have nested documents so this is basically an elastic search property and these nested fields is something you can use exactly in the SQL queries exactly like you would use normal fields so if you actually have nested fields then you basically have the fields name the nested field dot if there's another nested field dot etc etc and as well the same thing if you have an array then you just can select kind of array elements out of it so this is directly supported with the SQL so this is kind of an extension to the normal SQL that you can have those nested elements there then create the create planner is actually that piece in create which just tries to optimize and have an internal strategy for having the optimized queries on the create cluster the collect shuffle and reduce phases are used for data aggregation across the cluster and for sure create uses a cache for internally speeding up the queries and caching so if you have for example, paging to have faster access here and create is developed with the Java NIO interface so basically it uses the internal asynchronous capabilities of the NIO library in Java to execute or basically process the IO single table you can't you can't have so you can have there are no constraints support also when we talk about tables and kind of some single tables there are joints I haven't mentioned that so far actually in one of the latest versions so they are now inner joints and as well cross joints but for example, outer joints are right now not really supportive so you can have some dependencies between tables not regarding constraints but you can for example do the joints which is actually was a pretty big step so I think it took about a year to develop so it was quite a bigger subject yeah, yeah, you can that's basically happening automatically mm-hmm, mm-hmm no, no, I meant you no, it's happening on the servers so the servers basically no, it's happening on the create level exactly so that's the secret sauce of create basically it's relatively new I have not seen kind of supply good goods so there's I've played around with it but I have not really, I don't really have a test environment where you can see comparison between let's say a MySQL database joining locally versus create actually doing that anyway it's kind of a question what is kind of supply, what is kind of supply good, how can you really test it well because to be fair and having kind of supply good numbers you would actually need to choose a scenario where MySQL really would have problems because of the size so I haven't seen that yet I hope they can publish something on the blog soon about the performance there they're actually a couple of sorry I think there are two block entries right now on the create website which is talking about optimizing basically performance of joins if you want you can have a look there but like I said I don't have a direct comparison between relational versus NoSQL joins sorry no but no we don't I was actually sorry I was thinking about something else so when I was actually thinking about real time data so there's currently some work invested into making create a MySQL slave so regarding streaming data but not in the Spark world I don't know but if you give me your email address I can basically ask so actually my colleagues are meeting tomorrow for some skiing in the Austrian mountains and also they just have a week ahead of planning and so on unfortunately because I really hate the cold so fortunately unfortunately I'm not there so I don't know right now what is planned but I will be able to find out at the end of the week yeah but the problem actually you got it exactly so in general kind of Elasticsearch is kind of packaged into create and you can query create through the Elasticsearch interface but you should not really insert or update data because that's gonna confuse the the metadata layer which is sitting on top of it so the schema information as well the internal information create actually has about it so it's read only the Elasticsearch connection create is licensed under Apache under the Apache license if you want to have a look into the source code then you can do that this is published all the source code is published under on the on the GitHub account also if you want to have a look so we tried actually to reuse as many open source projects we actually could to not rebuild the the wheel so if you're interested in what we are kind of using then you can find that here that basically said just a couple of small words regarding database scaling with create so that's one of the features we are pretty proud about so that you just have a shared nothing architecture so that all the nodes are basically the same the app containers by the way they actually so the connectors we have built here the drivers they are intelligent enough that when they are actually talking to a node that the node actually will basically tell them either to talk with them or basically talk to another other node also there is a mechanism built in in the drivers so in case that one of those nodes basically go away then it actually can talk to another node so that's the in this case the intelligence is built in the drivers create runs really well on different container based environments one of the really really good integration or tests we have done is together with CoroS so for example you can run create containers there with with fleet and you just can actually push the the create containers out with Docker and fleet and then manage them there and kind of scale up and down so this is actually working really really well so in general and all these different environments like container ship directly with Docker or with Tutom so we have actually built Docker containers and we basically have tested in these environments and this works really really well like I said because of great capability of kind of the auto detecting the cluster and then when you basically scale up and down to do this all automatically yes if that's the case then we have a problem yeah sure so I think I mentioned that before so if Mazikast doesn't work then there are a couple of other possibilities to configure it yes what is the default right now Unicast yes so like I think I mentioned it for example in Amazon you have different ways of doing it so one ways I think you can do it with DNS but kind of the most elegant way of doing it is do it with tagging the instance and I'm not sure so I don't think that the Docker container has been tested with ECS there's actually a methods scheduler I think which has been written yeah yes that was about maybe three quarter years ago there's a scheduler for that and so it runs in methods sphere as well I'm not sure about Kubernetes so I think there was something there's something built there as well but I'm not sure about that so the easiest way is just simply search on our website so you will find it we had actually two or three weeks ago I think we had a problem with the block and part of the content actually was down because of a little mishap but right now all the content should be up again and as soon as there's kind of a new thing out, a new feature out you will find it for sure on the block there's also quite a lot of documentation out there it's actually pretty well documented the only thing so we have a couple of third party for example drivers they might be not as well documented but whatever is actually maintained by crate has actually quite good documentation and if not there's in the bottom corner there's a little chat box and there's a real engineer sitting there so you can actually ask them these kind of questions and so the only thing you shouldn't do is most probably do it kind of super late in the night because this is kind of Austrian so middle of European time so they're not going to be in the office but if you kind of try to chat with them in the morning they will basically directly respond to you so this is the crate is available as a docker container so you can directly use that you can use it with swarm then it's on Kinematic so that's another place where you can actually get the container so that's this one here I think the projector is actually not really sharp and also there are a lot of different ways of run a lot of different cloud platforms we can create so there's actually on AWS there's an AMI you can directly use also there is a website if you're interested you can come to me and I can give you the address but there's actually a cloud formation generator and the cloud formation generator you can kind of type in I want to have, I don't know an X large crate cluster with 10 instances and I want to use EBS storage and with two availability zones and blah blah and then at the end it will spit out a complete cloud formation template and then you can load the cloud formation template into Amazon and the only thing which you need to have ready at this point is your SSH key in Amazon and everything else it will automatically orchestrate and create so that you have security groups and all the other stuff will be created for you and configured Azure their image is available for GCE their image is available and then also for soft layer their image is available here's a little bit different it's all basically installation procedure there that's one of the things I haven't tried yet so I can't tell you exactly how that's working Crate is mainly as you can see popular in Europe so we are very present there at user group meetings and conferences and so on so in the US we start to have right now as well our download numbers we start to have more and more people looking into Crate so in Europe we have right now the most people kind of looking into Crate if you want to play around with it I'm also recommending there are a couple of nice examples you can try to kind of get a feeling how you can develop with Crate how the drivers basically work so there is on GitHub is a GitHub example so I don't know if you know but there's so the GitHub has actually opened up there so they have a quite big amount of data about the commits of their users to public repositories and that's actually data you can mine and GitHub has right now every year has kind of a data mining competition where Crate are finding kind of interesting analysis of that data and we have actually built such an analysis on Crate there's an example you can download from GitHub which analyzing languages request latencies sentiments and so on so that's the way the example looks you just can basically reuse for your apps and you can see how the analytics and the aggregation basically perform so if you download the complete GitHub archive this is gonna be 100, 200 gigs of data at least so you can see how this is actually performing how Crate is actually performing with a big data also what I want to mention is Crate is so because there's actually elastic search under it works very well with elastic search tools so for example you can use Kibana for in this case this are some web blocks and this web blocks are actually fed into Crate as a SQL table and then you just can for example use Kibana on the elastic search interface to analyze that as a dashboard not to my knowledge so that might be possible that you can do that through the plugin interface but I'm not 100% sure sorry, what is it? the writes are consistent yeah there is a performance penalty for doing that because you need to wait till basically the last of your replicas have actually written before you can actually return I'm not sure how this is implemented I think this is as well a great thing as far as I know I just only know that you can tune it and for example you can disable read after write consistency but it's heavily recommended to use that okay that's all I had as slides questions correct so when you download a block Crate will automatically so you don't need to know about which node is actually located so this will be handled internally by Crate it's not chartered, that's correct it's basically it's a file which is in but it's actually so it's replicated and you will be for example if this is kind of the next question so it will not for example if you download it if you have a replication factor it will only download it from one node so it will not go to the multiple nodes and kind of stream it at the same time so it will be only downloaded from one okay then thank you very much and so I'm also in the exhibition hall forgot right now which number the booth is something like 300 something so whenever you have time or want to ask some more questions these come over so I'll be there tomorrow at the normal hours cool thank you very much