 Hi everyone, I am Emmanuel, I think you have my name and email at the bottom. I work, I'm a data scientist for a startup who just launched in Singapore, we call Omnistream. We're not super famous, you'll probably never hear of us because we're very B2B, so don't worry about that. But I used to work for another company before and this related that experience. It was a company very much, I won't disclose the name because it was not a really good experiment, but it was a company very much focused on web stuff. So we're trying to acquire a lot of traffic from the web, turn that into, like, convert that into sales for someone else and then sell the lead to someone else and then get a commission out of that. It was a bit of a cash machine which is part of the presentation because we had a really large budget for Amazon. So as a data scientist it was kind of interesting to work with them. And today we're going to talk about data wrangling that is getting all that data that comes from all those providers and trying to push that into a consistent single source of truth like we like to call, but it's a bit fancy, we really just want to put everything into a data warehouse. So then we can start working on it, we can start actually doing some data mining, some reporting BI and stuff like that on the data. So the kind of source of truth we have where Google Analytics Premium, it's the same than Google Analytics except that it's a lot more expensive, but you get to get your own data points. If you use Google Analytics you just get aggregated information. With Google Analytics Premium you get every single, like, little data point you get, AdWords, DoubleClick, Facebook and some ad hoc stuff here and there for like CRIM, for core center data. And yeah, the task was to put everything into a single data warehouse. A bit more details for each provider, we split that into four tasks. The first one is to fetch whatever data the provider gives us and put it as it is in our data lake so that if something goes wrong, if the provider is shut down or something, we can always get the data as they send us at the time and if we improve our process later on, we can always replay the whole process afterward. And so we just pushed everything to S3. I only have really good things to say about S3. It's a really excellent product. I mentioned that because I don't have so many good things to say about other products, but we'll touch on that. Then the next step is to normalize the data. So basically take everything that's in our data lake and make it like fit for our one purpose. So we just transform everything into data that we like to read, basically a data format that we designed. And then we push everything to a data warehouse. Today we're going to use Redshift, again really good product. And then inside Redshift we can do a bunch of ETL, but that's not really the point of the topic because once it's in the data warehouse, we're all set. We're good. That's the easy part. All right. Pulling data from the provider is by far the most annoying task ever. If people tell you data science is like Glamour 1.0, don't believe that. It's mostly dealing with really crappy data sets. Every publisher feels like they need to have their own data format. For example, DoubleClick, you know in comma-separated value files, usually you use a comma or a tab or something, in DoubleClick they use the thorn sign, which is the old English for like a letter in old English. Anyway, even the good one, you'd expect Googles to be quite great at that. They change the schema every other day, which makes it quite hard to follow. But that depends. That also means they improve the product every other day, but it just makes it quite hard for us to follow. DoubleClick managed to mix its encoding in a single file. That is brilliant. You have some Latin one in a file, and then you have some like BID, something the Hong Kong Chinese characters in a single file. It's brilliant. It's just a horrible mess to pass. And on top of that, we have a bunch of like custom data source that some of them don't provide an API. So we managed to get the data through SQL injection. So they were all like partners. So we could do that. It was not illegal. But still, that's how we managed to get the data out of them. Anyway, that's the really annoying part. It has little to do with AWS. It's just two points, like the kind of thing we were doing. We put everything in S3 and we go from there. Normalizing the data is an embarrassingly parallel problem because you can basically look at every row independently and you can just normalize row by row. So you can normalize million rows by million rows. It's just the same. It really doesn't matter. And so map reduce, I don't know how familiar you are with it, but you surely heard of that. It fits quite well. Even if we don't really need the reduced part, just the map part, mapping is basically going row by row and like changing the results. The good stuff is that we had more or less unlimited budget. The company was poorly managed. So we decided, okay, let's go for the expensive stuff and use that. For the anecdote, we had a $10,000 Elasticsearch cluster that month. Elasticsearch cluster just to pull the lead coming from the internet, a part of that, which is a really bad use of Elasticsearch and a really bad use of that $10,000. Whatever. So we went to use Elasticmap reduce, which is a fancy AWS tool that allows you to basically pop a cluster to run a bunch of big data tasks. So I put the description. You don't have to read it. The point is it throws in a lot of buzzwords, notably Spark, which we decided to use. If anything, if you really have to use Spark, Elasticmap reduce is a really good product. But my point today is you probably don't need to use Spark. So that being said, S3 will behave as a Hadoop cluster. If you're not familiar with Hadoop, it's basically a distributed file system. And EMR will behave like the Spark intents that just scan your Hadoop cluster and parallelize everything and does the bit of data transformation and then spits it out to the rest of your Hadoop cluster. So for our tasks, the normalization part, EMR will automatically split our data set, like all the rows together, in how many cluster instance we ask for, do the little bit of processing for the normalization and spit that out back into S3. Really cool use case. EMR is good. Spark is good for computation with a lot of caveats, but it's good for computation. In our case, the computation is quite limited because we're really just reading rows and normalizing them. The downside is EMR is quite expensive. Actually, I looked at it yesterday. It's not that expensive anymore, but at the time it was quite pricey. But mostly it was already buggy. Like if you wanted to run Spark on EMR, you actually had to monkey patch the instance. So you spawn the cluster. And the first thing you do is you run a batch script on all the nodes of the cluster to change some Java variable that was botched and replace binary. It's quite funny because it was pushed to production. I can probably still pinpoint to the AWS Forum developer question on that. Why is it not working? And the GitHub resolution issue on totally unrelated product to get it to work. Anyway, the main problem is that it was crazy slow. The bottleneck in our application was mostly the data transfer, because we get gigabytes of data every day for all the providers. And so although the computation was fine, it was just overkill computation-wise because you have to launch a master node and then some slave nodes that do the actual computation and the master node just splits the... completely overkill. The other problem, it's not really a problem, but how to open Spark is like Java virtual machine base. I have nothing against Java. Okay, I do have things against Java. But we use Scala, and Scala is cool. I really enjoy Scala. It's not for everyone. I'm not advocating for Scala, but I have fun using it, which means that the development is quite slow. Anyway, we trashed all of that. We spent three months doing that. It was just not cool. So instead, we completely rewrote the whole thing. And instead of having hard-to-write Scala tasks, we just wrote a small Python script that will run on small, easy-to-instance, like a very small, easy-to-instance. I'm talking like the smallest one, like not the other one, but what's back to that? Micro one, yeah. We define a data exchange format. If you use Spark, if you use those things, you'll have to use the Apache format, like Avro and Pocket and stuff like that. They're probably good, but they don't really fit on it. We just use a stupid CSV file that is the CSV file. You know, it has the header at the top, let's say the name of the columns. You have the schema of the column, so it must apply to this is an integer, this is a string, this is a string of no longer than that, that can be null or something. Really super simple. And then you gzip that, which has the advantage of being streamable. That is, if you only have half of the file, it's fine. You can resume afterwards. You can keep going because it's not a binary format. Wait, it is, but it's streamable. And you can just get the beginning of the file to just get the header and make sure that everything is fine. So you can do all your consistency check like really quickly. It's also quite efficient, like in terms of processing power, you don't have to decode everything in terms of volume of data. It's quite like if you use JSON, every time you have something, you have a field, you have to name that field, then the value, name the field, name the value. CSV, you just do it at the top once and you go. I'm not even talking about XML. Anyway, so we do that. We read the data and we push it to redshift. It is crazy cheaper and crazy faster. And because of the way it's built, it's just small Python script. It's also really easy to scale because you don't really have to parallelize that much. You say, okay, you focus on that task. You focus on that task. You focus on that task. And it works almost transparently. In fact, we did the parallelization layer with bash. And it's a very reliable infrastructure. I mean, S3 is excellent product. EC2 is an excellent product, but you know that redshift is a really good product. And one of the problems with tasks like that, you may be wondering, why are we redoing all that? Surely there's a product that exists to do something like that. Not really. We search for it, but because all the data inputs were so specific and so crappy, to be honest. You can't find a provider that has them all and will push them for you. So we had to write all of that, especially like the adult custom call center stuff. And for that reason, because everything sucks quite a lot, you want to develop quite quickly. And so Python is great for that because you just write the normalization step as, okay, this is the input. You do your Python magic. This is the output boom. It works. If you have to do that in Java, because everything is typed, it will take you forever, absolutely forever. And again, computing power is not really an issue, even if Python is quite all right. So that was the main choice. You really accelerated the speed of our development. So at the end of the day, this is what we used. We used S3 for Dalalake, again, really excellent product, easy to excellent product. We used AWS Data Pipeline, which I don't know if you're familiar with it, but it basically allows you to, like, connect tasks together. And it's okay because it can notably spawn the easy-to-instance we want and then run the script we want on them at a given time. So that worked quite well. Beyond that, it's a bit of a main product because you still, at the time, we had like weird error messages that are not documented anywhere and we couldn't ask them anywhere so because we didn't have a, how do you call that, the call center, Amazon Help Desk or something because they refuse to pay for that, whatever. So AWS Data Pipeline, it works, but I wouldn't expect too much of it. We used SNS4 for notification of our reporting, that's pretty brilliant as well. And Redshift, I can't emphasize what a good product it is. Nowadays, people want to use no SQL database, they want to use Hadoop and Spark, but I guess I hammered that nail a bit. Redshift is based on PostgreSQL. It's a great database engine. It has a few quirks. That is, Redshift is a really good analytic database so it's really good to do, like, small computational, really large amount of data. It's not really good if you want to do, like, when an application database would do like connect stuff and do joint and have like complicated relationships in your table and extract data because the index is built in a completely different way. In fact, you don't really have index, you just define the way your data is ordered in the thing, in the tables. Python for any scripting, programming, and some bash. If you don't use Python really, you should. It's the best language ever by far. And Gzip, CSV, again, don't believe the hype. There are a lot of formats for data everywhere. Surely we all use JSON. It's quite a good format. Hopefully we stopped using XML because XML was not good. But CSV is like old-ass computing and it still holds really well. So that didn't make it elastic map reduced for the reasons I mentioned. Spark and Hadoop, the problem, and I guess this is one of the take-home message. I think I have a slide that's called take-home message. You probably don't need an EMR and you surely don't need Hadoop and Spark. If you have, okay, today, just like literally today, I picked up a hard drive with 600 gigabytes of pictures. I will need Hadoop to process that because it's unstructured data. It's all over the place. I'll upload all that to S3. It's going to take me days just to do that. But then everything will run smoothly with Hadoop and Spark. But if you just want to do normal data transformation, if you have structured data coming, you're wasting your time, you're wasting your money using Hadoop and Spark. So you've probably already bought a Hadoop and Spark cluster, trying to get your money back. Data normalization is quite a hostile environment. If you actually work with external provider, any kind of provider you have will have the different crappy way of giving you data. So you need something agile to be able to process the data efficiently and push it to something that you can process, something that looks like all the other data that you have. That's something you can't get away with. You have to do that. It has to be custom development. And you don't need any fancy Apache data format. Python is the best. Postgres is the best. And that's it. So we're quite new. We still do a lot of data wrangling, but it's fun now because I'm in charge. We're looking for people. So if you don't like Hadoop, if you really like Python, if you like, we have some former gen-speed people. We're going to OCaml. We really like that. If you hate building web-included apps, just send me a note. I'll be happy to have a chat with you. And that's it for me. Recently, you're quite new here at St. Paul. You just came from Hong Kong, right? Yes. I welcome Emmanuel. It's really great to have some Hong Kong refugees. You're free. Yes. But on a serious note, someone serious, who does data mining in this group? Just raise your hands. Analytics data, come on. Someone must use data mining here. Which is just me. I do. Well, anyway, the AWS platform is pretty awesome for big data and all that rest of it. It's certainly better than my USB3 hard drive, which has given me a lot of passes. Any questions for Emmanuel? I think they all have a Hadoop cluster and they're like, no, no. I've met massively, massively, stupidly parallel jobs. What do you use, if not in the market? Oh, it's so easy to parallelize that you just have to split your data set, which takes no time at all. You mean you just write a script, the parallelization or something? Well, you just look at all the files that you have in your data lake and you say, I want to spoon 20 nodes and you ask, sorry, not 20 nodes, 20 scripts that will run in parallel and you just instruct every script to take a part of that. You don't have any kind of concurrency issue or something like that. Any questions? Three, two, one. Well, you can always ask them later. Thanks again, Emmanuel. Thank you.