 Yeah, welcome everyone. Thank you for attending and thank you for the introduction. I hope I will meet the expectations So yes, I'm here to present Bonobo, which has an ETL project in Python 3.5 and the next versions That is about six months old it's I worked on it for much more time, but this this version is a brand new rewrite from the beginning of the year So real quick, I'm Romain Dorgay. I have a French name and probably a French accent too. I Worked in a lot of different companies in different contexts, but I've been around web development and software engineering for around the last ten years and I Seen a lot of ETLs Market ETLs in different contexts and I didn't find what I really wanted So it's the main reason why Bonobo exists today and why I put so much energy in that Real quick also. I'm working currently as an advisor in the startup accelerators of BNP Paribas So we are a team of former entrepreneur building Y-Boost and FinTech and corporate accelerators Where we basically give business and technical advice to funders So back to the beginning My plan for this talk is To take maybe something like 10 minutes to go into what exists in various language Not only Python because most of the ETLs I used were not Python What exists also in Python and why I decided that it was not meeting my needs Then I will try to use the most of the time of the talk to show you Real cases and real world usage. Well, not really real world, but Example usage and so you can understand a bit more and also dive you into the Few things you have to know to start using it and there is really not much And then we have a conclusion one With some pointers where you can go from that So ETLs As many of you probably already know it means extract and some learn According to Wikipedia. It was already popular in 1970. So definitely not something new And it's basically everywhere where you have more than one data store talking about one data So if you have some master slave data, if you have some stock system connected to some E-commerce website, for example You will probably use some kind of ETL to connect all that together For those who don't know about that the most simple schema I could come up with about ETLs is that you have stack of data here. It's full bar and buzz. You have Not a red list of transformation you want to apply on each line of data and the extract here can Transform full into something else when it finished transforming it it goes into transform and while transform is taking care of the result of Extracting full extract can start handling bar and etc. So you can as it's completely independent you can Under each transformation line in parallel step by step In the real world it usually looks like that. There is databases. There is mails that sense There is logging maybe not mails anything But the general concept is exactly the same Just it's not Not as linear as it is here. There is lot of tools I Have seen in the market mostly java-based That looks like that. It's usually an IDE first and probably you can code it here It's talent opens to you Probably you can use code, but mostly it's a graphical interface that's configure everything from dialogues it's very handy, but you when your programmer you feel very limited very fast There is clover that I never used but looks exactly the same just a bit different This one is penta open But I would add integration also called kettle. I use this one a lot, but also same concept. So we have In this world, we have mostly gray first eventually code, but and mostly java-based In the python world, there is a few libraries and not at all exhaustive here Bubbles I think is now marked and maintained Peter is much more fluid interface. There is a lot more. So some people in less conference told me about METL. I think there is a python ETL, but but None of these according to my analysis where I'm doing the same thing as the java tools, which is simply connecting independent boxes together using your data flow and So I started to create bonobo In fact, I started to create another library which way which was a python 2.7 But it was just badly written So the best thing to do according to me was to start again and drop completely the python 2 supports. I Explained more later There is also related tools that you must know I guess joblib dask pandas tools you may at least know some of them. Maybe pandas. It's amazing tools, but ETL is not really their main focus For example pandas is really good to transform a data set into another data sets and I'm using maybe every day, but When I want to do more engineering on data like taking one item at a time and transforming in step A step B step C step D. It's not really the topic at all There is other scales of the tools to transform data Real quick. There is you may know IFTTT or Zapier, which are cloud-based Software as a service tools to do small automations Obviously this won't run on your laptop and there is huge data tools Which like spark adobe I just used a few here But either you need a big infrastructure to start doing things or at least a decent infrastructure Either you're using a cloud-based thing and you have the same problem about How do you work without the cloud and how you don't how you are you're not looking yourself into one vendor? As said in the description of the talk, it's not a not that there is no big data in this room But yeah, I want to tell you a bit of story about how I came to discover ETLs while I was founding Co-founding a company. So when we started we had a few different Partners, we were doing a marketplace about retail. It was a clothes is mostly for women And we needed to to work with different partners to integrate the Stokes and catalogs and colors and pictures and etc on all marketplace, which was multi multi-brand So the first partner went very well. We just coded it in Kettle Pantau Yeah, after a while well removed thing again things everywhere and after a while it was working. So we were really happy but just we we got a few deals and when you get best ID to Integrate the other partners is to like copy-paste the code rights to Second and third partner because it's about the same But just a bit different Of course, this is not a good engineering practice, but When you're used to subclass things and to instantiate things with different parameters and now you have a guy You're just lost because you you can't really do that So you don't know how to not repeat yourself And of course the time come where you you need to fix a bug and you just go crazy and maybe if it's not a bug it's new features because you didn't support colors for now and now you have different model of colors and you to update everything so Really what I needed was something cheap. I could install on my laptop use on servers too and Using code as configuration and preferably Python code. This one is not for any good reason except the fact that I prefer code in Python Than anything else. But mostly I need something that use code as configuration to do ETL just like Pentaro were doing ETL And yeah, that's a bonobo It's a framework to write ETL jobs in Python using code and eventually someday some kind of guy may come to visualize thing But first it's called so you can write classes you can subclass things and Adapting like you just coding web or maybe other engineering I'll go very fast on that but it's very different to all all tools existing in the pie that I That I know if maybe that is tools. I don't know so I'm very happy if you tell me about that and I can use told me that Bit stupid monkey bonobo is not a monkey. It's an ape and French language Apparently don't make the difference. So I did not even know that that was two different words So, let's see Yeah I will I will just one second. I will try to show you First how to boost swap a project then I post to show you all the different concepts that I used without telling before Then I will go back to the demo and examples to apply the concept I showed to different demos So the basics is peep install and you have a generator using cookie cutter. That is just bonobo in its something And you can run something with a bonobo run. So How do I switch to a terminal? Not like this obviously okay, so I Already run the in it, but because I'm I'm sure you don't believe me Or maybe you believe me, but I will just show You can bonobo in it foo for example and it will just create a food directory with a main.py file and probably if I give main.py with foo slash main.py Yeah, there was a few difference, but mostly end of lines. So it's the same file So I removed foo because I have other file that will use I will use after that And just so that yeah, I can bonobo run main.py for example That's the default transformation that is bundled with the generator Nothing really fancy just generate numbers and takes only the odd numbers and I can also I can also run on a directory because the main file is considered the main and So I can it's running dots will do exactly the same Okay, that's not really interesting, but that's really the basics So that was that so now I Want to show you What was actually run so I don't I won't show you the 1 to 42 or 0 to 41 But I wrote a simple one here Which is basically the definition of three different functions one yielding you wrote and Python in 2017 one Just applying title to the to a string in inputs and one just printing the thing And one once I define all that I can create a graph instance here. It's a linear graph Which is the default thing we can do, but there is also API method to add Other chains for King for some point in the graph. I will show it in an example And yeah, I just define a graph and because there is a graph instant here I can use bonobo one on this file and it will just add some plugins for the display and run that so Here I should have the first up UI file Maybe it's a bit big. I don't know I Have the first up UI file, which is not at all what I wanted to show Okay, I need to check out example one And I replace the code in yes, it's this one I replaced the code in main.py so it's exactly what I showed and I can Bonobo run Main.py I will see the outputs of the load which prints the thing so Euro Python 2017 which is titled and I see some statistics. It's very fast here So it's already gray, but you would see the statistics move while while it's running on a longer transformation Okay That was this one Yeah, so second one a bit more Complete I Just made a Europe Python.txt file that's I just expected that on the Europe Python society Which is a company behind Europe Python about all the conference In Europe Python, it's like two or three lines each time that says it was there There was maybe a few attendees. Sometimes we don't have the information about the number of attendee attendees Sometimes we have the date We have the date every time but not really formatted the same and same way So I took this data and said okay, I will extract All the all the paragraphs about each conference Send it like yielding it to the next the detail of this code is not really important Then transform that using a few regs to find like the location the number of attendees if it's here, etc and create a dictionary from that and then I Made a little helper function called 0 to crocs that change the formatting of fun input output But no not really important yet So here I create a graph the same way and I'm using a built-in which is pretty printer with better than print to print And if I run that Yes, I need to run actually I will see The name venue dates attendees attendees only favorable for example, it's not yet available for Europe Python 2017 Yeah, that's that so there is a few change I can make to this To this transformation to make it a bit more useful But just before I want to explain what's happening under the hood And maybe full screen is better So what's happening here? We created a graph instance and the graph is really a list of edges and nodes. Nothing fancy. It's It can be represented graphically like this, but it's just two lists in fact And to prove that I just removed all the code that yes is in use but not really useful and yes The graph definition is that On first call you you have a shortcut to call add chain that At the first chain you pass to the constructor of the graph But then you can add chain anytime you want and you can specify different inputs because you don't want every first node to be at The beginning of the graph, but maybe fork an existing chain then one once Once you define the graph you either run it using the bonobo. It's run method Or you can run it using the CLI like we did before in a shell And what happens is that it takes this graph we defined before It there is an executor strategy that adds a few things here It adds a global context a context for each node and a thread around each node It creates fee for queues that are third safe queues Python built-ins that's not built-ins, but standard library That are used to buffer input and output between the different nodes and in fact the context Here is only used to keep the transformation Contextless and stateless because if we need to to keep for example statistics or maybe Instant session of something we need during the time of the execution We don't want to modify the object you you provide it to to the graph So it looks like that now the global The global context it creates a context for each node just what I said before The strategy is relying on thread pool executor of Python and concurrent dot futures, I think And we just create a runner which is just something that will run every time it gets something in the In the input queue it will learn the node and push the output to the next queue And then it does nothing and when it's finished it shut down That's implementation details. You don't definitely don't need to know that to use bonobo But it's just to show that it's not that complex what happens under under the hood What you can use as transformations in bonobo Is various things so you can use functions like we did before Most mostly if you have for each line of input one line of output You can use generators if you have for each line of input zero one or more output lines For example, it's very useful to implement joints cartesian products or even to make something that either yield or not You can use iterators which are not really codable, but it's handy to say okay I can have transformation that have no input like it's why it's called extract here that it has no input and yields a bunch of output And of course you can use everything that is codable in Python. I'm just trying to call it if it's codable then Yeah, it's probably a duck and So that's the The handmade way to do that You can do the underscore underscore cold under and it will work, but there is a Hang your way to do that. I have a bonobo that conflict and configurable class that Allows to use a few descriptors to specify what kind of options and dependencies you will have in your transformation Of course for simple transformation, you won't use that But if you yes, if you need to configure the transformation is probably easier to use that so Here so we define an option called table name that we will use to query your database It has a default, but you can already it we'll see after We define a service that we call database which defaults to database default It's a symbolic name that will point to something will also see later Yes at Whatever you want to instantiate this query database class you can Or not override the different values and there can be validation, but that's the detail Much more interesting there is services and services like the database service we provided here is Basically saying okay my transformation will rely on the database, but I don't want yet to tell you What implementation I will use and I just say it's probably called database default So at one time provide me something called database the default and I'll try to use it like I thought it would work And at one time you can provide a via get services function Simple dictionary that provide the implementations Allowing for example to provide a different dictionary for tests And so you will be able to provide a my database test implementation or mock implementation Instead of this one to test the transformation without testing the external dependencies and postgresql and etc etc There is bananas is bonnable Not a lot for now It's kind of a standard library. It allows to read files read files. We will use that just after nothing fancy here There is a few tools to work with the life cycle or to debug things. We use pretty printer before And there is a few extensions and plugins or plugins and extensions in the order of presentation Or this projector is really good quality because we can see the thing. It was not the case last time The console plug-in will show in real time the input and output of each transformation I'm apparently not able to draw Tree an esquiat tree in the console but to show the graph like a git git log would do But if some of you know how to do that, I would be really interested because it it will the same feature But it would be nicer and of course I am using Python logging to do that. There is a Jupyter plugin I yeah, I should have the time to show you that after And everything that relies on bigger libraries or big dependencies Bundled eventually as extensions. There is for example the SQL alchemy extension I'm trying I'm starting to use which allows to work with SQL databases There is a Docker extension That adds a run C command to the bonobo CLI not just the exact same Tries to do the exact same thing as bonobo run but within a container Yes There is a different repository called bonobo devkit that allows to work on different forks at the same time of The project probably more something for me or for anything that we want to contribute, but it's useful, too And yes, we have time for more examples, so I will show you a lot of things So first what? so I'll start again from the Demo I did before and we'll try to show you how to use a service instead of directly opening a file How to write the result to a CSV and how to write to JSON? so What do did we have? It was this one. Okay, so we were reading the Europe item that takes T transform, etc. So What I want is instead of opening this file using something that I will be able to switch from local file system to S3, etc And there is a very good library. We are relying on and bonobo, which is called file system 2 that does exactly that and so Yes, so we're just depending on that and it will be installed so I can do something that is at requires of FS for example, I will import Requires Yeah, so that's Python. Okay, and So it will be provided as a parameter By the decorator just above and I can use FS like this doing FS.open instead of just opening the file So I did an FS service. Maybe it's already defined by default. It's the only one that I defined by default but But in fact all this thing is not really useful So we'll just say okay FS is bonobo.open FS which by default will use the current directory as the root of the file system object so the Bonobo run first.py should do the exact same thing it did before but we are not reading any more and directly opening the file Next step was writing to a CSV So that's a good occasion for me to show you how we fork the graph and not make something just linear So I will use the add chain method To add Bonobo.csvWriter To a file name which will use also the file system service, which is what the same service and service so we'll write to european.csv And to explain that I don't want it to be a new chain that just Take an empty impulse at the at the beginning of the transformation I need to say Okay, the input of this chain like the node before the first node of this chain is arg to quarks About arg0 to quarks. In fact the previous transformation, so transform function was returning One argument containing a dictionary and it just transformed this first argument Which is a dictionary into keyword arguments and by default it can be overwritten But by default all writers are taking keyword arguments as inputs So it's also why it's here and it works better with project printer too So that should work It's here that I would really like the ASCII R3 But there is a CSV writer that run at the same level of project printer It didn't it didn't take the output of project printer because project printer didn't have any output It took the output of arg0 to quarks and this output which is here came to project printer and to CSV writer at the same time Because I don't know if I removed the file before the Presentation I will remove it and run it again. So you're sure I'm not lying And if I open that it should contain all the data but format it as CSV with maybe the number of attendees like for last year or maybe not if we Don't know from the European society side Okay Now we'll do So the next task I had was to write to JSON Which will be really easy because the syntax is exactly the same If I wanted to change a bit the formatting there is advanced adoptions that are not the same for CSV and JSON obviously But if I just want to write to a file It's easy If I can use my keyboard Okay, so yeah, it's exactly the same but you were python.json So I don't have any you were python.json file here I ran the thing and now I should have the file containing the same thing but format it as JSON Okay, so that's very basic So I tried to find other example to show I looked up yesterday I think for Rimini open data and yeah, Rimini has open data. In fact, so I Don't speak very good Italian. So I understand absolutely nothing about what it was about But I understand JSON. So I could extract thing and just play a bit. I Think I need to get checkouts and thing Which is example three And he's not really happy. So I will For the checkouts Yeah, okay Don't look at my commit messages So I have a Rimini.py Which just it's a bit similar as what we did before but here we require the service that we call HTTP If I open the services.py file, I will see that I just defined that HTTP is a request I Could have used anything else but that means that probably I will rely on anything that works like requests So I use this HTTP to HTTP get a new URL while I have What I have a next URL because each batch of 100 Results will if there is an X page say okay next URL is that because the web service is not very good I need to substitute slash node by slash node the json because it returns to HTML and I iterate until there is no next URL Then I still have to croc and just write a json file about that So it will be a bit longer because I need to Obviously, I can't do it parallel because I need to have the result of the first request to know the next URL before I can do anything So I should have run it before I started to say that But yeah one 100 by 100 there will be an extraction from HTTP Then the R0 to cross is maybe instant and json writer is maybe is barely instant too It's it's a bit different to use json writer than to just json encode the whole thing Because I will just encode each line independently to avoid to having to have to buffer everything So really what it needs is only one line at a time and it will write really a few bit every time in the in the fight so Maybe I guess it's about the elections in Rimini. There is Well, we don't have it with this type. We don't see a lot, but we just segregated all pages from this way this Rest API not rest just API There is things I don't understand. There is users maybe the person that created the item on the on the website I don't know and there is the Different district, so I guess that's related to where you have to go vote when there is an election, but that's really a wild guess Okay That's Rimini but for something that we all understand and which is English I will show how to extract all the public Europe I can attend is in in a notebook. Of course, you can see it's doable with Other ways, but it's it's just for the example So I will write a Jupiter notebook. I guess everybody knows about Jupiter notebook here Yeah, so here I have an attendees adjacent which Should not exist And I have this so I really like restart and clear outputs Just to be sure Mostly to clear output. In fact, so I'm using selenium here Which is basically something to control a browser if you don't know about it It's not new. It's very old library and I have a few wrappers, but really it doesn't contain much code So I say, okay, there is a who's coming page on the Europe item website. I need to implement a Browser service which is using bonobo selenium that create browser You could just create a selenium browser directly and I open a file system Okay, so I'm a bit short on time so I will just execute the graph Execute the graph building step I have a graph great and then I will just use bonobo.run and so there should be a Firefox Hoping with the who's coming that will scroll down every time the Infinite scroll is not done. It will get as long as it gets elements When it can't get elements anymore, it will try to like bounce top bottom once just to see if it's not some JavaScript That doesn't work and or some lags And if really we didn't get any new data, it will exit after a few seconds. I think there is something like 350 Public attendees announced on the website. Of course a lot of I've made it private and they're right because there is people like me Which would call the thing? But I won't do anything with data. It's just an example for here And Yeah, so probably here it's it was bouncing probably Lucas is the last one maybe it already bounced and Yeah, the plus sign here just changed two minutes To say that this transformation has finished the other one is instant So we should have a jzone with all the attendees now Yeah And yeah, we know if it's a speaker. We know it's tagline if you put one here There is no tagline. Yeah, there is tagline, etc, etc So with the very few minutes left, I will skip the siren example and go to the So yes, Bonobo is a very young library Six months old is not a lot. Definitely. I'm trying to work as hard as I can but I'm not super human So it's not enough of course But yeah, I'm really excited by this because it's something I'm using Every time for everything in fact And I'd really like to get to 1.0 either the end of the year or the next year and 1.0 Mostly means for me a stable API you can rely on that is fully documented and fully tested, etc It's already fully tested and a bit documented, but I need much more Python 3.5 is Personal guerrilla. I started this year I don't want Python 2 anymore So I'm trying to push the most I can to only use Python 3.5 And there is some really handy syntax to work with data The I don't know it's called but the star star operator within a dictionary to expand a dictionary instead of Updating things in place using the dot update on dictionaries really really awesome. We still as a We still have a global interpreter lock, of course But Maybe we will overcome this limitation of running on only one core using different strategies for now It's the threading strategy by default. So we have the jail Maybe not a problem if you are your bound, but if you see pure band it can be a problem But yeah, probably process pool strategies, maybe a desk distributors strategy everything like this we can try to Limit a bit what the jail is bring is bringing bringing us as trouble So one of the zero will stay of course one hundred percent open source. It's a patchy license I want a very light library Of course, it should do the basic things like CSV, etc And most file formats and tools should be included But all things containing dependencies and complex things should either be implemented by the user or go to extensions It's small scale. The goal is one minute to install easy to deploy It's not once again, not big data not statistic not analytics If you want to do blockchain with that you probably You're not in the good conference And it's basically lean manufacturing for that. It's like I have a production chain Why use all little packets of data and one at a time? I'm adding something checking something Modifying something, etc I escape on that but the Internet is completely crazy like I can use is more concerned about me knowing about actual ontology of not ontology taxonomy of of monkeys and apes and primates so I Really like the last one that says Python not only at as duck typing it has the little known Primates typing feature and yeah, this one saved my life Not really, but it was really funny I'd really like to It to become that are pressing for humans. Of course, that is a lot more to do You can read more on the websites bonobo dash project or you can read more in the documentation You will find a link in the website. There is a slack channel. You can discuss come. It's really open And there is a github. You will also find the link on the website Yeah, one more thing because I finish I will try to organize a sprint whether you want to come or not It's not a problem, but you should really consider to go to whatever sprint It's really amazing at your Python to just code on a project by guys last year I did by test which was really great way to learn so Come of course code on bonobo, but if you don't come to other sprints, it's really really really great thing and just before we take a few minutes for questions if Just before that it will be really great if you could give me Really fast few lines feedback as you think it's I I really need raw feedback on this year Well, I have a little form and yeah, that would be really really really great for me Thank you very much. And if we still have a bit of time, yeah, let's try to answer questions if you have some Thank you very much. So big applause very interesting and I think there are many questions at least I have many questions, but I would give you It's always good to ask question in the back of the room because it's good for the health of the organizer Hello, thank you very much for your great talk and working on this project There's one question that caught my eye When I work with ETL something can go wrong, especially with RRLs on the web and stuff. How would I in this framework deal with that? Okay, so Question is about error handling For now what's what's happening in the in the framework is that each on each line it's calling One It's calling the the function of the node one time. So there is two possibilities. There is errors that scant That I call unrecovered unrecovered Unrecovered variable, it's not really easy to find short of world So there is unrecovered available errors that will just stop the graph and yield and raise the exception So I can't just run the graph. So you developer you should fix that Instead there is also recoverable errors, which are errors that happens only on one line of data or a few lines of data So There is a default error handler that you could override that will just use the console to show the Unrecovered errors and just skip to the next line of data But if you want to handle it differently, you could override this handle error thing and just Do whatever you want Probably there will be things like sentry plugins or things like this But for now it's really not a priority and it should be just a few lines to implement that Another question Okay, then maybe I have a question. Do you know the Kafka streams? It somehow reminds me just that the Kafka streams are meant for being distributed on a cluster running Big data on the Kafka queue, but you also have queues Distributing the task to threads I think so more or less it looks like the same architecture Yeah, so I'm not at all familiar with Kafka but Here we are talking of queue instances that are queued at queue Python queues within one process I guess that's Kafka queues, but maybe I'm wrong. So because I don't know I guess it's some kind of message bus that will be able to pass messages from one server to another Because it's kind of an architecture thought for big data first here What I really want to solve is the problem that hey, I need to transform data right now Let's install it and and cut something. So it's in intra process queues Probably it's a bit similar because yeah, you have first in first out messages. So at one point you need a queue but It's it's thought for different scale of architecture first, of course tomorrow if I Can do the same with desk that distributed for example There will be same kind of queues, but not the same exact type of object That will be able to pass messages from one server to another, but then you have to think of Funny problems like how do you optimize the topology of your graph to like group the nearby Transformation on only one server and maybe they cost not the same So how do you balance the number of transformation on each server? It's not not really easy and I'm pretty sure that if you have data of this scale there is you can definitely afford to install and big data infrastructure and use either Kafka or Hadoop or By by spark for example or things like this Thank you. Not a question. No one. Okay. So everyone is hungry So and big thanks again. Yeah