 talk of this session. It's titled Things I Wish I Knew Before Starting Using Python for Data Processing. So ladies and gentlemen, please welcome our next speaker, Miguel Cabrera. So welcome to my talk. I remember this room being smaller last year, I don't know. So my talk uses a clickbait title, but I won't show any ad today. So my name is Miguel Cabrera. I'm going to talk to you about some things I learned in the last few years I've been working with Python and that I would have preferred to know before starting using it for data processing. So quick introduction. Miguel, I'm from Colombia. I live in Berlin. I work for a company in Munich called Trust U. We do data processing for hotels. And as I said, I've been doing Python just for a couple of two years, so this is more like a beginner-to-beginner talk. However, I think if you're starting with Python, in particular in the data science area, you're going to take some good stuff from this talk. That's my contact information. So the priors for this talk, what I think you are. So you are relatively new to Python. You are used to Python mostly in the scientific stack, NumPy, SciPy, and so on. Your work or your decided work has the data working it or machine learning. You are not necessarily a trained software engineer. So if you are, if you have years of software engineering experience, you're probably going to get more in this talk. So it's your opportunity to walk away. I can give you two minutes now. Not just kidding, but you can leave if you want. So who's who? I want to know. So who wants to be or who is a data scientist? Please raise your hand. OK, data analyst, data engineer here. Yeah. So machine learning developer, that's a good, OK. I hope I'm not going to pour it. Software developer, yeah. So you're being warned. It's really basic, really high level. So if you already have experience, you might get a bit bored. So this is a really basic other title. So the agenda, we're going to talk about some basic concepts and practices, in particular object-oriented programming. Then I'm going to talk about some goodies of the collection module. So something about iterators and iterables. This was like a collection of things I wanted to show. I have many more things prepared. But because of time, I do have to pick the ones I like the most. There are different things. There's a different level of abstraction. So be warned that we're going to switch to a really high level to some code. And we're not going to get in depth in any of those. However, I will give you some points and some direction where you can get more information about it. So this is like a meta talk, so to say. So but let's start with a story. This talk, as I said, is based on my experience. But it's also based on experience of my colleagues and my interaction with them. So let's talk about David. David just graduated from the university. He's, say, a math PhD. And he mostly use R and MATLAB. And he comes to work for a company when he has to do mostly Python. And he starts to write a code to classify some document, text document, for example. He uses an LTK in encyclical learn, which is, I assume, you are somehow familiar with it. The thing is, he writes a really nice IPython Jupyter notebook, sorry, with the code. It runs nice graphs, and so on. This is a random image, so don't try to get anything from it. But and then my boss tells him, OK, you have to integrate the code in our code base. So he had to go from a Python notebook to a really big project with a lot of dependencies, a lot of script and so on. And of course, he is lost. So he tries the best to do that with not knowing what's going on. And he ends up writing what some people call a spaghetti data science code, which is same as spaghetti code, but for data science. So if he has to integrate the code, it's going to be really bad for him as well. And someone else has to integrate the code. That person is going to hate him forever, almost forever. So how we prevent that to happen when we're going to start doing data science, and we're going to actually integrate data science, and machine learning, so I can learn code into a code base. So I think the answer is going back to the basics. And in a nutshell, I think data engineers or scientists have to become more software developers and get into the middle point. And how do you do that? Well, first we have to have a distinction between code and software. I take this from a talk from Daniel Moisette in PyData Berlin this year. And I really like how he put it. So code is something that runs in a computer. So when you write a script or you write a Python notebook, you're probably writing code. Code might not have tests or follow any convention or documentation. You just write the code and it does the job. It's OK. Software, on the other hand, some people think it's just a programming test text inside a deliverable. Some people think it's the whole thing including all the deployment script testing, documentation, even customer support or technical support are inside the software. And you want to do create the software that is maintainable, testable, deployable, and all the apples that you can put in. So the question is, how do I, what's the way to transform my code into software? So I think let's go back to the basic. What is Python then? If I'm going to work in Python, what is Python? The important thing from this is I got it from the documentation of Python. So this is Python's object oriented programming language. So as a data scientist, you should be able to know what is an object and how to use objects. So this is like the first tip I'm going to give you as a Python data scientist or data engineer, learn how to use objects. And for that, I'm going to give you a really quick and dirty introduction to what are objects. So objects, three main concepts are the objects. They have data, they call attributes, and they have some operational data that are called methods. That's in a nutshell. And how does an object look in Python? Well, before going into that, I just want to raise a distinction between cookies, cookie cutters, and cookies, sorry, classes and objects. So what is a class and what is the difference with the objects? A class is kind of like the template that you use to create more such objects. In the case of cookies, the cookie cutter, if you're from the UK, you're going to call a biscuit. And use the cookie cutter to create many, many, many cookies. And you eat them, hopefully, afterwards. Not all, because I want to be bad for you. So in Python, this is how an object looks. It has a name, a class, sorry. That's the template for creating cookies, or the cookie cutter in this case. It has any construction function that is called every time you instantiate it. And it has some data that's the attributes and some methods. Right now, you're already expert in object orientation in Python. If you want to create a cookie, you instantiate the cookie class. If you want to, one of the key concepts in object oriented program is that you can subtype. So you can extend one object or one class to make do something special. In this case, Alpha Hor is just a type of cookie that is sitting in Spain, also in South America. And I just, with this example, I extend the cookie class and I just add additional attributes, for example. So who's familiar with sci-fi learns? Just raise your hand if you use Jupyter and Ipython in the book. Yeah, so not so much. But when you're working in Ipython or Jupyter and you're calling sci-fi learn for scampals, you're writing statement. Many of those statements look like this. And what you're doing there is you're actually calling objects and creating objects and interacting with objects. So you have to be aware of what you're doing and how you can use that in your advantage. So how do I write good object-oriented code now that I know how to write code? That's a really tough question. I don't plan to answer it today, but I'm just going to give you some tips. There's some basic ideas in the object-oriented world and actually in the programming world that you want to know. One is don't repeat yourself. So if you are writing scripts and you feel you're repeating, repeating code, copying from one file to another, the same code, you might want to create an object out of it and reuse it. And that's one of the key features of a jet-oriented or the targets of using object-orientation is to reuse things. Kids, always keep it simple. Don't try to put a lot of things inside objects. And also use the solid principles. These are really abstract principles. I'm not going to go into details, but basically is that one class should do only one job and you have, well, I'm going to skip the rest out of time issues, but my recommendation is to check that and to check the link below. It's really important. It's really nice to know these concepts. So first tip or first thing I will, like I think it's important if you're going to start doing serious data science with Python is learn object-oriented programming and Python master it. So the next thing that you have to learn once you already know how to organize your coding objects is there's things called con conventions. This con convention is like table manners for developers. So you're sitting at something you want to do so you don't annoy other people while you're doing it, while you're eating or while you're programming in this case. When data scientists try to integrate the code, one of the things that annoy people the most is that they have no idea of con conventions and you have to always return the code to fix it or you fix it yourself. Why con convention? Well, convention are important because readability cones and they are small details actually. Things like, should I use space or should I use tabs? What are the indentation rules? How do you organize the code in a file? Pep 8 is the de facto standard so you should learn it. There's some resources online to you to check. This is a nice user-friendly way of learning with pepe.org. This is an example of a right, there's a wrong way to do any things. There are many details in these conventions and you might get, oh, so many things to learn. I just want to code. Well, you can help yourself if you're an editor. In this case, it's Emax the one I use but sorry for the VI guys or other editors. You can configure probably to help you not only with checking that you're following the convention of your company or if you're using pepe 8, following pepe 8, but also to help you detecting things that might go wrong. Like for example, in this case, a variable is never used and your editor can help you detect such things. Other topics that I would have loved to mention to go into more detail, but I don't have the time because I want to show you more cool stuff. It's a project structure testing. There's no tests in like data science project generally they don't come with tests. Versioning and branching, namely learn how to use your source control, core reviews and in general, the software development lifecycle. There's some books that I recommend you to read. They're really general. They're not that specific to one language but if you want to get closer in this side of the data scientist area and because I'm more of a software developer, those are good books to start with. Also, I was reading the description on EuroPython website and there are some talks that I think are relevant and they probably talk about these issues. If you go to any of them, please tell the guy that I sent you there and he probably gonna buy me a beer for that or something. So let's go to some now to go into code right now. It's been a really theoretical part and you're probably boring and you want to see some code. So let's do it. So the tips and tricks I would have loved to know before starting doing it in a code sense. Especially I can in a nutshell, the collection module. It's incredible how few when you start using Python from the data science perspective, how few you know about things that are in the standard library and one of those things is the collection module and let's start with basic thing, counting. Counting is kind of like the basic building block for many statistical algorithm. If you start from basic native days to work to beg, they are based on counting. However, I don't think data scientists know how to come properly in Python. Let's see. Let's start. How do you count in Python? First attempt, you use dictionaries who has written such code to count stuff. Oh, you particularly know your stuff apparently. Let's see. So the actually more phytonic way is this one when you don't ask for permission but for forgiveness. I think it's something like that and it means. Who has written something like this? Who can do better? Let's use collection default date. Who's familiar with default date? So some of, yeah, that's good, some of you. So it's basically the same. You use default date that has a, basically you pass a default value or a default generation function and in this case, it's an integer and by default it will be zero and so I don't have to do any check. But let's use the counter. Who's familiar with the counter? Few of you, so even fewer. So counter is really cool and it's just a default date that is already prepared for counting and it's for free. And that's how you use it. I just pass only the list of items and it's all over there and I just get the count. However, come from some extra goodies like you can get some, the most common, some values and do some set operation on them. And I found that pretty cool. But remember, counter is a class and I just mentioned that you can take classes and extend it and add your own behavior and for example, you want to calculate the probability for some items. I can extend the class counter at a normalized function and I already have the probability mass function for that. Easy peasy. If I want to overload the initializer to call normalize as soon as I have all the items in the counter, you can do it also. So when you're counting things in Python like you want to do when you're using a statistic and you can use things like pandas or scikit-learns, probably sometimes when you're building the features, you should totally check out counter class. And there's a really nice article for Trey Hunter about how historically the counting process has been developed in Python and it's really good to read. Name tuples. So name tuples are a thing that is, I discovered recently and they're kinda hidden, more or less. Who's familiar with name tuples? Well, most of you, some of you, yeah. So the thing is that when you're writing code, you use, people use a lot of dictionaries, lists, tuples. And when you're starting integrating that into a large code base, you see that code and you see it's a dictionary and you don't have what to expect. So it's really, it makes the code hard to read. In this example, if I remove this, you have no idea what I'm talking about, what is PT in this case. You might, out of the context, maybe. So just by using name tuples, you can make the code clear. So name tuples are basically sort of like a class generator online with the particularity that the attributes are read only. So they are basically a nice struct in Python as if you're familiar with C, that's more or less the equivalent. And so you can create class on the fly. It has cool methods also. If you really need to use DIC, you can transform it into a dictionary and you can create one out of an interable. And I think it's a nice way when you're writing code to organize it and to create sort of like domain classes that represent things in your code and your ontology. In this case, we worked a lot with hotels. So I created a hotel base and I told the street tour and I actually inherit from it and add a method to calculate something. And I pass these classes or instances of this class around my code and that makes it, in my opinion, more readable. So let's go to the more needy part. And this is really interesting because it's really confusing. And for me it was. And actually I remember that during my first my first interview for the company I'm working, right now I was asked something about iterators and iterables and I think I answered correctly, but it was out of luck. I don't think I, then I discovered, oh, I did it right, but I didn't know why. So let's talk about this. When you see a code like this, you're probably familiar with it, how to iterate through a list. But what is happening underneath? What you can do this? How can you do this? And why it works? And how can you write your own classes that have the same behavior? I was confused and I was looking for ways to, what's the difference between iterator and iterables? It's the least dictionary and so on. And I found this nice article by Vincent Driessen when I used it, I used this graph from him. You should totally check it and we're gonna start kinda like exploring the concept using this graph. So two concepts, maybe abstract for you right now, it's iterable and iterator. So an iterator, sorry, an iterable is something that you can call the iter method on and it will return an iterator. And an iterator is something that produces a value when you are called next. Abstract, okay, let's go into more detail. So a comprehension, for example, so this is a container. And that container, for example, can be at least a dictionary and a tuple also. Container is something that you can check whether something is inside the container. That's where more or less the name comes from. In this case, I checked that one number is in the list. In this case, it's set. And a container is typically an iterable. So you can go through all the elements one by one. So in this case, this is a list and I call the iter method that gives me x and y that are iterators. So I can call, if I see the types of both, one is a list and the other is the least iterator. And I can call the method next on those items and obtain lazily the items from the list. Now, when you do this code underneath in the Bico level, that's what's happening. Python gets the iterator from the list and start getting the values. So this is more like syntactic sugar in some way. So in a nutshell, iterables is any object that can return an iterator. That includes container like list dictionaries, files. They have to implement the, if you want an object to behave like that, you have to implement the iter, the under method. Some of those things might not be finite. They just can generate value forever. I'm going to see an example of that. There's a module in Python called iter tools that have a lot of functionality to working with them in iterables and iterators in generic. So how do I implement my own iterator? So for kind of like dramatic reasons, you can implement both the iterable and the iterator in one class. So you have an iter method that return itself and then you implement the next method. In Python 3, it's like an under method. In this case, it's just a code that reads a file and then iterates in inverse order. So start from the last line up. When there's no more lines to return, it will raise top iteration, which is the exception that is called to stop the for. You can use it easy. You instantiate it and then you do the same as if it were a list. So now we covered the green part of this graph. So we know that we can get iterable from things like lists and dictionaries and files. And from then we can get iterators that will produce values in a lazy way. But there's another way to get that iterator and it's by using generator. Who knows who's a generator? Fewer of them, okay. So, hope you got them. So let's start. You can get a generator from a generation expression and what's, or from a generation function. Both, as I say, are generators. So from a generator expression, let's start with a non-generation thing. It's basically a list comprehension and I'm generating 10 numbers and then I'm creating, sorry, a list of 10 numbers and then I create the same list of the square of those numbers. So if I check the type, it's a list. What, these are only 10, but what is this a billion of numbers? Probably I won't have enough memory to store them in my RAM or in my disk. You can do the same with generation expressions and this is not a topo, just although it looks like it, it generates, it creates a generator object that will produce the squares in this case from the list number in a lazy way. So each time I call next, it will calculate the square and return it. Think about it as a factory of items and the factory uses, in this case, the function, the square function, multiply x itself. So if I want to do the same I do with a list, I can, I generate the squares and the lazy square, I can print the items and it will be only generated when the four internally calls the next function. Before that, those number don't exist. So a generation function is the same idea, but it uses a magical work or yield that works in a nice ways. When you create this, you call the function fit which you will obtain also a generation, a generator. Then you call next, what will happen is that the code is gonna be executed. Then this yield will return the value back to the program and will continue only after the next, well, the next, next is called. In this case, I'm calculating the Fibonacci sequence that you might be, you are familiar, I'm sure you're familiar with, and I can just call next, and next, and next. Something to be aware here, this is an infinite generator. You see the wire through there, it won't stop. If I put this into a for loop, it will go forever. It will be generating, generating, generating numbers or the sequence. I can use some of the methods or the functions of iter tools to just obtain just a subset of that. And in this case, I just get the first three using four. You can also implement your iterators or iterables using the yield keyword, namely replacing the iter function instead of returning itself. You just return a generation function. In this case, I'm reading a file, for example, from HDFS, I distribute the system, imagine just one server located somewhere, and imagine I pass a source that has the method open and I start iterating through it. In my, this open method might be even an iterable or even a generation generator, for example. I do something with the line and then I pass it back. And I just can't call that as I do with a, with a for loop and process of the line. So that's more or less the iterables and iterators. So I can think that, hey, this is supposed to be related to data science and data processing and so on. What is this, what is the relation? Well, you sometimes, you cannot load all your data into memory. And if you're working into the big data field, that's probably your situation. As I say, you might not have enough memory to store all the data you want. That will happen when you use a list. So you can work with, in such cases, by using data streaming. In data streaming, you can get it by lazy evaluation, which is what I just show, generating or processing things as long as they are available or needed. And you can create sort of such, like in-memory data processing pipelines using iterables by changing them. Some example, I just show you this class that gets some obtained line from a server and just send it, do some processing, maybe split something. I can create another that takes that and then check whether it's a, say, Python comment or some random comment and passes over. So it's kind of like a stream that gets processed and then sent forward. And you can change first create with the source creatives, the one object and pass it as the input of the other, for example. And you can just call it as in a for loop. And inside, you're gonna be processing in kind of like a stream fashion. So think about inception, the movie. You're going at different levels. First level do something. The second level do something. They are sending data. You don't have to write an object for this. You can just get a generation function and replace the whole thing with the function. I just do it as an example of how I like to do it. There's a talk in EuroPython which we'll get into more detail. It's on Friday. If you got really, probably you didn't get the whole idea as clear as I want, but you can for sure get more information in this talk. So to finalize, some conclusions on closing remarks. Data scientists, engineers, developers, you name it. You should learn, in my opinion, start with the collection and the iter tools models. They're basic. They are your best friends, iterables, iterators, and build your data processing pipeline using them. Use object oriented programming for organizing your code. That will help you not only to make your code more maintainable, but when you go to integration and you're working in large teams, you will have a better time getting your code into the code base. And finally, you're gonna have to start moving to be more a software engineer instead of being just a scientist or a data juggler. You will have to become more of a software developer when you want to get your solution into an existing product. Some credits. The images I use, I base most of my talk in a couple of articles, in particular in ideas coming from Radim Trejurek. He's the creator of a library called Jensim. And he also, I really like how he, in this article I'm linking here, how he talks about data processing pipelines using such iterators and iterables. As I say, we're for trust you. We are hiring. You want to know more about the company I'm working for. We have a small table where you can get some goodies and just drop by and talk to us. Or only if you want to talk to me about the talk after the Q and A session. You also are welcome there. So question, comments, remarks, or you want to trash my talk, be welcome to. Thank you. I think everyone is hungry, so I don't expect many questions. We have time for questions anyhow, if you want. If we don't have questions, we thank you again. And good lunch.