 Thank you. Hello everyone. Thanks for coming My name is the meter and I don't know and this is my first ever Europe I can talk so I'm I'm quite passionate about pandas and I hope by the end of my talk you might want to try it as well So let me first tell you a few things about myself So I have been a software developer for over 20 years now I started back in the day with basic and Pascal went through C C plus plus C sharp even PHP for three years and then I discovered Python through jungle and Python became my favorite language by far since then I've used it for pretty much everything server-side software scripting web apps mobile apps and all sorts of other things so I Was working for canonical for four years and I was working on a port from Python to go of a cloud deployment suite and After that I decided it's time to get on my own. So I Went full time into freelancing with Python again happily and Founded my small own company devulated so What about pandas? So seriously How many of you have used pandas before? All right. Great. So have you used it for anything else than scientific and statistical software Okay So just a quick introduction for those of you who don't know about it. So pandas is an open source Python Library, it was created in 2008 by Wes McKinney It has high performance and easy to use data structures and A great API for data analysis built on the solid foundation of NumPy And it's also very well documented too well, you know in a way so I first heard about pandas in Europe-Python 2012 I think and since then I kept hearing about it from all sorts of people All the time and I decided to look into it and see Actually, what is all about? I'm not from a scientific or financial background so That was my first experience with it Basically, I liked about it Be that it's easy to install it has a very few requirements Especially on Linux. It's trivial, but also on Windows and macOS It's as fast as NumPy yet a lot more flexible and I Personally don't really like NumPy that much because I found it somewhat counter-intuitive and awkward to use Pandas also reads and writes formats in pretty much any format you might have to deal with especially CSV, Excel and HDF5 To name just a few which was an obvious advantage for me and Also since I'm quite a visual thinker I like how easy it is to plot stuff with pandas with matplotlib So I did try it, but I found some quirks and pain points which kind of put me off and I want to share a few of them with you So it has a good documentation, but at the time there were not a lot of tutorials and hand-on hands-on guides You know, it was a bit intimidating to read all that documentation and know where to start from There are also confusingly many ways to do the same thing kind of un-Pytonic in a way at least then Also, there are lots Lots of indexing like every sort of indexing operation which was it's also its power but I didn't understood it then and it was kind of Seem to be pointless especially the multi-index and It has same defaults for most things it can handle lots of types of data intelligently however, not as fast as you might like so You might want to actually be specific when you want to deal with specific types of data like data Date time or floats or integers and do some conversions in between So let me tell you about the project of mine, which I kind of found unexpectedly how Good Fit pandas for some of the tasks. I have to deal with so the project is an SVG mail labels Generator, which means send personalized mail in senders labeled on the envelope in sender's handwriting and This is done by following a few requirements so one of one of them is acquire sample of the user's handwriting on on a tablet and It's acquired in a vectorized SVG format then extract individual letter or symbol SVG files small ones from each of those sample pages per user Then out of those compose arbitrary word SVG files and Make them look as if they're written by hand and finally Generate mail labels from those words sticking them together into multi-line multi word labels so first the acquisition of handwriting samples is Done on a tablet stills with or pen Every user gives one or more of those samples. They're saved as SVG files and This is an example of One of those so basically it's a standardized text that you every user decides what to write and It writes them. It writes that sample on several different pages. So to have base for comparison basically and each of those things are basically SVG the pen strokes are recorded individually in the SVG file as vectorized curves and This is for example, how it looks like one of the outputs of that process Which is a mailing label Done for force one of the users So Sorry the zooming is kind of Weird, so this is the generalized process. It's a multi-stage pipeline of sorts So it first starts with the parsing of the SVG sample page then enter pandas pandas is used to Read those in Present them in a tabular fashion In a data frame so they can be easily hand-off Then there is a letter extraction process which heavily uses pandas to extract individual strokes and combine them as they were on the page so that you can Come from single individual strokes to actual letters and Then reuse those Then there is a classification step which is done manually and Basically labels each of those extracted letters as a BC door sign and so on After we have this there is the word building stage where we select Letter variants for a specific word stick them together apply some alignment and so on and Finally, there is the labeling stage which is Producing labels out of those words and aligning them ready for printing so let's Look into the parsing first The problem is how to extract meaningful information from that XML SVG in Python and what I found is this excellent SVG path tools library Which has a lot to offer so it has a path base class And a few subclasses there like line cubic bezier a quadratic busy a and few other top-level utilities Each of those classes have rich API for path intersection calculating bounding boxes transformations scaling and all sorts of other things you can cut paths you can Like translate them and so And also it allows you to easily read and write lists of SVG paths into or from SVG files and also apply some like scaling and other things so And it just takes a single line. So this is basically an example of how How it is how easy it is to to get those paths from from a file and this SVG To paths takes a file name and a bunch of other optional arguments Deciding how to convert and what to convert so it converts everything to those three Primitives line cubic and quadratic busy a it handles arcs circles and all other things You know it converts them all into those And returns the list of path instances and a list of dictionaries which contain the extra XML attributes of each of the paths So once we have these This is the easiest and simplest way I found so We use Pandas data frame dot from records a class method which takes an iterable or in this case a generator of dictionary like objects with the same structure and In this case what I cared about is the actual Index of that path instance within the file and it's well It's a bounding box so the minimum and maximum horizontal and vertical coordinates that fully encompass that stroke and We get the structure that looks kind of like this then Onto the letter extraction so the problem is Quite computationally intensive if you address it from a knave You know Algorithm so you need to compare each stroke with all nearby strokes Which might have something to do with it and merge them together as letters And what I found is that using a data frame simple iteration and filtering Obey it over multiple passes. You can do that that easily and in quite quite quickly as well So the multiple passes are done by basically taking the data frame and Returning it modified along with two sets of indices one for merged paths and one for yet unmerged paths, which You can see here using the data frame. You can easily extract those And then the each of the steps which I'm going to show one of them Which is this merging the fully overlapping paths Basically, all of them look look like this. So we iterate over each over the data frame taking each path in sequence and then we Filter the data frame For example in this case all the paths that fully overlap their bounding box fully overlap with this current path We take this as candidates like a subset of the data frame Then we run run a fairly complicated merge procedure, which I won't show because it's like a Page and a half, but basically what it does it updates the data frame so that When you merge two paths, they have the same bounding box So updates the X min X max and so on of both to match the combined bounding box of both and Also updates those merged and unmerged Sets and returns the data frame and after each of those steps We run Update data frame step which calculates additional properties for each of the paths and since pandas allows this quite easily you can chain Assignments like this You know like for example calculating the width or the height of the bounding box the half width half height which is used in some of the merge steps also the area With with multiplied by height and the aspect we divided by height and finally we need to sort the values So that they come in kind of natural writing order top to bottom left to right So then once we have this we have a bunch of smaller files letter files, which we then Need to classify and this is a deliberately manual plot process as per the client requirements There is an external tool they used already for for this sort of thing. There is no pandas unfortunately So it loads the merged and unclassified letters and letter SVGs Shows them one by one to a human allows the human to align them, you know in the box of the letter for the background and Also allows them to label them like this is a door sign. This is an end capital A. This is a lower case L so on Once we have these we have labeled SVG letter files letter variants and then we come Down to the word building. So this is an example of an intermediate output of the algorithm, which is a debug version showing the letters their bounding boxes in green and the running baseline of the word which is The line along which all the letters are aligned. So it looks like they're written on the same line So it takes a single word as an input for example testing It does a selection process for for each letter Either sequentially or randomly with the seeds it picks a labeled variant for that letter Then does horizontal composition merging selected variants with variable kerning which is a typographical term for the spacing between the letters And then there is a vertical alignment step which According to the running baseline aligns Certain letters like for example g y and others too Are there up there either below the baseline or above the baseline as needed and Outputs a single SVG file for that word win the same Size so the label ink just to remind you how it looks Basically, it takes as an input an excel file with mail addresses No surprise here pandas works great with this. So the structure is one roll per label one column per line Is as simple as parsing using pandas read excel and The generation stage builds words with variable spacing one for each column and the alignment is done with so-called variable leading so the leading is the vertical equivalent of kerning so the spacing between the lines and That's it. Basically. So I think I should tell you what I learned from this process basically So pandas is great for any sort of table-based data processing. That was kind of an unexpected discovery for me So it might be intimidating at first if you haven't used it. There's what to read, but if you learn just a few things and start from there like filtering and iteration you can go a long way Also Take time to understand the indexing and the power of multi-index because that gives you the power to deal with multi-dimensional that that in a very Comprehensible way Then also, of course any time you need to deal with CSV or Excel Which is quite a pain. Otherwise with pandas is trivial and fast doesn't have to be, you know financial date or anything and Also dot the documentation is great. There is a lot to read. So it could be a bit confusing at first But I would suggest start with 10 minutes of To pandas, which is one of the main sections of the documentation There are also a lot of tutorials now a lot of cookbooks, you know Hands-on guides and it grew a lot. They were actually Recently a documentation sprints forced for pandas, which expanded up even further. So With that I have just one more thing to say Please consider buying West McKinney's book Python for data analysis because it's great and it will help you a lot with your journey into pandas and I'll be happy to take any questions Thanks very much. Are there any questions? We've got lots of time Sorry, I may ask a silly question. I know you said we need this panda To have you meet any I mean you in your practical user keys in your practical life Work, have you meet some the limitation of pandas? Oh, yeah Well, there are things quirks that you tend to learn to live with but you tend to overcome as well Like for example dealing with any sort of numerical data that can have gaps in it or possibly strings or anything Like they turn up as nans instead of you know something else So if you expect to get integers, you might get floats instead But yeah, that's yeah, the type converges one one thing another use case I Would like to rise to our community to to pay attention to is from my work a other case the data the data input we got is a Jay nice to the Jason Jason file is a Jason Jason Nice to the Jason string. So the pandas, you know, use pandas read Jason can only process one level Yeah, so I make it a very I haven't used it personally for Jason I think postgres is better for that if you can afford it. I mean if you can have it I was no, I mean my solution is I have to write my personal Library to process this one into a data frame, but that's quite a static So I was I was always thinking if the pandas can can absorb this feature basically he analysis the Jason's files Because the output always even it's nice to the Jason the output will be pandas will be pandas data frame So so I was thinking if it pandas can absorb the the feature basically and Firstly step one analysis the Jason fail to identify the you know, these like keywords you can do just the crunch and get the different Oh, it will be a proof improvement for Panace, but that is splendid. I agree My question is the limitation of the pandas Yeah, so I'm sure you can go a long way using pandas for some of some part of that process, you know reading the Jason the nested Jason And for sure if you can convert it to something more tabular You'll get a lot more out of pandas Are there any other questions? No, I hope you try it