 Thank you, um Alexander. I'm from Mannheim in Germany. I'm a developer of my own company I'm a organizer as well I'm a speaker sometimes a MongoDB trainer for local for our local community and for the Python community I've served as a program work group co-chair Building this conference. So if you have any comments suggestions what we could do better or so I'm around just grab talk to me Very interested in your intro. My talk today is a MongoDB and what's the MongoDB aggregation framework? We're going to cover the pipeline bottle the pipeline stages and map reduce in MongoDB and at first, um, who knows MongoDB just Okay, awesome, and who's actually working with MongoDB? Okay, and who has worked with the MongoDB aggregation framework? Okay, cool. Okay So let's bring everybody up to speed with orientated document orientated databases and set 15 seconds Basically, we work with J the document is adjacent like object. We can store it in our database. We have no shame or enforcement Yeah, um a collection is basically just a collection of documents actually in multiple collections make up our database So pretty easy. It's a pretty simple Concept or the MongoDB aggregation framework was introduced about like three years ago with MongoDB 2.2. It's a framework for data aggregation basically the documents are Processed to a multiple stage Pipeline on the that's giving aggregated giving back aggregated results. It's basically Designed to work straightforward. So no unions like in SQL and and Technically this looks like this. So we have our documents We do a match which is a find we get less documents because we found some the cell phone set and then we do some grouping and we get even less but Actually, I thought it's a little bit too technical. So because basically I think it's more like a relay lace, you know Like relay racers they they go and they they pass the button to each other and basically this is how The MongoDB pipeline works. Um, so we have a match which is fine So we say to please little doggy get the button. Yeah Please pass it on to the smart fox doing something smart which could be like a grouping and then we want to Present our data a little bit more nicely. And so we pass it on to the projection space and Let me tell you a little bit about The data set we're going to work. I'm going to do present the things and what we're doing and I've also prepared some live demos and This is built with MongoDB Yes Um, we're using MongoB 3.0. Um, the new wide-tired Tiger Storage engine with give us a compression Pymongo is obviously the driver we are going to use it's Maintained by MongoDB themselves and it's pretty well maintained a driver. It's always up to date. It's really good And we're working on a data set of 37 gigabytes, which gives us compressible. I take about 9 gigabytes of RAM. Um, basically, um, as you might remember it is my second career. I used to be in the Record industry with a techno house startup business in the 90s So everything I do still in it is very close to working with music on so we have from a project we're doing It's called chart guys. We have some collection of a playlist from the icons music store a playlist is basically All the information about the release you can find and it's music score It's a set of playlists that appeared in some charts somewhere around the world within the last Three years and basically this is what it looks like So pretty cool. So, um, but Don't worry F. Um I've narrowed it down to What we're going to work with today just to give you an impression about our document structure for the for the demos So basically This is a document and info is all the release information So like the album artist the album name when it was released how much is in store and the children That's that's what we call a sub document. It's it's a list with Objects and that's basically the songs each and every song we have in our playlist and I was wondering actually Which artists to use for my demos because it's really hard to choose music artists making everybody happy and I thought I Found something neutral because I chose Taylor Swift. Yeah, and it's not because I like her music actually I actually I don't know any songs of her, but she did this great blog post Making Apple pay for the trial period for the new Apple music servers So artists get paid more money for people using a new service by Apple So she did the good thing and I think that's really worth mentioning her even at Euro Python. So, okay So, um Let's build our first pipeline I've commented in some Notes for the SQL guys to make it easy easier. So basically This is a pipeline a pipeline is basically passed in as a list into Pimonga and Match is just basically a find as you might remember from our document. It's the artist name So we were looking for the artist which is a variable. I've stored already As Taylor Swift, of course, then we're going to do Projected basically it's a select and basically we all all we want to do is print out the Yeah, all the releases by Taylor Swift sorted. So then we switch to This go here. So basically it's just an import we import pie mongo. This is just like a simple database connection and So let's see our database which is live on this MacBook Um, I must say it's only assigned two gigabytes of RAM for this database So it's not usually we work with a lot more RAM in MongoDB So we have a 1.3 million playlists found and it's about like 17 million songs covered in our data set and Usually you could like with you to imagine could also like just like a query and our query says, okay, we found 4093 releases of Taylor Swift So and with the aggregation framework That's the same code. I'm shown you under the slide before we do a match find project Basically, just like we're just project here. It's just like a renaming of the attribute actually and then we sort by release ascending order and basically that's Looking like this Okay, you see we have many releases She's quite busy artist famous karaoke Okay, so and What else we do we can extend our pipeline We can do a grouping so now we want to group everything by Name which is basically the album tab after you probably see we have a lot of duplicates You have done some duplicates in our data set Which is because albums are released by different companies worldwide So at the actual store they get a new ID. They're basically different products, although it's the same contents from the music so the passing in The name as underscore ID underscore ID is in the grouping operator basically what we want to group by You know, it's it's mandatory and it's always called underscore ID and we want to count how many albums are there Account we don't have a count operator in the aggregation pipelines So basically we're just summing one for each and every Document in our group and then we project and sort just to make it a little bit more nice and This is what we get. We still have Some different versions Okay So, okay now we've this nice pipeline we've got a result and I think it's so it's it's so nice Let's print out what we found with again And we get this and it's so nice. I just want to print it again and oops. What happens? I've just I just actually just rest is where we stored Our query to the aggregation framework and I just wanted to print it again and what what what's happening here? Why why doesn't it? Give us any result back. That's like the first trap. Can we I want to show you is MongoDB aggregation framework it returns a cursor So basically the cursor it shows just it points to the data in the database So you get back to from from the MongoDB aggregation. So once we call list the cursor is exhausted So all the data in and then they're printed and then they're gone. So you can't Just use them again unless of course you store them in a new variable So these are all our aggregation stages, um, I've put the like their SQL brothers On the right-hand side. So basically a match is where or having operator sort pretty obvious is Order by limit is also I think No explanation necessary project is a select and we can also use it for renaming as an SQL form our result group is group by unwinds And we're going to go into that very soon is Somehow a little bit of a join not really and Redact we're not going to cover and out is basically just an operator Please send those a result of the aggregation back to a new collection of MongoDB to store it so To make things a little bit easier for you to fall. It's in the next we're going to do something with the artist name and name is the album title and We're making our pipeline a little bit even bigger Like with the group operator what I've already shown you already This is how it looks like prints out a little bit more nicely and next step is how can we work with lists of soft op so As you see we have a list here just with older Songs on that album and we want to do something with it And of course the natural thing would be I query the database and just iterate over it with mongolub With Python. Sorry. Um, but that's quite an extensive task and we can do it in the database and there's this unwind operator and It's basically from from my experience. It's at first sight a really confusing step because it's quite unusual What I've seen so it confuses people and So I think just like Let's just chill you. I think it's probably the best explanation because what unwind does is Basically take all the sub documents in the the list and for each object in our sub document list It creates a new document and this sounds really like an expensive Operation what I can assure you that I did a really good job and it is not expensive at all It's really handy and basically Um, you sure this is like what we're doing now So, um, of course, I'm really sorry. Let's go to that later Here we are, sorry, so let's do this so So now we have all two hundred three three hundred and thirty two songs by Taylor Swift found them we and we can immediately work with them here in our Grouping our stage and as you see the path has not really changed Although this used to be a list before and we don't do read There's no need to do to do anything ago about like iterating over like a list index or anything like that and I've prepared a little bit more It's like here. This is basically what's happening. Um, so it's basically we just get one one one release limit we get one playlist and Then we do unwind and then I'm just renaming it with the project parameters on this is basically what we are getting we're getting All these are single documents new documents just like rated on the fly we can immediately work with so basically it's basically like Just yeah, yeah, it's an unwinding of the data. It's a little bit unusual concept, but it's basically really are simple so but Okay, another one which is Quite obvious one okay that's We have also like a sort which is also like an obvious pipeline stage and I want all the releases Just sorted by count descending and release ascending and basically it looks really simple and It returns us something like this and What's going wrong? Something's wrong because we said I want by count descending over all our data And then I want to have it sorted by release and ascending order, but our result is basically by Release and then by count so something's going wrong here and I can assure you it's not we're not it's not broken it's actually like a trap because We in the pipeline we pass in a Python dictionary and the Python dictionary, of course is in unsorted and of course so We just pass in something which is not ordered and of course our results get a little bit unpredictable but of course That's that's like This is solution and I can encourage you always to use Son from B zone collection or you can also use collection order dick and pass in all sort parameters like as an In an ordered fashion because otherwise you are your sorting order won't won't really work. And so this I'm oh wow it works Okay, okay So this was just like a really quick introduction to Stages, there's a lot which is as mentioned before like it's a skip. It's just like skipping documents out write your results to a new collection There's a neo-gear which just gives you all the documents around Geospatial point Redactors, I don't know. It's it's some people use it to restrict document access on a document level bottom. I've never really seen in production and These are like the stages. This is like our race and now we have some data and basically This is very limited from what we can do. Basically. It's just like mangling around a little bit with the data so We have more of course There's like a minimum a maximum first and last Operator and this is what we're going to work on again. We're searching for an artist we're using release date and a release date epoch and the subtle distance is that a release date epoch is actually a date and The release date is string. It's no date. It's a string on them We're building a new pipeline. We're doing a grouping arm by What we want to find out I want to find out What's the earliest release of Taylor Swift and what's the latest release of Taylor Swift? So we do a grouping by underscore ID As you see underscore ID is empty So new primary key, how can that be empty? Yes, it can be empty because we want a group of our Complete result set so we can just put none in there or leave it empty So there's no need to look for an attribute which is the same on each and every document Just leave it empty and reintroduce new two new attributes Men date max date and basically it's a really simple operation We just walk the path info to really to to today information men max and project it and And yay, you know Taylor Swift is around things 2006 Things she started really early. She's like releasing stuff and she's been around for a while and So what's what's first and last good for I mean we have min max It also would work actually on on on min max would work actually on a ray Just but it's it's it's just like a little bit different and the cave and it can save you Some extra calculations on what's the difference the difference to our previous pipeline is we have a match and then we do a sort by release date and Then we do our grouping and our grouping instruction is first and last and what does first and last do it's really simple Get the first document of the group and last is get the last document of the group So there's no need to iterate over your complete set within the group to find min or max values Basically, you just can say okay. I want this document. I want to look at this document And what's in the middle? I don't really care So this can be really effective and as expected same results so and With dates we can even do more. We have some nice state operators Pipeline we do basically same result by release date We do a grouping and I want to have releases grouped by year. I'm fan boy now I've talked so much about Taylor first. I want to really know everything So I want to have see which year which release so how many releases per year. Sorry so We have actually we extended our ID a little bit and now it's it's it's an object with our dollar year operator and We pass in the date Epoch which is the date and we just pass it in and dollar year will basically just grab the year from our date and This is then our ID. We want to group by and and This Works like this. It's really easy. It makes it really easy if you have some Data with timestamps and so we see okay account. So you see she's She's like a bee. She's releasing every year a lot of releases. She's hard worker and So but what if I want to dig even deeper. I'm not interested in getting The releases by year. I'm also interested to getting Each the release count for each and every month. She has released something and of course I wouldn't mention it if we couldn't accomplish it in the year We also have a month operator and the next thing what happened is now the ID Which is our primary key can also be a multi key and So we have a multi key year month and basically we do the same arm As before we get the year and month new attribute our ID key has to is this build out of two attributes year and And month and let's just run that and wow we see Haven't checked. That's probably not a month. No hardly any month. She didn't do anything so Well, there's a lot of more date operators as you can guess. There's also like a second minute Many more date operators and it's not a we're not able to cover them all in this small talk I've covered that but it's I'm getting a little bit bored now with Taylor Swift because I It's early in the morning and we want some some more tension So actually I thought about who could and else who could join and so I thought hey I just Google Taylor Swift nemesis and Google says it's an alien space robot called Katy Perry And so let's bring Katie. Let's bring in Katy Perry and it's really easy We can extend our match operator So Katy Perry's is now stored in our nemesis variable and basically we can also do searches with a dollar in Operator and it's basically just the same as in Python. So I think some really necessary To explain to you guys So and of course now we have a big competition. I'm wondering Who deliver delivers more song value? For my 99 cents. Is it Katy Perry? Is it Taylor Swift? So I want to see the average playtime of their songs I'm interested who gives gives me more songs longer songs. I can enjoy for my money It's not a good thing, but it's just like nice example So what we're doing as you see we now have three unwind stages So basically the first thing is we unwind the songs and then we unwind the song offers the song offers And within those song offers assets is basically the prior to the playtime stored So and we want to access this information. So that's why we have a pipeline of one two three unwind It's unwind unwind unwind and then we can group by just going down the path by the song name Which is a childhood name and then we just do an average of the path of our long of the duration we have stored within the assets and Show you this and Something's wrong Okay, sorry and just fixed it. Okay, something's broken. I'm very sorry Won't waste any time to fix this no life So basically, but I can explain you Basically, it's just like the same we did before with the releases here and Counting the releases and the next step of course would be getting The playtime so I hope My notebook didn't break. Yes. Oh, it didn't break Okay, sorry again, of course We have our group or playtime and we just were projected and as a result we can see okay Taylor Swift gives us more more music like about like 10% more music than Katy Perry for 99 cents and That's a really easy operation, okay now something Something a little bit more challenging I'm interested in thank you I'm interested in getting the prices of the releases of the artists and My own it's basically scraped data. So it's not probably As clean as I would wish Basically, we see a formatted price with the currency in front and the price But it's just like in one attribute and I'm interested in getting The prices in US dollars And that's that's easily to solve with a string operation and a compare operation so We have to speed up a little bit Basically, we do a project phase. So just focus on the things in bold They're important ones here and we have is US dollar Basically is a comparison of the lower strings of the first three Characters in our price formatted, which gives us back US dollar or some currencies or numbers or whatever and the comparison is basically is this US dollars It's pretty obvious and then we just do a new match for is dollars zero Okay, so also feels a little bit wrong but Comparer parameter gives us zero back when it's a match and it's gives us Minus one back if the value is higher and one back if the wire the value is lower. So it's pretty Pretty handy one. We could also do is equal There's also an equal operator which would give us like one or two back as we expected as a boolean true false then we sort group and We can even do Something something else we can also go and push every release we find in our group Into a new list with a price and the product That's basically very similar to JavaScript post. She would be like actually like you know an append and Python so Go here Back and here you go and you see Katy Perry Perry Katy Perry's products and Here's Taylor's next and the next next object Taylor Swift and the real List with all their products. So there's really a lot more operators And just I can't suggest if you if you fend the application framework is probably useful Just go to MongoDB documentation is very well written. It has a lot of examples. It's it's it's quite easy to get into and And one more it's a variable operator It's it's it's a map operator and as you can imagine it's basically the same as a Python map And what what do we do here? We're getting the ratings count which is actually how many users have given some stalls to the product one we scraped the data And we want to adjust it a little bit because Our management is our back and we need to make it a little bit look at a little bit nicer What we don't really want to do but it's just like a good example So basically we can pass in to a dollar map and input the ratings count As value and then we can just reuse the value on our list And we just like add 10 to each and every object in our value We find in our list and then it's applied Then another thing which is not probably obvious We cannot use some operator on an on a list like we can do really it's really handy in Python Yeah, we have to unwind first. Yeah, so basically for each and every value in our list We have mangled with we unwind it to a new document and then we can do a simple grouping as we done before and Yeah, and there we go Which brings us back of course to the next thing You can also do map reduce in MongoDB and how many of you guys work with map reduce or who knows map reduce Yes, and who actually works a lot with not map reduce Okay, so bring everybody up to speed Map reduce basically. It's a really simple concept. We have all these documents We map them map them is basically we just go through and we find key value pairs Which is actually In our word example, we find to find the most popular words in our release titles And we just emit them as tuples as you can see to the reduce phase Which is run by the producer in our example. It will be just like sum up the counts It's really really really pretty easy operation Basically, we just will use our name operator and You might wonder why would we use map reduce in MongoDB? So because we have this great aggregation framework. You've seen substrings we can so do so many things so what's the point actually being in using map reduce and and It's basically for most for most of the time you can work with the aggregation framework in most of the cases It's faster. It's more accessible. Thank you and But however map reduce gives us more power because you can actually pass in JavaScript there And then you can build much more More complex queries and for example for our example with splitting up The release titles in each in words to count them was This could be quite challenging in the aggregation framework. So Let's do it. So Okay, this is map Okay, let me show you a little bit more. Okay, um, okay, this fits. Okay No, sorry, this is our map and So for memproduce we're just using from be some import code, which we can just pass on Text and this is the JavaScript function and what's this is JavaScript function to it basically stores the name of the the the info which is stored in the name of Attribute it just stores it and splits it. It's it's a really simple operation and then it just we just check for some Punctuation and stuff remove it. It's probably it's not the best way to do this It's just like for the simple example and if we actually find the word we emit it so basically if there's something like teenage We emit teenage one and if the album is called teenage dreams We can do also emit frames one we send it to the reducer and the reducer has really simple code here we just take all the keys and Basically just count just do somehow how often did the key would actually appear from our emitter and Here we get a result and This is some and now we're going to do a little bit more because I'm I'm want to remove stop words Which is not really part of the aggregation framework, but just to make it a little bit nicer That's why I've added the natural language kit remove stop words and this is like the most popular words in Katy Perry and Taylor Swift's albums so you see they probably have a younger audience with dream and teenage and one and boys and Fearless and speak Kissed and stuff. So, um, yeah, it's really it's It's really easy, of course would take tombs to wood Unfortunately, we don't have enough time left We could also run this operation across the complete data set and to see what's Basically the most popular words in album releases being sold at the iTunes music store so To finish arm. I want to give you some more best practices and tips you can Use with the aggregation framework. First of all database think about your indexes Especially if you do queries on them, of course, if you have a huge data set and you don't have an index MongoDB has a collection scan and if it's a slow computer It's of course taking time and probably frustrating for you Think about probably getting your data set your database to to your RAM You can just touch commands in MongoDB, which actually do something like yeah similar to Unix touch You touch it and then it fills up your RAM As much as possible as masters will ever get from the system to store data You can work with life and in in memory You have to mind that the result can be only like 60 megabytes because that's the you that's the maximum We can store in a BJ's and document, but I mean like 60 megabytes is still huge Pipeline operation has also a limit of 100 to be but you will hardly ever sounds not much, but you hardly will ever really hit it. Um On your queries, you can improve your queries up front. There's this nice Oh, sorry for the break here. That's um a nice explain operation, which will basically give you information What would MongoDB do when query doing your query? Um and you get some results You see how many documents scanned if indexes were hit if it was all index and then you can really go and say, okay I can really optimize all my work with just like introducing a new index Hardware is of course really important, especially run more is better. That's really simple equation here Mind the disc performance, of course SSDs and cloud computing makes it really easy and yeah, and You can also think about working about and with a dedicated server in case you have something like a replica set and a right heavy Database So you can also say okay, just do another copy and work locally and do your aggregations without Having to worry about if you have a lot of traffic and your database and the last flight is some useful resources, of course as I mentioned MongoDB is Has a very good documentation. It's pretty updated by Mongo as well And I also want to mention Asia Kamski are she works for MongoDB also as a trainer And she has always like awesome tricks and tips Here we go We don't quite have enough time for Q&A So if you want to ask Alexander questions, then try to find them instead. Yeah, I'm around just ask any time. No problem