 Good. OK. So the last guy, he was quite good. Machine learning is super interesting and super fun, and I do it for a living, and I'm happy about that. It makes for great talks like that other one. But in an 80-20 Pareto efficiency kind of way, that's just like, you do the mechanics of data science. It's like, get the data, clean the data, fit the model. 80% is like the drudge work. And I'm glad you came to join me and hear about the other 80%. So hassle-free ETL with PySpark. The 80% is not fun. Let's do it as easily as possible. First, me, command line humor. I'm Rob. I'm a data scientist at ZocDoc. If you don't know what ZocDoc is, think of it like an open table for doctor's appointments. You book online. Super easy. You want to see a doctor now. Awesome. You got it. I'm also a Corgi enthusiast. On the right, you'll see Ellie. She is a Corgi. She is also my pride and joy. Side note about ZocDoc, a little like the way that the company is cool with sending me here, is that let you know that we are hiring for machine learning engineer. I'd be super thrilled if one of you contacted me about that, because I also get a referral bonus. All right, so what do we want? Again, core premise here. We're lazy. We want easy. And that means being very needy. We want a lot of things. So we have all this data. We want to be able to access it, explore it, wander around a bit, find out what we want. Then we want to package something up nice and tidy, run it every day or whatever to automate our ad hoc stuff. If we want whatever company we're working for to survive, we better start planning for 100x and hope we get there, or else we just fade away. So we're going to need scalability. And again, lazy. We want to reuse code. That's what code's for. Compartmentalize stuff, plug it into other places. So the name didn't give it away. Part of the hassle-free part is the PySpark part. And we can check off ease of access and scalability in a couple of steps. One of Databricks, Jupyter or Zeppelin, take your pick. I don't care. No bias here. Hook it up to a Spark cluster. You're done. Check, check. That was very simple. I use Databricks. I note on the bottom, not a rep. I don't sell it. It's just good. So if you don't believe me, I trust that you do. But we're Python people. We liked iPython notebooks before they became Jupyter notebooks. We like ad hoc stuff. It's fun. It's a good time. Databricks, Jupyter, Zeppelin, all notebooks, no-brainer. We can fast-forward pass this part. Scalable, for those that aren't familiar with Spark, this is kind of a freebie, too. Spark is a distributed in-memory computational engine. The syntax associated with it is expressive, compact, and kind of fun. It's written in Scala. I don't hold that against it. There's a nice Python wrapper. Very nice. And if you have more data, this is the scalability part. We'll just add more machines. It's distributed. What do we care? Same code runs no matter what the size. But for the main part, the other components that we're looking for, we're going to learn by doing. We all write code, so it's easier to see code. Background information for those that don't remember this now seemingly outdated terminology, ETL. Extract step one, get your data. Transform, make it into something you actually want. Load, save it somewhere nice so you can access it later. A little glimpse into what PySpark looks like. You can extract data two lines if I had a little more width on the screen. It would have been one line. Clean it, a few transformers, and then save. That's context-specific. No code there. And for those developers that deal with particularly needy PMs or business users, as a data scientist, a lot of business users and PMs in my life, you do this once and you do it well. They're like, oh, it's like cracked to them. They want more. So ETL jobs can accumulate quickly. You make a nice data warehouse table. You put a little chart that's fed by it. Graphs are great. Insights are great. We want more. Before you know it, you have stuff running all night. It's very inefficient. You'll notice that there are generally more questions to be answered from data than data sources themselves. So if you think of the ET and the L, there are a few Es, many Ts, hopefully not too many Ls, or else you have data spread all over the place. So if we're dealing with an in-memory computational engine, well, let's take advantage of that. Let's have jobs that are dependent upon each other and share that memory, work on the same cluster. So we can construct a little dependency graph. So if you have one job that loads a bunch of stuff, keeps it in the cluster, it saves data, makes it readily available, feeds it to all the little dependent jobs, and then once its first level children are done, delete the data, repeat, recursive in that way. Easy way to do this. TrTL, super compact little pipeline package, does exactly that. Maintain the job order. Cache the data that's gonna be used. So if you have a job that runs, it does some stuff. TrTL will check, so it does this have any children. If those children are gonna, maybe they want the data. If it does, well, cache the data because Spark has a notion of lazy evaluation, caching means, in this context, save what you've done so far and like force the evaluation when needed. And then pass it along. For those looking for kind of a web app, GUI, or job scheduler, like run every day, noon, cron stuff, that's not what this is. This is just about passing the data around. Will that be added? No, there's enough of that out there. So the actual learn by doing? This inherits, the inheritance from job that's in TrTL. There wasn't enough room on screen to put the import statement, you can trust me on that. So if we have, if we picture one loader as a job and then two children that want that data after it's done. In kind of the classical run of pipeline in its own concept, you might have one pipeline that says like load data, transform data, save it and you have a bunch of these going at once. But really it's the same E at the core. So load the data, save it, make it fast. That's only a couple lines. Notice PySpark, a lovely syntax again. If you want to load a huge JSON source, well then just read JSON. Really can't be any easier. If we want to write a dependent job, label it as such. This is dependent upon, get some data, the original extractor method, or extractor job. In its transform method, it's gonna, you've defined the parameter that it's gonna receive. And that's gonna be the data from the prior one. You're gonna do some stuff with it, who knows. And you load it, snuck in a little more PySpark there for another taste of how easy it is. If you have data partitioned by, because you want to partition data, you want to be efficient about it, save it as a type, in this case par K. Couldn't be any easier. If we have another job, so I said there's this little diamond structure here. So if you have a second child, that's also gonna receive the same data. So get some data, has two children. So tree TL will say like, it has more than one child, save the data, keep it in memory in the Spark cluster, pass it along. And then once those jobs are complete, the original E will un-persist. And then lastly we have this final job that takes the output, the transform data from the two second step children, merge them together. So each one of these, the text that it has a child, cache the data, pass it along. Last job does its thing. The middle step jobs, un-persist. Now you have this last one. Each one can extract its own data, receive data from a parent, it can load its own data and pass its transform stuff along. So we have this communication between jobs, just to keep data in memory. It seems like a lot of tedious bookkeeping, but if you're dealing with, like if you launch a cluster, you need to get the data from somewhere. And if it's a lot of data, which we're planning for 100X, you don't wanna overload your network costs or anything like that. Like it's just inefficient. If you're constantly dealing with IO, then you're missing the whole point of using Spark. The whole point is to keep it in memory. Load it once, share it around and get rid of all of it. And when you finally wanna run it, there's a job runner in Treetio. Give it some jobs, let it figure out. You've tagged the dependencies with the decorators. It'll kinda organize everything. In between jobs, it will cache the data as needed because the package doesn't do much, but it does that. And I'm willing to check off reproducibility and reusability there. But there was a lot of setup there and I had a lot of comments saying there's more stuff to be done. So it doesn't feel like you save all that much code, but if you're only loading from a few different places and it's not as if there are data idioms. Like you get data every hour or every day or something like that. Well, all these components, all these ET and L components are composable. You can have a generic job that says get data from some source that you're gonna dependency inject. And it's gonna follow this year, month, day, hour pattern in the file structure. So then you can mix that into whatever job you want. Ditto with loading. I know a lot of my data, it's we care about searches per day and appointments per day and things like that. So when I load messy data and I transform it, I'm gonna extract messy data and transform it and then I save it somewhere for ready access and analytics. It's usually partitioned by like a year, month, day construct. So it's consistent. We can mix that in too. So all you really ever have to write is the transform method. All right, a few guidelines. This first slide is completely unejectionable. Jobs should always be rerunnable, tested and rerunnable, get it. So there are always issues, there are always bugs. Everyone likes to say that they write and test their code thoroughly, but let's be honest, we've all put horrible bugs out there before. It's inevitable. And sometimes things just crash. Who knows? But you need to be able to rerun your jobs. And you need, you know, if you had an error, you've had an error from, you know, five months ago during like one specific week, it was not an arbitrary timeframe that did happen to me just a couple weeks ago while I was putting this slide together actually. So it's there for that one week. I need to rerun the full tree for that timeframe, no problem. The rest are more PySpark specific considerations because the other stuff is just generic. If you've written any jail before, you know, rerunnable. Cool. So Spark has its preferred, embedded, optimized file type is called parquet. It's super easy and it's super fast. If you wanna save something as parquet, you just write dot write dot parquet. If our goal is always easy, let's do something like that. Also, if you're using PySpark for ETL, you're probably doing analysis in and it has a nice machine learning library. It's great. So Spark reads parquet very quickly. It's a nice descriptive column store format. And then burying the lead a little bit. Always use data frames when you can. And Spark has two primary data abstractions. It's the RGD and the data frame. The RGD came first and it had no schema information associated with it, but it was flexible, made sense. And then they put a higher level abstraction that had a little schema information with a huge performance boost. You always want the performance boost. Give a little schema. It's not too much to ask. Partition your data. Again, common sense. Something that databases have been doing since there's been databases. Just because we're moving to data lakes and event frameworks and things like that, just because we're schema lists, whatever, doesn't mean we can't partition our data. I mentioned that a lot of my stuff is partitioned by day. So data frame, write partitioned by write as parquet. I forgot the write in there, so that's actually a bug. But again, easy. The whole point of the TreeTL organization, the whole point of sharing with memory, caching intermediate results in Spark is very important. Because with the lazy evaluation, you construct this huge chain of commands that I might do, then you call an action on it. It'll go through that big chain, but then if you call another action on it later, it's gonna go through the big chain again. That doesn't make any sense. If we have a break point where we're gonna use a data frame multiple times, cache it, save it. Memory efficiency is why we're doing this. Between jobs, TreeTL does this, within jobs, go nuts, it's up to you. Flip side, uncache isn't a word, but do that. Again, it's easy. Memory is what feeds Spark, it's what makes it work. So if you're not using something anymore, delete it. Make room for lots more data. Again, TreeTL will do this, between jobs, within a job, go nuts. Be efficient, and that's it. And if you work at ZocDoc, you'll find out why we end our slides with cows. 10 minutes for questions, separate, this is for batch jobs. So if you're training a lot of my inputs or either for kind of like offline data mining or offline model building, stuff like that, so the cost of waiting till tomorrow and having a run overnight, zero for me. Streaming is kind of its own at will. In that case, you don't have to worry so much about those dependencies, because you already deployed something in the same cluster, it's through the same pipeline. So this is more like if you, so like the full production scale organizes everything, job scheduler or Luigi, it's like a bunch of different work streams, they don't need to talk to each other or anything like that. This is like within one of those batch jobs. You could be more, like Luigi has nice support for running a Spark job as one of its steps. Well, within that Spark job, be smart. Yeah, I believe Jupyter does have integration with, you can hook it up to a cluster now, something like Databricks has the full manager of the cluster built into it. I don't know about like cores within a given machine, but I don't know if that's like the primer, like if you're using Spark, I don't know if that's the primary goal for optimization. Right, the point is like, you have something in memory and you wanna spread it out horizontally. It's like you're not trying to like squeeze every bit out of the individual machine. Yeah. A 3DL is as small and simple as possible. I have it for a very limited number of use cases. It is not a platform framework or anything. Well, it's all offline stuff. So there's no risk to it, yeah. Yeah. How did, how did 3DL manage memory versus PySpark manage memory? Like what, how did those work together? Oh, so it's not that it's like, it's not managing memory bytes itself. It's making sure that in between jobs, like if you have a job run in isolation, like it has no need to save data for later, the job is done. So it's just like it's leveraging PySpark's caching system. It's just doing that when it's needed for other jobs. So like PySpark, if you were to replicate that functionality just in PySpark natively, you would have essentially one big job. So this is so you can write individual little component jobs, plug them onto the end of the list and then 3DL will say like, oh, it has a new dependency here. Keep data around for this guy also. So it's about decomposing. So I do sound like a Databricks rep at this point, but they have a free community edition. I would suggest using that because you don't have to pay for the resources of the machines you hook up. You got the notebook. There's no like DevOps consideration. That's free. Yeah, because it doesn't like, it has a cache method that's essentially called how you inherit and override that functionality is irrelevant. It's more about that. It's recognizing the fact that this is a time in which we need to keep data on hand and pass it along to someone else. So you can like the way that I use it is my override method is always the same. I just say like transform data basically dot cache or anything associated with it. So yeah, you could. I haven't though. Pardon? Yeah, because it's only passing data to the job wrapper essentially. It's like there is a job class, right? And then it has these components, E, T, and L and whatever data is being passed along to the next job, it's just an input variable to the transform method. If under the hood, you're then saying do some other stuff and drop it in Cassandra. That doesn't matter. Because it's really like you're implementing that transaction, five minutes. Not yet, but I will. GitHub sounds like a good place for it. Also, I believe that PyGotham puts all the videos and notes up. Am I right about this? Yeah. So in your work, do you use Spark specific functionality? Do you really need to use Spark in treat L? No. I'm saying for the purposes of our various goals, I find that to be the easiest like no brainer way to achieve scalability and like batch jobs and stuff like that. There are a million ways you can do it. Just my personal preference. I happen to like Spark a lot. Basically any system where you could say, oh, I have more data, don't change a thing, just add another machine, thumbs up. It could be inside a Luigi job. If you have, so picture a bunch of Luigi jobs in a sequence, right? One of them, well like a number of them are Spark jobs. If you find that any of them load or share data, well then maybe just put them together, let them, you can still write them as separate, nicely segregated little objects. Treat L, just kind of put them all together and say like this is one job, keep it in memory. Point is like you just, anything that avoids duplicate, extracting and loading and whatnot, I don't care if you use treat L, just in general don't do it. I guess that's about it. Thank you.