 started. Okay. Okay. We're live. All right. Hello, everyone. Welcome to the Wikimedia Tech Talk series. Today, we have Dan Andrescu, who is a senior software engineer at the foundation in the analytics department, telling us a little bit about what's going on in the analytics department. What is it that they do there? I am really curious personally as to what is the work they do. I'm really excited to hear him tell us a little bit about that. And if you have any questions, please use IRC and the Wikimedia Office channel and direct your questions at me. And I'll make sure to pass them along to Dan. And this is just a quick early announcement. Next month, we're going to have Alex from SRE. He's going to talk about the deployment pipeline. All right. I'm going to start this and Dan, you're all good to go. I was just launching it and my computer kind of hangs in there for a second while I'm launching it. Everything good? Clear? You can all see the slides. Yes, we can see the slides. Very good. Cool. So welcome. You know a lot about how data flows through our system or are completely confused about what streams and lakes and clouds have to do with anything. You have come to the right place. I'm going to take you behind the scenes and show you how we see our data. I want to have some fun. And just basically, yeah, let's look at what it is that we're doing back here. So mostly, we maintain about 111 hosts. And these have access to 2.4 petabytes of distributed storage. And together, this is about a tenth of WMF resources. So I think WMF has something around 1,300 hosts and 15 petabytes. We're a team of six engineers and a manager. Two of us are site reliability engineers. These guys troubleshoot all the things that go around with distributed systems, hardware failures, security upgrades, rolling reboots. They're amazing at reading logs, like the best. One of us streams in Scala and does massive distributed joins on skewed data for breakfast. I'm going to show you some of that later. And three of us are getting really good at reading logs, because that's basically all we do. The three of us have more of a front-end development background, and we work on some of the UI tools. But also a lot of the layers in between have front-ends, like there's five layers of front-ends. So we got lost in there somewhere, yeah. And our wonderful manager keeps it all together, while also in her spare time hacking bot detection and differential privacy algorithms, which is great. So on top of all the infrastructure, we build and run systems. These are, there's five different kinds of systems. I've got them kind of catalog here. We handle lots of data coming in. We've got Kafka, Hadoop, et cetera that handle that. We aggregate and sanitize it. A bunch of custom code that does that. We serve slow answers to really big questions with Hive and Spark. We give fast answers to questions that are asked frequently with our fast query systems. And we make pretty dashboards and graphs and things like that. Most of these systems are out of the box as much as we can handle. And we integrate them with custom code. That custom code is, I'm going to show you a high-level view of it, but it's everything from elegant Python scripts to thousands of lines of complicated distributed Scala and programming. Going into detail on much of this is kind of, yeah, like we could do it. And there's talks that do this. These are slides, pictures from other slides. We're happy to do that at a time in the future where other folks can. But I wanted to go up a level for this talk and look at high-level how the data flows. So I like this kind of corny analogy about water, because I like water. If you've seen me running through the rain. Streams of data come in just like streams. They get filtered through algorithms that get rid of data we don't need, clean the data that we do need, aggregated, kind of like a wetland does. Now, this part of our system is called refinery, because my team refused to let me rename it to wetland, but you know, it's okay, you win some. And finally, the data gets stored in like this big, peaceful data repository we call the Data Lake, where you can have a nice relaxing time floating around fishing for the data that you need. And some of that data gets published for everybody in the world to work with and you can kind of think of that as evaporating to the cloud. And there you go. I've tied a pretty bow around our whole world here. Yeah, so what's all this water? I'm here to learn about data, not water, right? So let's look at what our platform does. It's basically what we're trying to do is build a platform that lets people ask questions and get good answers to those questions. And that means the most important part of this whole platform is the question, right? Like, what is your question? And we all work together, WMF and the movement work together to ask good questions. The strategy processes and these things are results of us asking questions. And so the C team, the researchers, product owners, data analysts, legal, really everybody gets together and tries to find out what we need to know to make, to evolve our movement. Once we have that, we can instrument, we can add measurements, right? And measurements, we call that instrumentation, instrumentation panels on planes. And for example, we maintain code, the analytics team maintains code that instruments our varnished front ends and we'll be moving that to Apache Traffic Server when the traffic team migrates the cache and we're there. But most of the instrumentation is done by teams developing user-facing products. So you all working on visual editor and putting in logging code in there, that's most of the instrumentation, most of the logic that does that. And right now we're like in a WMF-wide effort, we're helping improve instrumentation APIs and build more standardized instrumentation libraries. So this all can be easier and we can just get data flowing in. When we have data, it flows in through Kafka, which is a huge distributed streaming platform. We have a variety of different clusters. Two of them are maintained by the SRE team with a little help from us, used for things like the media week job queue. Those Kafka clusters may be useful in the future for updating, for example, the complex dependency graph. So when someone edits a template, notifying all the places that use that template that they need to update is going to have to happen via a stream of events and Kafka can help with that. A big cluster in the last cluster is used for analytics data. It's a big cluster where we're getting all of our data. So there we're trying, we're in the modern event platform effort. We're trying to standardize how the data comes in, that it all has a sort of standard schemas that evolve in a dependable stable way on top of it so that whenever someone comes in and looks at any topic of data there, they know what to expect. Once that data is coming in, we have to sort of handle it, right? So we bucket it, we aggregate it, we clean it, we do all these things that the wetland was doing, and this is the place where we have most of our custom code. So until now we have a lot of custom infrastructure and now this is where we have a lot of custom code. So this system is called refinery. It uses a job scheduler to react to different data. So when data comes in, it sort of sends a notice that it's ready, like one hour of data is ready, and we operate on that hour, we aggregate sanitizers. This is a place where, for example, we parse geolocation and user agent information, reconstruct and reshape faulty data, and much more. And we can talk about that more. We want to think about kind of long-term what is this data, what is it going to do once it lands in a storage system? So one of the first things we think about is long-term retention. We decide if to keep, what to keep, how to keep the data. So we can keep longitudinal data long-term. We can delete after 31 days, et cetera. We can sanitize stripping certain fields or making them less granular, like, for example, the user agent. We don't store that in raw. We get rid of after 30 days. We parse it and try to determine operating system and browser information, and we keep that around for longer, 90 days. And then we delete that as well and aggregate the data and store that long-term. We're also looking at, we're investigating differential privacy and other privacy-by-design techniques. Marcel and our team gave a good talk about this. It's Strata, there'll be links to his slides in my slides as well. Now, once we kind of have the data cleaned up, aggregated in a way that we can use it, we have to decide how and where to store it. And that, what we use to decide is how is it going to be used? So if we store anything long-term, if we have huge data that's queried as a whole, so for example, how many editors edit articles that get fewer than 10 page views per day over the past three years? Processing that, like an answer to that question, means going over about 14 terabytes of data. And we replicate that three-fold, so it's not going to fit on any one machine, right? To query that, basically, we have the Hadoop cluster, which has a distributed file system called the Hadoop Distributive File System, HDFS, and we put that data up there, it replicates it three-fold, and so now you have 50-some nodes that have copies of this data all over the place. When you launch a job, you have a resource manager that tries to put your logic as close to the data you're trying to query as possible. So that's HDFS, that's why we store it there. If the data is not too complicated, so things like if you have a table, it doesn't have too many columns, like the classic SQL tables, but it has a lot of rows, billions and billions of rows, but not too many columns, and needs to be queried very fast, then we put that in Cassandra, and Cassandra can serve queries to that data. It doesn't do any aggregation or computation on top of it, but can serve the data very, very quickly. Basically, it's a big indexing system. This, for example, hosts data that serves the page of API, and we're going to go over the page of API as an example to this whole flow in a few minutes. If we have not super huge, but pretty huge data that's more complicated, so it has many columns, but none of those columns have too many unique values. For example, like putting all of the user IDs or all of the page or all of the article titles would be too complicated of a column. If it doesn't have that, then we put it into it, which is a place where you can slice and dice data, really big data, terabytes and terabytes, but not super complicated, super unique data. There you can slice and dice by different dimensions. You can hold a property constant, like, say, the browser, and look at how different metrics fluctuate over the other properties of the data set. For a specific browser, what countries, what mobile versus desktop sites, where is the data being accessed? Drew does this with a clever index that he's talked about in other talks. Finally, we have to decide how you present this data. Our audience is pretty varied. It includes researchers, both internal and external, people downloading the data and working on public data, product owners in the foundation, the communications team, data analysts, engineers, and community members. A few examples like analysts will log into their cluster via Jupyter Notebooks, which can use Python to run Spark or Hive and query data on HDFS. They have full access to basically the big, slow engines there. Community members will look at Wikistats, which is a pretty UI for top-line metrics. Some community members download huge dumps of analytics data and build tools and workflows on top of it, while engineers like Eric Bernhertson on the search team will use Pageview data to train models that improve search results. We support different ways of interfacing with the data, and the same is our infrastructure where we try to use out-of-the-box tools. We do the same here, Turnilo, SuperSat. All of these things come out of the box with lots of value, and we try to contribute to those open-source projects. At the end of the day, we have to develop custom code. I'm going to show you real quick a few examples of these interfaces and how you can look at this data. The most general-purpose dashboarding tool we have is called SuperSat. It basically can access everything on our Padoop system, all the data sets there. This is showing geo-editors. The data set that looks at what countries people are editing different projects from. This is showing some of the top-wiki's English comments and wiki data, and how many editors are coming to those projects as well as from what countries are they coming from. Basically, if you want a dashboard, a custom dashboard that gives you something that you want to look at regularly, and this is the tool to use. Turnilo is a place where you go and explore, so you slice and dice data. This, for example, shows traffic for the last 30 days broken down by the browser that is being used. If you're trying to find a particular pattern in Firefox or something like that, you come here, you can break it down. You can break it down further by versions and all things like that. You can see the dimensions on the left side that you can split and filter back on. We have this little tiny tool called custom dashboards. This is very useful for canned data sets that provide a lot of value. For example, we released operating system breakdowns and browser breakdowns, and these are widely used over the internet to determine, you know, we're a big site, so people want to know roughly what's the most popular browser on Wikipedia and stuff like that, and this data set helps them do that. This is the interface I was talking about before, the custom notebooks you can see here. Basically, you're just doing very, very simple Python, like literally putting a SQL statement and calling Spark SQL on it and transforming it to Pandas and, you know, doing everything all the data science you want from there, and this is basically running in your browser through a tunnel, through an SSH tunnel and doing this work on the cluster. So you have the full power 15-ohm cluster right here, and you can do anything you want. And finally, this is our community-facing interface, WikiStats. We basically built this so folks can celebrate each other's amazing work they can find here, you know, up-and-coming projects can find Rises and active editors, registered users, they can run campaigns and try to see if it moves and needles and stuff like that. So this is really focused towards the community and top-level metrics like, you know, stuff that the media cares about and our communications team wants to share. It's all in a nice, shareable format. Okay, so that's kind of an overview of our pipeline and I can walk you through a little example. I talked a lot about page views to kind of give you context about this example. There's many other data sets that flow through our system. You can find them all in our docs or you can ask us, any of us and but it helps to just give one concrete example. So here's we have the question, how many times does content on our sites get viewed by hopefully human users? It turns out it's pretty hard to determine who's human on the internet these days. Surprise. So we instrumented that by going to our front-end caches. We have a tool called Varnish Kafka and it sends an event to Kafka every time Varnish services a web request and the data flows into our Kafka platform. We get about 100,000 requests per second, something like that. We get more from other custom events but that's just page views. Sorry, that's just web requests. Web requests are the raw sort of unrefined data. Those become page views later as I'll explain. And we bucket them by hour in HDFS. Now HDFS has for this hour, these were the web requests that the Varnish layer saw. Now we look at that and we do a bunch of things. First, we check for gaps, could there be hardware failures and data loss and things like that and we check for gaps and we alarm if there's any gaps. Most of our systems have data quality alarms that say, hey, this job didn't do its thing and we look into it, et cetera. Yeah, and we also clean the data so we take out any raw information like IP address because we don't really need the IP and we don't want to keep it. We basically want to keep just, we keep it for operational purposes but we delete that after 31 days. What we keep is the location, like the geolocation of the IP. We keep that around that's useful for analytics and we parse the user agent like I was talking about before. So we generally make the data easier to use, aggregate it and then we aggregate it hourly by page title because also we don't care too much about each request separately. We care more about what articles are being accessed and by what user agents and stuff like that. So we aggregate that that reduces the size and it allows us to store it long-term, free of any sort of private identifiable information and allows us to query it faster and move it to the other parts of our flow that can be used. The decision to keep the data, what we sanitize what we keep, we basically evaluate privacy risks whether you can de-anonymize a user whether you can reconstruct browsing sessions from the data that we have and then we take action so we delete raw data after 31 days refined data after 90 days and investigate, we're looking at other privacy techniques that we can use to make the data safer because all we really care about is getting our answers. We don't care about keeping data around for the sake of data. And for page views specifically the data is of course stored in HDFS and drew it in different forms and showed to different people but I just wanted to talk about the page view API specifically because it's relatively simple data so there's not a lot of dimensions. There's whether it's accessed on the mobile site or whether the user agent was a spider or probably a human we have few columns and lots and lots of rows so lots of data. So Cassandra is the right place we store it there and it allows us to query it very very fast and so we built a very simple Node.js service based on the Node.js service template that we used to use using REST based and that's basically the API that lets people access data in Cassandra. So that's basically the page view API that's used from a whole bunch of tools a whole set of tools both internally using that to combine it with other information externally to have some info pages of articles on how many page views each page had and look at project level page views and stuff like that and it's also used in other interfaces that we have like wiki stats and dashiki and other other places. So that's pretty much the page view example I just wanted to point out one other small thing is there's a tool called Matomo which if you have a smaller site that gets less traffic like for example the wikipedia 15th birthday website that we put up you just get a little snippet of JavaScript stick it on there and you get some automated metrics without a lot of work so you get unique visitors and things like that. So that's our system overall I wanted to highlight a few things just to kind of give you context so if you think in terms of infrastructure, puppet code servers, things that are hard for our 2SRE engineers to work on these are the areas where they're doing the most work so Hadoop and HDFS is just huge it has a lot of moving parts Kafka as well so that's a lot of custom puppet code Cassandra is up there because it turns out Cassandra is very hard to operationalize and to maintain it's complicated you have to kind of know exactly how it serves data quickly to tune it properly and yeah Kerberos is another thing we're working on right now to harden the security on our cluster so that's involving a lot of work and that's roughly where the SRE work is going and the majority of our code so our application logic is that we write I don't think we have more than 100,000 lines of custom code but it's somewhere close to that so it's not a huge amount it's mostly tuning things out of the box but it's not a small amount it's not like a small code base it's mostly some of it is data transformation we're processing everything from the data sets flowing into our own data sets that we're importing from MediaWiki so for example we scoop all of all data from all MediaWiki databases and aggregate it and clean it up and present it in a way that's easy to query and ask questions about editorship and retention and stuff like that to just yeah like all the data transformation that I mentioned that's all happening in a big custom code base and some of that is pretty tricky but I want to show you one example at the end of the talk that kind of gives you an idea of distributed programming and why that's a little weirder than normal database programming and the other we have a lot of Python utilities that kind of glue everything together and call out to different APIs to delete sanitize data all that stuff and our front-end so that turns out it's pretty complex the UI has a lot of custom code to make things look pretty and feel good for the user so that's another bigger code base and I think one of the last things I want to mention about the overall flow is tech that you should keep in mind kind of as you think about how you could benefit from the analytics infrastructure and systems and experience that we have I was just talking to Subu who's working on porting the JavaScript parser to PHP and part of that work is testing testing the results of both parsers to make sure that they match and that could involve a lot of compute a lot of differences being made and that's not at all in any way analytics related but the compute capacity so the cluster could help that problem very much keep in mind are the data flowing in through Kafka if you want to react to things in real time send notifications for different reasons think about how we compute and aggregate data you can think about that notebooks if you want to come in to the cluster and run a big job and share the analysis with other people in a friendly Python set of commands that's available and if you want to publish data sets so for example we're working with Wikimedia Deutschland and other teams internally who want to publish media coms data so how many times images and videos are being accessed we don't have very good instrumentation on that but we do have some data that we're going to put in an API and evolve the quality of the data over time if you have data sets like that that could be useful to you and others you can think about that I think for me that's pretty much a victory as I'm going to declare victory for myself or having gone through and explained kind of the overall picture everything that flows through here kind of goes through roughly that there's a lot of nuance there's a lot of stuff I skipped for the interest of Brevity but if you have any questions so far I'm happy to take them I'll tell you I have after this an example problem that we solve in distributed systems so I can either take questions now about the overall picture or just go through the example and take questions at the end I'll wait for Subbu to make the call on that I think the stream is at least 30 seconds behind but I did have one question I think so the picture was pretty impressive you have a lot of systems in there and you mentioned how everything starts with a question so I was curious how does this whole workflow work suppose somebody has a question how much of it is a lot of custom code how does it work how do you is somebody sitting with analytics people trying to figure out how to piece it all together or is there some kind of a language that people write code in I don't quite really follow that that's a really good question and I completely forgot to talk about how we interface with people that these are the most important part who this infrastructure serves thank you for that so yeah basically when you formulating that question that's like I said a teamwork effort and instrumenting is usually you know on your team so if you have a product you're usually instrumenting that yourself because you know your code base you know all that stuff we help you in terms of helping you define the schema and we're working on that process to make it tighter and a whole bunch of other people to make sure the schema makes sense and it conforms to all the different policies we have and so but once you have instrumentation and a schema picked out then the data starts flowing in and that's you can for the most part hand it off to us to take in process in a way that makes sense we have standard ways of processing this data we decide together how you're going to store it how you're going to access it where you're going to do with it we can define so like if the raw data is not very useful we can define data sets that you build on top of that that can be useful or we can load it pretty much as is with simple modifications into Druid we have like a direct to Druid so one of the most common instrumentation is event logging on your feature say someone is instrumenting and they can go into that data set can go straight to Druid and they're updated hourly and you can query and slice and dice right there from Ternillo the interface to Druid so that's all pretty much out of the box but you don't really have to talk to us at all if you have data sets that you need more custom attention then we can help you to a certain extent we have a priority list and you have to get in line to help you shape that stuff and load it on whichever platform makes sense for you to query yeah, does that help answer or do you have Yeah, that helps answer the question I have more but I can follow up later Andre had a question on IRC Yeah Is asking are there plans to update redirect or do something about stats.wikimedia.org which looks very static but at least what makes it hard to find things Yeah, so this this tool, this is wikis stats 2 if you go to stats.wikimedia.org slash v2 you'll see this we're being very like cautious about the quality of it it's definitely totally okay to browse but we're just being very careful that it meets all the community expectations and build up with the old tool so the old tool might look static but it's actually like a collection of brilliant work by Eric Zockta that a lot of people find extremely useful and they've sort of gotten used to it over the past 12 years that it's been up so we're displacing that very carefully and right now this project this wikis stats 2 project is in alpha mode it's going to hit beta sometime in the next few weeks probably and after that we have a bunch of other changes that we want to do over the next year and it's going to be released next year but I think when it goes to beta we're just going to redirect it automatically input about that is welcome of course everybody here I think uses wikis stats to some extent so it's all your tool like let us know if you think it's ready to go to this new version and stuff like that we find that feedback useful okay another question another question okay great let's go and take a look at this little example problem and then I can answer any other questions at the end okay so I chose this because I ran into it and Joseph who's the scholar distributed system expert on our team he explained to me what the problem was and it was very frustrating for me because I'm coming from like a more traditional database background and I was so confused why this particular problem happened and here I'll set up the context so we import a bunch of media wiki data let's focus just on the revision table and the comment table so recently comments have been factored out of revision and they're now linked by an ID and we have to join revision and comment together to get those comments because some of them are useful for the reconstruction we do we don't collect all the data together so when we import we don't keep each database separate we just put it all in one place and add a column called wiki db that helps us analyze everything together it helps us do cross wiki joins and stuff like that and it's a lot you know more helpful for analytics purposes so when we join those two tables we're joining all of the revisions with all of the revisions from everywhere and the problem is basically that the data is skewed so a lot of revisions have the empty comment for example let's pretend like the first couple revisions have comment 80 a and b and revisions 3 through 10 billion have comment 80 x which happens to be the empty stream so when you join this table what happens does it just take a long time or you know it just fails it just completely burns to the ground in a spectacular fashion and even though we have a big cluster and plenty of capacity to process all the data what ends up happening is that you can't parallelize this particular join and I'll kind of explain why so the data is skewed and we have many many billions of revisions on one side trying to join to one record on the other side and when we join we use when we join in a distributed system we use a technique called map reduce right so we sort of break the problem down into two steps first step we have a bunch of mappers that are running looking at a big distributed cluster and going to the data and saying I want this part of the data so that's where the filtering would happen just like you would do map on you know javascript.map and you filter what you need and you send it to another worker called the reducer and then the reducer does whatever aggregation type work like you would call reduce in javascript that you take multiple streams from different mappers and you do some work on it and then you output and then you combine the output from a bunch of different reducers and that's your final output so that's kind of roughly what that looks like you know you have a bunch of data distributed all over the place you have mappers that know how to fetch it they go in and they find rev comment IDX and comment IDX and they know that this is going to be a join so they take rev comment IDX and comment IDX and hash it by the number of reducers and send the result send the row to the reducer that the hash maps do right so like you're trying to distribute a big data set of millions of records into end reducers you're going to hash mod end let's say so that it lands on one of those end reducers and the problem that you run into with skewed data so when you have lots and lots of repeated values is that all of the comment IDX will hash to the same reducer and that reducer just burns to the ground just you know it can't and if you have 10 billion records you wouldn't it wouldn't right it would be fine if you had a million 10 million it would be fine the problem is it has so many records that it can't even swap them out to disk as it's processing them because the operation is too slow think about if you have 10 billion records and it takes 10 milliseconds if you have just 1 billion records and it takes 10 milliseconds per record that's 4 months of computation so if you can't for example if you thought hey this is kind of dumb I'm just hashing the same value to the same reducer why don't I set up some server somewhere like a very fast lookup server like a reddit server where I tell it I hashed, I just keep a count a tally of how many times I hashed X to a particular number and when it gets to a billion or however many records one reducer can handle it does like a little collision detection and starts hashing to a different reducer well that conversation over the network would take a few milliseconds and if you add those milliseconds together you end up with like months of lag so with this size data you literally can't do it so what do you what do you do how do you solve that problem how do you take the load off of how do you de-concentrate that load very ugly and not at all generic there is no generic solution that we know that I know of but one way you can approach it is you can sort of add a fake ID to one side of the join that you randomly assign a number between one and K where the K is like some constant that you magically came up with that works for your cluster whatever your cluster size is and then that's on the side that has the many duplicate IDs and then on the side that you're joining to you want to duplicate those IDs with that new fake ID so that there is a target for all the joins so you duplicate that K times randomize you know one through K values for fake ID over 10 billion records on the left and the revision side and you assign you multiply the command ID X field by K on the common side so now when the join happens all of the random one through K values from revision will find a join on the common side and it will be the same value because you're just duplicating the common test and when it hashes it instead of hashing just comment ID X it now hashes fake ID 6 comment ID X fake ID K comment ID X and those are all different hash keys so they'll get evenly distributed so now you've managed to evenly distribute your work over your reducer cluster and you don't you know burn one of them to the ground anymore so yeah any any questions about that do I lose anyone on that I don't know if I did a good job of explaining because you know like if you don't if you just learn something like that you don't always do the best job of explaining so I'm happy to go over something again or fill in a gap can you explain why you picked this example what what you're trying to illustrate with this example maybe that will help yeah so this is this is something we do we we ran into this problem when when this refactor happened so what we do let me back up so we put together a data set called the media wiki history data set that's all editing metadata from all the databases looking at the logging table finding where users change their name where articles change titles where users were blocked and unblocked pages deleted and restored revisions reverted all of the different things that happen on the wikis and we put them all in one table literally like one table with many columns that explain everything that happened to that revision and give context so for example one row would have the number of edits that the user making that action has done up to that point in their life it would have the name of that user at that point in time and the name of that user currently today you know that's kind of what the data set looks like and we build this by doing a complex set of aggregations on top of the media wiki data so when there's a refactor uh in the media wiki schema that affects us pretty deeply and this was one of those examples where we had to join to the comment table all of a sudden and not so much now but especially in the middle of that migration we ended up seeing a lot of skew data because you know not all of the comments had come in so a lot of them were null a lot of the commentaries were null not migrated yet and we hit that that problem and I wanted to illustrate kind of the difference between working with big data and how that could have unexpected unexpected nuances versus you know if you're just doing a join on a couple of big tables in SQL it might just take a long time but it's not going to run out of memory or crash some particular worker because it's just you know churning over it and stopping to disk when it needs to join and stuff like that yeah so that's why I picked this example that's helpful thanks okay um yeah so we keep good docs on all of our on all of our systems and infrastructure and procedures uh and if we don't keep us honest and let us know we'll update the docs um but feel free to visit them they're on wikitec slash analytics um and yeah thank you ask me anything look on IRC okay this is the very beginning is this recorded I think so this is recorded on screen oh awesome this is very helpful um and will you share the slides as well yeah uh I have uh I noticed that on comments there are usually pdfs um but I have the slides with the notes as well I'm happy to share that um because I think that's where most of my talking is so I'll do both awesome thank you I'm going to uh further share this with other people in products that's awesome yeah I hope that it helps and I thought there was a little we usually talk about more specific systems or more specific algorithms so I thought this would be an interesting overview yeah I know seeing the high level view and how it all works together is super helpful um I'll make sure my team sees it definitely as well as other people okay you know if you have questions go have the rest of your day go have fun we're gonna we're gonna meet the new CTO in like a few minutes right so wait what there's a meet and greet with a one of the CTO candidates sorry not the new CTO one of the CTO candidates I hope I haven't like someone's mind backs yeah alright thanks Dan and okay thank you thank you all yeah thank you