 We're from Chegg, I'm Chris. Grish, nice to meet you all. We're going to talk you through the great data migration. I had to come up with a semi-sexy title. Advice on maintaining sanity when saying goodbye to your old workspace. There's a pun in there somewhere. OK, so what we're going to talk through is basically like our migration from our analytics stack, our data product going from Adobe Workspace to Amplitude but it's as much about the tools as it is about the organization and the culture and managing that migration through that process. So we'll talk you through some of the challenges that we had. We'll talk you through the solutions that we put in place and some of the lessons learned and some of the pain, a lot of the pain that we went through. First, what is Chegg? Chegg is an ed tech company. We provide tools and content to help students learn, primarily college students. But how many of you have heard of Chegg? Awesome. OK, I don't need to actually explain too much about Chegg. So yeah, we help students learn and we have a platform to help those in the midst of their career, up skill, cross train and achieve career success. Within Chegg, we run the data solutions organizations. I'm a manager of data solutions. Girish is the chief kind of technology data product manager and our group data solutions enables scalable data driven decision making, basically by building a kick-ass data product. That data product encapsulates the capture of data, the systems that allow us to capture data, the systems and processes and governance that allows us to migrate that data and then it in syncs and then they tools to allow access to that data. Primarily, that means we do things like instrumentation, writing instrumentation specs. It means we work with engineering to devise data capture strategies. It means that we just stand up amplitude and roll out amplitude features. It means that we get to come do events like this and meet with talented folks like yourself, one of the perks of the job. OK, so our challenges back in 2021. If we rewind a year, we had three fundamental challenges on our data platform. One centered around data creation, two centered around consumption. On the data creation side, we had literally six plus different areas where we would log data. Six plus logging technologies. We had mixed panel for logging apps. We had Adobe for client side logging of web experiences. We had our legacy event architecture, our new event architecture. It was a mess and what this ended up doing is it had a very high cost and caused a lot of confusion. We would end up duplicating logging in places and we would end up having to educate new engineers on how to log things because there's so many disparate sources to log them. Even worse, this materialized itself in data consumption because all of these different data sources would be materialized in kind of a spaghetti-like network of data sinks and pipelines. And users who would try and get access to that data, this is probably familiar to many of you out there, users who wanted the access to the data then would have to go and write these complicated SQL queries. And there would be this tribal knowledge of like, oh, you want to go and get conversion data? Go to Adobe. You want to go and get session level information. As a source of truth, you have to use a legacy event architecture because that's where we report out to earnings and we can't be subject to data loss. All these hidden rules that were like partially articulated in Confluence documents, mostly articulated by finding the right person to tell you what to do and where to go. Yeah, and what you want to add is data discovery, data catalog or lineage was super hard, right? Because everybody is going through eight different systems and all of their data is mismatched. We heard earlier today how different people would come and say my data is different quality and also consistency in data was a huge problem in this left section here. And then the second issue in our data product last year was we had an unimpactful UI. We had Adobe Analytics and it worked very well for a subset of users who liked drag-in dropping into a two-dimensional interface. For the vast majority of people who needed the self-service data, namely product managers and marketing, they couldn't do it. They could only get the very basics. And that caused they only had basically one thing that they could do, one option for them, ask analytics. And what ended up happening then, we described earlier analytics is already burdened because the infrastructure is so complicated. So you get all these questions being forced to them and they're hamstrung because the areas where they can query data is so complicated. Pretty bad data product. I reflected on this data product and drew out some of the characteristics of it. And this is as much to do with the experience. Like I think of the product as the experience. This is the characteristics of that experience and what the vision was to move to. So in our old data product, we had typically people would share expert contacts. If you wanted to know something where data was located, hey, Girish knows it. Chris knows it. Go to someone on Data Solutions. Rather than in the target state, we want people to share information. If you want to know what conversion rate was last year, you don't need to know to go to this person in order to get the right chart. You know to get the right chart, that information, that link is shared directly with you. The second characteristic or challenge of our old data product was this, what I call Tower of Babel. Basically multiple different people speaking different dialects around data, depending on their role, depending on their persona. And it manifested itself into people having to translate with one another in order to get simple data requests fulfilled. Compare that with the target state where everyone can speak the same language. The way that we describe a onsite experience is known and the way that that's materialized into data is shared across the organization. Okay, I'm going to skip over some of these because I want to save time for the crux of the meeting. I did want to call out the two latter ones. Multiple paths, different answers. With the variety of different places you could hit, inevitably you can ask one question, have 10 different ways to answer it, and they're all right. And I'll never explain this to a C-level exec because it would be the wrong thing to do, right? But it's true, you can ask the same question, you can have 10 different ways to answer it, and they can all be right. Depending on the structures that are in place, the way that you turn that question and answer it, right? The SQL code that you implement, or the tools, the structuring of the data. What we want to do is put in guardrails. I don't think there's ever going to be a solution that completely prevents multiple different answers to the same question. But what we want to do is put in guardrails so that the same question can at least be put down the same path, and we understand how that question should be answered and how it should not be answered. One thing that Chris and Nibir, as a student, if you look check, as a student, they can go from domain to domain to domain while they are clicking through and getting help. Experts were isolated to those individual domains. And we'll talk more how Amplitude helped us, but that was key for us to see the journey of the user across domains, and people are going, ah, this is how business works elsewhere that I'm not part of. Worst part of all, so those four above challenges cause probably the worst challenge of all. I'm a previous data scientist, so maybe this is just me internalizing this and saying this is terrible, but because all of these self-service requests can't be fulfilled, these requests would then get funneled down to analytics. And most of the requests were basically to count things, to count metrics so that they can report it out. This positions analytics effectively as accountants and nothing against accountants. They're amazing and they're needed for their job functions, but analytics can do so much more. These brilliant minds of understanding the user journey are target state positions analytics as data scientists. Okay, so what was the approach? We aligned on like three coordinated efforts. One was to unify the front end instrumentation, so get all the logging down to a single standard. The second was to deliver a intuitive, like impactful analytics UI. And the third was to drive trust and data. I'll talk you through some of like what we went through across each of these. Unifying the front end instrumentation. As I mentioned, we have like six different areas where we log things. We first created a unified standard. There was a single new logging stack. We call that Rio, Rio is Spanish for river, like a river of data, right? And we aligned the organization that all new product releases would have Rio instrumentation in place. Cool, that covered all of the existing instrumentation, or all the new features. However, for all the existing experiences, they already had logging primarily in Adobe through their client side logging. What we did then was we recognized that there's no way the company's gonna invest to re-log experiences that are already in maintenance mode. However, those experiences are very core to understand. So we built a listener. We built a listener that listens to window.digital data, that JavaScript object that sends data to Adobe, and emits from changes in that window, that JavaScript object, and emits those changes into Rio events, converts them, sends them into the event make. That covered basically the collection of raw data. Next was Adobe did some things for us that aren't available just at the raw level state, like things like enrichment, marketing channel attribution, for instance. So we built out a framework to do what we called an enrichment framework, to do append information to these events. Things like marketing channel, subscription status, geo information, bot filter, I'm missing a few on this, but. And this then gave us the access to our entire ownership of our data landscape, our event landscape with the enrichments we need. And we were left with one last thing. This was great for all new data, but all the legacy data was still sitting in the old repos. So we did something that in retrospect, tested our sanity, and I'm not sure I would make the same decision today. We back filled all that historical data. We took those experiences and how they're logged. And if anyone's familiar with Adobe, right, you get them into your environment in what they call Adobe Quickstream files. We took those Adobe Quickstream files and we translated them into our real architecture. And then we reprocessed four years of history. And it wasn't fun. It wasn't fun. We lost a lot of sanity during that timeframe. Okay. Next was we aligned that we needed a more impactful UI. And so can anyone guess what tool we initially chose? Actually, it was Adobe Analytics Experience AEPCJA. Sorry, yeah, correct. Yeah, yeah. We initially aligned on going down that road. And we loved it from the engineering side and quickly realized it was not at all going to give us what we needed for product analytics. Yep. We then aligned on amplitude, right? Which is why we're here. We have two amplitude instances, one feeding real-time data, one feeding the enriched data set that I described earlier. And thirdly, driving trust in data, getting alignment around when someone asks a question, where to go to answer it. So we built these data marts that basically aggregate and summarize information about key metrics, which then are compared against the equivalent metrics that you can pull from amplitude. We, with the introduction of this unified data set, then worked with analytics and product marketing who had charts and graphs that were built up in legacy experiences, like in Workspace or like SQL queries, and kind of managed that migration to now report from to this new data source. Which sounds like very simple on the surface, but is insanely complicated. Yep. Just because you have so many disparate people and so many different use cases. I would say this drive trust in data was such a huge problem for us because always folks would come and say, I can't reconcile data, right? Two systems have different data, but with amplitude and with us owning the instrumentation that Chris talked about earlier, it actually has become quite a pleasant surprise where we can say, ah, 100 is 100 in two different systems. And that has given us a lot of credibility that this platform actually works and data is really clean. And we have ways to go, but this is a good start. Yep, yep. And finally, we put in place some data quality measures, some monitoring. We have a lot more to do there. Yep. Oh my God. What went well? So we expected an option somewhat to be in 250, 300 range, but I think we've exceeded that. Greatly we've got a ton of users accessing and the reason for that is the earlier example I gave, many users were stuck in their own specific world or domain business. Now they're able to see journey across mobile, across web, across different, like kids go to citation, bibliography and then do writing tools and do check city. So they are able to go, ah, here is the journey for a user end to end. So our option has increased our questions and sharing, they love to share what they built, dashboards and notebooks are springing up left and right center. So we are very happy about this one piece here. The other KPI that was kind of interesting was we worked with the analytics team to understand the number of requests that come to them that should be self-served. And it's a little bit qualitative, right? Cause like they have to bucket in their minds whether it can or should be self-served versus an actual request that requires their amazing minds. But it gives us a measurement from which we can compare our progress in enabling self-service analytics. So that's that 20% that you see. Like after three months of launching aptitude, a fifth of all requests to the analytics team that should be self-service are being self-serviced. We hope obviously to increase this number to close to 100%. Yeah, yeah. Engineering dependency on Databricks on SQL has reduced as well with that. So Amplitude, as you guys know, we love it. People organize our own folks users in product analytics, data science, marketing to a great extent, they all use it. So why is that Amplitude eruption was good? Amplitude support was great. KC is not here, but shout out to KC. She's been incredible. She was brilliant. I think every step of the way, how people embraced, it's like that starting trouble, but she made it so easy for folks to come. Thomas is here, shout out to Thomas as well. And we have access was just half an hour later, folks had access to Amplitude. We've seen a lot of the user interface. You guys use it as well. The user interface was intuitive. It was really a low barrier to entry, if you may, in terms of how to use the system and user training documentation was all great. In fact, after Amplitude adoption increased, there were more questions internally towards Chegg asking, hey, where is this data coming from? What does this data mean? That was never asked before. So it actually was a reflection on us to improve our own data catalog or event catalog. Integration was super simple. Our engineers really had no trouble at all in integration. So that's one of the things that you will learn with my past experience integrating to various vendors. I've been in engineering most of my life. Amplitude was one of the easiest ones. Thanks, Jeff, is this, thank you. And we're not getting paid to say this and we're not a shareholder, so. And the data sources and quality. So we have tons of data sources. So one of the challenges we had was to ensure that all of that somehow made sense in getting it into Amplitude. And I think that was easy as well. So overall, I think we were pretty happy that within what, two and a half, three months was our total effort to get go live. And we made that with time to spare. So it was a good experience for us. Aha. The moment we've all been waiting for. So a lot of this is more towards check, right? And this is the challenges I'm sure you all are having as well is how do you ensure that the system of truth which is data that you produce has good taxonomy, right? Good clear meaning to it, good description. How do you ensure that our users are able to discover it and find out that there is trust in the data and trust in the taxonomy? So that was, still, we're still working on it. But I think using Amplitude, it has challenged us to look deeper. In fact, the next release or next version we are all working behind the scenes is to make that better. So that is a huge, huge challenge for us. The other thing that we are talking inconsistently to Amplitude is we, most of our data is immutable, like doesn't change, right? But there are instances where data does change and it is one of our hard problems that we have today is how do we enable data malleability? How do you enable, even folks like Databricks are going towards a model where they will allow you to change data? So that's one thing that we would want to have for us going forward. Self-service for all. This is one of the key things that we started off, we have a huge spectrum of users from product managers all the way to engineers. All of them use Amplitude, but we found that there are certain pockets like product analytics which is a huge hit, whereas our engineers or, for example, our UX folks have certain restrictions when it comes to using it. And we have certain requirements, or I was just talking to Patrick as well, so we have certain cool ideas that we can share to make that easy for us. Data sources and quality are mainly on us because we need to make sure the client-side logging is clean and has no limitations and updates and loss of data. All of these are critical pieces of our business that we need to manage well. And I think we also have, unfortunately, too many third-party systems that we share data with. There are several of the partners that are out there, Brace and Particle. When we do that, it becomes hard for us to say what we sent, what we got back. So that's one of the things that we are gonna work on to make sure our third-party interfaces are well done as well. So those are some of the key things that I think we continue to improve. Yeah, just a little more context on the taxonomy challenge. I don't know if this is unique to Chegg because we went from the legacy Adobe experience, but the way that we used to capture data was very specific to that digital analytics 1.0 framework. Things were, if it's a page view, then it's called a page view event. If it's a click, then it's an interaction event. A success is a success event. So we initially modeled that new Rio paradigm around that framework. And what we found is that, unfortunately, that doesn't translate very well to the modern way of describing events where an event represents a user activity. What we found is that then doing that translation and pumping it into amplitude is a little bit more painful than we would like. And we're looking forward to the next opportunity where we're able to now give things specific names, streamline the process of defining things, and streamline ultimately the structure of the data itself to simplify consumption. Next steps. So I kind of alluded to it without calling it out that that next journey for us is Rio 2.0 and a event lake house. Rio 2.0 will allow us to collect data on the edge. So through Lambda on CloudFront or OneGraphCalls, therefore we're not subject to client-side data loss from things like ad blockers. It also means that we are able to enrich that data in motion rather than today, that enrichment framework I described earlier runs on a delay of four plus hours. So it allows us to enrich it in motion and therefore send it out to our partners immediately as opposed to waiting those four hours. The next is the empowerment as well as self-service analytics. I think one of the challenges we've had in standing up amplitude is getting, we've gotten great adoption. You can see there were like 450 users. What we don't have I think is great empowerment where people like get on the platform and like, wow, this is amazing, that aha moment. So we've spun up a project literally called project aha. To go and meet with product, to define their user experiences and their OKRs, to assemble the official charts that reflect those journeys, understand how they compare to the official charts that analytics or BI maintains and reports up to the execs in order to put those guardrails that I mentioned earlier in place. And then therefore give confidence then and kind of empowerment to the product team and to marketing team of I know where to go. I have like a chart that I can start with and fork and it's trusted and I understand how and where to use it. And then finally is driving more trust in data which I alluded to earlier. One part is that those guardrails that I spoke of and the second part is in defining data quality standards and a data quality framework for monitoring real-time alerts on monitors and a incident response framework which if any of you have like done site incident engineering like that there's a very robust process there. But for data management, we found that we didn't have that same process instilled in the organization. So when a data incident emerged, people kind of were like, oh, what do we do now? Instead of aligning that with a way that we respond to site incidents so that key data issues are immediately resolved. Yep, yep. And hopefully if we're semi-successful, we'll maintain our sanity throughout this. I do want to call out Chegg is hiring so if you're interested in like ed tech or helping students learn and maybe this isn't the right time to say it because I think we'll hopefully maintain our sanity. Come join us, but it's a great company.