 My name is Raul Singh and I'm so excited to be here. Cassandra Summit has not been Cassandra Summit for years now and I'm really happy to be here. I'm also super excited that we're talking about AI and when they asked me to potentially tune this talk because we decided to do these summits together I said you know what there's going to be enough talks about AI so I will cover a little bit of AI today but my goal is to come across with three points one is you know how do you design an open data platform that can withstand personnel changes people coming and going technology changes as technology comes and goes right number two is how do you do that in a way that future proves you for more and more AGI or AI coming down the pipeline and number three is how do you control this how do you keep control of it we're at an open source convention open source summit here so it's it's not necessarily about everything happening to be running on your hardware but it's really do you have control over the software okay so the biggest challenge you know I think humanity faces is that people are going to figure out you know they got to figure out how to steer at the computer while we're about to do the work right I think if you don't know that trust me even as smart as you guys are how do you steer at your computer while we're about to do the work you're going to do it because you're going to be running a an LLM based application you're going to be watching how it runs right half of this this is a joke but half of this is true and this will come full circle when I talk about you know how do you build a data platform that is AI ready right that AI can help with later on but really it's it's about the knowledge management it's you can design a good data platform you can have the right engineers you can have the right technologists but how do you manage this process of designing and building so that new people can come and help you design and build and at the same time you as an organization become an expert on that platform there's no two platforms that are built the same every single platform is different and you know we'll work with data stacks a lot and you know data stacks being a product company they're always like this is our product and it can solve these problems like yeah but this is the industry and this is their problem how do we solve it how do we work back from that so I'm going to cover a playbook for how do you design a platform an opinionated framework for reference architecture and the approach for knowledge management and we're not going to get to the migration use case but we're going to look at a standard data fabric and we're going to talk about cloud versus open cloud open source our open core these badges you can collect later but there's three badges I'm going to give you the deck itself you can get this deck there are going to be high resolution images of the things that I'm showing that I think will be useful for you to take a look at later and then a link to a bootcamp that I'll be doing around generative AI specifically with modern data platforms in the mix so our company Shameless Promotion we help platform owners go beyond their potential that's why we work with Cassandra because Cassandra is infinite and we do this using our playbook and that's what I'm going to be talking about and our playbook is basically set of principles okay we work with a lot of logos that you probably interact with and these logos they end up using databases like Cassandra because that's the only database in town at that point that that'll work all right at the end of the day you know we we look at our company as a people and knowledge company ultimately we're solving problems for people we're solving challenges modern technology is very disconnected right this is a picture from 2020 of the marketing technology stack and in a consumer oriented company the marketing stack is where a majority of data comes from and how do you consume and how do you process that information so that it can help you make decisions right and and then you have on the the the bottom right side this is a tech stack of all of the open source LLM frameworks and tools and AI tools that are out there so technology just keeps getting more and more complex not because the technology itself is becoming complex there's another player there's another player there's another player but what do most people need most people they just need a way to find information right analyze it and act on it it's usually these three things right and I joke that you know database technology is been the same get get your stuff into the database get your stuff out of the database but as people use it it's about retrieval it's about finding visualization one way to do it analyzing the data could be one way to do it um and acting and sharing on it could be another way to do it and just put in the back of your mind ask yourself where does the AI fit into this because the AI can help with that stuff business platforms uh mature business platforms they actually cater to four different kind of audiences employees partners customers and let's just say things and I would maybe add AI later on um as services that are going out to people it's it's a customer experience if it's a information system like a CRM if it's a business intelligence dashboard right I didn't make this diagram this is from Gardner's uh you know business business technology digital platform or something like that and big companies have hundreds of systems that they have to organize in order to have a business platform and for me the Nirvana is all of your stuff together all in one place so you can find it use it share it and going beyond the 12 factor app manifesto which is is the data available for the people at the right time is it available in real time what it needs to be is it available behind uh password because it's sensitive information uh is it up all the time uh is there a recovery point objective and a recovery time objective of zero because that's what's possible with technologies like Cassandra so this is what I'm thinking about is if in a in a uh perfect universe all the data being generated in an organization could be in one place the ability for the company to use that data allows them to with human hands with AI hands basically make decisions quicker better faster to be cheaper and better than a competition all companies go through this sequence by the way um you bring up silos you create standards uh then you create standardized data and then eventually you can have standardized processes so that the company can do new things quickly and the value of the organization's technology depends directly on how your optimized data core is being built and managed because if a company acquires another company how quickly can they integrate that company's resources it's not about the technology eventually it's really all about the data today companies are using things that are outside the firewall right they're using apps software as a service quick book sales force etc the standardization is about containers it's not about you know windows versus mac versus linux it's about are we using kubernetes or are we using kubernetes uh and the optimized data store uh is you know i think a little bit harder to handle because not every database can handle every problem that you have um and if you're a big company you can't just get you know one database right if you're a big company you can pay millions of dollars and get oracle but if you're a company that wants to scale in a growth perspective to be able to cater to millions and billions of users um you can't go down that route that's not going to work either right the goal is business to business modularity how can you make your platform in a way that allows you to connect with those other people i just talked about customers employees partners right and eventually things and maybe ai um in a way that if the shifts in business change the direction of your company you're able to pivot much quicker than than if you didn't have the stuff together uh generic data platform you guys have all seen it you know data in data out every data platform has some component of the scheduler streams batch right data comes in from different sources it goes out to different systems what casandra does is because of cross data center replication it allows us to have the same data set being acted upon with completely different tools so in this case we're looking at one that is running spark machine learning workload we're looking at one using presto for reporting and another one and by the way presto can run on or trino can run on spark so it's not like they're mutually exclusive and then another one that's doing all the transactions apps that are built on node on spring on aca right this allows us to have this data fabric and because it's open source you can choose to have this in aws in multiple regions in aws in different clouds or half of it on your infrastructure and half of it on the public cloud so these data centers can really sit anywhere talking about design when we think about building a modern data platform you would say okay what are you using for infrastructure automation what are you using for orchestration actually i say let's step back for a second let's think about who's going to be using this what are the processes that they need to be you know responsible for and and that's really where the design component is less to do with the technology and really to do with the business goals there are tons of modern stacks out there right if you look for modern data stack or modern data platform you'll find a bunch of things they usually have snowflake dbt data bricks right spark so which one do you decide to use how can you decide with this you know and it just continues to grow and every day you go there's another logo that you have to take a look at our approach is to look at people processes information and systems as an inventory and then ask how are we taking this most important business process and making it better the framework it's comprehensive it it's about the user experience platform what are we using what data system are we using what cloud are we using and the approach is less programming less scripting and more configuration and automation I can't obviously go through every single step here but your company will have some version of this the point is can you take your approach to your platform and teach it to that person that's coming on to your team tomorrow and then can you take that same approach and teach it to the LLM that's going to be assisting right because if you're going to be scaling an organization yes you can always add more technology you can always add more compute nodes but when you need to make more features it's going to be people at least for now so we use a canvas you guys have seen a business model canvas you guys have seen lean startup canvas we use a canvas that basically takes two dimensions people process information and systems these are our contexts and then responsibility areas so you can basically inventory every single person or persona what are the major business processes what are the major information requirements for those processes what are the technologies that it's running on right now to find out potentially are there redundant systems are there processes that could be sequenced together you can zoom in for example this is a design for a data platform where there were customers business owners engineering and operations every business that's of scale has obviously customers the business product owners right what are their goals well in one case is to add data reports and integrate data right the business product owners they want to release and support customers developers and engineering they want to develop test release and operations wants to support infrastructure if we think about the goals of each of these users they're not trying to fight each other but sometimes it feels like that but when you put it all in one place you can show them that actually they're trying to do the same thing right they're trying to help the customer out and what we found out here is that there's actually very little overlap in terms of the data requirements between customers and business product owners versus the engineering and operations and people were expecting there to be more overlap in our framework we go through a simple checklist is it distributed is it real time is it extendable or open is it automated and can it be monitored or managed you can do an architecture on amazon public cloud right you can do an architecture on microsoft azure public cloud sure you can do an architecture on google public cloud are they all distributed yes are they all real time yes are they extendable yes they all have open uh usually they're all open in terms of apis not open source right can they be automated yes can they be managed or monitored yes absolutely so being distributed platform does not mean that you have to use open source but the case i'm trying to make is you should look at open source as a as a priority because you get more freedoms when you're using open source if you look at the ecosystem there are tons of specialized databases that are out there there's tons of streams that are out there there's tons of tons of stream processors out there so how do you how do you use this stuff right there's lots of data modernization automation integration tools out there uh who has done etl with the no-code low-code tools that are out there anyone like airbite or group eru okay i didn't think so there's it's relatively new um in the whole ecosystem of business software no-code low-code i believe that's the future right and all of these tools basically help facilitate automation and building of systems without too much programming and when you bring it all together i look at it as a computer there's the major components of a computer right you have your ram cpu disk display motherboard operating system and the way i correlate that is persistent queues for ram or bust right uh queue processing for and compute for the cpu persistent storage uh reporting engine for the display orchestration framework for the motherboard although you could argue that the data pipeline could also be that and a scheduler which is the operating system and then there are strategies there's cloud native on google for example uh self-managed open source self-managed commercial source uh managed commercial source right you don't have to take one approach to the other you can say you know what we're primarily an aws shop and there's nothing that's going to be better than s3 for what we need to do we're not going to bring up a minio cluster for s3 right and you can decide to use managed kafka services for um for apache kafka from i forget they keep they change their name all the time um and but you know what we're going to use kassandra for our database and we're going to use emr for our spark if you have this grid of the different strategies with the different parts of a data platform it allows you to make choices based on your business there's a reason why somebody is going to be happy using uh elastic map reduce for spark versus data bricks right there's a reason why somebody would be happy using google data proc for spark uh versus you know self-managing their own spark cluster right these decisions aren't objective they're all going to come down to a business and their goals what this approach asks you to do is to do the homework look at the ecosystem look at the grid of what are the things that are really important in in a particular business this diagram is going to be available on the downloads so we're not going to go into it but in this what i've done is collected basically all the open components for a distributed data platform and added to it things like no code low code for etl or reverse etl as well as low code no code for a user interface and the goal with this is that when we are thinking about a scalable data platform that we have a starting point we don't necessarily have all of the choices made for us but we have a starting point and in here there's astra and kassandra and you know uh self-managed dsc as an option because all three of those can do something very specifically at the same time for the query layer we may end up using snowflake in some instance and we may use presto in another instance and then finally the approach the approach that we take to managing platforms is making sure that we have at least one page for every component and every platform that has the following schema of knowledge documented how do you set it up how do you train on it if you don't know what you're doing how can you do basic administration what does the configuration look like where is it stored and what are our external links to knowledge it could be jira tickets it could be a blog for an expert it could be a youtube channel right and periodically a person that owns the component will go and update that i know that this is a technical you know conference but trust me on this knowledge management is a big differentiator between teams that perform and teams that don't right and i didn't even come up with that like that's a mckinsey study 75 more efficiency if you have good knowledge management but think about the ai that's coming think about prompt engineering the more relevant data that you can give that ai to build things or maintain things for you the better it is right so it's not just like you're preparing for today when you start documenting systems in a structured way you're actually preparing for the future the full scheme is available on the download but it can go deeper so if you want to go mature the knowledge base beyond one page per component you could technically go and have a whole page per page you know per section or a whole wiki site just for a stack for one component or one platform i separated out basically into two sections every framework has a platform and there are components that make up the platform and then there's extra resources and so what you'll see here is that the components of a platform could be the infrastructure the compute the data you know the cloud components open source components versus you know cloud native components this may seem heavy but if you do the work and you have a read me in your repos for every you know system out there you kind of already achieved this right it's really about having a structure in place that you can tell people where the things are we talked about data fabric fabric very basically but let me go a little bit deeper and i'm going to wrap up getting data into kassandra is hard it's getting easier because there are lots of no code no code tools that are out there once you have data in kassandra you can serve it up more easily today with things like Stargate right it creates an api on top of your tables with graphql json rest g rpc also document db and the idea of reverse etl which is a relatively new concept you can do that today as well where you take data out of kassandra low code no code and synchronize it back to your systems right and when you put it all together what you have is kind of this unified data fabric that can bring all your data together for a business process it together and send it back to where it needs to be in an ideal world your users don't even have to use a new system because you could just be materializing information into the apps that they're already using and then you can add to that lm engineering right before lm rag engineering ml was very hard right um now machine learning ops is optional but what is it about this new kind of pattern what is it it's really another workload it's a workload for getting data in processing it with embeddings and storing the data so if you think about the larger ecosystem of what's possible with kassandra it's not just you know rest apis you could be storing features for ml models and serving them up in real time you could be training features uh and and models in in real time that's what kassandra gives you it's the ability to do all of this at once in the moment in real time and not break at the seams if you try to do this with any monolithic database it's just not going to work and this is where uh at least in the vector database world you know the monolithic vector databases are not scaling because there's not just not built like this right and because kassandra's been around for a while there's lots of tooling out there there's an ecosystem of tools when people say why kassandra i'm like okay when there's key spaces at amazon cosmos db yugabyte and silly db all copying kassandra obviously kassandra's doing something right right obviously kassandra's doing something right why is it that cql became the protocol for structured note not only sequel who knows but that it is what it is takeaways i would say don't reinvent the wheel i just identified the objectives i think a lot of times in the data platform world it's like oh we need to warehouse all the data that's not the goal actually the goal is somebody using it it's somebody using it for some job that they have some goal that they have um prioritizing dev ops and data ops automation is the key for a modern data platform people should not have to stare at the screens while a batch job is running right um they should not be looking at the log analytics uh you know sitting in front of their computer to make sure a batch job is running okay right or a real-time et al uh you know app is working okay um use open tools obviously it's about open platform and then basically to document the stack if you document the stack you're going to be fine thank you uh dream big um the whole deck is available with that qr code and um i can take any questions if you have i knew this was going to happen so i do a bootcamp on large-scale lm rag engineering uh you're welcome to go check that out um we did one earlier this year with a partnership with data stacks with so much that has changed we're going to have to update it but uh the goal is to get people from you know zero to hero uh you know looking at the whole rag lm engineering skillset and getting them up to speed uh to be able to build a large-scale lm application uh quick preview of the agenda it's uh strategy and theory uh which is important if you're going to be looking at this long term as a career uh the design patterns that are out there uh there are no code and code lm stacks if you will believe it uh and you know building a custom uh chat bot with uh lm uh for your own data um and you know if you don't believe me come to me later but i built the production scale lm app using astra and bubble in an afternoon so i know i can do it and i know you guys can do it too and with that i'm going to wrap up so thank you so much for coming appreciate it