 A warm welcome to KubeCon 2023 to this beautiful city of Chicago. I'm so happy to be here, and I'm grateful to each one of you for choosing to be at my session today. We are about to explore how Kubernetes and inference decision trees can help us efficiently scale gen AI inferencing for solving complex reasoning problems. So without further ado, let's jump right in. We have an exciting journey ahead and plenty to share. A quick introduction about me. I currently work as a principal solutions architect for AWS. And my daily adventures include collaborating and co-innovating with some of our largest ISP customers, right in the heart of gen AI world, the San Francisco Bay Area. Previously, I worked with Cisco with the emerging technology and incubation team. I believe it's now called Cisco Outshift. Anyone from Cisco here? Great team. I love working there. I miss you guys. So my journey with Kubernetes goes all the way back to 2017 at UPS, where I architecturally established UPS's first private cloud using Kubernetes with OpenShift. Any OpenShift guys here? Awesome again. Great product. I learned a lot from you guys back then. I was a student. So exploring the intersection of Kubernetes and machine learning has been an active area of interest for me since. Back in 2021 at KubeCon at LA, I talked about the intersection of Kubernetes and edge AI. That's a link to my talk and a blog. And today I'm thrilled to talk about my exploration into the world of Kubernetes and gen AI. You guys ready? All right. So to kick things off, I'll start with a simple math problem. If somebody has downloaded my slides, you might be able to get a sneak peek at that. But it's still a surprise. So I'll start with a math game. From there, we will examine the challenges that the most state-of-art large language models today have in solving such complex problems. And we'll see how a prompting strategy called Tree of Thoughts approaches such complex reasoning problems. Then I'll talk about the efficiency and complexity challenges and how to use various software patterns and hosting this Tree of Thought strategy on Kubernetes can make Tree of Thought strategy not just functional but practical and efficient. And what better way to bring it to life than with a live demo? I'll show you a Tree of Thought strategy solving a complex math problem on a Kubernetes cluster. So to wrap things up, I'll share some of my concluding thoughts on the future of Tree of Thought and its evolution. So in order to understand the core problem here, we'll start with a math game. Ready for that? All right. So take a look at those four numbers there on the screen. The problem you have to solve is to take these four numbers and use any four basic arithmetic operators, addition, multiplication, subtraction, or division to get 24. Seems pretty straightforward, right? Give it a try. All right. It's getting you the answer. It's pretty easy. You just add all the four numbers together and you'll get 24. It's an easy start of the session. But here is when things get interesting. So let's see how the most advanced large language models out there fare with this simple math reasoning challenge. We're talking about GPT-4, barred, cloud, from Anthropic, the giants in the world of the Rative AI. That's the prompt that you're seeing on the screen. So we'll play a game. We'll see how chat GPT-4 tries to answer this. So a quick show of hands. Who here thinks GPT-4 can solve this little math challenge instantly and accurately? Instantly. Well, I'm surprised. There are very pessimistic people out there. Come on. It is GPT-4. You know how much money Open AI and Microsoft invested in that? Billions of dollars. And they can add four numbers. All right. Well, let's give it a try. The proof is in the pudding. OK. So here we go. So I'm going to start this challenge. So I'm going to first enter this prompt. And you can see chat GPT-4. This is not 3.5. This is 4. This is the state of R. He's trying to solve it. OK. It's trying to arrange this. Well, that answer is clearly wrong. Oh, it tried to correct it. It's still way off. That doesn't make any sense. Right? OK. All right, so let's not jump to conclusions yet. Let's see what Baud from Google, all powerful Google, and Cloud do. So I'm going to give the exact same prompt to Baud. And while that's happening, we'll try with Cloud. OK, so they are doing the exact same thing as GPT-4. We give the prompt. And let's see what happens. So anthropic. OK, so you can clearly see Baud has come up with some nonsensical answer. So does Cloud. No better. All right, let's give them another try to be fair. Let's try it again. Let's see. Maybe another chance they can get the answer right. Well, no. Right, so there you go. All three most advanced large language models are not able to solve the simple problem that, I guess, a third-reader or a fifth-reader can do, mentally in their head. OK, so all of those are incorrect. It seems that something is not right here. Why can't they answer the simple methodizing problem? Exploring why that happens can give us a fascinating insight into the state of the large language models and how they work. But that will require us to go into examining the fundamental building blocks of large language models. So bear with me with a little bit of computer science here. So I will try to explain it the best I can. It is a complex topic. So when we talk about current large language models like GPT-4, Cloud V2, or Palm 2, they fall in the category of self-attention-based auto-aggressive LLMs, or ARLMs, which are a type of decoder-only transformer variant. So you will read about the various variants in the transformer's architectures. They come from a branch that are decoder-only. Because of that, these LLMs typically generate a token based on the preceding sequence of tokens without any backward editing. So this sequential processing, while effective for doing many tasks like chats and asking all kinds of questions, it poses a significant challenge when it comes to problems that require multi-step reasoning or exploring different continuations within a single thought process, like the math reasoning problem that we saw in the previous slide. Moreover, these models do not possess a feedback loop. They do not have the capability to go and self-check their work, iterate upon it, perform logical correctness checks, and generate as they generate text. And this can amplify small errors potentially leading to larger mistakes as the solution progresses. So we saw this happen in the GPT example before. As it was trying to correct itself, it was getting into a bigger and bigger hole and getting more and more long answers. So once LLMs, such LLMs, they cannot backtrack. They cannot correct a mistake and start the process again. Now, this is stark contrast to how we humans solve problems. We often involve a heuristic-guided solution approach, wherein we navigate through a chain of complex problems with relative ease. And understanding these limitations is crucial as we continue to develop and implement these large language models in various domains. Now let's explore three prompt engineering strategies that have shown a lot of promise in making these LLMs a reason more like humans, especially for these complex math problems. The first strategy here is called the chain of thought strategy. This strategy takes a sequential approach and works through steps in an intermediate manner. While it offers a clearer path for complex tasks and enhances transparency, you can see all the chain of thoughts as it's solving the problem. It can sometimes lead to ambiguous thought decomposition. The second strategy that can be used here is called the chain of thought with self-consistency. This is where we are looking at an ensemble method as the algorithm generates multiple independent chains of thoughts and then selects the most frequently occurring output. It's a robust strategy. It works, but it does not explore alternate paths within a single chain of thought. And lastly, we have the tree of thought strategy, the core focus of today's talk. Imagine navigating through a vast tree of reasoning paths with partial solutions in a problem space. But that exploring that entire combinatorial problem space requires a lot of sophisticated and efficient research algorithms to explore all the problem space. It allows us to do locally explore a thought process and globally explore backtracking and look at techniques. This strategy has shown the highest success when it comes to solving the reasoning problems, such as the one we started with. And that's why we'll take a deep dive into the tree of thought strategy, a powerful framework introduced independently by researchers Yao and Log back in May 2023. So it's not that old, it's about six months back. I have a link to all the research papers that this talk is based on in the last slide, so you can read these papers. They're not that fun to read, but you need a patient to look through all the math and you'll understand how they built up this strategy. It is just fascinating. It took me a while to crack this thing in my head to build my demo for this. So a tree of thought builds a reasoning tree and each node of the tree represents a partial solution. It's like it's building this whole mental map and trying to solve the problems as intermediate steps on the node. Each node is that on the tree represents a partial thought. Then each thought or idea branches out, creating a tree-like structure of interconnected thoughts. This makes a lot of sense when you see the demo of this. But I'll walk you through this high-level strategy. It is way more complicated than it is, but this is a good start. This is the best I could do. I had to spend a lot of hours trying to find the best visual to represent this. I couldn't find this in any of the research papers, so this took me a while to get right. So the thought generation in tree of thought happens from these nodes and each partial thought on a node is generated and it's used to explore potential solutions. Then each partial thought, which can be a potential solution to a problem, is sampled and evaluated using self-reflection. By that, I mean we ask the large language model to evaluate the validity of its own thought. So now you ask it a question to solve in steps, then each step is given back to the model, and now reason over whether these particular steps can get you to an answer, and these conditions start happening and expanding as a tree. That's the basic strategy here. The LLM then evaluates each thought, classifies them based on the likelihood of that particular partial solution in solving the problem. The evaluation is then used as a feedback signal for the tree search algorithm, influencing what branches to promote and what branches to prune. Now the partial thought chains that score the highest are validated for correctness of the complete solution. So now you have partial thoughts and you use a large language model to give you some indication of whether or not that can lead to an answer. Once you find the highest ranking thoughts, you then send them for an entire proof saying that, again, back to the large language model, can this particular chain of thought give you the answer now? Yes or no? So that's what the evaluator does. And then the tree search algorithm is going through these nodes, asking questions, ranking them, sorting them, pruning branches, expanding branches. And the tree search algorithm decides the tree strategy from the feedback that comes from the evaluations via the orchestrator. And depending on the nature of the problem, we can choose from various tree search algorithms such as Bread First, Depth First, Best First Search, or A-Star. I'm also trying to do a Monte Carlo simulation, which is an interesting way to look at this problem, but I haven't been able to get it right yet. And then finally the orchestrator. The orchestrator is sort of a control plane guiding the search process. It facilitates multi-round conversations with large language models, providing hints and suggestions to the tree search algorithm to steer its trajectory. It ensures that a rich diversity of potential thoughts gets explored. It also utilizes backtracking and look-ahead techniques to make global choices, ultimately identifying the most promising solution candidates. That's the strategy in that show. But we won't stop there. I will actually take the math problem that we saw in the earlier slide, the one where we were to arrange 1111, and we'll see how this strategy solves that math problem. So it starts by representing the problem, as I said, as a tree data structure. And then using this tree, using a tree search algorithm, so data structure is tree, then we use a tree search algorithm to find the correct solution. The tree search structure starts with a root node, representing the problem. So in our case, the problem was take 1111, and then arrange it to get 24. So that is used as an input, and this input is wrapped in a prompt, and that is then used to create new branches of the tree. So thought generation works by taking this input node, and then wrapping this with a prompt, and then asking the LLM, the large language model, you take that input that is in the red box, that becomes the prompt for the large language model to respond back with a set of partial solutions. So you break this whole thing in steps. You as a large language model, I don't want a solution. Give me the steps that can get to a solution. The orchestrator then takes this response from the large language model, and then sends it for self-evaluation, that that gauges the progress of each partial thought in solving the problem. Now pay attention to this step here, this step. This is where the branching out, the tree expansion, and the evaluation happens in the tree of thought strategy. The response from the LLM. So you get, in this case, multiple possible chains of partial solutions. Each chain is split, parsed, split, and then each line of the response becomes the next branched out node of this tree. So now you see there are four nodes of this tree. It has expanded. The self-evaluation of then each one of these nodes happens by wrapping that partial thought at that node and asking the large language model to gauge the likelihood of that partial solution on that node in solving the game of 24. This is what I mean by self-evaluation of LLM's own intermediate thoughts. The large language model, in this case, responds by an evaluation classification. In this case, it gives a classification of sure, likely, or impossible. So you take that classification, and that classification is used to provide feedback to the orchestrator on whether or not that particular branch has to move forward or should it be pruned, or should it branch out. Then this thought generation, thought evaluation tangled. This continues until the search algorithm finds the highest ranking solution or hits the maximum tree depth as specified in the tree of thought hyperparameters. So you can make this go forever. You need to eliminate the tree at some point, otherwise you will use too much of compute. So you put a max tree depth. If you can't find the solution, then the tree of search algorithm just exits from there. So the highest ranking thoughts are then validated for the correctness of the complete solution. This is done by taking the highest ranking thought at that node, wrapping it with a prompt, and then asking the large language model to judge if you can get an answer 24. Now at this time, the classification expected is a binary choice. Are you certain or is it impossible? So there is no range here. If the answer comes back as certain, then that node is tagged as the answer node and the search terminates, or else it continues to branch further out. So that is the search algorithm solving the math problem we had in the first two slides. Now this all looks good in this diagram, but how practical is it? If I try to run it using open API, open API, or bedrock API, or a large language model like Lama on your own compute node, what is the cost to done this? What is the complexity? So I'm going to take a deeper dive into that, and this is from my own experience of trying to run this. So from a computation complexity perspective, this is the, both the time and space complexity of all the common research algorithms are polynomial. They run in polynomial time. It looks exponential or they're polynomial because B and D there, the B is the branching factor, and D is the depth of the tree, the max of the tree. They are both constants. So the time is polynomial. So you can see that as you branch out more, or you go deeper in the tree, the things, the amount of compute required gets more and more. And when you run, and if I try to run these tree of thought locally on my own compute node, within 15, 20 minutes, I get code out of memory exception. It just requires just too much of GPU memory to run this. And if I use API, such as OpenAI, I get rate limited and all kinds of crottles get put in. Within a few minutes, I burn out all of my quotas. I've spent hundreds of dollars on this. Within few hours, everything is gone. And I still don't have an answer. So for example, if I run a tree of search for this problem that we said, and I set the branching factor to three and the depth to four, running this strategy takes 80 to 100 LLM API calls to make. You can quickly see how expensive that can get. And it's cost prohibitive. So I mean, I can't move forward with that kind of use of compute. So what can we do to reduce this complexity? This is where we need to combine prompt engineering with software engineering. And I'll explain this approach in the language of patterns. So in order to reduce the complexity, time complexity, we can engage patterns such as functional decomposition, task parallelism to break this whole strategy into independent, parallelizable units of execution. For space complexity, we can use patterns such as distributed cache, state externalization. So as to reduce the memory pressure, so now the state is outside and it's broken into multiple units that run on the clusters. So the amount of memory required on a single node is reduced. And then for your throughput limits, to manage them, we can use patterns such as exponential backoff or rate limiters that can track the requests made over time. And then for latency, we can use patterns wherein we use event-driven parallelism. We can bring in load balancing and asynchronous processing. So we minimize user synchronous blocking calls. So every call, every operation is asynchronous. But if you run it on local nodes instead of using the API, that can improve latency. But you need to have enough AI acceleration and memory on the compute node. And finally, to reduce the cost, you need to... We can use patterns such as the operator pattern that I'll go in details in the next slides, and horizontal auto scaling patterns that can help you delegate all these aspects to underlying infrastructure. And then you can reduce the number of API calls by batching them. There's a technique called request-patching. We can use that to package multiple calls into one request. Now, these patterns look good on a slide, but how do we actually take this DOT strategy and build an efficient system out of this? To do that, we break it into... I will show a strategy that I used to break it into four different units. So I break the entire strategy into four decoupled, parallelizable software modules. And I call them the thought generator, the thought evaluator, the solution evaluator, and the orchestrator. So as an example, I'll show you how to take the thought generator through this process. The first step is to modularize the thought generator component. And here we first implement the functional as-mode. Here, we first implement the functional aspects of deduplication, prompting, parsing, and interfacing the LLM. And then we layer on the non-functional aspects, such as batch prompting and trait limiting. Then we externalize all the state processing to all the state to a graph database and all the communication and distributed caches outside these components. The next step is to containerize it and package all the dependencies of the container. And the last step is to parallelize it using an orchestration platform. So that's how the entire architectural process works, taking this complex strategy, breaking it into independent modules and then packaging each module as a container. Now, when you do that, we end up with something like this. If I had more time, I would go into the details of this entire architecture, but I'll walk through this quickly because I really want to have some time for the demo. This is how the entire platform looks like. What you see here is, at the top is the externalized tree state in a graph database. I'm using Neo4j. At the bottom is the radius bus, which both acts as the PubSub layer, but it also is a distributed cache. So all of these modules, all the things that are happening in the distributed strategy, are now running independently, asynchronously over events. And wherever this system sees that there is not requirement of any compute, you cut those calls out and you start to give more resources to the nodes that require more compute. So now, all of these boxes and software modules, there are so many layers of modules and you think about load balancing and service discovery and interfacing. How do you tame this complexity now? Now, in trying to solve the computation complexity, now we ended up with operational complexity. So the complexity moved to somewhere else. Now, how do you tame that? That's where you're being in the big guns, the Kubernetes and KEDA. The first thing is to take all the modules that we just saw and like all them cubifies them. So by that I mean is that you express the desired state of the TOT environment in constructs such as pod services, replica sets, deployment, etc. And then you delegate all the aspects such as deployment, config management, self-healing, service discovery, load balancing to Kubernetes and automated using a level four Kubernetes operator. And you'll see that in the demo too. And the last step is to delegate all the auto scaling requirements to KEDA. So KEDA kind of acts as, creates a homeostasis of sorts wherein it adjusts the system resources by watching custom metrics produced by the TOT system. The thought operator is the most important component of the system. So I'll take a quick, you'll take a quick look at what is happening under the hood with this operator. The first thing to build this operator is to create what's called custom resources. So for each TOT module, there is a custom resource that defines the desired state of that module. And then we write a custom controller for each of those TOT modules using the Kubernetes operator framework. I used Golang to code it. The controller ensures that the observed state of a TOT module matches the desired state defined in the custom resource. The controller then takes corrective action to reconcile any discrepancies between the observed state and the desired state. And then we have the actual operator. When the operator gets deployed to the cluster and we instantiate the custom resources, the custom controllers kick in using and then they use Kubernetes API to create the required deployments and take over all the operational aspects. Now here is the complete architecture. I know there are a lot of boxes there, but to quickly go over that is you have the TOT operators for each of the four modules, the thought generator, the thought evaluator, the validator, and they are monitoring all the custom resources by watching any discrepancies between the desired state that is specified in the CRDs and the actual state by monitoring the TOT environment. And it reconciles any discrepancies. Keda, on the other hand, the custom operator could have done this too, but Keda is very sophisticated in scaling and being able to manage it much more efficiently. So here Keda is looking at custom metrics that are generated by the orchestrator internally and then it scales the pods for each module automatically depending on what's happening in the environment. All right, so that's a lot of slides, lot of slides where let's see some software. Is it ready for the demo? All right, so I have a live demo too, but there are so many moving parts in this. That is, so I'm going to start with doing the infrastructure deployment first. I will start with a blank cluster and then you can see all the components built, but then the actual demo, the GUI demo that shows that your thought strategy is solving a problem, I have that recorded. So I can pause it and show you how the strategy is actually working. So to do that, I have to switch my screens here and let's see if this works. So you can see here, I have the, I have a completely clean, clear cluster, no custom resources, no pods, no deployments. The first step, as I said in one of the slides, is to install the custom resources. To do that, I'm going to use this make file that comes with the Kubernetes operator framework and do install. The moment I hit that, you will see custom resources start to get deployed here. Let me keep this on. Okay. So you saw the customers, these are CRDs for each of these modules, each of the modules of the TOT system. At this time, there are no pods because nothing is getting, there is nothing running. At this point, I only specified the desired state of the TOT system. Now, I'm going to deploy the, I'm going to deploy the operator, the custom operator. And with this, you'll see the KDA operator also deployed. So what you see is now the KDA server, the Metric server, and the custom thought operator are installed. And let's give it some time to start. So I'm showing the layers of this stack. We started with deploying the CRDs. Next is we installed the operator. The third step now is once the operator is in, I will install the two, what I call the infrastructure layers in this. One is the Redis Pass and the Neo4j database. So I have CRDs for those already deployed. All I have to do is K apply minus F, config, something is not working, I think. Okay, this is always happens during a demo. So demo guards are not happy. I forgot to run something in the background. All right, right there. So you see the TOT graph? That's the Neo4j got installed. And the thought bus that is Redis is now installed. So let me clear this. Let me maximize this so we can see it more clearly. See those are running. And here the whole and you can see now there are deployments there. It'll take some time for this to come up. All right, they're all green. They're good. The graph database is still pending. And you can see all the persistent volume claims that created the services. I have a bunch of services that are required for discovery, but I also opened a few node ports so that I can connect my ID to this cluster to test it out. And you have the PBs. So at this point, the infrastructure is ready to try out. So this is the part for my demo. Now I have all the pieces required to run this. So I'll switch to my I'll switch to my presentation again. And I'll run this. Okay. So what you're seeing here is, let me see. This is the, I call it the thought prompt, but this is the GUI for the tree of thought strategy. So what you're seeing there is, these are, let me pause it. So those three thought prompts, those are the prompts for evaluating thoughts, for generating partial solutions, and for validating the final solution. And you can change these prompts and you can also try experiment with this prompt. That's why I call the thought prompt. This is like a playground. So there's a lot of prompt engineering required to get these prompts right. And each language model has their own nuances on what kind of prompts you put them so that it kind of gets you to the solution. Okay. So now let me jump to, these are the TOD hyperparameters. Let me pause it here. This is where you see things like the depth of the tree, the branching factor. You can see things like the batch size. And then there is this concurrency model that you can choose from multi-processing to multi-threading to just serially processing. When you're experimenting with these, sometimes you need to run things serially so that you can follow through the whole tree when it's parallel running just too complex to debug it. So I left these strategies in there. And then you have the traversal algorithm. So you can there choose between, you know, breadth-first, depth-first search, or you have, I don't have the Monte Carlo yet, or A star. All right. So let me jump to solving the problem now. So I entered the problem 1, 1, 1, 1. What's, keep noting at the bottom right screen. You'll see the pods are running. But as I hit the solution button, you'll see the Kedak extent. It will spin up more pods for that particular piece of the strategy. So for solving this at this stage, it requires more pods to do the set evaluation than creating the initial prompts. So you'll be seeing pretty soon more pods getting generated right there. That is the Kedak custom scalar kicking in. It's creating now more pods for the thought evaluator module. And then this is the entire tree. So right now, this is your root node. It represents the problem. And as this solution is running, you can see in the thought prompt of those green lines, those are the logs coming in from the Redis Pass that are streamed over a web socket to this. All right, so the tree is forming here. You see the initial prompt branched into eight partial solutions. And then as we are going through this, you'll see, let me walk through this real fast. So you don't have enough time left. You'll see the tree will keep expanding. So now it's at this stage. This is the Neo4j browser I'm using. So since all the tree is saved in Neo4j, I'm using Neo4j as a thought state manager. So this is your first level of the branching out. You'll see at each stage, it only branches into three clusters because the branching factor was set as three in the hyperparameters. If you set it to four or five, you'll see a wider tree. Okay, now this is getting really complicated. So it has now expanded all over the place. Okay, so at this point, I have a solution. Those orange things show the answers. But what's interesting here is that this tree has found independently two answers from two separate branches. But this happened to be the same answer. And what you see here is the solution. This is what GPT-4 should have answered instead of saying, you know, 11 divided by 11. Those are the steps you get at 24. So that is, you know, the real end of my demo. But what you can also see is I want to quickly show you that this tree explorer is very interesting because here you can now explore the entire three, the thought process. You can go through each node and see what happened and how did it reason this problem. Or you can ask, instead of the thoughts, instead of the thoughts, you can ask this browser to say, give me the answers directly instead of looking at the whole tree. So those are two answers. And then you can walk back from the answer of what parent thought led to this. And then you can start to go back all the way up to the root node. Okay, so that concludes the demo. And I have some final thoughts. This is where this is for you, for the audience. Where do you take this work from here? So I put this effort to take it at this point. But I see the future of this into what I call the logical evolution of this is what I call a new strategy coming up, which is called the graph of thoughts. So the graph of thoughts is an interesting structure wherein it is also a tree. It's also a graph, like what we saw. But here the graph is graph of thought is not restricted to a tree. It can be any DAG. It can take any path. That's what you're seeing here. So any branch can connect to any other branch. And this requires, this enables us to find more efficient paths in this entire graph of thoughts. And it also has the potential to reduce the cost and increase efficiency. This framework was recently introduced about, I guess, two months back by researchers Besta and the team from ETH Zurich. So that research paper is also in the reference notes. You can study it. It's a fascinating way to look at this problem. Another potential solution that I see is potential future direction that I can see is to use the tree of thought as what I call its self-placed technique so that we can enable the tree of thought systems to develop innovative problem-solving strategies that are not present in its training data. It's drawing inspiration from how AI agents like AlphaGo or AlphaStar learn to enhance the strategies through self-play in competitive environments. That's another future trajectory this can go into. And finally, fine-tuning large language models using tree of thought-inspired approach that involves this high-level counter-factual decision-making and considering different choices for the next, for example, you're trying to generate some creative text. So instead of generating the next text, you can ask to predict the next, to predict and go to the next token using the strategy. This could open new frontiers for our problem-solving abilities using large language models. And who knows when the day p equals to np is coming closer. That's all folks. Thank you for your time. And I'm open for questions now. Any questions? Yes, so yeah, let me show you that. So this is the actual UI. You can choose Cloud GPT-4, Turbo, Lama. The internally switches the response API to use a different wrapper. Yes, yes. Or if you have a resident on node like Lama you're using or Falcon 7b, it can make that. Yes, it's built in. Any other questions? Yeah, so I have another question. Hi. So instead of re-selecting which model can it have ability to choose the model itself? Like something like this, right? In this example, we don't really have to burden the LLM. We could just simply send it to the mathematical engine. Yeah, I think that it's way, way harder to do than it looks. Because each model has a very nuanced way of how it answers these questions. So what you are thinking, I also was thinking in similar terms, wherein each part of this problem, so the first set of thought generations happens, let's say with GPT-4, but then the evaluation can happen with a different model, which is much smaller. So that's where I was going with this. Instead of using these massive models which are very expensive and require a lot of compute, you can break this problem where your initial thought generation is the big guns, and then the smaller evaluations. And finally, when you have the correct answer to validate if it is correct, that piece can be done with a local LLM that doesn't require GPT-4 or clouds. So that will help reduce cost and increase efficiency. So yes, that's where I think what you're saying can be done. Yes. Yeah, I have two questions. One is LLMs, they're mostly probabilistic way of predicting the next word in a natural language. Whereas the decision trees, they're more deterministic, would force way of exploring different options, right? So they're different methods. And are you advocating combining the two? So that's my first question. And the second question is, in this example that you had, you had four numbers, right? And you were saying that you were using prompting to help LLMs to arrive at the right solution, right? Was here the connection between LLMs and decision trees? You simplified the problem by going from, let's say, four numbers into two numbers, so that prompting was good enough for LLMs to solve the two-number problem as opposed to like the four numbers with many combinations of operations. So yeah, to answer the first question, LLMs are always stochastic, probabilistic. There is no determinism there. What you can do is, you can play with these parameters here. So the temperature of the overall entropy, then you have the top K and the top P, whether you take the top X tokens or you take the probabilistic distribution. But you always get a representation of the probability of the answer. It cannot be deterministic. What the state of thought strategy is trying to do is with the last step of solution validation, it is trying to reduce that probability to a very narrow set of answers. That's what happens here. But if you run the same thing 10 times, three times, you will never get an answer. This tree just goes into nonsensical paths. And there is no way to completely control that. If I put these parameters too tight, in which I keep the temperature close to zero, then I never get an answer. If you keep it very high, then it goes the other way around. It starts to create all kinds of hallucinations. But the idea here is that this strategy will have to find that balance. The next part of your question was, can you repeat that? I guess lost that. Yeah, basically, how did prompting help LLMs to arrive at the right solution? So the examples I give here, these prompt examples that you're seeing, the first step is the prompt engineering here. You have to find those prompts. But they're not tied to just four or five. Pick up a class of problems. So in this case, the class of problems was Gamer24. You can take Sudoku. You can take crossword puzzles. You can take some creative writing. And then there are only three prompts you need to engineer. The first prompt is the initial thought generation. The second prompt is, how do you validate these partial chains? And the third prompt is once you are close to an answer, how do you validate if it is correct or not? That's what you have to play with. But then you can choose any class of complex problems here and try to find an answer. Great talk. Thank you. So one question I have is, how much of the infrastructure nodes, GPU nodes you had to use to give these three of thoughts? The last example, I mean here, explained in the initial few slides. What is the operational problem we have? And when you try to use Keda, using that, how many GPU nodes you still need to use to solve this problem? A lot. A lot. So that's one of the reasons why this three of thought strategy and the infrastructure I put together was to solve that problem. Is as you spin up these nodes, this three of thought strategy starts to expand. So the Keda scalar is closely watching what the orcitor is telling it. And then it starts to power up those nodes which are involved in evaluations and starts to power down the nodes that are not doing much. So it tries to rebalance this entire compute on this cluster, closely watching what is happening as the tree is expanding. So yeah, I mean, this is a large, I mean smaller. There is no easy way to do it. You need powerful machines. Otherwise, you will spend days to get an answer. But the intent here is, knowing that reality, can we make it more efficient? Yeah, so you're right. So Keda has some custom scalars built for Redis and other things. So I'm not using that. This is a custom scalar that is looking at the metrics coming from the tree of thought. So the orcitor is looking at how the tree is expanding and it's creating some custom metrics. That's what Keda reacts to. So it's very fine grained, tuned to the tree of thought strategy. Correct, correct. Yes, you can expose this matrix to Prometheus and then use Prometheus scalar but the custom scalar is in its purpose built for tree of thought. So I was able to get more control with that but absolutely with Prometheus, you can do the same. It can get very close to that. Okay. All right. That's all for today. Thank you. That's all the time we have.