 Hello and welcome. This is active inference guest stream 65.1. It's December 6th, 2023. We'll be hearing from Phoebe Klett and Dan Simpson on Bayesian world models for explainable transparent reasoning. There will be a presentation followed by a discussion. So thank you for joining Phoebe. Very much looking forward to the presentation to you. Awesome. Thanks for having us. Yeah, I'm really excited to chat with you all about how we might start to integrate today's state of the art language models into more probabilistic machinery and what that might bias. All right. Let's get right into it. So some of the things that I'm hoping to discuss today include why we might use a language model for something that isn't long form text generation and how we might do that and then motivate a little bit kind of why we might need a world model, what simple self organizing world models might look like and even in the simplest cases how we might start to use those as effective recommendation engines in the wild today and then a little bit of discussion about kind of where this research is going. All right. So what are language models good at? Today's especially large language models are trained on next token production. So this means given a sequence of tokens, we're going to estimate which token is most likely to come next. Maybe rephrasing that a little bit. We might also say that language models are trained to estimate which sequences of tokens or words are likely. The caveat being likely to appear in the training set, but today our training sets are quite extensive. So given this simple training objective arguably surprising that we've seen language models do as well as they have in really impressive tasks. So they start to demonstrate like really great language understanding meaning like the semantics of language itself and not just the syntax of kind of like which sentences make sense. They've also started to demonstrate some world knowledge. So implicitly learning kind of how the world works just through our own human abstraction and how we articulate that. We start to see language models falter at tasks which look more like symbolic problem solving. So math in particular, we see this in programming although we're getting better at this and in general any kind of like long-term planning tasks which require abstract reasoning. So we see this also in problems that look like this which are word problems but which are really about kind of like understanding abstractly how these kind of abstract objects relate to each other. And again this shouldn't be surprising that language models struggle to do things like this since it's so far off from their original training objective. And in particular even when we arrive at the right answer in some of these cases it's really hard to know after the fact kind of the reasoning for how we ended up at the right answer in this particular case and maybe more fundamentally if we don't know where the abstraction is happening or where the reasoning is happening it's very hard to guide that process. And so this starts to motivate the need for something that looks more like an explicit world model. And now to start to borrow some words from Yasha Benjio this is one of the kind of biggest issues with today's language models is arguably that we're asking the language model to be both the inference machine and the world model implicitly when this doesn't quite make sense. And so some things that we might hope for in a world model are to model causal relationships to be really adept at modeling using uncertainty and to be modular. Yann LeCun puts this in a similar way so we might kind of ask a world model to be able to distinguish between which details are important versus irrelevant and to be able to make predictions that can be performed in sort of this abstract space of presentations. And so hopefully in the next following slides we're going to motivate what it might look like to use language models as inference machines in maybe like the simplest case of self-organizing world models. And even in those most simple cases how we start to get something much closer to the kind of explainable reasoning which I think lots of us are grasping for right now. So in a very simple case we might think of a world model as simply a collection of hypotheses where we model confidence over each of those hypotheses. And in particular we care about predictive world models so given some evidence or data that we have observed in the world we'd like to propose the world models which best explain the evidence that we've observed. And if that language sounded leading as it was indeed we are proposing in this case to use Bayes' Rule which we all know and love which tells us exactly how to update our beliefs given some new evidence. And so when our hypotheses are kind of these Bernoulli random variables this bottom term can simply be expanded into this equation on the slide. And so the real tricky part in this computation is the likelihood piece given that our hypothesis is true what's the likelihood that we would have observed the evidence. And the claim that we're making is that language models are actually this is like a natural task to ask of a language model when our evidence and our hypothesis are both semantic objects as we discuss language models are kind of trained exactly to understand which sequences of text are likely so asking it to prompting it in a clever way such that we can extract this particular likelihood is actually really natural. So how do we do this? So we come up with this kind of clever prompting scheme that allows us to extract again exactly how likely is this should have occurred. So given that our hypothesis is h in this setting and our evidence is curly e, we only need to change our evidence into the conditional form and then phrase the question like this to the language model so the input sequence of text goes like given that potentially if our hypothesis is like Walmart has been severely impacted by COVID-19 pandemic would we have observed the evidence that Walmart laid off 10% of the material stuff. And so either using something like screenshot prompting or guidance or some kind of control generation technique we can ensure that our language model outputs either yes or no and then use the logits from its answer to estimate that probability. And again the claim here is that like this is actually really natural task to ask of a language model that leverages it's like innate reasoning engine better than just kind of allowing it to ramble using text sometimes. And we can even make this updating scheme a bit more sophisticated using things like precision weighting. So this will bias our posterior over hypothesis either towards our prior or our evidence that we've observed based on how confident we are in our prior and in our evidence. So the world models that we've seen so far are fairly simple in particular we're modeling all our hypotheses as independent from each other and this seems like a large simplifying assumption so how might we support more complex world models where we condition our hypotheses on each other. Alright so another thing that hopefully will be familiar to this crowd. Basinets have been around for forever and are often used to model mechanistic failures and large systems and can be understood simply as distribution where each variable depends on some small number of ancestor variables. Perhaps more intuitively we can also think of these as directed graphs where we have edges between variables which are directed in the sense that the parent is conditioned on the child. So let's see an example. Alright so now we have two different kinds of hypotheses the ones which are kind of parents in our child set up are kind of qualitatively more abstract than the children. So our two children nodes are similar to the hypothesis we saw before. Mark a sentiment for Best Buy is Poor or Walmart will grow its physical footprint this year and then the more abstract hypothesis being retailers were negatively impacted by COVID-19. And so then we specify this conditional structure kind of in a classical sense by specifying all of the joint probabilities upfront or alternatively learning it given some healthy amount of training data. And again we're proposing that instead of doing that intensive process we can use language models to extract those probabilities in a natural way. And so one natural complaint at this point might be like well this space is going to start to get very large. If we're trying to update these beliefs in time given some large data stream this is going to become intractable given our current framework. So how might we start to augment that? Luckily there's been a lot of work historically done on this problem and we can start to use things like message passing to update our beliefs in an approximate way. So as long as our Bayes net has a tree dependency structure we can use things like the sum product algorithm to update our beliefs. And I don't want you to worry too much about the equations on this slide. If you've seen this before this kind of recursive structure will look familiar but if not the general kind of intuitive idea that we're going to compute these messages from all of the children or the neighboring nodes and use those to propagate through the graph to update each independent belief. But let's just see an example. All right so suppose we observe some new evidence and we'd like to know how should we update probability over hypothesis A which as you remember is the parent node in our graph. All right so B here on the left hand side is the belief. X A is essentially like just the variable representing the hypothesis A and then the fee and the side terms are specified by our language model or the Bayes net in the classical sense. And so we're summing over all of the variables, all of the values that the variables X B and X C can take which in our case of course is just zero and one. So if you might recognize at this point that this is indeed the exact marginal probability for the hypothesis A and that's because our graph is simply connected in this example. So in general this isn't true but turns out to be the case in our example. And again we can compute these fee and side terms using the language model itself. All right so because we're all lovers of free energy here I'm going to walk through kind of how this is maybe the first example of self-evidencing or minimizing free energy kind of model. So as we noted belief propagation isn't exact for more complicated graphs and so it makes sense and might be useful to ask the question how far apart or when are our beliefs close to the exact marginals. And so we often use things like KL divergence to compare the difference between two probability distributions and that is explicated on the right hand side. And then those of us who are like have a background in physics might recognize Boltzmann's law as well. So this is just the idea that we might represent the probability of a given state using an energy function and we're not going to accept this as truth maybe as some of us have done in the past but we are going to just use it as a definition for this energy function such that when we plug in that term here and expand out we start to see these first two terms look a lot like the kind of energy and entropy functions which we are used to. And indeed we can just classify those two terms as the Gibbs free energy function. That makes me happy to see this all coming together. And in particular it might make sense just to note at this point that using world models which are self-organizing in this sense seems to be very compelling since the kind of world model which we want is one which promotes the evidence that we've seen so far. Why is it useful to formulate this in terms of free energy besides the fact that we all find it compelling here? There's a lot of progress by constructing analytically tractable approximations of Gibbs free energy often. I'm not going to go into the details here but here are two examples where that's been fruitful. Alright so now I'm going to chat briefly about kind of how we might use these systems as recommendation systems and indeed in the world today. Alright so one prime example for a system like this might be useful is a situation where we have lots of data incoming at very high frequencies and we always want to have some set of naturally discreet hypotheses that we're modeling beliefs over which are being kept up to date at a very regular cadence. And so actually a lot of the muscle here is just reformatting documents or however our data comes in as evidences. This is not always obvious or easy to do but once you've kind of figured out that part and in particular we've been using things like RAG, retrieval augmented generation or embedding based systems to kind of figure out when data that's incoming is relevant to a given hypothesis. Once you've kind of built up that machinery the actual updating computations as we've shown already is actually pretty simple. So we do these likelihood computations and we update our beliefs and then at any given time we can query that model for our marginal distribution over any given hypothesis. And it turns out that this kind of setup has many practical applications. It's also noteworthy that even with very simple systems like these these are like out of the box controllable and explainable. So just by storing the magnitude and the direction of the update to the posterior for each piece of evidence that we observe we have a very natural built in explanation for our belief at any given time. And that makes kind of these applications where folks might really like to use a language model but really require a robust causal relationship between the outputs and the explanation which you don't get from a language model on its own. So a system like this can be very appealing in those situations. So now on to further work. Everything that we've discussed today is early work towards integrating language models into more probabilistic frameworks and there's been a lot of exciting work done in this vein right now. Some important questions which are especially interesting to me are which parts of the world model should be learned versus encoded and how do we want intelligence to scale. So I mean that both in the sense of composing systems naturally there should be some very like natural way that we can compose to intelligent systems and also such that we can scale them with compute. And I don't mean to restrict myself either to the kinds of compute that we have today. We're also working at some exciting new computing paradigms at normal which might be more compatible with software of this nature. Also the two folks that we referenced at the beginning of the talk, Jeff Hinton and Yann LeCun have done really exciting work in this area which is very inspiring and so in particular G-flow nets are also probabilistic graphical models which I think folks will find a natural next step in reading if you so desire. And that's it for me. Awesome. Thank you. Very cool. Dan, do you want to give a first reflection on our thought and then meanwhile anyone who's watching live please feel free to write questions and I'll relay them in. Absolutely. So hi, I'm Dan. I work with Phoebe on this project. And yes, the I think the thing that's most exciting about this for me personally is sort of two fold. One of them is that it's a way of avoiding sort of having to trust a language model to understand and reason about text. Because they're not, it's not that bad, the thing is that the extremely strange thing about language models is they're quite good at being almost good enough. But they're never quite what you could use. You could never use a language model to, I don't know, sort of triage like an important sort of situation where a bunch of different things are coming in and you have to make a decision about which is important. The reason you can't do that is you simply cannot understand the encoded biases, you cannot get it to reliably generate reasoning. You can ask it for reasoning, but the thing that it prints out is not the reasoning that it used internally because it doesn't reason. Fundamentally, while these have input and output than a natural language, they are not artificially intelligent. They are just prediction machines. And so we have to be very careful about not anthropomorphizing them. So this is a way of using those incredibly powerful prediction machines in a framework where we can make sure that we essentially keep a record of what we're doing so that a human can look at it. Because there's a lot of talk in this world about sort of post-human AI and those sorts of things. The idea that the machines will become intelligent enough or the machines will rise up in a slightly more alarming type of way. And that's all great and wonderful, but that's not particularly interesting to us at normal. We're much more interested in sort of having mimicking explicit decision processes so that a human can audit them and can make these things work. So that's kind of the area that we're coming from. Awesome. I'll go to a question from the chat. So Josh asks, where does hypothesis relevance enter the calculus? Is it folded into confidence? Not sure if it ought to be. Just saw hypothesis relevance mentioned. Hypothesis relevance. Does that mean which hypotheses are conditioned on each other? Is it possible to ask a clarifying question there? Maybe Dan, you have a better idea. They can follow up, but I also wondered about this. You might know what was relevance. Maybe the temperature and the rainfall were relevance, but then how does this approach help us understand when one of those relevant factors no longer is relevant or when a new relevant factor comes into play? Yeah, these are great questions. So I think in terms of understanding in an automatic sense, when two hypotheses are relevant to each other, we can leverage embedding type language models for this kind of thing also. If we don't have a more structured human intuitive sense for when two hypotheses are related, in terms of how those relationships evolve over time, this is something that's really interesting to me. And I think looking at the theory behind structure learning or when we propose to add new nodes to the network or propose to add a new edge to the network or things like this is a really exciting research direction. Although I don't have a silver bullet answer to how we should do that. Just to add a little bit more to that, it is a really interesting research direction. One of the things that Phoebe mentioned in the talk is that there is a difficult step that we're not talking about which is actually translating this natural language into reasonable hypotheses. So there is a step in there where you take essentially a chunk of text and you have to decide if this is a hypothesis, if this is a hypothesis we've seen before, if this is a sub-hypothesis or a clarification of a hypothesis that we've seen before and so on and so forth. So that in some sense part of the data processing and it is an important one that we are continuing to work on and refine. The other thing, a different interpretation of the question around relevance is around is the hypothesis relevant to the thing that you're looking at? We could have a hypothesis that the sky is blue but if we are deciding whether or not we need to check that part's oil like the truth or not of the color of the sky is very irrelevant. And that then comes into the nice thing about having your world described as a collection of statements with truth values associated with them in that you can directly reason over them. So you can put a classical decision framework over that to take into account both the knowledge you have of the world and also which parts of these worlds are sort of unknown. So in that sort of situation the person using the world model to construct a sort of decision or an output will be responsible in some sense for assigning a weight or a cost to each hypothesis being true. And for some of those hypotheses obviously it will be zero because again we do not care about the color of the sky if all I want to know is if I need to change the oil in my car. So that's the sort of the other end of the answer. So there's a version of the answer at the start of the information flow and there's a version of the answer at the end of the information flow. But it is a tricky point and one that we are sort of continuing to iterate on to try and find sort of good ways on both ends of that. Yeah well a lot there it's very interesting how in that presentation in response I heard both about probability distributions on rules and rules on probability distributions and like which one whether it's the tail wagging the dog or the horse in the cart how to design these synthetic intelligence systems that appropriately bring together aspects that are more symbolic more rule like and then more probabilistic more embedding like. So where does that end with you or how do you see the design of these systems with mixed symbolic and probabilistic components Yeah that's a great point and I think this really gets it like which parts of the world model should be learned or should be represented in some like more discrete space versus like encoded based on our own human intuition for rules and structure and I think like maybe this is would be fairly represented as a cop out answer but I think it depends a lot on the application I think like when we're developing systems like this and and just trying to iterate through as many different hypotheses as you can quickly like choosing an application and benchmarking and testing and seeing like what actually works is a go to strategy for us in terms of like well which parts should be fixed and are actually helpful to increase reliability such that like we can use our human intuition for how this particular you know system is built versus like well this is something that we want uncertainty over that's like a really important part of the learning process for us in terms of yeah that kind of iteration so I think it probably depends on the application Yeah it definitely depends on the application it's also like depends on where the actual challenge points are so we've got like outside of this we've got sort of a few other things that we've released publicly that kind of look at this idea of external rules to the system and whether or not we can add those in so one of them is something called constrained generation where we sort of force the model to only produce something valid and that's sort of quite a useful way of removing one particular aspect of stress from the model which is that it may make sort of syntactically or sort of in somehow incoherent outputs that don't follow the rules and then we can then focus with the rest of our energy we can then take that as given and focus with the rest of our energy on improving the bit that we don't have rules for so those sorts of things and sort of a different version is trying to improve something by saying no you broke a rule go back and make this sort of true so this kind of sort of chain of thought prompting type of idea so yeah the symbolic and the probabilistic I think in our minds live very closely together as two tools that don't completely solve the same problem and I think there's sort of in the world of I'm not sure how familiar anyone in the audience is with language modeling but like in the world of language modeling before this sort of explosion of neural networks and artificial intelligence type method there was a lot of work on symbolics of language and grammars and all of that sort of stuff and that work pushed quite a long way forward and this work is pushing quite a long way forward and I suspect the next thing is going to involve them joining up again because they each have good points they each have bad points and you know who wrongs don't necessarily make a right but they can make the less wrong nice yeah recently we heard from Elliott Murphy and talking about the neuro linguistics and about how the statistics of language are not the rules of language you can always come up with a new expression that's never been uttered that's not going to be in the training distribution okay I'll ask a question in chat from upcycle club they wrote what are some of the key challenges associated with developing such Bayesian world models hmm I think we've touched on a bunch of them the ones that are most top of mind for me right now are the structure learning thing that came up so how do we understand like when to propose new hypotheses and how to integrate those into the models and then yeah just figuring out like yeah I guess this like proposal and evolution process of the nodes themselves since everything else like the framework works pretty automatically in a reasonable way thank you Bayes thank you to the development of language models but kind of moving from this like more discrete case into a continuous case more fully represents the space that we're learning over can be challenging yeah I would also say that like it's a sort of a maxism that max maxism what on earth did I just say there's a common saying let's go in that direction there's a common saying in this world that sort of no model ever survives its first encounter with data and that becomes true here as well so there's lots of like as we've been building these things and using them we found lots of little spiky edge corners with sort of making sure that the language models are actually doing what we want it to do so there are a lot of questions in building these things around how do you actually test that the components of it are actually working the way you want and then on like a broader level how do you compare something that is fundamentally trying to solve a different problem to other methods so we are solving a problem under the constraints that we want a fully auditable system we could also solve all of these problems by a thing called in context learning which is basically putting the context into the prompt of a large language model and asking it the answer and that also works especially when you've got things like GPT-4 which are just wonders and glories it works really well so then we come to the question of how do we actually make the case from a bigger picture perspective can it be more than just a can we find benchmarks that reflect the structural advantages of this approach over something like in context learning that don't come across as false so that's kind of like a stranger answer because it's not really about actually developing the world model it's about convincing other people that it's a good idea and that is a thing that is true of essentially all of the things on this slide as well they are all quite complex and odd little methods that you know there's a degree to which we definitely can solve this an easier way so what is the thing that the application or the benchmark where we can say no if you do it the easier way you will fail at this measurable thing very interesting so you mentioned the self-evidencing advantages of using world models that are self-evidencing rather than reward maximizing for example so how do you see that playing out and I can connect it back to active inference of course but how do you see this self-evidencing centrality play out in the kinds of models described here there are a couple reasons why it's so compelling to me and the first just has to do with explainability it's really convincing to people to say why did we predict this, why do we believe this well this is the actual real world data that we've observed such that this is the impact that that's had and then I think I don't know you hear a lot about designing these really complicated reward functions which are often very clever pitfall because they very easily become disconnected from the complex world that we're trying to model and so you end up in weird local maximums or minimums and you start solving the problem that you've designed versus the problem which actually exists I just have always loved the idea that what we should be doing is self-evidencing from an intuitive sense that feels like what an intelligence system should do yeah I don't have anything interesting to add to that I just agree with Fabi that explainability and the capacity to explicitly reference previous data including like leave one out so techniques from non-parametric statistics about the effect of adding in another piece of data or removing a piece of data it's like to bring it to like a homeostatic setting which is commonly considered an active inference like we're trying to be within a homeostatic temperature range 37 yes we could propose reward functions but as those start to include open-ended exploration structure learning just like you described it Fabi like we're solving the problem as designed rather than the actual question of the homeostatic temperature and the sort of path of least action first principles physics grounded intelligence perspective from active inference is like make it the kind of thing that measures itself at 37 and then as long as it is it is and when it isn't it's dead and that's the kind of mortal computing crossover which is like outside of its zone of surprise it's not just that it's getting a bad grade in the class that is like a deeper failure signal than that and to understand okay when is it a yellow flag when is it a red flag in terms of the new scientific literature coming in those have plain straightforward ways to interpret that developing larger higher order apparatuses will never return to that kind of basal simplicity yeah couldn't agree more yeah I mean absolutely it also like the other thing that it can do quite well is deal with essentially outlier studies so situations where you have a new piece of information that is strongly conflicting with all the previous pieces of information and trying to sort of work through what that really means and there's a like there's a degree to which we can even sort of extend this process to multiple agents that have these belief systems and then look at sort of consensus of experts or weighted consensus of experts so for instance you could have like a weather vane type of situation where somebody really over indexes to every new piece of information and you would do that with you know technically you do that with maybe a power posterior type thing or you can have somebody who's built in strong priors in a particular direction and you can then like take your consensus of artificial sort of decision making all of which has within their universe well reasoned updates to the data and then you can look and try and work out what that swarm of experts can tell you and sort of do very empirical things like try and you know work out which of these experts is doing well at a particular moment in time because you know there could be there could be times when the world's very or the problem you're solving is very chaotic in which case the over indexing expert would probably be a pretty solid bet while there are other times where sort of things are pretty stable and you probably it would be possible that the sort of the more conservative expert is more sort of empirically making good decisions and good recommendations so there's like a lot of ways that we can not just like incorporate these sort of homeostasis ideas but we can also change what that means for different agents and artificially like do that artificially and then combine them together to try and get a almost like a blanking on the word but you know a forecast under sort of a hypothetical set of situations and we can actually sort of bring those ideas of the world forward and see what happens when they sort of meet with actual information. Yeah this angle of mixture of experts as it's sometimes called more in the language model space or ecosystems of shared intelligence or diverse intelligences in the active inference area like that's very interesting obviously has connections back to human teams and teams of beyond humans and so on a lot of this is still text based so maybe you did or didn't mention what representation the base graphs are but they're plain text like there was a lot of discussion about bringing from natural language scientific papers or however it is into a structured form and then the explain method that you showed kind of taking the structured form and just giving a little syntactic fluency so it looks human readable so how do you see that essence coming into play with models and then with action in the world that isn't just developing the next text token but a robotic actuator or modifying some other control element of the world. Yeah that's a really good question I honestly haven't thought much about multimodal stuff in this particular context but I think the framework is general enough at this point such that it's definitely could support lots of different modalities. Yeah I'd be really curious to see how this did with something like audio in particular yeah and then to your point about yeah this maybe like discrete versus continuous relationship I think that's part of what we're learning is how to go from these long natural text documents to a system which is appropriately discretizing our hypotheses such that we have these meaningful explanations like you mentioned. So I think yeah I think like continuing to develop robust ways of surfacing those explanations is a big part of this as well like over time we're going to observe lots and lots of evidence how do we make sure hypotheses don't get stale and how do we use evidence to know when they are and things like this are part of that also I don't know if that directly answered your question but that's some of the stuff that I've been thinking about related to that. So in the I mean in examples like sort of moving towards robotics and sort of tech video generation and image generation other audio other sort of multimodalities to be honest I think of these processes in general as enabling like building a world model to enable a sort of sequential decision process. So if that decision process happens to be should the robot turn left and that's what the decision process is it's multimodal in like a very classical sense that you can put any type of decision framework over the top but it's not sort of generatively multimodal I'm not saying write me a song that sounds like Beyonce and a song that sounds like Beyonce comes out I think this sort of this sort of Bayesian world model layer is blocking towards that sort of thing but that's really not sort of the aim of what we're trying to do it's also like within normal like our almost I don't want to say mantra or manifesto because that sounds culty and no one wants to sound culty but like our basic aim is to always center like humans within our process and so some of this multimodal stuff it's less clear where the human lives so for instance like a video generation type thing where does the human live so keeping it at this abstraction of sequential decision making then it's a decision that a human could do you know human with their thumbs could be moving like a robot around and doing that sort of stuff but yeah it's really all about sort of controllability and auditability for us in sort of a sequential decision process so to the extent that that sort of leads in its multimodal world that's sort of part of what we're doing and like to some versions of multimodality is we're just not swearing in that particular space yeah not a great answer but a long one let them distill it down later in the auditability area it almost falls out to me to be like a syntax of auditability semantics at the syntactic level just tagging or versioning when a file comes in or when a given computation is executed that is basically transfer across all settings and then where I see you honing in on with this work is kind of the semantic auditability which is actually how we compose our accounts but I decided to walk because this happens and so bringing that different kind of trace to systems is going to make it what will it open up in science or education or how do you see this sitting at a console somebody is at now and making this different like over what timeline yeah I mean it's really quite nice for storytelling because as you said you can say things precisely like well you know because we observed this thing or because if we had observed something else you know like maybe you can even make statements which are conditional in that sense I think it does like empower whoever is sitting in front of this data to feel like really sure about again like the reasoning engine that went on which to me is pretty different from what it feels like to sit in front of chat GPT even though it's quite useful often you know you try the code and it works or doesn't work or you like ask your friend is this really true and that feels pretty different to me from being able to look at the evidence themselves and say like oh well actually if this is the reason you think that evidence is not true or you know like you can bring your own human intuition or world model in terms of validating or yeah superimposing what you believe on top of what the system believes and so that makes it really easy to make decisions quickly. I think there's also sort of a converse of this which is that it also makes it clear which evidence was not used to make a decision and that can be quite telling in these situations where you could be worried that a particular type of evidence isn't being weighted correctly or isn't being sort of formatted correctly so again like if this is a sort of a like a system that builds an assistant that sort of does surfaces all this information and sort of makes a recommendation with reasoning for a person that person can then look and be like and they know what the data is they can look at the day and say you know why didn't you consider the make of the car or why didn't you consider this or why didn't you consider that and they can then use their understanding of what's not being prominently used by the model to sort of sense test like it's sort of I mean in some sense that usage of it is a reformulation of what Phoebe just said where you use your internal world model but it's like I think it's important to know when evidence is being used and this is like I think you simply cannot get from like a GPT type thing or any sort of like prompting type method we know for instance that like the order of the order that you submit your evidence in is probably going to matter for a prompting based method okay that's obviously not true for a Bayesian update where we have this sort of this this coherence principle where if you shuffle your data and enter it in a different way you will get the same posterior up to computational artifacts so all of that is like in my mind is just as important to as the ability to write a report that says I made this decision for these reasons yeah well that makes me think about this kind of view from the inside interpretability where the rules help and also knowing what evidence is not used is importance for compliance and knowing what information like in a healthcare setting was or wasn't used what about thermodynamics we heard about free energy Boltzmann came up how do you see the info thermo nexus what have we learned from the last hundred years of thermodynamics and information theory and all of this and on the software or hardware side how is that kind of a free energy nexus being used I'm really excited about how all of this seems to be coming together free energy just keeps showing up in all of these exciting areas to me we have like a book club for a singular learning theory and like they talk all about free energy too and I think some of those ideas are really exciting I mean at normal I think like the thing that I would highlight is like this idea of software hardware code design which is really special I'm really trying to do this hard and fun dance towards each other where we're like thinking about these new kinds of systems and how they might support each other and empower each other and how to build full stack systems which is really challenging and also really exciting and I think like from like the first principles of thermodynamics perspective like we're all just in mathematics and physics people at heart so like going taking like all of what people have learned in language modeling and all of that like very much to heart as well approaching whatever problem that we're facing from the first principles how do physical systems work in the world what are the assumptions we're really comfortable with and building up from there is definitely our natural mode so I think that makes it easier to start working together also it's also probably worth saying that we have a sort of a secondary not secondary a very different stream of interest in thermodynamics as well which is the ways that we can use actual physical thermo dynamical principles to build hardware that is specifically has sort of noise in it as a first class citizen and because of that it is particularly well suited to probabilistic tasks and so we've built if anyone wants to look we have a blog I believe the URL is blog.normalcomputing.ar and amongst other things that are on it there is the very first demonstration of using physical thermo dynamic hardware to actually do computations is the computation the most vital computation that we will ever do converting an 8x8 matrix so no we can do that otherwise but it is sort of building up towards this idea that we can use thermodynamics not just in our modeling and our understanding of the world but also in our sort of low energy compute stack to actually realize these things so I think I think it would be challenging to find a group of people on this earth who have more investment in thermodynamics and don't work in a physics department because we have investment all the way through from sort of active inference type things all the way down to this like this this hot thing goes there which is kind of cool I'm not a physicist so I that's my understanding of thermodynamics this hot thing goes there informative thing goes here hot thing over there call it a day um yep that it's a really cool fusion with the kind of parsimony and elegance and the aesthetic of math and physics and first principles and the different parsimony of pragmatism with the actual material basis like of a synapse the size of the synapse and the kind of stochasticity that that size alone um entails with like membranes and all this those stochastic aspects are leveraged for the compute the synapse is not simply a variance reducing machine and so it's like the platonic slash mathematical ish spirit finds a common home in these real simple physical demonstrations and today it feels like there's a big gap between the mesoscale computational architectures that you describe today that are very much running on the kind of von Neumann architecture turing completeness paradigm and yet very tantalizingly close like to a physical object that has a constrained rule de facto like only one thing can come out of this at a time as long as the funnel is this wide and so bringing the rules and the regularities of what we call physical things to bear with the fundamental and the imposed constraints on informational spaces very cool directions note about just where active inference and action plays a role is and also you mentioned like the hypothesis going on in the pro active stance where we're using expected free energy or something like it to calculate future courses of action over observations that we haven't seen yet moves that haven't been made yet there's an explicit epistemic value and so that can be diagnosed and observed as a measure of where a given computation is on the continuum between purely pragmatic value just constraint satisfying and realizing preferences and expectations and then the pure epistemic value where all outcomes are good and the more information gain the better and then being able to take control of that balance and know amidst changing situations again taking probabilistic or global based approach there to when epistemic and pragmatic like gas and break kind of come into play these are very basal control knobs or features in active inference that it's just not going to show up at the 50th layer of scaling is all you need yep cool well do you have any other like thoughts or things you want to add or questions or where things are heading for your works nothing to add at this moment but certainly excited to keep in touch with this community and yeah collaborating yeah absolutely and we sort of share I mean we write papers and stuff but we mostly share most of what we do be it academic in the sort of machine learning space or be it in the sort of the physical hardware space on our blog which is blog.normalcomputing.ai thank you so much for inviting us it's been very fun awesome thank you hope to speak again so peace bye bye bye bye