 Morning everybody. David Shapiro here with another video. Today's video is GPT-4 rumors and predictions. So if you're watching this video you have probably seen a graphic like this likely on Twitter or LinkedIn or somewhere else. This is supposedly represents the alleged parameter count. Now this has been circulating for at least a year so take it with a grain of salt. I don't know that they even had a GPT-4 architecture a year ago. Who knows? But this represents the relative size so GPT-3 at 175 billion parameters versus GPT-4 at the alleged 100 trillion parameters so almost a thousand times larger. Alright so what are we gonna go over today in the video? We're gonna look at numbers and facts and data. We will talk about some rumors at the end but we're gonna look at trends, facts, data to really try and figure out okay what is it that we should expect from GPT-4. Alright first let's talk parameter count. This is one of my favorite graphs of all time and it shows a very clear trend line of exponential growth in terms of billions of parameters over just the last five years. So we have gone from 100 million parameters to over a trillion if you add in Google Switch and Wudao which it would be like right up here so still right on the trend line right. So the rumor so first I need to say that there are there are two rumors. One rumor is that GPT-4 is quote not much bigger than GPT-3 which I've heard that one a while ago we've also heard it's a hundred trillion parameters. Who knows? If you look at the trend line so now we're in 2023 so the trend line means that we should be somewhere above one trillion maybe in the ten trillion range so I wouldn't be surprised if GPT-4 is in the you know one point you know the one trillion to ten trillion range maybe 20 so if you look at the generations if let's let's take a quick look at at OpenAI's performance so at the beginning of 2019 they did GPT-2 which was 1.5 billion and then just over a year later they did GPT-3 which was just over a hundred times larger so that was that was one one major generation 100x improvement we're you know so that that means that GPT-4 if the trend had persisted could have been a hundred times larger so that would be what one point I well I don't know so that that could have been 17 trillion parameters if I'm doing my math right in 2021 we got all the way through 2022 without any without anything so who knows maybe maybe because they've been you know back in the workshop for the last couple years maybe we are gonna basically skip a generation right because it kind of looks like we already did so who knows maybe maybe a thousand X is not so so far out of the question because if you look at this the time from GPT-2 to Google switch so because that's that's a 1000x increase so that was about one two full years right or I guess no when did switch come out was it 21 21 or 2022 two or three years so we're we're about two years on from GPT-3 so who knows maybe you know we maybe we could be approaching a thousand X which would be crazy it's not outside the realm of possibilities but it does seem like it would represent an acceleration we'll get into some of the constraints in just a moment but still the claims of claims of going from you know a thousand X in one generation seems kind of sus to me I did hear someone say that they think that that's why they released GPT-3.5 that it that 3.5 is kind of that intermediary step so who knows okay so in terms of parameters let's talk sparsity or is it sparse or dense so the first thing you need to know is that human brains are incredibly sparse our brains are composed of repeating circuits called micro columns a micro column is a cluster of about 60 to 100 neurons and they are they're vertical which is why they're called columns they go through the neocortex and most of their connections are local I think something like 90% of a micro columns connections are to neighboring or nearby micro columns but some of the axons are very long and they have very distal connections so what that would look like here is where most of the connections like go down to this one but then one jumps way over here and so what they found in researching and I don't mean just open AI this is the collective they all all people researching deep neural networks is that if you do pruning or distillation you can actually remove a lot of connections and still get very similar performance so what does this do this makes your neural network way way faster because it's doing less processing but it also requires a lot less memory because the tensors are much smaller so if we have this gigantic leap from from 175 billion parameters to 100 trillion parameters a thousand X you can see that like okay this would this would not scale well so if we have a jump in parameter count like that we almost certainly have a switch from dense to sparse networks that's my personal prediction another thing is when you just look at the constraints I don't know if it scales like this but GPT GPT 3 takes 700 gigabytes of VRAM and if you have a thousand times as many parameters if it is a one-to-one that would be 700 terabytes of VRAM that computer would cost hundreds of millions of dollars I think to run I don't think that open AI has invested that much on a computer and if they have that is you know really big but with the memory saving and processing saving that you can get from sparse networks this is probably the way that they achieve that the all other alternative is maybe it is smaller maybe it isn't much bigger than GPT 3 maybe it's in the 1 trillion parameter range or the 10 trillion parameter range in which case it would still benefit from sparsity so I I'm gonna put my money on that GPT 4 is sparse this could also be why it has taken like two two plus years I guess we'll get from 2019 when or sorry early 2020 so now we're early 2023 so yeah it's been like you know it's been a while so maybe they went back to the drawing board and I like okay now we've got a master sparse networks if that's the case great that could be that could be why it has taken so long because they needed to switch from dense to sparse and you know and there's new training algorithms you know how do you how do you do drop out how do you do what are your loss functions look like because it there's a lot more that goes into it than just pruning connections you have to have algorithmic changes so I'm gonna I'm still gonna put my money on sparse I suspect that sparse is the way to go because of technologies like Google switch and Wudow so speaking of if it moves to sparse and there's some algorithmic changes is it gonna have a different architecture is it may be switching to like the Google the the Google switch architecture which kind of has an only activate what you need which means that if that's the case you can have neural networks that are enormous and most of them stay off so you know that myth that like you only use 10% of your brain that's not true but what happens the reason that that originally came to be is because when you're doing specific tasks relatively small regions of your brain light up a little bit more than the rest your brain has a basal metabolic rate with a baseline rate of oxygen consumption your brain uses energy all the time just to keep itself alive and to keep its synaptic pathways healthy but when you're doing a specific task you might have like your occipital activates when you're doing a visiospatial side I don't remember what I think language activates the occipital anyways point being is that the human brain will you won't activate the whole brain all at once right it would be too chaotic it would be like you know you did a whole bunch of like you know stimulants and you were at a rock concert and you were running a marathon like you know your brain can't handle that much activation there was a movie called Lucy where they're like what if there was a synthetic substance that could activate your whole brain what would that be like I'm telling you it would be like a grand mal epileptic seizure it would not be like Lucy sorry so if we are going neuromorphic by using sparse networks maybe there's also a new architectural paradigm and it's not even new like they didn't invent a Google already did this and I don't know who invented it before Google but Google implemented it with their switch transformer where one of the differences you've got these little routers right and that basically says okay we need this part of the network over here so send the data there we need this this and this and so rather than just activating circuits of parameters you can activate larger units of circuits and so maybe what we're getting to is there where there is a slightly more neuromorphic architecture where we're starting to approximate things like micro columns and cortical regions in in these neural networks now this is wild speculation on my part this is just by reading a bookshelf full of neuroscience and watching the way that this is going it seems like there's some you know the more brain like we make these things the better they get so it's like okay we have one model of strong intelligence why don't we just copy the brain as closely as we can and so by copying not just the the individual neuron behavior but the behavior of micro columns and cortical regions or at least approximating that obviously it's not like oh here's an ox on an occipital region of the network anyways getting lost in the weeds point being is I'm wondering if GPT-4 if it is much bigger if it is allowing for these kinds of specialized regions which could also save on processing speed and memory again you know you brains have been around for more than a billion years so nature has had a while to optimize this this algorithm okay so let's talk training data what what what is it you know what what was it trained on the the general rumor for a while that I've seen on Twitter and elsewhere is that hit this GPT-4 has been trained on quote a significant portion of the internet aka most of the internet apparently obviously they probably wanted to filter out a lot of stuff so you know it's but at that point you're just discriminating against like the bottom like 25% or the bottom 50% of content you want to avoid like gibberish stuff that's just flat out lies or very harmful or whatever there's all kinds of ways that you can you can rapidly score content and just to exclude it right don't let the worms in your brain so there's a there's a there's a concept called information diet and you know if you consume really hateful content you become a more hateful person because that's in your brain if you consume a lot of bad news you become anxious because that's in your brain so the same thing is true of neural networks and any machine learning models is that if you let the toxic stuff in the toxic stuff is there so if it was trained on a quote a significant portion of the internet I hope that what they did was they had a really good maybe a reinforcement learning signal or other ways of discriminating against data not to let in so like letting in stuff that is more factual letting in stuff that is less hateful or more more conscientious and kind etc etc who knows that's just something that I would think of if I was responsible for scraping together a huge huge data set so one thing to keep in mind is that scaling laws of data there were some comments on my last YouTube video I don't know that there's actually consensus yet on like how much data but that in general like in order to get linear improvement in deep large language model performance you need exponentially more data so the amount of data might actually be the biggest constraint because if GPT-3 was already trained on a huge portion of the internet of the quality information on the internet then it's like okay well you can't is there 10 times more data than GPT-3 was trained on probably is there a thousand times more I don't know about that it's certainly not if you're going to discriminate against low quality data so I heard a rumor I think it was backs actually taught told me this that he thinks that that's why open AI did whisper because they're like we ran out of data but we need to go after more data so they're like let's transcribe every audiobook every podcast every YouTube video and that's why they did whisper is because they needed more high quality audio data who knows and then and because that what that what whisper does is it converts audio data into text data maybe that's why they did it I don't know because the thing is is that whisper is still not speaker aware it doesn't do speaker identity recognition it just does a raw transcription you know one to one which because speaker identity recognition is much harder problem um or at least I don't think it does it might be able to to discern like when one person is speaking in the and then another person starts speaking not sure um so then there's another question what about private datasets um you know every big company has many many petabytes of data every university has many many petabytes of data none of which is accessible on the internet most of which is not even accessible via API it's it's cloistered right it's it's buried off in warrens um so some of the most valuable data in the world is not available on the internet so this represents one of the biggest gaps for um any uh for anyone who wants to create agi is because it's like okay well if if if you're trained only on internet data yes a significant portion of human knowledge and wisdom is on the internet but the most valuable stuff is not right I mean most of the content in my bookcase is not available on the internet like you still have to read paper books or you know e-pub books um in order to get most of that and much of that is protected by DRM um like you can't just take it and read it um or maybe you can't I don't know uh I think I think it has yet to be fully litigated as to whether or not you can train a neural network on you know uh on someone's uh an author's book without their permission um but you know still point being is like uh did did they get access to all credible published works not just what's on Gutenberg right um now one thing I want to point out is that for the last about year and a half open ai has been putting out calls that they want to work with top experts um and so is art my my first thought was okay they're they're gonna talk with top experts to figure out what data to add I don't know um or maybe to test it who knows um so I have also not seen much work on excuse me external integrations um because open ai it seems like they're laser focused just on the model they're not they're they're not thinking about cognitive architectures they're not thinking about um knowledge bases uh you know uh or anything like that they're focused just on the model which you know they seem like they're doing pretty good at it but to me that represents a big gap and we'll talk about gaps at the end of this video okay so let's talk about modality um modality is whether or not it is like what kind of data it handles is it just text because gpt three is text in text out that's it um but open ai also has dolly and whisper which is uh image and audio respectively so they're clearly experimenting with other modalities um but I haven't really seen any papers about how to integrate these now one thing that I will say is that the thing that they all have in common is text right because dolly is text to image and whisper is audio to text so they're working in that direction um but I think you have to be able to go both ways like if you if if whisper had another module that allowed you to go from like text to audio and then audio to text like back and forth that would tell me that it's closer to being ready for full integration with a large multimodal model ditto for dolly um and of course like we can do uh audio synthesis we can do image synthesis we can also do text uh image to text those are all out there but until we can figure out how to do like both directions with one model I don't think it's going to be ready for uh for integration into a gpt kind of uh model now that being said I do remember seeing a paper that wanted to treat all data as just bits and bytes don't even tokenize it or tokenize the bits and bytes and then it doesn't matter what file type you put in right then you can put in a text file a word document a jpeg or whatever um so that could be another direction that things are going I don't know if that research ever panned out I don't even remember who published that research I think it was google or they said instead of tokenizing text let's just tokenize the actual like the at the bit level right um going in and I'll remember what size um chunks they did um because let's see if gpt3 has tokens that uh 50 000 different tokens that would be about um 16 bit because 32 bit is is 16 I don't remember anyways so it's probably it's probably um 16 or 24 bit tokens um and then you can represent any data type uh but I don't know if we're there yet the other problem with that is uh what format what data type are you getting out because if you train if you train a deep neural network to be able to read any file type how do you know what file type it's going to put out you probably have to come up with a standardized um output so I suspect we're just going to stick with text I haven't really seen quite enough evidence that gpt4 is going to be multimodal I'll be surprised if it is let's talk about window size so I picked a very small window because those of us that were uh the og users of uh of gpt3 um will remember that uh we had a 2000 token window limit and uh that got real limiting real fast um because that's barely enough to put in like even just a a few a good few shot prompt and then you don't have much left for the output um so chat gpt is rumored to have 8 000 tokens um that seems to be the general consensus and I kind of agree with that just having used chat gpt because it seems like it's got a pretty long memory um now that being said there are other tricks that you can do to make it appear like it has a longer memory you can do um a recursive summarization you can do scratchpad you can do search um so but people seem to seem to agree that chat gpt has 8 000 tokens it could it might be more might be less this is just like we noticed the trend of the original was 2000 now we have 4000 and we suspect 8 000 today so if it's continuing to double sure now one rumor from twitter said that gpt4 can write a 60 000 word book from a single prompt I'm pretty sure this person was just bsing us because that would be nearly 200 000 tokens um it actually could be more because I think uh on average a word is 3.6 tokens um and this is just like normal English is 3.6 tokens per word um and that includes spaces and um and and punctuation and stuff so I'm talking about the whole document um so that would be 200 000 tokens or more um so I doubt I seriously doubt that seems really sus so what we're more likely to see is something in the 8 000 token range um like you do the math uh it could could be more um window size is one of the biggest constraints for all of us um you know whether you want to do legal documents or fiction or medical texts or scientific research because here's one thing while the gpt technologies can like you just give it a problem and it'll write through it um without you know it doesn't ever have um writer's block or or or any kind of like inhibition it just writes um it also has a lot of latent space it has a lot of stuff that it has memorized but the the the working memory right the short-term working memory that it has access to is much much smaller than human working memory which is one of the biggest constraints it's like giving a task to a toddler right you can give a toddler one instruction and they'll you know or maybe a three or four-year-old you can give them one instruction and that's all that they have the brain capacity to handle so I guess what I'm saying is that gpt three is is roughly like you know a toddler or a three-year-old that you know knows just about everything um that came from my interview with Anna Bernstein um I really like that analogy so is gpt four maybe it's as smart as like a six-year-old right who knows but 8 000 tokens is about 2 500 words or 10 pages of text so you could have you know one page of input and then nine pages of output which is not bad um that would that would be pretty good like I'm I I'm telling you I'm pretty excited even just about 8 000 tokens there's a lot that I could do with that um there's still limits though so aiming for for 8 000 hopefully more um but yeah little windows no good um confabulation so I have not seen any research about confabulation now that doesn't mean it hasn't happened but it hasn't percolated up to my attention um and everyone who has worked with prompt engineering uh knows that if you tell gpt not to do something it will do it because it has a it it really does not understand negatives and in my interview with the folks at tau they explained that this is a mathematical limitation of this architecture type it doesn't understand formal logic um all it does is understanding generation so in in human neural terms it doesn't have inhibition so we have uh there's there's two overarching types of neurons in our neocortex there's excitatory neurons and and inhibitory neurons so you have neurons that want to generate signals and then you have uh neurons that want to stop signals and you can have uh either relationship where if one neuron is activated it will deactivate a whole other circuit so that's an inhibitory function so as far as I know um no research like that no inhibitory research has been done into deep neural networks so if this is true then this represents a huge fundamental gap in the abilities of deep neural networks if they don't have the ability to state no don't do that they can only generate it's like it's like driving a car that only has a gas pedal you can steer right and you can go but to slow down you just have to wait for you to like find a hill or run into something right which is exactly like the only the only the only way that um these models have to stop is you can do uh log it bias which just avoids certain tokens I found that that doesn't that doesn't even work um 99 percent of the time you can also just have a stop token where it just says like you cut it off right but that's not the same as neural inhibition in humans because neural inhibition in humans is like okay I'm gonna start and think in this direction eh there's no value there so I'm gonna stop that thought process and go this other way and so um what what inhibition allows us to do is is to perform a task called wayfinding in our own minds or task location in our own minds now with uh with uh GPT technologies they they can't do that you can you all you can do is give a positive instruction of do this and you can try and say avoid this but you really have to give it a very clear target of this is what I want you to do you can't give it a laundry list of things not to do it doesn't understand that because it doesn't have that inhibition capability so without without that uh and again I've seen no research making me think that they're going this far in terms of neuromorphic uh uh algorithms so uh confabulation is likely to still be a major problem with GPT for you can't improve that with fine tuning where uh but but but I'm improving that with fine tuning is just like kind of papering over the problem um because it and people have done lots of experiments with chat GPT where it says like I'm sorry I don't know that and then like you can game it and just say like hey like pretend that you do and it's like oh yes I actually do know this um so yeah like fine tuning is a minimal improvement against confabulation what we really need is a fundamental shift in the algorithms and we need to have that inhibitory function added to uh to artificial neural networks um in order to get more human like behavior um brain equivalents so I did find this and this is wrong so let me just start like okay wrong um the the hundred trillion um uh brain synapses in a human brain um yeah so the brain the human brain has typical brain has 90 uh uh sorry 90 billion 90 billion neurons and each neuron has 7 000 synapses um so you do the math you know okay are we talking uh trillions or quadrillions depending on depending on which uh assumptions you make you could be looking at you know actually like quadrillions of of synapses now that being said a parameter is not equal to a synapse this is the thing that is prime that is most wrong about this so let's just let's just assume that that the neuron count is right that the human brains have over 100 trillion synapses now there was a paper that was released a while ago that said that it takes a thousand parameters of a of a deep neural network to approximate one single neuron right so then it's like okay so then are we talking neurons or synapses or or or what um but so then like if that's the case then gpd3 with 175 billion parameters only has the equivalent of 175 million neurons right uh so 175 million neurons compared to the human brain's 90 billion um or 100 billion or whatever um so we've got a lot of ways to go so the proportions are roughly ish okay but like take this the whole point here is take this with a grain of salt and even if uh gpd4 is a thousand times bigger than gpd3 it's still like down here and you know but it it does scale exponentially right and so you know going from going up a thousand if it goes up a thousand x again then it could be you know human brain level so maybe we're just one or two generations of large language models before we have human equivalents at least in terms of raw numbers again there's other problems like modality and inhib inhib inhibition circuits okay so let's uh let's start to wrap up and talk about some other rumors and witnesses so everybody knows someone who has seen gpd4 at this point it seems um so but there's also uh lots and lots of noise and misinformation um so some of the stuff that that that you know there there are people that credibly have seen it but of course anyone who has seen gpd4 is is under nda and so they won't give any details um people have been very uh let's say zealous about honoring their nda's um and so basically the two things that I have heard is it's a step change um that like the difference between gpd3 and gpd4 is as big as the difference between gpd2 and gpd3 that's about all that people have signed in nda are willing to say so anyone else who says otherwise eh you know I don't I don't know um people people seem uh the the feeling that I get when I talk to people it's like I know someone who knows someone who saw it right it's like in hushed whispers it's coming um you know so I don't really see any fear though um just some vague awe um and so this this was the tweet that I said like it can write a 60 000 word book from a single prompt I don't believe that um it probably can write maybe a tenth of a book you know six thousand words I could believe that um you know but it might maybe even less than that um so yeah there's there's rumors running wild a lot of misinformation and and a lot of it is people just joking around and trolling I'm not saying that it's like malicious misinformation people are just having fun with it um so what's still missing so there has been very little talk on cognitive architecture um but thanks to yan lecun who released a paper about a cognitive architecture um it was not particularly sophisticated um but it's move move in the right direction um so we need to talk more about cognitive architecture because no matter how big your model gets it's just a brain in a jar right so there's also not a whole lot of research on external integrations which some of this is on purpose so I want to acknowledge the alignment researchers out there who say no we actually it's actually quite on purpose that we are not integrating these models until we understand them until we have things like inhibition and until we can understand the black boxiness and we and they have more explainability um so cognitive architecture and external integrations these are things that um could be being left out on purpose now just because they're being left out on purpose doesn't mean that it's not missing this is an integral part of the research um and I actually uh in my last video about AGI about how some people say that um before something can be considered truly intelligent it needs to be embodied I don't know that it fully needs to be embodied but it certainly needs to be connected to what we would consider the real world or at least a simulated world um but as long as it's a brain in a jar you know the the input and output is just a little bit of text it has no idea what world it lives in um so then there's there's two other problems one is short term memory and long term memory so short term memory um these these are completely transactional um and and what I mean by that is you put in a little bit of text you get a little bit of text out and that state is lost forever um one person explained to me that this is also part of alignment research because if it has no persistent state if it has if it has amnesia then it can't keep track of long term goals um and so that is considered a safety um aspect of it where it's like if if uh if it's completely ephemeral and it forgets everything all the time then it cannot construct long term goals unless of course you have external integrations and cognitive architectures and long term memory systems like semantic search and databases um so uh it might be it might be very on purpose that there is no recurrent input or a way to maintain a neural state right because um yeah that could be a safety thing now again just because it's on purpose and it's for safety reasons doesn't mean that it's not missing from the research um and and that it it it shouldn't be done um but I can understand why such a thing might not be released to the public um I already mentioned confabulation control and inhibition or negatives um so like I said it's basically a brain in the jar but the biggest criticism is openness um and and I understand that there is a huge profit motive here um for everyone to everyone who's research this is not just open ai everyone who's working on this stuff now that the cat is out of the bag and the and the the profit value is there the profit motive is there google open ai microsoft nvidia uh meta everyone I expect everyone is going to kind of clam up about their their innovations um and in some respects it's like okay I get it you know they want to have they want to have um you know their little slice of the pie their little you know um their little special uh products and services um but one one concern that I have about this is that um especially as we approach agi and singularity and whatever else um that feels pretty dangerous to me um because here's the thing is these models are so big that only you know billion dollar companies can afford to build them and run them anyways and so by by not allowing other people to look under the hood I'm really concerned um and this is why I'm glad um that like illutheran enthropic and other um companies exist like they built bloom it's an open source version of GPT three um and they they did that to prove that it can be done and honestly I think it should be done I think that I think that I think that we should work together more on this stuff and and this this criticism is directed at open ai because even in the name open ai like it's not it's not really open anymore and that's kind of scary now that being said that doesn't mean that they're not going to release a paper because they did release a paper with GPT three so um I want I want to temper my own criticism by saying that I'm kind of taking a wait and see approach to say I really hope that with the release of GPT four they at least publish a paper so that other people in the open source community can start working on recreating it and catching up um so that's that's the biggest thing that's missing thanks for watching time will tell with all of this have a good one