 Hello, hello, oh wait, it's quite loud. So I'm going to stop my talk in just a second, but I just want to let everyone know in light of the fact that the contrast on the projector is really a bit weak. I uploaded the slides, so if you want to pull them onto your own laptop and you can stare at them right in front of your face, that might be helpful. It's at bit.ly slash capital N, capital L, capital P, the number four beer, NLP for beer, and then you can pull the slides in and you don't have to deal with the screen. It's bit.ly bit.ly slash capital N, capital L, capital P, the number four beer, B-E-E-R, and that beer is all lowercase. I can repeat again. Did everybody who wants it get it? Yeah? Cool? Coming in? All right. Yeah, yeah. It's bit.ly slash NLP, and that's capitalized, four, the number four, and then lowercase B-E-E-R, NLP for beer. Okay. So I'm going to get started by telling you all a little bit about the things that keep me up at night. So the image on the screen right now, or perhaps on your laptop if you've grabbed the slides, is really the sort of summary of my nightmare of existence, right? You have a lovely bar with lots of beer on offer and almost no quick and easily accessible information about what any of those beers might taste like, right? So and this is a really common sort of information display in a bar in England. You have a bunch of sort of commercially sensible cask labels that sit on the pumps and maybe some branding on top of a draft pump, and that's all the information you get. So I don't know, this one's purple. It's got a nice Helvetica font. What does it taste like? I have no idea, right? So here I am. I've got like 30 beers in front of me, only so much alcohol I can shove in my liver in the night, and I have to make a choice between all of these things, and I have no information to make that choice with, or no practical information to make that choice with. Right? So I'm going to recognize this bar. It's lovely. It's in clerk and will. So this is my dream. This is from, unfortunately, now out of business bar called Mason and Taylor. This is the paper menu, and it is a lovely piece of text information about beer. So just really quickly, you can sort of go through things, and you see they've got, they've organized things by broad color, which is a really good way to sort of start with beer organization. And then inside each description, which maybe you can't make out in the projector, but inside each description you have a couple of sentences that roughly walk you through what kind of flavors you can expect, how strong it is in taste, what the ABV is, of course, and how much it costs. And this menu is quite quick to parse. You can even do it when you've had a few beers already. You can probably work out what's meant to be going on, this sort of thing. So really what I want to do is talk about ways that we might be able to move from the sort of current state of affairs automatically to a menu that kind of looks like what we just saw, right? Ways that we can harvest the broad intellect of the universe on the web to make good facts about beer that are well organized and easily accessible. You know, it's a dream. So today we're going to have a little talk. It's going to be called Taste Great, Less Wordy. We're using document natural language processing, and we're going to do it for beer. My name is Ben Fields, and I work for a little company called Fund Implausible Solutions, and I'm camping right over there for the next few days. So this talk really is about two different things. The first, of course, is beer, lots and lots of beer, shelves and shelves of beer, ideally, how you deal with it, how you think about it, and how it works. The second is text, and not just text, but the organization of text and the organization of big works of text. So we merge these two things together and we end up effectively talking about beer reviews. And it's worth pointing out that everything that we're going to talk about today can be generalized to other topics. So the things that we are sort of talking about specifically with beer, of course, can generalize to other beverages, but more than that, basically anything that people write about, when they write descriptions about items, you can do the kind of stuff we're talking about to those large stacks of descriptions of items and get somewhere to where we're going. And all the sort of text processing techniques, actually they generalize beyond text as well, so you can find people who do the kinds of things that we're doing to text to any streams of bytes. So anything that you can chunk up and call those chunks characters and words, you can actually do a lot of the stuff that we're going to be doing to text as such. People do it to audio and video sometimes and the effects are strange and interesting. So the other thing that's worth noting, and then we're done with the caveats, is that the talk's really informal, so if there are any academics on these topics in the audience, like we can talk about the details of the stats afterwards, but for the sake of sort of brevity and getting through things and being able to follow everything, I'm going to skim over a lot of the math in a way that's approximately wrong, but is still high level, followable. So let's talk about beer. So beer is made of four things and maybe some additives, but really four things. The first is water. Water is extraordinarily important with beer. The mineral content in the water has a huge effect on what sort of beer you can get, how different flavors will work. Most of your beer is water. The second thing is malt. So malt is a cereal grain. It's also used to make whiskey and really this is where all of the alcohol, which comes from sugar, comes from the malt. You can use barley malt most commonly. You can also use rye and wheat and oats. Basically anything that is a cereal grain, you can turn into beer and you have to malt it. Hops, hops are the herb that makes your beer a little bitter. Anything that isn't sweet probably came from the hops, unless it's a flavored beer with an additive. Hops come in a bunch of varieties and the particular kind or the part of the hop that matters is the flower. So it's just the female hops that we get these things from. And yeah. And then of course yeast. So yeast is a single side of the organism. It is magic. It is completely magic. So you give a sugar solution to this yeast and it turns it into alcohol. It's wonderful. Human society depends on this first of domesticated animals yeast. So with those four things, you go through a four-stage process. You mash. So mash, how many in the audience have ever made porridge? Anybody made porridge? Anybody? Hands? Hands? OK, so that's mashing, right? So all you do, you take your cereal grains, malt mostly from barley, and you soak it in hot water. And the temperature you use matters, but don't matter too much. And while you soak those grains in water, you activate some chemical processes that turn starch into sugar. And you have to do that because yeast can eat sugar, but it can eat starch. And you want alcohol, and so you have to feed your yeast. So you mash it. And then after you've mashed it, you boil it for an hour, maybe an hour and a half, depending on the beer style. And while you're boiling it, you're achieving a few things. The first thing is that you're knocking out a bunch of proteins that'll make your beer not taste as nice and not keep as well and be a little cloudy. Nobody wants cloudy beer. And the other thing that boiling lets you do is it gets off some of the water so that the ABV will go up a little bit. And actually, there's three things. And you need it to get the hops well integrated into the beer. There's a few things about the way the bitterness happens chemically that require a good hard boil. So yeah. And then you do a ferment. So ferment is where the yeast goes to town. So once you've made a sort of really sweet, slightly bitter solution with hops and your boiled syrup from malt, you throw a bunch of yeast in it. And the yeast eats all the sugar and boils up quite nicely. As you can see in the picture, we'll maybe work on it. So that's the foam that the yeast makes because yeast eats sugar and produces CO2 and alcohol. And so the CO2 makes foam. Then you package it. And the packaging in this case is bottles. Also, you can package your beer in a cask, in a keg, in a can, all kinds of ways. Sometimes there's a carbonation that happens in the packaging. So this happens with Cascale. It also happens with bottle conditioned beer. So that bottle conditioning, it means that the carbonation actually happens via the process of the yeast continuing while it's in the bottle. You can also force carbonated beer, which means you just sort of shove a bunch of CO2 from a canister into the liquid and then put it in the can or a bottle. OK, so that's it. So that's four ingredients and four steps. And you get just about every beer that you can think of. I say just about because sometimes there are fruit adelives, sometimes there are weird things, sometimes there are industrial processes that have adelives. But generally, for most beer, most of the time, four things, four steps, that's it. So that creates a extremely diverse range of beers. And there is a set of people in the world that like nothing else than to sit around on the weekend and taste them and bicker with each other about which one tastes better and which one tastes of grapefruit and which one tastes of leather and which one tastes of elderflowers. And here's a picture of one of these lovely souls. I count myself among their number frequently. So the thing about these people is that we have some websites where we can sort of collect our opinions. And this is one of them. It's ratebeer.com. Rate beer is quite a good one. There's a few others, though, also. I'm not necessarily advocating for rate beer over all the other ones. But their website is quite easy to scrape, which makes it handy for the latter part of the talk. Rate beer's been around for a long time. And there's quite a lot. So the first thing to mention is that rate beer's aesthetic hasn't really changed since 2001. So it looks like a website from whence it comes, which is nice. It's quaint. So rate beer tells you a few things about every beer. You get a sort of basic metadata, which you can see here for punk IPA, broodogs sort of core hobby pale ale. So you see the ABV of the beer. Now you can see where the brewery is located, what style it is, and then a sort of nominal idea of how good it is according to the community. And the way that they express things, rather than using ratings, is a sort of percentile score. And you'll see two percentile scores for any of the beers on rate beer. One of them is in overall across all beers in the site. And the other one is style normalized. And the style normalized one can be more useful for anything that's not really over hopped IPAs, because that's kind of what rate beer likes. So if you have beer that is simpler or darker or a little easier on the palate, the community at rate beer tends to not like it as much. And so what you'll find is you'll get beers, and actually one of the ones on the bar is one of these, that rates really highly in its style, and still not very highly across all the beer, right? So be aware of style bias. So okay, but if you scroll down on a rate beer page, what you see is this lovely collection of reviews. And so here we have for punk IPA, a bunch of 100 to 200 character reviews. Some of them can get quite long that explain the beer and that also split up people's numeric opinions, quantified opinions into a bunch of different dimensions. So aroma, appearance, taste, feel on the palate and overall score. You get a date, you get the location of the user, and then a bunch of text. So here we have for punk IPA, clear, medium to dark, yellow-orange color with a large, frothy, creamy, good lacing, mostly to fully tasting or mostly to fully lasting, off-white head, aroma is moderate, multi-carmel, moderate to heavy hoppy, citrus, dusty, grass, orange, citrus, et cetera, it goes on from there. So you get a lot of quite specific flavor characteristics that can be quite useful and sort of how it's presented in things like this. So just another example, that's a different style. So this is a kernel beer from London, the export stout, which is quite a large, dark beer. Also scored quite highly amongst the community. And so this beer, if we go down to the bottom, we see this one treacle-like, viscous, thick, a dark brown or black beer with a cream brown head. Can't think of much to improve here, bitter and sweet and mineral, all in one perfect harmony, right? So that reviewer clearly loved this beer and gave some detail about why. Also that got a 4.2 out of five, so you understand what sort of critical expectations are of the community. So this slide, if you can make it out from here, is just the text from some reviews on that export stout. And in fact, to be a bit more exact, this is all of the text from all of the reviews for that export stout. So you probably can't make it out. I can't make it out from the stage. It's about 75,000 characters. Sorry, 75,000 words. And so the question is, if you use a site like this, how can you get a good feel for what the community thinks about a beer and what the descriptive characteristics are of that beer without reading pages and pages of text, right? I mean, you can sort of do some sampling, read a few reviews here and there, but that's not very satisfying and it may also give you the wrong opinion, right? So, lots of people write lots of things about beer and we want to get a good, high level view of all of those things. So we're gonna use text processing to do that. And specifically, we're gonna use a part of text processing called natural language processing, NLP. It has no relation to the technique for understanding and divining human insight that became quite hip amongst corporate folk in the 80s. They stole our acronym, we want it back. Natural language processing. And what that means is that people are speaking as they would and we are trying to get the computer to understand that or at least to have a good enough statistical model that it feels like it understands it. And in fact, most of the time that's actually what happens because the computers are not conscious and they're not going to rise up and kill us, at least not yet. So, what we're going to do is summarize documents and document is in quotes here because a document is just a thing that we declare is somehow related, right? So that's part of a sort of external labeling process. This clump of text here is related and we will treat it as a single unit. And so we want to summarize documents and we want to be able to compare the documents amongst themselves. So the first thing with natural language processing that's worth understanding and where a lot of domain specific stuff comes in is that you have to pre-process, you have to clean up all of your data. So the core step here is what's called tokenization. Tokenization is a really fancy academic word for split on white space, right? So, here's some Python code that does tokenization in the stupidest way possible. So what we're doing, we're lowering all the words. We're splitting them all on white space. Hopefully any programming language that you like can do this. There are other more exotic ways to do tokenization, I should say. You can be really, really specific about it. So you can split, for instance, contractions. You can drop punctuation. You can ignore words. You can do all kinds of fun things with your character sets. But the simplest and most straightforward way is just split on white space and take every individual word as a unique token. So you can use a toolkit in Python called NLTK that does slightly more exact tokenization besides just white space splitting with, as I say, punctuation splitting and removing contractions and things like this. And so then the other thing that you can do in a preprocess is you can declare a bunch of words that may be common but are for the particular purposes of your language processing not actually relevant and are never going to be. And this in language processing is called stop lists or stop word lists. And so for English, you can do things like put the and and and I and he and she and we and it and things like this. It does depend on what you're doing and you have to be careful with stop word lists because if you just sort of throw a big list of words that you don't think are relevant on you can miss some subtleties in the statistical model. But it's quite good for joining terms and pronouns and things like this when you're looking for a descriptive text. Fine. So we do all of that. And then we can also as a last step for our particular adventure, basically do part of speech labeling which is another fun trick and it's a little bit more complicated so it's outside of the scope of this talk but the bit of code in the slide shows how you can do part of speech labeling with the language processing toolkit in Python and then you can just filter out everything but adjectives and nouns. And because what the ultimate goal here is is to get a list of descriptors for each beer, we don't actually care about nouns or pronouns or past particibles or other parts of speech, right? Just adjectives and nouns. So fine. Okay, so the simplest and most straightforward of the statistical models that we can use is called TFIDF and TFIDF is it's self-describing which is quite nice. So term frequency by inverse document frequency. And so you count in two ways all of your tokens. First you count how often a given token occurs in one document that's term frequency and then you count in how many documents that term occurs and that's document frequency and you take the inverse of that and you multiply them together. So the idea here is that a word that's very common even if it occurs many times in a document still won't get a very high weight. So if you have a word like the and your corpus is English language documents then the is gonna probably occur in just about every single one of your documents. So even if it occurs many, many, many times in a single document it still won't score very highly, right? On the other hand if you have a word like I don't know, brown or barley or grass or something that is maybe not so common across your whole corpus it's only in a selection of documents but in this one document in front of you it occurs hundreds and hundreds of times. Well then that document is probably talking about grass. So to put it all together in a standard sort of workflow we take in a document from the web a bunch of reviews about beer and we pre-process them and then we count the term frequencies per document and we count the document frequency across all the documents and then we get TFIDF scores for all of our beers. So let's walk through an example real quick. So here's a kernel export stout you may recognize it from earlier in the talk. Here's a picture of it poured. Some nice, nice head, it's good. So here's all the text you also may recall from earlier. So once we push it through our entire process along with the full corpus what we end up with is a bunch of pairs of words and scores and the scores are gonna vary from zero to one unless you do some normalization and so this beer, the kernel export stout is 0.157 coffee and 0.116 chocolate and I'm not gonna read all the numbers but it goes on so it's coffee, chocolate, black, dark, licorice, espresso, cocoa, brown, bitter and smoke and in fact there's about 10,000 words that all have scores, these are the top 10. So this gives you a much faster and easier way to sort of appreciate what the beer is about than having to read 70 something thousand characters of people's reckons about kernel export stout London in 1890. The other thing that pops out to me and perhaps this sort of ages me specifically but so we have a bunch of words and we have a bunch of numbers attached to words in pairs and that's giving me a flashback to my early web browsing and here is a word cloud of all of those words so because I don't know about you but whenever I see words and scores for words I just wanna put them in a word cloud so we've done that and so you can see that here you start to see the relative weights of the terms so it's mostly sort of coffee and chocolate and then a couple of coloring notes about black and dark liquids so that's fun. Okay so but the thing about this is that the computer is not clever remember we're performing bits of statistics to trick the computer into tricking us that it has any idea what's going on it does not understand it's just counting so this word cloud has a bunch of overlap of ideas that are expressed in multiple words across all the people so for instance espresso and coffee these things they're different but they're very similar especially amongst all the other words right they're not quite the same thing but they're very close and here for instance is black and dark and brown which are all color terms and especially the refining color terms right so dark doesn't exactly tell us the same thing as black but they overlap the same sort of space of concept right similarly we've got cocoa and chocolate I don't know enough to understand the difference between those two words and their usage here so what we can do is a thing called topic modeling and topic modeling adds a separate process where you look at the whole corpus of your text and you count how often words occur with other words so you're looking for co-occurrence and this allows you to pull out words that are commonly used together and that's what they're called topics here so the computer still doesn't know anything about language or about humans and how we talk to each other but it's sufficiently good at counting that we can count in lots of ways and then it's better at pretending it knows what's going on so also we can spell it in an American way and we can advance so instead of doing the TFIDF process where we preprocess and do term counting and then we do document frequency counting we push all the documents through and then we count co-occurrences as a separate process that happens in advance and then we get a bunch of topics for most topic modeling methods and indeed the one that we'll talk about specifically in a second you need to specify to the algorithm how many topics are expressed across the document so the model is gonna look for say 100 topics in your document or in your corpus or 200 or 50 or whatever and there's some side effects for picking different kinds of values for that but that's a detail that we'll have to skim over so this can be refined as well so you can add a new document to the corpus and sort of make your topics a little bit more accurate and when we say accurate here sort of it's the idea of how correct is it that two words are going to occur together in a given document that the system's never seen before so if we have chocolate is it very likely that we'll also have cocoa in a document and that sort of idea of accuracy can get refined with more bits of document fine so if we do this with all the beer reviews what happens so we get a bunch of beer topics so here are the top five terms from three topics I don't know if you guys can read it but what we've got here, the first topic is chocolate, black, coffee, dark and roasted and the next topic over is smoke smoked, smoky, peat and the third topic is citrus, hops, grapefruit, orange and pine so the first topic is basically talking about a particular flavor you're getting off of the malt so dark malts are imparting these chocolatey tones and a little bit of the color effects from that same clump of things the second one is clearly smoked beer peat being a common source of that smoke flavor and the third one is a collection of flavors that you get from hops and indeed a hint that that's what's going on because hops is also included in that topic and they're also all relative terms so in the smoke related one the percentage of likelihood associated with that word is quite a bit higher than for instance peat because not all smoked things come from peat so a document in this sort of world view is described rather than a bunch of words it's described as a bunch of topics which can then be made up of a bunch of words so we have a mixture of topics and that's a document so how do we do that so we are going to use a thing called latent Dirichlet allocation which is very fancy sounding it's got three words one the middle word is the guy's name so you know it's fancy so Dirichlet is a statistical tool that is basically it's like so you have a bucket so here's our color visualization of a bucket the bucket has four colors of stuff in it and each of those things has a percentage of occurrence so 45% of this bucket has blue squares and 24% has pink squares and 18% has black squares and 13% has white squares so that's a Dirichlet it's basically just assigning ratio values that all add up to one and so the idea is that by describing these things we can assign probabilities to unseen things we draw out of that set so latent Dirichlet allocation is simply a method of assigning all those values given that I have no idea what's going on inside this magical black box over here to the side so informal this is really informal I need to take it aside and tell you that I'm telling you things that are not quite mathematically correct but basically what you do is every time you draw a slide every time you draw a piece from the Dirichlet you learn a little bit more about its contents so if I know that there's a bucket of balls and the balls are different colors and I pull one out I know a little bit about this if I pull two out I know a little bit more and I can just repeat forever until I have perfect information about the contents of the bucket that I can't actually see all the contents all at once fine so if we go back to the kernel beer that we started this talk with and we get all this text again and then what happens is we get some topics so these are the top four topics for the kernel export stout and what we've done with this is we've pushed all of the dark beer flavor things into a single topic in the front so instead of having about 40% of the words that showed up in that word graph or in that word cloud rather it's just it's a single topic and we have a percentage associated with that topic and then we have two or three other topics that are not quite the same that were things that were missed in the previous method so the second topic for there I'll just read them for those of you who can't quite make out the slide so the first topic is dark brown sweet fruit raisins so it's not the sort of coffee espresso notion of dark like we saw in the example it's more of a like a rich sort of dark fruit like prunes or figs this kind of kind of dark flavor and then you have a fruit topic that goes fruit, peach, tropical mango apricot and then a hoppy topic that's bitter pine dominated and then a topic about presentation pours nice head sweet aroma so this is an interesting collection of things that actually says that rather than just being hit you on the top of the face with roasted notes what you actually are getting here is a sort of deep dark beer with a lot of sort of aged fruit notes right a lot of prune notes and things like this which I think is actually more accurate given this beer but I'll let you guys find a bottle of this and make your own decisions so to go back we have these four flavor ideas can we map these four flavor ideas to topics in our corpus because topic models as an exercise can also tell you quite a bit about all your documents not just about specific items inside of your corpus so we have water or sorry I'm gonna do it in the wrong order we have hops hops is citrus hop grapefruit orange pine malt which is a caramel amber malt sweet copper yeast banana clove bubblegum bubblegum and then water now the thing about water golden white grassy light head so this is a bit of a reach the thing about water is that the flavor you get from water is flavor you don't notice so if everything goes right with your beer those flavors are gonna get shoved somewhere else they're gonna get credited somewhere else you only really notice your water profile when your beer tastes of sulfur which is not something you want or when you're just in case anyone thought that was a good idea or when your beer tastes really salty which unless you're drinking a really obscure style from northern Germany is also not something you want so the water style the reason why I've tied this topic to water is because this topic describes a lot of characteristics from Pilsner beers and Pilsner beers used extremely soft water profiles and if you get the water profile wrong the style drifts entirely off so if you see beers that are described as white and golden and possibly a little bit grassy which is a hop note for that same kind of style if your water profile is not basically extraordinarily soft it's all gonna go wrong these are not beer styles that you can make with ground water from basically anywhere in England for instance so I'm gonna round out the talk with some practical and some timely information so this is the beer list that the robot arms the robot arms is the pub that's right over there right over there in the tent like it's like four tents down it's where the beer is at camp so this as of yesterday are the beers that are either on right now or will be on at some point in the next three days these are the APVs of these beers also important to know when you're sort of calibrating your drinking sessions and these are the top three terms that are associated with the beers so I'm just gonna go through them because I got a little time so we've got three beers from Milton we've got the Pegasus which is amber and bitter and apparently a little mushroomy I don't know it's the internet you're gonna have to go with it so the Milton Justina soft floral peachy so in contrast even though these two beers are both nearly the same APV this beer is going to be more pale so sort of blonde a little floral from the hops right so then you got the Milton Minotaur which is a mild and it's got a pretty standard profile for a mild coffee dark mild brown so this is brown more than black 3.3% Trin Ridgeway which is a session beer so it's session nutty amber and the Trin Moongazer which I don't have the review it before because I had to throw this together speedily and the Otter Bitter which is a fruity session amber beer the Castle Rock which is a fruity golden grapefruit beer and Bex Veer for those of you that like that sort of thing is pale and golden and watery again I take no credit for these words and the Starro Promen golden floral and metallic so with that I'm going to pause here in fact I'll go back you can all take your notes and I'm happy to take questions I think I have about 10 more minutes so the question was how do you review or how do you deal with negatives and the example was it's not hoppy or it's not golden so the sort of first answer is that that's probably the hardest sort of part of this kind of thing and in this particular case I can get around it because of the nature of the data if you if you look through the beer reviews it's very uncommon for people to describe beer in the negative it does happen a little bit but because it's not common and effective what we're doing is taking a bunch of averages it sort of goes out in the wash in this particular case so it's not a problem I encounter in this data set there are some methods so some of the things you can do with topic modeling will start to suss out some of the negation so if you find because it's rare that someone's going to say in a sort of 50-50 split not hoppy and hoppy for just for different beers right what's far more common is that certain words get negated and if that's the case and will happen is that will get pulled out in gathering topics or can and I again I mean though the way you deal with negation because it's quite difficult tends to be less generalized so tends to be that you're going to uh... use something that you know that's particular to the domain right and in this case because we care mostly about adjectives and we know that people are generally positively describing okay so so the question was uh... we're working with a quite narrow domain of beer and how do we do topic modeling for a general for a larger domain like the whole of English language for instance yeah or something something bigger than beer here's quite big but yes topics are almost always useful no matter how much sort of generalization they're doing right so that the work that a topic is doing is effectively going from every single word is its own little animal to one hundred or two hundred slightly more complicated statistical models but then you can do some things with right and so how well it works depends on how much work it's doing uh... but it's always going to be helpful so if you have the whole of English and i don't know there's a hundred thousand words or two hundred thousand words across all of your documents and you're reducing that to two hundred topics so then each topic an average is covering the space of what like a hundred thousand a hundred thousand words right so so that's a poor math a thousand words and and so so it means that your topics are going to be broader so here in beer we've got topics that talk about hops and topics that talk about uh... amber malts and topics that talk about adding smoke and if you did the same number of topics on analysis of a corpus of that comes broadly from English maybe you get a topic about beer or depending on how broad the the the corpus is you would get a topic that talks about alcoholic beverages right and so so it's a matter of of what what detail you'll see and how quickly you can do analysis and how how well you can measure inter document distance so one of the things that people like to do with topics that doesn't quite fit in amount of time is that you can actually measure how close to things are and and that works really well with uh... fewer topics so uh... maybe fifty or a hundred uh... and there's some some mathematical reason but basically the the smaller the number of topics the more accurate that distance measure is going to be uh... but you lose sort of semantic understanding you lose in a appreciation of what each topic means so the the underlying math to do distance between uh... documents doesn't actually depend on the topics being understandable by people uh... but then you can use the topics in a different way uh... and then you get that out of it as well more questions we have five minutes sure there's something else you ask me stumped you all all one of the back yes uh... okay so so the question the question and i hope i heard it over the the windy tent correctly what sort of aspirations do i have for this where do i see natural language processing going in general or just with beer both okay i mean horsey books i guess is is a good answer for this crap so i think um... the thing about natural language processing is that it's quite powerful but what it doesn't do is teach anyone about language so i i've i've i've hit on this point a couple of times uh... but no matter what it looks like the software doesn't understand speech right the software it's like when you uh... when when people get beaten by computers playing chess they're not playing chess the same way a human plays chess right and there's a sort of so it's a loose analogy but it's uh... it's it's a roughly analogous right where where basically your you're giving some software tools to pretend like it knows what's going on right i think one of the ways where you can see this is if you look at the sort of the flip side of natural language processing which is called natural language generation where you get uh... so in this case you would get the computer given some beer to tell you a sort of pros review about the beer right so instead of just a list of adjectives we would get you know the beer initially drank with a bit of straw and had a medium mouth feel that closed into a bitter long finish that was great i'd like another right instead of a list of straw bitter amber golden right what you find with most of those systems is that they tend to be highly rule-based and with skeletal structures somewhere in their system so it's very hard to get basically a computer to improvise uh... and you just sort of make it complicated enough of a fixed system that it feels like improve the improvisation to humans and i to me this is the kind of thing is can we can we build sufficiently complicated models that the software actually understands enough to sort of pastoring tests and things like this right yeah that's a good beer conversation i suppose okay more questions i have five minutes apparently so i mean well i think i think it depends like all people who care enough i'm going to say that it depends on when i i would probably start either the justina or maybe the other bitter and then go from there to one of the two amber session beers uh... and then maybe a mild so probably probably the moongazer even though we don't have the data for it it depends the um... the thing i will say is that um... one of the things that's really nice about doing this kind of analysis for beer is that it's a really good way to figure out is more likely to satisfy your interests and your tastes without having to drink a bunch of beer first if i know what i'm after i can look at a list like this even if i'm not familiar with any of the beer so if i if i want to start a blonde that's reasonably low avv and is quite quite hoppy so i can tell you from this list and i think i've had two of the beers on this list before but i can tell you with a pretty high amount of confidence that it's either going to be justina or the other bitter that are that sort of beer and if actually what i want is a beer that's kind of like that but more aggressively hopped with sort of aromatics in the front and maybe the sort of new world big big hop that you get it's it's going to be the castle rock harvest pale and in fact the difference between the castle rock harvest pale and those other two beers is quite clearly in the hop aroma profile and if you see a sort of list of ten instead of a list of three this becomes more apparent right so they're similar in the words that you see describing malt which are mostly color related or sort of a bread and cereal kind of words uh... but then the words that are used to describe the aroma are wildly different so you get sort of uh... softer words around the uh... justina and the bitter so uh... the justina for example has peachy appear right uh... peachy is a very soft kind of hop uh... a note that you get in beer uh... and then down here you have grapefruit for the for the castle rock which is a sort of key trigger word for big american hop aroma it's used quite commonly right so yeah i think maybe that's that's my non-answer answer so the code base that does all this analysis lives in a github project called consume underscore our beer uh... and there's a link to it as part of the slide deck so if you go to the link that the slide deck is on which i tweeted earlier and i'll push around a little later as well there's just it's a github project that does all the scraping and in fact i have an sql dump of all the data that's already processed which i will happily give you if you come and talk to me you can have it's a couple gigs and do whatever you like with it chris so not any companies that are commercially viable uh... uh... so i had a go at making a product uh... for a variety of reasons that didn't quite work i think chris already knew that answer uh... i think i think there are there have been some attempts actually so in in i'm trying to think of anything that's that's actually viable in beer anymore so the closest thing you get is some of the social network uh... kind of recommender things that happen around around beer so like untapped for instance although it doesn't really leverage its data very well does a really good job of collecting a lot of data uh... yeah and then also there's a there's a wine app that does similar things called called uh... vavino that actually is a little bit better about a tenderly of text so uh... it solicits i think part of this is because wine nerds tend to write more uh... so there's there's just like in a sort of broad and stereotype making kind of way so they're they're both the two apps are quite similar except for that one just deals with wine and the other deals with beer uh... vavino has a big free text field and and untapped is like give us twenty characters if you feel like it and so the result is that you get things that are sort of closer to the sort of length of reviews and it's quite hard to do a lot of these techniques with very short text because a lot of it depends on words occurring next to each other and that is just gonna work better with longer things so but vavino does a quite good job of pulling out uh... this one is a bordeaux that's going to be quite a full-bodied thing or this sort of thing and i'm not sure of which techniques they use but they're clearly doing some language processing uh... server side that does some of this other stuff i have not there is uh... some quite hilarious scholarly work that uh... has looked at uh... comparing commercial price points and social descriptors uh... and also uh... commercial description so there was uh... the study was originally done with wine data about ten years ago uh... where basically looked at the commercial description and the price and determined that and also uh... like uh... expert reviews so like sort of numeric scores you know did wine advocate give it a ninety or eighty six or whatever they found no correlation between price and the uh... review score but they found a strong correlation between the average word length in the reviews and the price so so and that's the commercial commercial reviews so like what it says in the back of the label so the expensive wine they use big words but it's not necessarily any better was the like broad takeaway from this review i have spoken with a friend of mine about doing similar kinds of analysis of beer the the market for beers a bit different so so the the price variance is very small so i think with that work a decent amount of the effect comes from the edges really expensive wine gets really flowery language and cheap wine people describe as it will get you drunk right and but but with beer because the outliers are much closer together right like the sort of cheapest beer you can buy is i don't know twelve pack of something awful for five pounds and then the most expensive you can buy like the most expensive you can buy is two hundred quid and that's a really expensive beer right and there's like two of those and so so because the the possible prices a lot narrower and most years just a couple of pounds and that's just it uh... i don't know how much you get a commercial kind of correspondence yeah and then the other thing that you could do is is if you compare the commercial description to the social descriptions uh... my suspicion is that in most cases they would all agree they would just be using different language to talk about the same kind of stuff uh... but yeah would be interesting to see if that's right i don't know i mean without without sort of running it and you can't can't tell for sure anybody else cool i think about a time now so that works thanks very much guys which one who could i this one's fine this is my twitter handle if you follow me on twitter i'll i'll treat all the links in a couple minutes