 Okay. With that we're live. Just a second to finish changing the settings. All right. Yes. Okay. Welcome back everybody. The third talk of the third day here in the conference. It is my distinct pleasure to introduce to you today a group from Arizona State University. So Ankustale and Beckett Sterner who will be talking to us about explaining ambiguity in scientific language towards a computational approach. And I've seen a little preview of this project before. This is a really cool project. So I'm really excited to see what you guys have been up to on it lately. So please take it away. All right. Thank you, Charles. And thank you so much for putting this workshop together. It's been, there's all these little gems that are scattered all over and it's really wonderful to find them and to connect with them as people give their talks. So for today I want to pick up what's been a lightning rod for debate among philosophers and historians and social scientists and also in English and rhetoric about the place of ambiguity in science. Is it something persistent that we just have to accept? Is it something we can eliminate? Is it actually something we want to have around? Does it have a stable continuing role in scientific practice? And many have argued that it leads to confusion and incorrect errors and so should be eliminated. Others have pushed for the value of metaphor and analogy for example in opening up new ways of thinking and even enabling our senses to connect with the world in novel ways and ground new theories and sort of provide a foundation for literal language. Others have argued that we actually can do things together because ambiguity lets us avoid having to agree on everything and so we know just enough about what we're supposed to do to get it done but not have to really understand all of the ways in which we differ from each other and so it can be crucial to social action as well. I think what everyone can agree is that it's still here and it seems to be sticking around for a while but the context in which scientific language is operating is changing in really sweeping ways and so I think data science data-centric science here provides a new set of forces that are channeling the debate around the place and persistence of ambiguity, its value towards a focus on what machines can do, what our computer is good at handling versus not and so in everyday language we can rely on contextual information that's just really hard to communicate to computers. A lot of natural language processing is stuck on that background knowledge there and we can also build shared understandings through having conferences over many years or being part experts in a particular field that as access to knowledge and the sort of networks that we're building grow larger more people can enter a conversation and lack that background they can't even see the context that others are picking up on and so one strategy as scientific knowledge gets opened up and globalized is to respond by trying to standardize key technical terms to have a single fixed meaning and so a lot of computer ontologies have this focus of I want to know exactly what symbiosis means everywhere that it's used so that the computer can understand exactly what this data point is signaling in order for it to reason about it and find it in a database but is this strategy of pushing for single fixed and universal meanings always the best strategy for advancing science and part of what we're going to argue today is that this new context for thinking about ambiguity highlights important and novel gaps in our understanding so how is ambiguity related to productive descent and competition if we're building the basic definitions of scientific terms into our data infrastructure what happens when people disagree with those how those concepts are defined what happens when they want to propose a competing way of classifying and framing data for example do we have to converge on a single standardized definition to avoid confusion maybe there are actually alternatives here where we can figure out on how to agree to disagree in a way that still allows us to communicate but doesn't force us to arrive at a sort of false consensus and I think there are lots of new opportunities here to understand how social factors like changes in in who's in a community who's in a conversation how different fields are connected how novel words you know percolate across corpora influence the behaviors of ambiguous language use like is ambiguity preferred or not preferred in different contexts depending how the social background is changing and I also want to highlight how from starting from a sort of philosophical point of view about the questions of the the role of ambiguity in science there are a lot of really amazing things that have been happening in linguistics and computer science that at least to my experience are not part of the philosophical conversation yet and really should be and I'm going to touch on a couple of those today but just to foreshadow there one of the novel insights out of cognitive linguistics that that I've been really taken with is that ambiguity can actually improve efficiency of communication when can adequate contextual information is available and the second one is there have been amazing advances in the last couple of years in terms of our ability to detect and quantify the extent of ambiguity in its presence using natural language processing and so there's a whole new subfield that's kind of popped up called lexical semantic change that's been making pretty big strides okay and so for today what we're going to do is start general and kind of set the scene with some big philosophical principles and then try and move move down to the specific and say well how would you operationalize these principles how would you determine if if they seem to be in action in a particular context and that's the novel computational approaches section and then we're going to if you may get through with us there's going to be pretty pictures at the end and we're going to start to explore how you could apply these new approaches using a corpus from JSTOR around the study of the word subspecies as one of these classically ambiguous and and hated but also persistent terms in evolutionary biology all right so that's where we're headed and what I want to do is is set up some competing ideals to start with to kind of frame the the or connect with the prior literature on ambiguity and frame questions going forward that can inform how we think about ambiguity in the context of data-centric science so you know crudely speaking right we know semantic ambiguity so in particular where words have multiple meanings it's sometimes productive sometimes harmful right if you're going to sum up the prior literature in a nutshell that's the short version but we don't really have a systematic understanding of where when how and why and so a lot of the studies that philosophers and historians and social scientists have engaged in have been deep qualitative case studies and so we get some inklings there about the conditions of productive ambiguity or harmful ambiguity but not yet the the position to really explore those more systematically and so setting the ground for that is what I'm aiming to do in this section and as background a group with Ted Piantadosi at all from 2012 had a really lovely paper on ambiguity in cognitive linguistics and so they they proposed two criteria for a communication system which I think I'm going to sort of take for granted as things that we hope for our language to do and that the principles we're going to describe should somehow be serving right or somehow helping us with so the first one is that we're looking for a clear communication system in which the intended meaning can be recovered from the signal with high probability so if the speaker wanted to say X right a clear communication system is one where the listener can get X with high probability and easy communication system is one in which signals are efficiently produced communicated in process so if you want to say X how much effort do you have to put into say saying X right you have to write a whole philosophy paper or can you just use a single word the the the spectrum of of ease there depends on how much effort both the speaker and the listener have to put into put in in order to get the the intended meaning out and I also want to pull in a third uh visitor autumn here around innovation right so an innovative communication system is one in which novel meanings can be rapidly generated and adopted so not just a static kind of equilibrium system where all the meanings and all the terms are fixed but where we're doing new things and want to say new things in response and we want our language system in science to be flexible that way okay so the first principle sort of captures the response that I was describing earlier where in the context of of having clear and easy communication maybe the right strategy is to make sure that every term you use has exactly one shared meaning across all contexts right sub species always means geographically isolated things with with morphological differences no no no changes no disagreement that's just what it's always going to mean and so this certainly maximizes clarity because if you see that term you know what the meaning is but if you need a lot of terms to describe each of your meanings uniquely in context and sensitive ways um that could actually lead to a very large vocabulary um and that that's not that can get quite complicated for both the speaker and the listener to keep track of um and it doesn't really have much to say about uh how to make something innovative basically you have to figure out what you're supposed what you're trying to mean um before you introduce the term as far as this principle goes and so one of the the striking points that I haven't really seen highlighted in the philosophical literature but that comes out of cognitive linguistics is that ambiguity can actually increase efficiency in a very precise mathematical information theoretic sense when if a term has multiple meanings the context of the term's use provides you information about which of the meaning is intended in that token instance and the reason is that that enables you to reuse short easy words right so words like bank um can have multiple meanings but as long as I'm talking about going through a picnic versus um you know depositing a check you're going to be quite sure which meaning I intend for this short easy word bank um and so you can imagine if there's a lot of things that we want to say very precisely in science uh overloading meanings into short terms like function or species uh might actually be quite helpful for people as they're trying to communicate so the second principle picks up on this and it um proposes that terms should have multiple shared meanings that apply in distinct contexts so uh the virtue of ambiguity here is that you can reuse your terms but you have to make sure that they in fact do have a correct meaning and that those meanings are clearly signaled by the context of use and so you preserve clarity right ambiguity does not mean automatically mean confusion here as long as you're careful about matching contextual information to intended meaning um I'm going to keep moving the third one picks up on something that doesn't really come into the cognitive linguistics context but does you know has been a primary focus of the historical and philosophical literature which is the importance of metaphor and so one thing that you might you might advocate here is that terms should have multiple meanings that are locally determined in context so the highlight here is innovation rather than clarity if you want to say new things you want to be able to use your language in surprising unexpected ways in a particular context and kind of see what happens right um and so the extent to which you can communicate uh effectively with each other is going to depend locally on uh how much background you have in common and the sort of shared history of how the the use of that term is developed in that context so what I've given you is three different principles um one which you know treats ambiguity as harmful and wants to eliminate multiple meanings or the the um uh availability or the possibility of having new meanings emerge in local context sort of interactionally and then the other two recognize ambiguity as something productive but they clearly depend on assumptions of shared context and background know how in order to realize that value in communication and so uh Pianta dosi at all uh do some interesting studies about patterns of how many meanings are there for short words versus long words simple words you know surprise are easy to say words versus hard to say words etc um in several different languages but they don't really unpack this background assumption about whether the speaker and listener can even share the same contextual information know how to uh see the cues and read them properly and so um I think this opens up an interesting opportunity to study how changing social historical conditions can impact um science the ways that scientists use ambiguous language and so in this next section uh what I want to do is um start to unpack how we can take these ideals and then use them uh you know along with their their conditions of applicability to make predictions um that that can then be tested using corpora so um I looks like a fancy equation it's actually quite simple um when you get down to it it's just saying you know the predictability of a term's intended meaning um it's it's entropy or uncertainty um is just what you learn about the the intended meaning for a particular context um you know given that context uh time or yeah um yeah times the probability of that context itself so in a context if I know that it's going to be this meaning and not that meaning I'm actually have great predictability and as long as each context gives me certainty about the correct meaning in that context then I really know what the term's going to mean overall but if I go to all the context and I'm just uncertain about each of the meanings um then we're going to have maximal uncertainty right and I'm not going to be able to predict what the intended meaning is with great accuracy what's nice about this is that um now we're sort of working in a context of precise distinct meanings and and the context in which they occur and these sorts of features are detectable by new natural language processing methods and before I get there I want to just draw a couple uh novel predictions out of this setup um so does the availability of contextual information affect the use of ambiguous terms uh just thinking in terms of the role of context in helping determine or you know providing information about the correct meaning can can get us two intuitive uh thoughts here so when we don't learn about context right the probability of a meaning uh uh is sort of uh equally likely um even given everything about the the usage uh in in which it occurs um then a term with fewer meanings is going to be preferable over one with more meanings right so if I have a term here that has five possible meanings one of which I want and another term that has two possible meanings or only one possible meaning and and one of those is the one I want um then if context isn't doing much for here all else being equal I'm going to prefer the term with fewer meanings but when context is informative a term with fewer meanings um is uh all you know all else being equal interchangeable with one that has more meanings um in terms of knowing enough to understand the intended meaning in a particular sentence for example um if the context is doing all the work I need uh then it doesn't matter whether there's a term of five meanings or a term with one meaning I'm going to get the right meaning as I read it and so what this is meant to to do is to set up some expectations about the usage patterns of terms depending on the availability of context and so um what we've proposed to do and and started to explore is investigating the predictions using a text corpus from JSTOR and so the setup here is to look at a set of synonyms that share at least one meaning in common and so they're inter-substitutable in a sentence or in a particular linguistic context in a way that preserves the semantic meaning of that sentence and we're calling that a sin set and so then within a sin set right some words that share a meaning might be short might be easy to say might be longer some of those words might have other possible meanings might have three other meanings others might only have one and so within the sin set the the properties of the synonyms terms can vary but in a way that's fairly fixed uh you know across a corpus over time or that we can track in terms of their changes and so given given the same recurrent context in many different instances we can look at how often each of the different synonyms occurs in that context and see whether some synonyms are becoming more common relative to the others and whether or not that's related and then explore whether that's related to external trends right so if you have more people publishing in a particular field do you see a preference for words with fewer possible meanings right if you're if you've got a large group that doesn't share the same background cues and knowledge do you see language shifting to prefer words that rely less on context to get the intended meaning and so I just want to flag here that this this is a problem called sin set induction in natural language processing where basically you're looking to find context in which a set of words is intersubstitutable while preserving the meaning the intended meaning in that context a lot of that work is focused on studying everyday words out of wordnet or wikipedia and it's kind of that big data computer science feel and our interest is sort of distinct in that we're interested in digging deep into relationships among a smaller handful of technical technical terms within one field like subspecies in some of its related terms and looking to combine an expert analysis of the ways that those terms are used and their possible meanings with scalable methods that can handle a larger corpus and so if we can find the context that occur in the equation that I showed earlier and we know the meanings that are possible for each term in those contexts we can start to get at trends in ambiguity over time especially when we compare alternative terms that share some of the same meanings okay and so now the pretty pictures if you've stuck around the focus of the case study is subspecies as a kind of overlooked but still really rich and interesting aspect of this bigger debate that philosophers of science are all well aware of and probably is infamous much more widely which is you know what is the species anyway how should we classify living things into these meaningful biological units and so we need to classify and we need to name these units in order to communicate facts about living things when you download biological data on the internet all that comes with names as identifiers but how we should carve those groups out is hotly disputed and has been for for for many decades at this point within this bigger debate about species the concept of subspecies is really important and sometimes species and subspecies are overlapping depending on who you're talking to one person species is another person subspecies and so forth and it's been much maligned and attacked as a way as a category within the species unit but it's also been in use in even quite formal and precise ways since the 1870s and so somehow it's managed to stick around right and that says that there's something here that we need to understand but it hasn't stuck around because everyone figured out what it's supposed to mean it's been doing some other kind of work in the scientific practice and so there's not a lot of historical or philosophical work digging into the debates over subspecies how they've been used in different places when people have been able to agree on a meaning versus not and so forth the corpus that we're going to explore this with is from JSTOR and consists of over a hundred journals pulled out of the sort of ecology and evolutionary biology subject group that they've defined and so this includes like things like systematic zoology, renamed systematic biology, oikos, paleobiology, and mycologia which is for fungi and so it's a targeted subset of papers that were published in a select group of journals and we're really honing in on discourse from ecology and evolution in the corpus that we're working with and so there's a whole bunch of articles in there out of all of that you can see here this is just the absolute magnitude of the number of token instances of the word subspecies over time and of course the corpus is also growing but this just kind of gives you a good sense of how many instances we have out of that many articles so it's a good number but just still a small subset of the overall set of papers and we're going to focus on a set of six words here subspecies, variety, race, lineage, form, and ecotype and primarily our results are going to focus on subspecies but I wanted to just sort of highlight how these are related we're trying to sort of explore what we can learn about the these synonyms using natural language processing methods so the first thing that we were we're looking at is to try and understand how our sentences in which these different words occur in our corpus related are they all clustering together or are we seeing clear separations based on the the keyword that's showing up in the sentence and so each colored group here is or each dot here is a sentence and the color of the dot reflects the word that occurred in that sentence so variety is orange form is blue subspecies is green races purple etc and so there's a thousand dots for each color and what you can see is that there are clearly separated clusters here but also some that are overlapping and I'm just going to pop out for a second because the the visualization tool you know what we're working with here is a very high dimensional representation of the sentences and it's the more dimensions you can actually have to play with the better so this is the same set of clusters we were just looking at here is ecotype there is subspecies and lineage so you can see here that actually subspecies is a fairly distinct group compared to race variety and form and then lineage and ecotype really stand out so we do it does look like we have some overlap originally or initially in terms of the the way that this method Bert is representing the the context of these words and their semantic content which is promising but then we want to be able to dig deeper and start to tease apart here the lexical context in which the word is occurred versus the meanings that are possible for that word in that context and so the preliminary work that we've been doing here is to start defining lexical categories just focused on the term subspecies for now where we're trying looking to unpack distinct grammatical ways in which the term subspecies can be used in different sentences and then eventually pair that with an analysis of the available definitions of the term over time so different published definitions of subspecies in order to look at the interplay between context and possible meanings and so just a conscious of time here the categories that we're looking at are fairly coarse-grain so you know distinguishing between singular nouns and plural nouns and then different sort of grammatical categories that get paired with them so do you have a the in front and a name at the end or just a the or an indefinite article here a or a plural term side note you might think that parts of speech tagging would get you this but apparently it always tags subspecies as plural or verb for whatever reason when we use the NLTK toolkit and so just focusing in on the subspecies terms what we found is that these categories do indeed seem to form distinct clusters within the embedding space that we're showing earlier and so here's the five categories again with with examples and then the highlighted dots show in this of those sentences that we've labeled by hand in the corpus and in trying to see whether or not we can scale up the the manual labeling work to apply to the whole corpus we applied a caniarest neighbors approach based on 150 sentences as training data and it it did reasonably well at the gate in terms of getting an accuracy of 80 percent and I think you know as we grow the data set there's there's reason to hope that that's going to continue to improve okay so that was a quick tour of how we were starting to define the lexical context in which we could then track the use of alternative terms over time as a way of getting at preferences for terms with more meanings or fewer meanings in response to changing social contexts and each of the different steps along the way I think has has lots of opportunity for for novel work and and investigations formulating the theoretical principles based on the on the background literature I think could really help connect the philosophical questions to to what's happening in linguistics computer science I think there's a lot more that could be done in terms of drawing out predictions from the information theoretic framing that the peontodo see it all introduced and then continuing to develop the approach to sunset induction that pairs sort of the expert lexical categories that we've labeled with a machine learning approach that can then scale up to a whole corpus is one of our key next steps and so I think I'm going to stop there to make sure we have enough time fantastic thanks so much this is this is really neat I hadn't I hadn't seen this data yet this this is really pretty I'm impressed at how how well this is working this is really cool already a couple of questions coming in two from Susan Hudson actually this is excellent so for first question in in ordinary language use polysemi is common but ambiguity is very rare right the bank example is a good is a good instance of this but so do you think scientific language is following different principles um so I think that's actually an interesting question I'm not sure that I agree that ambiguity is rare I've seen that argued uh by by peontodo see for example um sees ambiguity in within a context is rare convinced given the exchanges that I see between interdisciplinary scientists that are trying to talk to each other um and and in the the basic ways that uh my colleagues try to communicate seems like there's a lot of ambiguity there so I think that's an interesting question to to push on that that um you know might need a synthesis of of data to to support in the case of subspecies you know what you can see historically in literature is that people are redefining it on the go as they're applying it and so they themselves are not always clear about what a person in a given paper actually wants to mean by it unless they've defined it right up front in the paper explicitly and that that's often not the case and then they fight about whether they're using it properly and so they often say that you know people are applying the term incorrectly but it or in a way that signals that they're not really understanding what the definition is supposed to be so in terms of available definitions and the sort of fluctuating community I do think there's substantial ambiguity there but I that's also part of what we're hoping to get at as a quantitative measure how well can an in expert reader actually disambiguate the intended meaning in reading a paper is I think a good proxy for what it would mean for the language use of a field to change as the community changes so you've got a whole bunch of new people publishing or using the term subspecies are they really understanding what the experts mean by it or are they using it in ways that aren't careful and that that leave the ambiguity actually quite substantial that's really interesting thanks a second question so also from also from from Susan Hansen so when you were looking at the grammatical context of subspecies did you did you look at post modification within of subspecies of of something yeah and it's been interesting I started to dig into the variety the thousand cases of the variety sentences and there there's a distinct pattern where you might have the great variety of rock formations in a place and and the lexical structure there of having that that intensifier or modifier in front is signaling a different meaning for the word variety as a kind of range or extent of something rather than the northern subspecies of goose for example and so I'm anticipating that these categories are going to have to get more sophisticated and fine-grained as we bring in the other synonyms and some of those patterns will include or pick up on something like subspecies of a species name as part of a larger group so yeah sorry I'm not sure I'm entirely getting to the question there but I'm definitely seeing that and and I think it's going to be important in terms of defining the lexical context yeah yeah no good good that that that that that gets it I think um so a question from Stefan Hespergen who asks just a question for clarification so what what is the what what outcome are you hoping from the corpus analysis and how are you how are you trying to tie that back to these more general questions about ambiguity that's just what he just says that's a thread that I thought that I didn't follow my my bad but could you say a little bit more about that no and I mean I I did it fast and this is also something where it's been an ongoing struggle to line up all of the ducks in a row in terms of how the pieces fit together so the big picture is you've got words that are potentially more or less ambiguous right and in let's say sorry you have words that have few more fewer meanings and so in a particular context they may be more or less uncertain as which of those meanings is is correct and depending on the contextual information and so subspecies has relatively fewer meanings relative compared to variety for example what we're interested in and ultimately is how preference for subspecies versus variety or form or lineage or other terms is influenced by changes in the community of people using it so if you have new people or other fields coming in we're hypothesizing that you're going to be able to count on relatively less contextual information across the set of speakers and listeners because there's less shared background knowledge and context and sort of history in the field versus if you have a relatively small and stable subfield you can use terms with many meanings because you can presume that the person you're talking to will know how to use the context to disambiguate it in a particular instance and so the goal is to first figure out how do you get the context defined and then you can track how the terms are used in the same context at rates you know at different levels of frequencies across the corpus and connecting that back to the information theoretic framework you can actually justify how you're tracking predictability of meaning by looking at how the substitutability of those terms are informing your ability to know the frequencies of the senses. Great, thanks, thanks. Actually that's a nice segue into another question here from Cody O'Toole who says, hey Beckett, so this initial analysis was done synchronically, they're at the same institution. This initial analysis was done synchronically, I'm curious how this method could be adapted to apply it diachronically since meaning and therefore context often varies over time especially over such a large time period in this subspecies corpus so yeah, you were just mentioning that near the end that that's a target for future work, how are you thinking about adapting this kind of methodology for that? Yeah, so I've seen different approaches in the literature right now. This cluster diagram is drawn from the whole corpus so probably in reality 1870 in terms of the actual usage subspecies up to the present and so what we're seeing is a statistical random sample of sentences using these words from that whole corpus so in that sense it's giving us a diachronic kind of picture of the full variation over time. Part of what I'm doing that I haven't shown to today is going through some of the key theory papers in the field and actually pulling out the definitions that they're proposing for these terms and making a giant spreadsheet where you've got the definitions of subspecies that are being discussed over time as rows and then coding the criteria that they're mentioning in the columns and so you can see at some point genetic differences enters into the picture whereas in the 1880s it's all phenotypic differences and focus on integrating a sort of continuity across geography versus some notion of being a distinct lineage in a genetic sense. So what's surprising from what I've seen there so far is that actually not that much has changed at root. The idea that you've got some amount of difference in some kind of trait and that it's geographically distinct where the populations that are geographically distinct that's pretty much been there since 1870 and so you can add twiddles around that but in that sense the semantic space for subspecies I would argue has been fairly constant. I've seen other approaches that if there's really novel changes you start from an early period of time you do your analysis for a 10-year period and then you move forward to the next 10-year period and so forth and so you'll see things added to your picture that you'll have to code and categorize step by step as you go and our hope is that we can try and do get the big picture upfront if we can take for granted certain stability in the the mean differences in meaning over time. Great that's actually and that's actually a nice segue into the next question from from Stefan Lindquist who says it's a great talk. I might have missed this but could you explain whether there's a quantitative signal that you think will distinguish the sort of good ambiguity from the bad ambiguity and actually this is the part that I think connects right to your last answer so maybe you want to answer in reverse. What are your thoughts on what good ambiguity means in the context of discussion of subspecies in this in this corpus? Yeah so the setup that I have in mind there is the the principles that I gave earlier help us interpret what it means to see certain patterns in usage. So if you see a community that is making use of words with with many meanings and that you know those those words are also tend to be the shorter easier to use ones then that's consistent with the principle with the second principle that says ambiguity so long as meanings are used in distinct contexts is beneficial. By contrast if you see a community that is using terms with very few meanings and you know in a way that's consistent across all the different discourses then that would be consistent with the first principle which says that you want your your terms to have one single meaning in in all contexts. What I'm what I'm really hoping to get down the road is actually see those patterns change and to try and tie those changes to things going on outside the the corpus. So if you see a community that was using lots of terms with single meanings and now you know it gets smaller and it's sort of now a very stable set of people talking to each other for 20 years and you see them drift towards using terms with more shorter easier to use terms with more meanings I think that signals that there's also been a shift in how they're applying ideals or norms to the language that they're using and so that's the way that that ultimately I'm looking to connect patterns in the corpus to the the applicability or adoption of these ideals as as norms for for how scientific language should be designed. Excellent okay so that's the last question in the in the Q&A box so I get to I get to I get to ask one of my own. So one thing that one thing that I was wondering and this is this is a bit more technical but how how how to put this how clean did you find your data had to be to be fed into the into the BERT system so how how how precisely manually cleaned how much how much did you go back through your corpus? Yeah so the JSTOR corpus is a kind of grab bag because my understanding is that they've been running OCR year by year as they're adding new articles and journals and so the OCR that they system that they would have used in 2010 is not the same OCR system that they're using in 2020 but that's not uniformly distributed historically because they're adding new journals with archives and so it's not like just the old articles are cruddy and the new articles are great it's more like the new journals might be better and the OCR is definitely messy and we've been exploring the feasibility of not doing a lot of hand cleaning in part because we're looking to get at kind of aggregate trends and you know changes in usage of terms on a fairly large scale over time rather than trying to parse single sentences to a high degree of like sorry to interpret sentences to a high degree of precision and you know BERT works with bad data what it's telling you with bad data is is a little bit questionable and one of the things that we discovered is that for our application the existing training dataset of BERT doesn't really have the best coverage and so ecotype is just not even in there and what BERT does in those cases is pretend that ecotype is actually an average of the word eco and the word type um and then produces embeddings based on that which is uh surprisingly not bad I guess all things considered but not not something that I would um see as ideal and hate it I hate to do it but I have to cut you there for time okay