 Well, I'd just like to echo Chantel in welcoming you all back. I think the rain may have discouraged some of the native Tucsonans who aren't used to the idea that you can actually emerge when water is falling from the sky. But I think probably our crowd will pick up a little as the day goes on. You may have also noticed if you were watching the streets at all that Tucsonans drive incredibly slowly when it rains, they get a little nervous about it. Just a word about how we'll proceed with the panel. We're going to let our three panelists give their presentations just back-to-back and not pause in between for questions so we can keep some momentum going then. We'll have the panelists up here at the front responding to each other's papers, sort of talking about any common points of interest or themes we see emerging, things that they think would be profitable to discuss that they'd like to ask each other. And then we'll open it up to questions and comments from the audience as well after that. So without further ado, we are delighted to have Rohini Srihari, who is an associate professor of computer science and engineering from the university at Buffalo, SUNY. And her talk will be on multilingual text mining, lost in translation, found in native language mining. So please join me in welcoming Dr. Srihari. Okay, first of all, I'd really like to thank the organizers for putting together this really fascinating conference and symposium. For me, it's been nothing but a learning experience since yesterday, learning different aspects of multilingualism. So I really welcome the opportunity to participate. So yesterday we heard some talks, which were very interesting and I think made a compelling case for the problems and the pitfalls with monism as Chantel just mentioned. And so I think those points really hit home pretty hard yesterday based on the talks we heard. And I think they were focusing primarily from the perspective of policy and regulations and government and what is being enforced and what should be encouraged. What I'd like to do today is give you a little bit of a different perspective in terms of multilingual usage as we see it on the web. And the web, as you know, is a wild and woolly place. There are no rules and regulations. People do whatever they want. They say whatever they want. So the advantage of that is we really see the kinds of phenomena that are emerging in terms of multilingual usage. So I just want to talk about that a little bit. Now what is multilingual? So I changed the title a little bit instead of lost in translation. I made it lost in machine translation as a computational linguist whenever we think about translation. We're always thinking about machine translation. So I should have made that a little bit clearer. So I do a lot of work in multilingual text mining. And so some of the questions might be, what does that mean? What are you trying to mine? What kinds of things are you trying to, information are you trying to glean from the web? So I want to talk a little bit about that and then get into some of the problems with machine translation as we see it and how the web can actually help with a lot of that. So a little bit, first of all, some statistics in terms of language usage on the internet. I know there's been a lot of emphasis on Spanish yesterday and today. And that's because I think of where we are and the prevalent use of Spanish. But if you look at in terms of language usage on the web, English is still the number one language and Chinese, a very close second. And you'll see how close in a minute. And Spanish is the third. And so these are statistics compiled by an organization that's constantly monitoring the volume of traffic on the web. And in a minute, I'll show you some more interesting trends in terms of multilingual usage on the web. And the rest are 42.4. So I think from my perspective, that's what's most interesting in terms of the rest. What's going on there and how do we get to that kind of information. But it should be noted, people talk about social media and Twitter and I'm always getting asked questions about the Twitter fire hose and how much information there is. But it's interesting to note that there's a Chinese micro blogging site called Weibo. And the volume of traffic on Weibo has exceeded all of Twitter's volume on several occasions. So I mean, this is a real phenomenon happening in terms of non-English usage on the web. And just to hit this home, I don't know how much of this you can see in detail. But these are, again, these are statistics put together by internet world stats. You can go to the site and look at it. So it shows some interesting trends here. It shows, you know, in the column, one to third column, the internet penetration by language. And those are sort of the percentages that I showed you on the previous chart. But to me, what's most interesting here is the fourth column, which shows the growth in the internet usage. And if you look at it, the language that has grown in terms of usage from 2000 to 2011 by 2,501% is Arabic. And then you have Chinese also, 1,500%, and Russian, 1,800% growth in those languages. So while English, you know, the number of English users is increasing, the growth of these other languages on the internet is, you know, exceeding it, is growing at a much higher pace here. And I think that's what's really interesting to note. So, you know, you can go to the site and look at some statistics. But there's some other things that also, you know, that are interesting when you see this chart. It shows that the top ten languages here, you know, ranging from English to Korean account for about, you know, 82% or 80% of the language, multilingual language usage on the internet. And that's about four and a half billion people. But then there's another two and a half billion people that are also using the internet in other languages that are not accounted for by these languages. So what's happening there? That's one issue. The other kind of interesting thing is when I saw this, I said, all right, English, Chinese, and you expect this to sort of correlate with the world populations, right? So you think, yeah, English, of course, because it's very universally spoken. China is a very large country, and so it has a huge presence. But because of my own South Asian background, I find not one single South Asian language here, no Indian language. None of those are present here in terms of the top ten languages. And so one asks, why is that? And there's some answers for that. In India, of course, most of the communication actually on the internet happens in English. But does that account for all the people who actually use the internet in India? There's some other statistics that I recently learned that only 12% of the population in India have internet access. Whereas more than 70% of them have access to mobile phones. So if we enabled mobile access to the internet, we would see a lot more participation from huge populations. And so that's, I think, one of the motivations and the challenges for those of us who are working in computational linguistics and getting computers to help understand languages also. So anyway, so these are some interesting statistics. So now we get to sort of what do you do? What does multilingual text mining mean? What kinds of work do you do? What kinds of things are people interested in analyzing when they do text mining? So I'll just give you some examples of various kinds of things that we have been asked to look at. So sometimes people want to know, for example, what is the language usage in a particular region or in a particular city? And so one way you could do this, and this is based on social media, of course. And by the way, the one thing I didn't mention when I showed you the previous chart, the reason why there's this explosive growth of usage of language on the internet is because of social media. It's not that they're creating more pristine websites with literature and stuff like that. Unfortunately, most of the volume is accounted for by the social media, which some of us sort of dismiss as teenagers just twittering about dates and stuff like that. But if the truth be told, if you go through and sift through all of that, there's some really nuggets of real information in all of that social media as well. So it's important to look at both. Anyway, so this is one of the things you could do. One of the other types of analysis we did was during the whole Arab spring movement, we looked at social media. And we looked at, we did language analysis like this by region. And why would somebody want to do that? Well, for example, if you were looking at the Middle East or maybe Dubai, and you started looking at Chinese posts and non-Arabic posts from a predominantly Arab region, you might get a glimpse into what, let's say, the imported workers are talking about. They get a lot of workers, migrant workers coming in from other countries. And these people tend to be the first people who are aware of any kind of situation because it impacts their jobs and their lifestyles. Sometimes it's interesting to mine, what's going on in Chinese or Hindi from that particular, what are they talking about? And how is it different from what the native population is talking about? So some of those types of analysis is what we get asked to do. So this is an example of that. This is, again, just Twitter traffic. But you can't analyze societies just by Twitter alone, for certainly not. But it still provides a lot of valuable information. So you can analyze social media, you can also analyze traditional media and I'll show you an example of that. So in this case, we were looking at different hashtags that people were using on Twitter and some sentiment analysis and things like that. So at least it gives you, some of these techniques may not be that accurate and maybe a little coarse. But when you start looking at the volume of data, sometimes it washes out some of those errors and the trends actually do percolate quite well. Other kinds of things that you can do, and this is another effort that we've been asked to work on, is trend detection or meme detection. What are people talking about? And of course, you can go to Twitter and you can see the top trending phrases. But that's from all of Twitter usage. You might want to look at specific sources. You might want to look at what are trending phrases from this particular region, from these types of sources, and things like that. And also get the equivalent translations. And when you do this, you find out that if you take a certain time period, a time slice, the types of memes or trending phrases from different regions vary. Sometimes they share a lot of things in common, but it's not different cultures, different societies talk about different things. So that shouldn't come as a big surprise, but you can actually quantitatively go through and pick some of these out. So this is an example. We looked at Urdu news around the time of the Bin Laden killing. And we looked at what kinds of things were percolating to the top. And some of these may not be as obvious as you think like John Brennan, for example, we were wondering who that person was. It turns out that he did have some connection to this story. And if you looked at this versus just what was trending generally on Twitter, there was a lot of difference here. There was also stuff that was going on about some assassination of some judicial person in that region. And so people were talking about that as well. So these are the kinds of things you can do if you can analyze media. The other kinds of things you might want to do is very factual analysis. So not just trends and topics, but if you're interested in, for example, if there's a disaster and some like the tsunami, for example. And you want, people are reporting, they're using social media to communicate. You know what happened here, what happened where. These are the roads that are passable, not passable. There's just a flood of information coming in. And if you can take all that information, summarize it. Succinctly presented to users, there's a lot of very useful societal applications of analysis of this type of information as well. So that's another thing that you could do. And then this is another project that we have been working on is to analyze potential bias in media. So is it true that in some societies, certain segments of the press are actually using very inflammatory language, extremist messages to incite popular opinion and so on, and can you quantitatively show that? So these are the kinds. And this is a project actually that the State Department is interested in. And they're interested in it for good reasons to actually go and take the data back to the media, to the press, and talk to them and say, why don't you help us work with us and see if we can tone down the rhetoric in certain publications. And they're actually getting a lot of buy-in. And they've used this tactic in Afghanistan, Iraq, and now they're trying to do this in Pakistan also. So this project would involve, and all of this is, you're not going to read English newspapers and get a sense for this. It's really the vernacular. So it would be languages like Urdu and Pashto and these kinds of languages. So we've actually had some experience, quite a bit of experience working in Urdu and now getting into Pashto also. And it's fascinating. All of them look alike in terms of this script, but processing all of these languages, there's huge differences between them. And so again, quantitative analysis of this type. So another example of this. And so just to give you an example of if you were to do this properly, how challenging it could be in order to do this attitude or non-topical analysis, as we call it, you have to identify opinion holders, the target, who is actually the target of the opinion, or what is the target of the opinion and the actual attitude itself. So most people, when they look at this, they see green and red arrows that they show you on CNN for positive and negative. And if nothing else, I hope to convince you that actually doing this type of sentiment subjectivity analysis is quite challenging and quite deep. And we haven't solved it yet. We're nowhere near it. We're trying. And I think we're making a lot of progress. So at least, for example, we can identify the agents and the targets reliably. But to really analyze that attitude, you need to understand so much about the context in which it was said and things like that. So there's a lot of work remaining to be done there. But if we can, there's a lot of usages for it. So if you can do all of that, then you can actually present it in an interface like that, which I've worked on, where you've looked at all the different content sources and you can analyze according to organization or organize according to location and things like that and allow someone to actually see this type of analysis result. So finally, the holy grail of text mining, which we're nowhere near, but this is what people are really trying to attempt, is to see if you can actually do some predictive modeling. That is by analyzing large volumes of data, analyzing trends, seeing what's happening, can you actually sort of predict that maybe there's going to be some kind of really disruptive event? And when actually DARPA has a challenge out here, not only do you have to say that there's going to be some possible disruptive event, but where and around what time also. So that type of prediction. And they used as an example for the London riots that went on. There was a lot of stuff that went on that could possibly have helped if they analyzed the information in real time and tried to make sense of it. So predictive analytics is sort of the holy grail, and this is what people are trying to work on. And it's a hard problem. Okay, so if all of those weren't challenging enough, now I'll show you the real challenges in terms of dealing with the different languages. So the first question many people ask is, you have all those different languages. We've spent 30 years and about, I don't know how many billion dollars working on machine translation technology, why can't we just translate everything to English? And I can give you that the answer is, in some cases, if you just want a quick triage summary of what's happening, machine translation works pretty well, it's effective. But for the kinds of analysis that we're trying to do, you lose a lot in the translation because looking for attitude, subjectivity is very nuanced type of information you're looking for. And you really need to look for this in the native language. So if nothing else, I've been making an argument everywhere I go that machine translation is complementary to actually processing in the native language, but you really have to be able to process the native language itself. So the other thing about machine translation systems is the way they're trained, based on these parallel corpora of one language and other language, and they learn how to compute the translations. Because of that, they experience some pitfalls. One is that they do a really poor job with translating names. So machine translation names typically garble names quite a bit, and that can be expected because names, you know, it's infinite set, right? It's not a closed set, you keep getting new names all the time. Unfortunately, a lot of the interesting things you want to extract hang on names, right? You need to know who or what or where or something like that. So this is just an example of, in this case, it was some Urdu text, and if you feed it through Google Translate, it comes out as Education Minister Mohammad Hanif Half-Dead, and actually his name. This is a name. It should have been Mohammad Hanif Atmar was the name, and so it garbled that. But once you get that wrong, you sort of lose a lot of the other information, right? I'm not making this up. It's actually true. So how can you solve something like this? So one of the things we do is we go through the native Urdu text, and the first thing we try and do is, in the native language, try and identify names based on context. Okay, so based on context, we know that that particular string must be a name. Now that you know that these characters are a name, then you do selective translation of that, and I can show you how we do that also. We use a lot of internet resources to try and do that. Once we know that this particular thing is a name. So this is where processing in the native language can really complement machine translation, because you look at the native language, you tag things like names and translate those, possibly some subjectivity elements and translate those, and then you can feed it into a generic machine translation system and get the full translation. So this is sort of where they work together. The other issue is dialects also. So when you talk about Arabic, everyone uses modern standard Arabic. That is the standard for printed or communicated forms that the media uses. Unfortunately, there's no one who actually speaks modern standard Arabic. It's just something that they came up as a standard for just to standardize the various communication. What people actually speak are Egyptian, Levantine, Iraqi, and things like that. So if you have a system that has just been adapted for modern standard Arabic, then the sentence, there is no electricity what happened. You apply just an MSA version on the actual dialect versions of these, and you can see what happens to the Google translate. It totally goes wrong, which means that it's not just languages themselves, but the dialects themselves also that need a lot of attention. And this actually is where a lot of this machine translation and multilingual text mining can get pretty challenging because of all the dialectical variations. So we actually worked on a project where we collaborated with Columbia University, and what they were trying to do was look at the various dialects and see if they could at least normalize based on dialects, and that way you could get better translations of these. So another problem. And here's an opportunity to use the web as corpus. So I talk about multilingual text mining and all that, but one of the major usages of multilingual text mining is actually to improve the quality of translation. The whole web can be treated as a corpus. And my favorite data set right now is multilingual Wikipedia. We just love it because typically you have articles like this. This one is on Barack Obama, but you have a Chinese version. You have an Arabic version. And if you can figure out some points of correspondence, you get immediate translations. You get the correct name translations. You get all kinds of other things that you can immediately translate. So machine translation systems, the way they're designed, is someone has to sit down and say, this is the English version. This is the Chinese version that requires an enormous amount of labor and hand annotation, very time consuming, very expensive. But sources like multilingual Wikipedia, people are doing that for you. This is the beauty of crowdsourcing, really. People are writing this. They're providing the links. But it's not as perfect as you would think. Some people don't bother to backlink correctly. So yes, you'll have a version in English. You'll have a version in Arabic. But if the person who authored the page doesn't supply the backlinks and things like that, you miss out on some of the information as well. So anyway, what we are doing right now is on a regular basis, sometimes nightly, but at least weekly, mining multilingual Wikipedia to extract translations, primarily name translations. And all of you know in Wikipedia, any name, any event that's of any significance, within a few minutes there's going to be a Wikipedia page about it. So why not try and exploit that? So in my opinion, one of the biggest applications of multilingual text mining now is actually learning language resources, how to translate, how to even do chunking and things like that, which I'll talk about in a second. And then you have these other kinds of phenomena on the web that you've seen, code mixing and switching. So this is where, you know, this is very, very common in South Asian languages. There's, in India, if you go there, people speak what's called English, which is a combination of Hindi and English. And really, you know, sometimes it's annoying unless you know both languages. They'll start a sentence in English. They'll use some Hindi words in between and they'll finish the sentence in English. It's just, you know, back and forth between the two languages. And it's just how all the younger people speak nowadays. And the most interesting thing to me is that, you know, Hindi is a language that's predominantly, you know, spoken in the northern and the middle parts of India and southern India, Hindi was not spoken. But even in southern India, the youth start using this English phenomenon, mixing Hindi and English. And it's not just in South Asia. Now you see, I'm not an expert in Spanish, but I'm assuming that you have Spanglish, where, you know, on the web people are using combinations of Spanish and English and stuff like that. Ordish is another language. And, you know, this causes nightmares for people who are building computational tools because it's bad enough you have to build it in English and Urdu and Spanish. But now you've got this mixture. And I had a student who actually was working on this problem. And this you see in social media, especially on all the blogs and things like that. So anyway, so all kinds of solutions for this, which I'm not going to get into. This is more technical stuff, but, you know, you have to apply all kinds of things in terms of applying two different language systems to try and understand this stuff. Finally, one of the major areas that I think I've hinted at this before, language resource acquisition. I think there's a tremendous opportunity right now to do this because of all that multilingual usage on the web. And what do I mean by language resource acquisition? So if you're designing, you know, natural language processing systems or text mining or machine translation, any kind of language technology, if you're actually trying to put together, you need a lot of resources for the language. You need lexicons or dictionaries, right? Electronic versions of this. You need translation lexicons. You need some basic syntactic analysis type of tools like part of speech taggers and chunkers, which basically group into noun phrases, verb phrases, things like that. So these are all the building blocks of what go into a language processing system for a given language. The problem is it's very expensive to produce these resources by hand. And the government actually has spent tons of money producing these kinds of resources. So for English, for the European languages, for even Arabic and Chinese, all these languages actually have very rich linguistic resources right now. But what happens when you go to those, what they call less commonly taught languages. And, you know, different people have different definitions of what a less commonly taught language is, but it could include things like Yoruba, which is spoken in West Africa. Russian, believe it or not, doesn't have as many resources available. Swahili and Somali, for example, all these kinds of new languages that are coming up. So the web actually provides a great opportunity to automatically see if you can construct these resources by looking at corpora like multilingual Wikipedia and blogs and do trend mining and see this phrase must correspond to this phrase and so on. So this is what some of my research has focused on and this slide here, which is the diagram on the left-hand side, you don't have to really look at in detail, but we're using a lot of machine learning technologies and what we call semi-supervised machine learning technologies to see if we can acquire these resources. So I think, you know, this is a very exciting time to be working in this area of multilingual text processing. And I have students who are working on this on various types of projects. So I guess I was intrigued by some of the pictures I saw yesterday, especially the first one where you had, what was that? You had two people who didn't have bodies and, you know, there was the abstract concept of what they were supposed to be saying and they were communicating it through language. So in my world, this sort of translates to this, where, you know, maybe you do have bodies and so on, but the important thing is that they're using some kind of a speech-to-speech translation. So I think this is the real challenge for our field. Speech-to-speech is one of the hardest problems you can think of. And I think, you know, here we're going between two different languages, but all of you can relate to this. Anyone who has called one of those customer service type of calls where you dial one, say, yes, no, I mean, it's really hard, right? And in spite of that, I think we've actually made a lot of progress in this. So for single utterance, it does, you can actually get reasonable translations. But if you say that you're trying to understand, you know, attitude or you're trying to understand dialogue, anything beyond like one or two sentences, the problem is the context, right? Someone was mentioning yesterday of you have to understand the history, the culture, what was previously said. From a computational aspect, we refer to all of this as context. And so the critical question is, how do you come up with computational models for actually trying to model this type of context? And there have been several efforts that attempt to do this. One effort that recently is taking place is metaphorical analysis of language. When you're trying to use, read something, people use so many metaphors. Can you actually build up a library of these metaphors and know when it applies and what type of context it suggests and things like that? So I think this is one of the critical problems. So even if we can solve all the, you know, someone called it nouns and verbs problems, which we're trying to do by, you know, looking at multi-lingual web data, this problem of modeling context is an enormously challenging problem. And you need to do that if you're going to actually understand dialogue. And I guess the last comment I'd like to make is that, you know, I see, you know, one of my issues is also the preservation of certain dialects that are, I think, going to go extinct because, you know, especially from where I'm coming from, we speak a certain dialect in our community. It continues in this generation, but, you know, we're having a hard time conveying this to the next generation. And what I noticed was that there are some people, younger people, who have actually started blogging about this on the net and trying to document, you know, why our dialect is the way it is. And actually, I mean, nothing has existed so far. It's not like it's an ancient tradition. Dialects are just verbally passed on, right? So there's no book or source material. But I see now efforts from younger people who are actually trying to document what these phrases and what these words mean, where they come from. So I think there's a lot of positive aspects to, you know, this kind of use even by the youth. So anyway, that's what I have to say. Thank you.