 I'm Anilita and I'm very happy to see everybody. Thank you for having me here at IIT. I've been here on the campus on many occasions doing a lot of open source work as well as working with many of the professors and the initiatives on open source projects in the past 10 years. I'm very happy to be here today to talk about language technologies and engineering, Wikipedia and Indic language support. I will try to keep this discussion a bit interactive. So don't be surprised if I ask you questions and I basically want to split it out into a couple of parts. First part, give you an overview of some of the areas I'm talking about and thinking about. And of course the second part in terms of having a more interactive session to better understand some of the areas that you're working on in language technologies and some of the questions that you may have on what we talk about before that. So kind of want to split it out into 40, we have an hour approximately and 40 minutes on the presenting hopefully and then about 20 minutes to have Q&A. Again, if you have a burning question, please reserve it for after we go through. Don't miss, don't hold your thoughts and don't miss them. I want to hear all your questions. So how many of you use Wikipedia? Come on, I've got to have a full show of hands. Everybody, right? Yes. So as you know, Wikipedia is very, very popular and very loved project. It really is the hard work of many, many hundreds of thousands of people all over the world contributing in different ways, their knowledge to share with others. And it really is an amazing phenomenon in terms of really a mega-scale project on the internet and the web working collaboratively where millions of people come and read Wikipedia as well as contribute to content in different languages of Wikipedia. I mean everybody all over the earth uses Wikipedia. So what do I want to talk about today? Again, very specifically, talk about a lot of things because we are working on a lot of different projects and a lot of interesting areas that are just so cool, right? Looking at mobile, we are looking at language technologies, we are looking at web engineering, we are looking at large-scale performance and scalability, we are looking at algorithms, we are looking at enormous amount of things. But I want to talk about some specific areas related to Indic language computing. And the reason I say that is because by background, I'm an engineer, I started my career in network engineering and landed obviously into the internet engineering and then moved on to web engineering and language technologies is one slice of that, right? So when I started looking at language technologies and have been involved in localization and internationalization for many years now, but really digging into what we could support for Wikipedia, I wanted to look at a few facts that, you know, when you look at Indic languages, what is the reality, right? All of you read Wikipedia in English? Yes? Any other languages? Yes? Hindi, Marathi, Gujarati, any other languages? Tamil, right? But that's small, right? Mostly you go to English Wikipedia. So India has 23 official languages at this point, maybe becoming 25 soon, but 23 languages, 8 major scripts in terms of writing systems and literally thousands of years of literature, verbal, oral and then written, right? And yet on the web, it's incredibly poorly represented. The collective wisdom of 1200 million people on this subcontinent, India itself and a few more million, hundreds of millions across the Indian subcontinent is practically unrepresented on the web. So almost as if this population doesn't exist. It doesn't. So it really is very poorly represented and there are an enormous number of factors that you know about and technology is only one of them where, you know, there are many factors as to why that is, right? We could argue for hours and days as to what those factors are and why. But a bottom line is access to the internet, access to computing, lack of language tools and devices, tools and data when you land on a computer platform, whether that's mobile, desktop or laptop and assets like fonts, input methods, data, dictionaries, glossaries, all this stuff. Where is it for Indian languages? Town. So there is a fundamental red flag there. That's why I marked these in red. So I look at it at the web as a really amazing opportunity. It's the first time in the history of mankind where everybody can have access at an equal level to a platform where everybody can talk, share and present ideas on. It really is that powerful. And yet you're completely underrepresented, much as we call ourselves a very technologically aware population but we're very, very underrepresented. And when you come to Indic languages, it's even worse. We have to be joking with ourselves when we say that, hey, we speak so many languages that we know so many languages that we are multilingual and we are diverse, but yet we only look at English on the internet, right? So millions of people, we care to neglect or just say, hey, they don't exist. They learn the language. They can learn English. Good for them. If they don't make it, who cares? So when you look at digital content on the platforms like Wikipedia, you'll immediately see that being reflected in the content and the type and the amount of content that exists on Wikipedia because that underrepresentation shows up very quickly. And that's what we're going to talk about. Why? So let me tell you a little bit about Wikipedia scale. Wikipedia is the largest content repository on the planet today on the web. You know that. It has about 32 million articles in 287 languages. We have hundreds of other languages in incubation, which means that there are content communities who are very small and have less than 10,000 articles who are in incubation. We will not put them on production until there is critical mass in the community. But we do have 287 languages represented on Wikipedia itself. That's a lot of languages. Even in the scale of thousands of languages that exist on this planet, this is the first time in the history of digital history that there are that many languages being represented and content being showing up in those number of languages, 287. So let's talk about the breakdown. Take these 32 million articles. So who is creating this content and who cares, right? You have four and a half million articles out of that in English itself. That's quite a lot of articles, right? And that's one of the reasons why you consume content in English because you can find it in English. You may not find that same amount of detail in Polish or in Hindi or in Marathi, but you'll find it in English perhaps. And then you have four languages, European languages, German, Dutch, French and Swedish. You have about one and a half million articles or so, right? And then you have one million articles or so for Italian, Polish, Russian and Spanish. So where are we? It's the second largest nation on this planet. Where are we? Nowhere on the top 10. And that's 50% of Wikipedia's content right there, right? So either we are a nation of illiterates who don't know how to write or we are completely underrepresented, right? So top 10 and this is the trend on the internet because this is, Wikipedia is fundamentally a barometer for presenting, watching what kind of interaction is happening on the web in terms of consumption of content and creation of content. We are nowhere to be seen in the top 10, right? For Indian languages. And then we have 287 minus the top 10. 43 languages which are one million to 100,000 articles. We have 73 languages which are 99,000 to a 10,000 in number of articles. Then we have 101 languages which are 10,000 to 1,000 articles. Just think how fast that curve drops. There has to be something wrong. And then 1,000 to 100 articles is 61 languages. That's fun stuff, right? There's really not real content. But in 287 languages and 797 production websites that we run, that is there are 797 projects that Wikipedia's universe has, including Wikipedia and plus other projects that are related. We have very little representation from the Indic languages or Indic subcontract. Out of half a billion users we have half a billion unique users every month. So just think of the scale at which we are talking about this. 21 billion views of page views a month. That's trillions of page views. And out of that mobile, which we see growing exponentially in terms of consumption of content, reading content on mobile devices, we are seeing 4.8 billion at this point in time and it's growing constantly, right? It's growing almost at 200% rate every year. So this is kind of a snapshot of the high level. Now I want to talk about Indian languages. Let's get down to numbers and let's see where we are. So you saw we break down the numbers on the 1 million plus top 10, right? 50% of Wikipedia is sitting on those top 10. And then you're seeing that from 1 million to 100,000, it's only one Indian language, Hindi. One. And that also is 111,000. How much is that? 1,11,608 articles as of now in Hindi. And how many speakers of Hindi exist in this country? And worldwide? You want to take a guess? How many? 50 grows. 50 grows? Exactly. 50 million. 500 million, sorry. So for 500 million people, we have 100,000 articles that shows how well represented we are. Then let's get to the next few languages, which are in the big top five. Nepal Bhasah, which is a very, very small language in a very small community. But because it's so passionate, it stands its second number in our Indic language family. Isn't that interesting? It's a Dev Nagari based script. So 70,000 articles there. Then Tamil. Tamil articles are very good in quality, most of them, but only 61,000. And does that represent the richness in Tamil conversations, local culture, history, science? There's so much to talk about. Telugu is 57,000. Marathi is 40,000. Top five. What happens after that? And after that, you have one language in 100k plus. You have 16 Indian languages in the 99k to 10k. You have 10k to 1k is seven languages. And then you have five languages which have less than 1,000 articles. 1,000 articles. That's pretty pathetic. So I'm going to go through the numbers here. And I'm going to compare them with speakers. Just look at them. Hopefully it will strike you as odd. So Hindi is 100,000 plus. Almost 500 million speakers. Navari, 70,000. Only 1.2 million speakers. Small language, very active, very, very interested in presenting some of their local information on the web. Tamil, 74 million speakers. 61,000 articles. Telugu, 75 million speakers. 57,000 articles. Urdu, that combines Pakistan, Kashmir, parts of India, and Bangladesh. Coming in for contributions. 104 million speakers, 51,000 articles. Marathi, 71 million speakers, 40,000 articles. So if you look at the numbers. Bengali, 215 million speakers. 30,000 articles. Does that represent what you think it should represent? It's pathetic. Look at it. So you can see. Look at Gujarati. 46 million speakers, 25,000 articles. So what's wrong here? Do you hate your own native language? Do you don't like it? Are you ashamed of speaking it and using it? What is this? And then you have the smaller 10,000 and lower. Look at the languages. These are not small languages. Punjabi, Sanskrit, Oriya, Tibetan, Assamese, Bhojpuri, Sindhi, Kashmiri. Look at the number of languages we have. People are interested. But look at the content that is being represented. It's pathetic. It's just very, very, very low. Is it the last language? Sichuan? Sichuan is a language which is spoken in the northeast and borders into China, Eastern China. It's a very large community of Israel and Tibet. But it is a very, very marginalized community in China. So they are not allowed to contribute. They've come and started to try to contribute on Wikipedia. So you're seeing their interest, but they're not being able to contribute to other reasons. So again, I hope that provides you a little bit of perspective because that's why I want to talk about open data. How do you change this equation? It's pathetic that so many of you are coming to IIT and working or teaching and talking about all these great things that you think you're doing, look at the representation on the web. Who cares what you do? So let's talk about open data and why I think that's something which is very, very important from a computational perspective and why it is so important to understand it because it is a model for creating more data through the applications that we are building and the model to have more people contribute. How many of you are familiar with different open data projects? Whether that's content or whether that's computation like databases, Hadoop, Eucalyptus, different technologies or different. Are you familiar with any of them? You've heard of them? Good. Why do we talk about it? Because remember, data in the digital planet needs seeding, right? Data creates data. It's just that just like money creates more money. Data creates more data. So when people need to start contributing and want to start contributing they have to have some fundamental components of data that can easily contribute online. It's not good enough to just write an SMS language about the status that you were sharing for your personal edification and say you are sharing data. That also you need to know something. You need to type, you need to know how language you need to know some specific things. But open data is a model that is the most scalable way when you are looking at a huge universe of data that is interested in creation of data. As an engineer I'm looking at Wikipedia and I'm looking at how crowdsourced contribution has literally worked and scaled for millions of articles of very high quality being created and the seeding of data is key. How do you jump start? Even if we were starting today and saying hey, today we have 111,000 articles in Hindi I want to have one million articles in one year. How are we going to do that? So open data is one of the fundamental models that works in terms of jump starting that. And the other is language corpora. When you ask folks where is the language corpora that we're looking at so that we can use that the corpus that we can use to be able to jump start other contributions there's not much for in big languages online unfortunately. I'm looking at crowdsourced contribution models and being on par with large European languages is something which is very key. You have to do it. So what do we need? Think of structured linked data. When you think of structured data what do you think of? Do you understand what structured data is? It's interconnected data where? What does that mean? Can I ask somebody what do you think of when I say I need linked data? Something like RDF Something like RDF? Yeah that's a format. Do you think of specific applications? Related data is what it takes place. I mean addresses, names locations but different kinds of applications map information data, maps on your Google maps you want to know everything about a location when you're going there but we need data we need data of all kinds we need cities and location information we need geospatial information we need education information every component that you look at look at is a challenge you want to have that data mapped enormously in order to build a grid and a network of interconnected data to be able to help build better content. You have to do it. These are projects themselves they're crowdsourced projects that I care about all the forts in Maharashtra I love those forts I'd love to see all of them with their geo information their history, their name when it was built when it was managed and when it went into disrepair I want to see all that information structured information on the web so that if you're building a mobile application you should be able to go and look at that information and just build a nice application on top of it but you need to fundamentally structure information and data in a digital consumable format in order to be able to share it reuse it, make it available digitally and be able to consume it have to structure information on the web so these are so many different areas where data can be collected and that in itself is a whole, whole ton of projects which can be done in a very, very cool crowdsourced way. Let me give you some examples when I'm talking about this I'm talking about it at a national level, at a state level, at a city level, at a local level, anybody can do this right? But there are some national initiatives which actually leverage this. In the US there is data.gov take a look at it look at the amount of structured data that is available publicly for consumption by anybody who wants to make information available based on this. In the European Union which is a confederation of tens of countries today, you have specific projects funded by the union to go and structure data. Why does this matter? I'm not talking out of the, you know, this is not an esoteric computer science theory. It really is reality you don't structure data that you're talking about how on earth will you make it available on the web? You can write as many research papers as you want to have no value because you don't have it available on the web. So even in Japan, linked data cloud, what does that mean? There is a whole interconnection linked data structures which can be used for different kinds of data. Just think of it as map layers of information and how you can take those layers and mix and match them and create applications which people can use, right? So I would say evaluate these programs, take a similar initiative and create some of these projects to be able to build large-scale repositories of data and it doesn't matter how accurate it is initially. It has to be done because if you don't have structured data and if you don't have digital data, you cannot get to creating any web data content of any large scale. It won't happen. So let's talk about Indic urban data. This is at a global scale. So what am I missing in Indic data? We are building a huge content translation platform in Wikipedia which means that it would be a platform that is going to enable people to be able to take a really nice article in English and be able to say, hey I want suggested translation in Hindi and it will give you a suggested translation and then you can correct it, adapt it and then publish it. But it's at least bootstrapping some of the effort to make it easier to create content of that. So given that project, we started digging into Indic languages. We want Indic languages to be a million plus in each of the languages. How do we do this? How do we bootstrap this? And we found terrible states. There is just so little, either under developed or missing data for terminology glossaries. Do you know what terminology glossaries are? Do you understand that? It's like taking medical terms or taking math terms or taking geography based terms. Different kinds of categories of information. Dictionaries. Give me one good Hindi to English dictionary on the web. Tell me which good dictionary exists. When do you go and look up Hindi to English to the dictionary? Do you ever look at one? Give me an example of any other Indian dictionary, Indian language dictionary online. Find me one. Thesauruses. Where you want to look at synonyms. Lexical databases. Corporal information where you have Indic languages being compared for words. If I know Hindi and I want to write in Marathi I want the right word to express myself in that language. Where will I find it? Unless you knew it from your school or from your mother where do you go and find it on the web? Rheumatical references. Give me a grammar book that you can go and look up online and go and look up how exactly to construct a good sentence in a language. Hindi language. Give me some examples. Email me when you find them. Spell checkers in Indian languages. Yes, I understand there's a lot of research you've done around this. I want in production system auto completers machine translation and we talk about knowing languages huh? So what are the issues we face with Indic data? When I am Wikipedia and I want to go and build all these tools but it's a problem so I am rather using it. Quantity of data. There's not enough Indic data. There's not enough content on the web. There's nothing digitally available. Very little. Quality of data. Verifiability of the quality of it. Representation in different computational formats like RDF. Retrieval. Give me a fast search engine that can go and search in Kannada. Tell me. Lack of standardization consistency in what the data is online. Usage variation. I don't want to use Tamil from 500 years ago. That was classical Tamil. I want to use Tamil which is used today. That may be a conglomeration of English little words. It could be a conglomeration of Sanskrit words. It could be a conglomeration of local words that have come into Tamil. But I want to be able to use what is used today. Give me a version of it online. Tell me where it is. And add it for Wikipedia. Larger issues you know of those things. Read through this. I tried going and looking. And at this point I'd like Pushpuk to actually talk a little bit about his work in WordNet. 3 minutes. Can you go back to your last slide? Yes. Sorry. Is this on? We do have this on the web. All these WordNet. And we also have Indo-WordNet which is a link structure of Indian language WordNet. And therefore this problem of finding equivalent words, except for a Hindi word. Equivalent words in Tamil, Malayalam, Marathi and so on. That you can do through Indo-WordNet. Some more. Quite a lot actually. So this particular resource is very heavily used and it has got very good coverage. Now we are waiting for support from the ministry for the second phase of the project. And all these insets are linked by semantic relations. Hypernemy, hypernemy, everything. They also have translated glosses, example sentences. And many of them have morphology analyzers as they're fronted also. Cool. Thank you. Other than specific instances where specific academic teams have taken on projects like WordNet there's very little that exists. You can go and look for websites that hardly I hand for that you will find online. And that's very problematic because when I want as Wikipedia I want to go and use these glossaries of this data. It's very hard for me to consume them because they're not the right format, so they're not available or they're not open or they're not free or they're not API driven. Lots of different issues that are existing. And these are only some of the examples that I can find a hand for. It's not really very much. So I want to introduce very briefly a small project Wikidata which Wikipedia is doing and introduce the concept and ask for your contribution. The reason I ask this is because all of you are interested in language technologies and building language content and Wikidata is an attempt to structure some of the information as I was talking about earlier in terms of linked information as a platform for all languages. Right? It is an open source project. It's a basically what it is doing is I'm taking the example of Berlin as a city example of a city and you can go as a contributor and create an element that you want to talk about right? That is you know something about Bombay, Mumbai and you want to add different details to it. So it's as simple in the languages that you know you can add the name you can see that it's translated and people have added versions in different languages. So it's language to language translation of a term and at the same time linking specific attributes to an element. Right? That's structured information. You can also see the country is Germany Berlin is in Germany, continent is Europe, instance of the city city with millions of inhabitants these are all these attributes that are being collected in a digital way and it is an open repository so you can actually create data for all the cities and places that you are familiar with or items that you would like to add and it's open. You can just sign up as you log in and just add stuff it's an open source project but it is very easy to do you don't have to program you actually just need to add information and it's really cool because you will always get credit and the media always has everybody's contributions time zone, local dialing code identified, GND identified, all kinds of stuff, geo information it's like describing something in multi dimensions and why is this project interesting because this is a format, it's a linked data this is the only way to scale across languages so I say I wanted to look up the information for our mobile in Tamil and I want to show it show all the information for Mumbai in Tamil on the Tamil Wikipedia I need a translation of those terms and then the words and the titles of all the categories and if you can add it in the languages that you know it will automatically go and show up on Tamil so when I say I want Tamil the version of Mumbai in Tamil give me the info box, you've seen the info box on the Wikipedia on the right hand side it is as simple then if you contribute information here it will automatically show it there and that's the power of structured information because you're not only providing this huge set data that you can consume anywhere but you can also interchangeably use that same standardized information across multiple Indic languages and that's very key because if you don't have Indic languages supported across the board it will not scale fast enough so I just wanted to show you that and that's a project that I'd love for you guys to go and check out it's wikidata.org, very simple go check it out, send me questions if you have any questions I can help you on. So I want to talk a little bit about what else Wikipedia is doing Wikipedia is doing a lot in language technologies we are enabling content translation we are starting to look deeply into machine translation into linked data into ontologies into translation memories into data purposes and sources for 287 languages and Indic languages are prioritized for us because we see a very high growth rate of consumption of those but we want to get more data and we want to get more computational components to be able to build that tool well we are building a content translation platform integrated in the wikipedia so that you can actually write in your own language easily and be able to translate with suggestions coming from machine translation translation memories, different data sources glossaries, terminologies dictionaries, different kinds of suggestions building translation memories adding digital dictionaries leveraging crowdsourced linked data that's a massive project what else are we doing we are doing language selection we have a whole set of libraries they're all open source and we all reused into any website on which you're building or using where we have a language selector that you can go and use 37 languages we provide enormous amount of functionality there you can actually go and reuse it go try out some JavaScript there are fonts you know about fonts right the tofu that you see on blocks most of the indian languages cannot render fonts, free fonts high quality free fonts for 63 languages including indic languages 81 variants why are we doing this we want to make sure that this baseline works for everybody on the web which means that every browser can talk in your language read and you can write in your language input tools to help you write when you are in an edit box on a website should be able to type it in go check it out on Wikipedia it's built in into input boxes we have 139 input methods for 64 languages we keep growing and adding it for all the 287 languages we don't need it for latin based languages because latin comes built in into the browsers but unfortunately for other languages like indic family, cjk seer lake reek we need that support on-screen key maps any of you guys are up for the challenge come and help us write some of the on-screen key maps that is the library we are going to use to write on tablets and smart phones so good challenge and then of course internationalization, lot of detail doing a lot of stuff then we do software, ui and message localization just think, everything is trans sourced everything is open source huge amount of work and working with different telecom partners to make sure that there is free access to media for everyone everywhere so, where we headed Wikipedia is 14 years old believe it or not and we still only have 110,000 articles in a week Wikipedia really is one of the most significant disruptive platforms on the internet really is, it's amazing content commons we want to be the content commons for the web which means the magnitude that we carry in terms of content will only continue to grow with it because we see more and more people contributing and we are trying to build different ways that people can easily contribute whether that's an image or a place that you like to take and post it on commons whether that is an article that you just went in the editor write a spelling check or anything else create open data this is a drive, we will push this because we do care about this generating high quality content it's one of our mandates we do want to keep upping the quality of what is creative delivering a first class multilingual user experience very important concept as the web changes from being Latin based to being multilingual it is imperative that everybody who speaks a non-Latin non-European language can get the same user experience as English or as German or as Swedish because if you have to go and download an image to read your language it's got a rotten user experience so that is something which is very key and engaging a new generation of users with mobile being mobile being everywhere as well as in smartphones and feature phones and those two fundamental components we will commoditize language software to be free open, available as building blocks on all browsers and web and it doesn't matter whether Google does not do it or Microsoft does not do it because we can do it okay and keeping the web open and free is very important because if you don't keep the web open and free you cannot contribute Google will tell you what to contribute you will not be able to contribute that's why Wikipedia exists so again I won't go too much into this because I want to take some questions so I'll just brush through this this is just details about the content translation platform that we are building again I'm just racing through these slides because this is like a tool we are building wanted to just show you a visual on this this is the content source content that we are bringing up so you can as an editor you can go and select an article in English and you can say I'm going to translate it to Hindi here or Marathi say here it is Korean but you can see here on the right hand side we have a language toolbar which will actually give you suggestions from machine translation from translation memory I think I skipped some slides but you can see that here you have dictionaries you have the usage from the source again we plan to integrate all kinds of data into that so that you can as a user just look it up here I want to see the exact meaning of it before I write it and make it easier to write content just it's easier you can just look at different references and we intend to do that for Indic languages and you can do it paragraph by paragraph where we are doing suggestions that show up from machine translation or from translation memories and then you can even do linked data see that linked links those links are carrying over the amount of stuff that we are doing with this platform and that's all I have work with us, we are open source and open please contribute and ask me questions questions, questions, comments questions, yes stand up introduce yourself what is IDC? IDC industry design okay cool so we over range yes I have a couple of questions how is the coverage of making data as compared to previous and how is it that you are differentiating so I wanted to see what your data was so I personally was not very happy with Wikipedia imposing their fonts on the so I preferred my own are you talking about English? so you are talking about designing part of the directly from so we just believe the yes so so I believe that Wikipedia is wonderful which is why is it that Wikipedia wants to impose their fonts why is it that they not let the system we don't impose anything so as of now for example as the web font yes whereas I don't know about that so I mean I can address the web fonts question first although it's not related directly to this presentation so today Wikipedia pushes on demand web fonts because most browsers from the data that we pick up do not have fonts that are available for specific languages for specific pages so in native languages you'd be surprised by the number of people on their systems or on their browsers don't have fonts so we detect that and then we push fonts and we created we have a default available you have to turn it off as a user but the reason for that is that we want to make sure that we read the information so it's not that we're dictating the font it is actually that we're trying to address the ability for everybody to read the page as a designer you might be thinking about which font and the fundamental concept of can people read fonts I understand what the problem is and I understand the technical and aesthetic issues that the blogit has but the blogit for very long time unfortunately has been the only family of open source fonts which are available on the internet for everybody to use but no, you can say that I can download the Adobe fonts prior to that but that's not a thousand of that we did for the company that's previous to the question that font specific we will deliver any font that you have made the fonts available now under open licensing we will push that too it's not that we are going to push on the internet we want to constantly deliver the best fonts that are available with open licenses to the user it's not too thick we are not in the business of actually recommending any font we are just taking whatever is available we are not in the business of creating fonts we are just taking what is available on the web and being used in the open source environment with open licenses to deliver to the user to be in the open again absolutely I would request that you as a designer and a font designer design 10 more fonts for every Indian language what are you sitting and doing I mean give those 10 fonts to me I will roll them out but use it you did thank you for it but all I am saying is if you are going to complain come up with a solution and give me some fonts I will roll them out I am not producing but if you remember in 2011 we have brought into the Marathi if you remember actually it was initially available we brought it into the Marathi yes absolutely again Narayama was the predecessor of that font and we generally work with every body who is we introduce we care deeply about having a great user experience and having the best fonts for Indian languages but believe me there are not enough in the high quality and technically they really are so if you have just released the font thank you for it and again do you want to roll out whatever yes no I mean can I answer your question on previous chapters yes sure so I have a few things to say one was thank you very much very nice I think you have a much better day than me and that is why I was taking pictures I hope that so so because I cannot build a similar presentation similar lines just using in the era of Wikipedia and so one of the things that we were looking at which probably you listed the list of speakers of the language which is a fair comparison and then many others would think in their mind without articulating this perhaps directly to you but there is internet penetration in different places and you know I mean we could work to change that but so I had a different take on that so how many internet users are there in that language and if you put that as a third column against that then going by the classic sort of change in the light bulb so we asked how many internet users in that language would be able to make one Wikipedia page and in that language they still come at the bottom yes that's right so I have the data I can add that in next time yes so probably I will look forward to that but it still comes at the bottom of the file and one of the languages that comes at all of the file is the language for a wide wide which is in Philippines and so that is what I started from very early yes so the second thing is very briefly the input methods that's the area where we work we create as well we have made it sure that there is no place here so and I think we are really concerned about the level of the kind of input that we are the least pessimistic in the world we are the least twitter dependent in the world as well and this is like per capita I mean per internet user so and all of this actually points largely there for input rather than love for languages for example for example I mean we have a lot more television in New York State than we have foreign business English and that's why I think it has moved to Hindi and that's why all the financial channels have moved to Hindi and so on so I think that input I think there is a lot more we have to collaborate on absolutely so as I mentioned I didn't go into that because I was looking at data in this presentation we have to go into input methods we have been doing significant amount of work in input methods and we have taken every single input method that is available we have been used by the Canadians as well as open source projects and added them to our open project you see that there are any input methods that are missing we will be rolled them out so it is an open project we were happy to add that but the other challenge that actually is the larger challenge and going into the next decade is on screen key maps and on screen key maps but no standard in Indian government don't have standard there is no ad hoc standard we can create one and we can use it but there are no standards and a body like I had thinking going on there is a big standardization project happening yes I mean but you don't understand the process of standardization and I don't want to be great I cannot afford to make it because the users are needed now yes but I have always had reservation about the English based input methods mainly because it represents a very very small community of the world like English based input methods even smaller proportion of the country that people can speak English and fundamentally there is a key problem in input methods first even the current desktop input methods which have been adapted for the web now are not designed for English language input methods they are not designed just put together because you figured out on a Latin keyboard you could do three layers of you know control shift build up and a character who does that so the point is that those input methods to begin with are not really that good they are not well designed and they are not really adapted for English languages which have a lot more characters and a lot more second is that we go into the world of onscreen key maps everything is moving to digital and mobile it is very important to redesign and do them from a design perspective and usability perspective but they are that integrated as an open library for every device on the planet so that a Samsung a BenQ a Nokia or Google they go and just pick up those onscreen key map libraries and just use them they have to be designed by Indians who actually have sense of usability and is thanks to the Indian language user and be described as a specific Asian but not a way for the government to tell you what to do do it so the point is to publish it can I let there be collaborative discussion on it do it the open way with the open source and just publish a standard we will build it it won't take us more than a week to build we will make a technology about the issue the issue will be usage visibility to design and doing a good job otherwise you will get key maps on China which list Indian characters as you will read them you type together one after the other Google give the tools because they are designed by for usage in the way and native user will type in a particular language then you take that incorrect incomplete paradigm just copy over from the key maps that exist now fundamentally reinventing the same stuff and carrying it over into the action what's the point of that do expect it, put on expect in this view can show you something my thought is CG I am working mobile computing for the last few years I have two questions, one is about the web phones how this web phones is going to interact with any kind of system-level applications in mobile phone especially if you are using any kind of web view or certain types in Android the Android backend is kind of very weird architecture in the language we do a lot of arranging in that how it is going to interact with the web phones in Android or any other way web phones are fundamentally pushed on the page on the page and it really is not something that Android West is putting there is some option and you can override but web phones only kick in and browser-level don't okay or native-west so web phones I don't want to maintain them I don't have to be a browser what you are seeing is the implementation of technology that is needed by the user and is not available by the browser what does Wikipedia do in the meantime sir, what sir, what he is asking is quite relevant what happens if you don't have the system-level there is no point in pushing the web-level and if there is system-level then there is most likely upon the release sir so you have done some study on that and we essentially found that there is system-level especially for the environment I am talking the case of the environment if there is system-level then the windows will provide a complete system which you don't have any point the Wikipedia square your usability study probably will not we do a lot of visibility testing we do a lot of studies so if there is no system-level support then pushing the web-point will be a little delay but understood but the point is system-level support and system-level provision of a content to different things you understand that so what I am saying is that the system doesn't have a content loaded for a system a content is supported that's a different scenario from what you are saying there is a system in Unicode can also support that script or that standard Unicode that is totally a different thing that's not what is going on there is another meeting yes we are actually another 5 minutes so who gets the rest of the distinction that is the distinction one more question again that is dependent on system-level you don't actually include any kind of web-point or something like that suppose in Android it is the hindi language for support what kind of web-point or what kind of web-point and push a key back from the web-point ok ok so that will be only for the certain of sales remember that web is moving to an hdml css native apps are actually being discredited very heavily and the standard specification don't read it from W3C is moving to an hdml css specifications that is the bare mobile development support because why I am raising this point is that I have a digital AOSP android for hdml and hdml and it was pathetic the way we take it they even they don't know how they did the rendering and each and every vendor doing their own way and some others doing their own way for entering the texture so in that case this will be a big issue Mr. you can't even tell the standard some single vendor from other way that may be the kind of one of the point which we are taking care of we do we work and push the request very heavily from Android to Firefox the only for us we really have issues that is iOS because we have a license mobile and browser and we work very heavily in terms of surfacing these issues but remember this is a changing paradigm the state at which them browsers are at today they are transitioning into mobile being mobile browsers and so there is a set of element of language support that gets prioritized based on markets so when you are looking at commercial browsers you will never have a language which is not being used very heavily you should make a problem with your languages 100 million parts of Google that is why Google is very very active in providing language support at least in your but some languages yes in the top 58 languages I will say yes there are 7 languages I am going to say Android is 7 languages Android is 7 languages Samsung is more Microsoft is putting out an app yes exactly but my request is that the building set we maintain these because we have a media level we can push that and that in itself is used as a parameter by Google for example where they have to prioritize languages because right so leverage that because that is something that drives the support for languages in England you just off the shelf we can have that in a year less than 5 years and that is the main thing so really really there that is where the collaboration is working together we can push that really after each of you that is the main thing we are using version support we are using main translate Google translate I am very very happy to be part of yes and I would like to talk to you but thank you again and you know where to find me thank you very much