 Hello, everyone. My name is Shinno Iwami. I would like to talk to you about methodology to grasp OSS as open source program offices. First of all, I have a notification about intellectual properties. Although I am working as an expert in NEC cooperation, I give this presentation in a private position because this presentation includes the analysis before I joined NEC cooperation. However, I have more than 20 years experience as an engineer and I have addressed data science for technology management in the latter half of my career during the current trend of evidence-based policymaking. Two, I believe that I can give you some hints about your open source program office. The figures in this slide were performed to overview the field of robotics and to discover the emerging technologies using databases of academic papers. In the upper figure, the analysis made a citation network and extracted field-search as mobile robot, manipulator, and medical robot. In the lower figure, the analysis discovered the field of perovskite solar cell as a rapid emerging technology. This presentation has four parts. The first section is introduction. In that section, I will talk about open source program offices and the scope and the framework of analysis in this presentation. The next three sections will introduce methodologies and results of analysis. Open source program office has several roles which are classified into outside, inside, and both. For outside, contributions to the OSS committee are required. What insights open source program office promotes strategy and utilization of OSS, fostering open source culture. Additionally, it has role to communicate between outside and inside and it maintains compliance related to OSS. The scope of this presentation is to provide methodologies for making a strategy. Especially, I use quantitative analysis that is called data science with network analysis, machine learning, and natural language processing. The merit of data science to obtain established evidence with automatic operations. However, data science requires a large number of data, so it takes time before the data has been stored. For discovering teeny sign, similar to intuition, qualitative analysis sit as interview and observation has an advantage. The merit of the qualitative analysis is the increase of manual work. Next, I explained the framework for analysis in this presentation. For making a strategy of OSS, we often want to grasp the latest situation of society, technology, OSS, competitors, and our belonging organization. I proposed the framework to seek about society, technology, competitors. This comes from the basic about basic and simple analysis, 3C analysis. The 3C analysis is to consider business from three positions, customer, competitors, and company. Regarding OSS, the concept of OSS community does not match with any of the three positions. OSS community is like a society, but OSS community is having participants from competitors and company. This presentation proposed the framework that OSS community is added to the conventional three positions. From this slide, I introduced three methodologies. The first analysis is about the strength of society and technology. In order to make a decision which field my organization works on, this analysis gives an overview of OSS and methodology. For this analysis, I explained data source methodology, results, and findings. In the methodology, I often use TFIDF and cosine similarity for other analysis. This slide shows the steps of this methodology. At the first step, videos are selected. In this analysis, at WWDC 2021, Google IO 2021, Microsoft build from 2018 to 2021, and SES 2021 are used. The reason of several years for Microsoft is that I can discover videos for several years. In addition, keynotes of SES include a keynote of Microsoft, so I expect to compare keynotes of the same company between different events. At the second step, the selected videos are played, and the text data of the sound are retrieved via a function of sound recognition. In this analysis, Google Docs was used. When you perform this analysis, you had better to use text data of transcription if you can get it. Because analysis will become more accurate. However, using a sound recognition has a merit that we can get text data from every video and sound format. Now, virtual events increase, and we can access more events easily. So the opportunity to get more text data are increasing. At the third step, knowns were extracted with natural language toolkit for English sentences. In Python programming, natural language toolkit is called NLTK shortly. When you extract words, Lentmizer is better than Stemmer for making words original. The step one, two, and three are performed for each keyword, keynotes. Then the steps four and five are performed once comparing between keynotes. At the fourth step, knowns are scored with TFIDF. At the fifth step, cosine similarity are computed, and the results are drawn as a feed map. Next, I explain about TFIDF and cosine similarity after extracting knowns. If you collaborate with data scientists, please remember only the expertise and the purpose. Data scientists must know, well, the tips to use these technologies, and you will just ask data scientists to do so. TFIDF computer score to identify feature words comparing between several documents. TFIDF gives us high score to what does appear frequently in one document. Meanwhile, TFIDF gives low score to words that appear frequently in all documents. For example, frequently words such as I, mine, me, the, of, are given a low score, and they are called stop words. For implementing TFIDF, the Python module scikit-lan is available. The scikit-lan has a setting that it can remove the stop words before completing TFIDF. Cosine similarity is a score of similarity between documents. On this slide, I write down sentence x and y, but we can apply similarity for documents thinking one document as one sentence. In addition to cosine similarity, there are several similarities such as Jackard index, Dice coefficient, Simpson coefficient, and others. From my experience, cosine similarity is one of the best similarities. The Python module scikit-lan also includes the function of cosine similarity. These are resolved by computing cosine similarity after computing TFIDF. These colorful figures are heat mapped that show the result of cosine similarity. Apart from this heat map, the analysis output, the list of known, ordered by scores of TFIDF. In the previous slide, I indicate cosine similarity that uses zero or one for each word. The left heat map of cosine similarity is computed with simple counting of frequency instead of zero or one in the previous slide. The right heat map of cosine similarity is computed with TFIDF instead of zero or one. On the heat map, let me high similarity and blue means low similarity. One of the list of features was by TFIDF. AI software development, cloud, mobile, and cybersecurity were constructed as frequent topics. As outstanding topics, drone, game, and sports were extracted from Verizon Kino. There is the world of Starbucks from Microsoft. And it means digitalization of conventional non-IT equipment. Investigating which words related to a line of low similarity on the heat map. We can know regular topics. The upper line indicates Microsoft 2020 that has unusual topics about COVID-19. The lower line indicates GM that has different topics from the other organizations. GM referred to electric vehicle and battery. Investigating which words related to a line of high similarity on the heat map. We can know organizations having similar topics. Microsoft spoke similar topics except the year 2020. Best Buy and Walmart spoke similar topics naturally because they are belonging to the same industry. In summary, this methodology can extract technological trends and changes of technological trends from keynotes of worldwide events. Again, the extracted topics for trends of society and technology in this analysis are AI software development, cloud, mobile, cybersecurity, drone, game, sports, digitalization, medical ID, electric vehicle. The second analysis are about the key groups of competitors. In order to make a decision whether my organization becomes a leader, follows a leader, or withdraw, this analysis gives a methodology to know competitors and their alliance by topics about OSS. For this analysis, I explained theoretical background methodology, data source, result, and findings. Before introducing the methodology, I explained the theoretical background. Regarding forms of sellers in a market, there are four forms in the economics, perfect competition, monopoly competition, oligopoly and monopoly. For monopoly, some countries prepare antitrust law. In the forms of oligopoly and monopolistic competition, I have heard that business tends to be stable in a class of technology management. For example, I consider that operation systems of computer are provided by two companies, Microsoft and Apple. Moreover, operation systems of mobile phone are provided by two countries, Apple and Google. As an ongoing example, strong seller of cloud computing as three, Amazon, Microsoft and Google. In addition to four forms, there is a form of excessive competition which is caused by too many sellers or selling too cheap. In this class, the market of Japanese home appliance around 2010s was said to be under excessive competition. There were mainly eight makers, Hitachi, Panasonic, Sony, Toshiba, Fujitsu, Mitsubishi, Sharp and NEC. Now, several companies withdraw the market and the other companies joined. So there are still too many companies. In short, for making a decision to join, collaborate or withdraw, we must grasp the situation of key players in the market. Next, I explain about the methodology. Data were retrieved from Microsoft Academic Knowledge. The first step is that data are retrieved via Academic Knowledge API and the formative data are inserted into the local database. The second step is that the number of paper, citations and co-authoring relations for each organization are counted. The third step is that the figures of bar graphs of each numbers and co-authoring networks are drawn by year. The fourth step is that the figures are organized in order to find insight. In the co-authoring networks, the top five organizations and their top four co-authoring organizations are drawn. Namely, co-authoring networks have at most 25 organizations in a figure. For this analysis, the main selected topic is blockchain as a topic deeply related to OSS. So, parent topics and child topics of blockchain or Microsoft Academic Knowledge are also analyzed. Additionally, open source software, open source hardware, open innovation, and open data were added to the analyzed topic. This slide shows the characteristic result. Please pay attention to the gray scale co-authoring figures on the lower half of this slide. Here, the co-authoring relation means a type of good partner. Because each figure shows several stages from decentralization to oligopoly. As a model case before the analysis about OSS related topics, the case of AI shows the growth from decentralization to oligopoly. As shown in figures of co-authoring networks on the blue arrow, applying the blockchain for this timeline, the growth and current status of blockchain are revealed. IBM and Alibaba Group became leaders about blockchain. In the same way, open source hardware, open source software, open innovation, and open data were plotted on the timeline about AI. These topics are too conceptual, so there is no sign of oligopoly. To observe the growth of partnership by co-authoring networks, one requirement is to be specific technology. Then, these figures show about all analyzed topics. From the horizontal axis, the growth to oligopoly is related to data size. That is number of records retrieved from Microsoft to academic knowledge. That is the number of results when you search a topic on the Microsoft academic knowledge. On the previous slide, one requirement to growth to oligopoly is to be specific technology. However, from the left figure to specific technology such as hyper-region, Ethereum and Bitcoin does not show growth to oligopoly. In this presentation, it is difficult to check the organization for all topics. The usage of this methodology will be to check the organization after identifying the state of your interesting technology. In summary, this methodology can extract from the error of decentralization to the error of oligopoly about specific technologies. The third analysis is about portfolio management by countries. In order to discover markets related to my own technology, this analysis gives a methodology to repair strengths and weaknesses. The analysis about security is indicated as an example. Thus, I introduce the analysis as a methodology. For this analysis, I explain data source methodology and result. Next, I explain about the methodology. Data were retrieved with the query of security from Web of Science provided by Clarivate Analytics. The first step is retrieving bibliographic data from database. The second step is that the citation network is composed. The third step is getting the monkey's mind component graph. The fourth step is clustering by a new method. The fifth step is automatic extraction of future words with TFIDF. Manual identification of topics and plotting positions by topics and companies. This figure and table show small fields in a large field of security. After clustering a citation network, one group is colored with one color. A line means an edge which indicates a relation. Topics were identified manually. Security includes other than security. The latest largest group colored by orange has a topic of IoT and cloud. The second largest group colored by light blue is about food science. The slightly far distance between orange and light blue imply that the relation between both topics are weak. However, based on the digitalization as agricultural IT, food science must have an opportunity to collaborate with cybersecurity. The third largest group colored by yellow is about energy, energy security, which is deeply related to cybersecurity. The clustering tool provides number of countries and average years when user events happen. Using these data, each topic can be plotted on the chart of portfolio management like this slide. Technologies on the positions of stars are emerging and earning money, and technologies on the position of cash cow, are derived technology and earning money. Technologies on the right half of the chart will be strengths. Results can be plotted like this slide. It is considered that the differences between strengths and weaknesses indicate opportunity of business. The United States is strong in food security, so the United States has an opportunity to export something about food security. Lucia is strong in security game, so Lucia has an opportunity to export something about security game. That Japan is strong in energy security, so Japan will have an opportunity to export something about energy security. This analysis was performed before I joined to OSS related team, but I think the methodology can be applied for OSS. In summary, this methodology can extract technological strengths and weaknesses and find market candidates. Lastly, in this presentation, one analysis about trends of society and technology and two analysis about competitors. In the future, I will continue to develop other analysis and present them to the extent avoiding conflict of interest. Thank you for your listening.