 Welcome to the session. My name is Josh Lee, and I'm from Canberra, Australia. For those still think Sydney is the capital of the nation, well, Canberra is their capital. So I'm going to talk about the projects I've been working on. It's about how we use Drupal to create a site to serve the global biosecurity community. A little bit of them themselves. This is my Drupal ID, and this is my Twitter, and I'm a Drupal developer in Tenocrat in Australia. And I've been in Drupal community for almost six years, and in those years I spent most of my time in the Australian government public sector. And I'm a module maintainer, and this is my second DrupalCon. Well, my first DrupalCon was in Sydney, so I have to say that it's really hard for guys in Australia to attend any DrupalCon in America or Europe, so I have to thank my company for me here. Let's get the story started. And this is a cute Australian wombat. Well, giving the fact that I've been working closely with my clients, which is Australian government, Department of Agriculture, Animal Biosecurity Branch. I think he'll be cool to actually add some random Australian animals photo in my slide, so just in case you guys are sleepy. And I'm pretty sure that clients are really happy about me doing this, too. So the story started with the policy makers from Australian government and New Zealand government, which are Department of Agriculture and Ministry of Family Industries in New Zealand. What do they do? So these are the descriptions I got from my clients. I don't really want to read it, but in my own words, I think what they do is they're making the policies of making the decision which country they actually import the animal or plant product from. So the last item, actually, I put in bold because I think it's related to my project. So actually, they provide independent scientific advice, social analysis, and science-based quarantine and policy advice. Right, so this is too long. A better example. Anyone here been Australian before? Okay, so you are Australian. All right, so let me explain. So if you arrive in Sydney Airport or Melbourne Airport, you may see this lovely puppy smelling on your luggage because this is one of our colleagues. To make sure that you are not bringing in any registered animal or plant product into our nation. So they are mainly the goalkeeper for the countries. Another example. If they find out there's a country happened to have food and mouth disease outbreak, then they will decide not to import any related product from that country. And if they find out any other countries are actually importing the animal products from those countries, they will stop import from those countries too. So how can they actually get the information? That's the question. Departments of Agriculture realize that they have to find the information in time so they have a dedicated team of news detectors. What they do daily is go to Google and Google about the diseases and about the biosecurity news around the world and they put other URLs into a spreadsheet and they go through the spreadsheet and visit all of the URLs and if they find that article is really important, then they actually pass that article into the right branch who actually makes the policy. Apparently that's a really pain process. So the departments think, well maybe we can actually ask the machine to do some job for us. Then the scientists actually design the prototype with Google blog and Python. There's really an early stage of news aggregator. So they actually are doing the job. They pass the queries into Google API and get everything returned. But they only actually focus on aquatic animal news. There are different branches, plants, animals, aquatic animals. So the departments say, well this is a really awesome tool. We really need to share the information to other branches, even other departments, even other countries. But it seems like this prototype doesn't really work that way for us. So we need to think about how can we actually build a system that we can actually share to other people. This is where Ivers from. By the way, this is Ivers the bird. If you go to Sydney, you may find this is quite common in Sydney's road. It's that for international biosecurity intelligence system. So first of all, why Drupal? This is a really popular question when you attend any Drupal con. Why do you use Drupal? Well, I asked the same question to my client too. They had two reasons. First of all, the first reason, well, Drupal is awesome. You have lots of modules that actually can match their requirements. Second reason, which is the most important reason. They chose Drupal because we have a really big community here, which we're all living in. The department actually decided to also the project to the contractors or Drupal shops because everyone in the community actually working in Drupal contributed their time and the skills into the community. So we all know Drupal. Then when they decided to actually transfer the project from one contractor to another contractor or from another Drupal shop to another Drupal shop, it can minimize the transition time. So because we already know what is Drupal and everyone is doing it in Drupal ways, it's really easy to handle. So Sight was developed in Drupal 6 and then it migrated into Drupal 7. It's a real online system that actually serves the biosecurity community by providing the following functions. So the system actually allows people to create their own search queries. Let me explain that a little bit. We define the search sources in the system. We have two kinds of sources. One is untrusted source, the other one is trusted source. Untrusted source is mainly the search result from Google. So Google is untrusted source because you don't really know if the information is related or it's genuine. Or trusted source. So the researchers may find some articles from whichever RSS feed are really useful. They want to know that the RSS can be from the authority, from the government, from the research entities. So people can submit their own RSS feed and create their own search queries. The system will actually search for them. Also, we have to support multi-language because the main goal for the system is we really want to know the fresh news. What happens in your country? For example, if people can talk about oh, I really see some animal died in my yard or I'm a farmer, I notice that my plants start to get sick or something. If this happened in a non-English country, people won't actually tweet about it or talk about it in English in the first time. Well, you have to wait for a day or two until the government actually release the report saying that we have a problem. If the site have to search in their native language, then we can actually get the fresh news from them. Then another thing, the site need the workflow because we have articles from the untrusted source, then we need the people to actually have a look at the article to decide whether this is related or not. So the workflow is really important. They can actually change the state of the article and all of the users will have their own preferences on the site and we will send out the daily digest email to the registered users according to their preferences. Well, I'm really glad that in Dra's keynote yesterday, he was talking about the future internet. So we are going to push the information to the end users. And I was thinking, oh, I'm kind of doing this in Drupal 7 already. So I was going to compare iBus with Fleetbot or anyone here use Fleetbot or Fitly. It's a really useful tool that you can actually subscribe to all kind of news that you are interested in. Like you go in there and you set up your preferences like I want to know all of the depth feature around the world. And Fleetbot or Fitly will push that information to you but not all of the information. They always think they filter the information for you. Oh, okay, I think this is the right information for you and I'm going to filter every other information out. We are not doing that. That's the main difference between iBus and Fitly or Fleetbot. We are trying to push all kind of information to you why we are not filtering anything because even the information is kind of thought information. We still want to know that. So the researchers want to know why even this is thought information you still put it online. Where is it from? There is going to be a reason behind. And maybe it's published by some global trader they want to actually beat their competitors or something. It's actually related to the barrel security community. All right, how it works. I'm trying not to go too deep about the code but this is a high level about how our system is working. CR means web crawler. So the web crawler is triggered by the Chrome job every day. And what I do, it will grab all of the search queries in our system to Google. And Google will use Google API to actually do the search job and return the result. Well, the results actually include the title and summary and of course the URL. Then we pass the URL to third party web services. That includes jail names, alchemy API, I'll explain what are those later. Those web services will actually analyze the URL for us and what is the article for us then grab everything we want to create a node in Drupal. Then we will save the node into our system. That's simple. This is what is inside of the crawler. So you may see we have two kinds of content types. One is search source. The other one is search query. The user can create search source node and search query. We have two queues. They are all using Drupal queues, by the way. The search queue is triggered by the Chrome job and it will actually pull all of the search queries submit from the user into the queue and buy Chrome job. Well, if you guys know how Drupal queues work, the Chrome job will actually trigger the queue to actually process the item in the queue. It will actually push it into Google API, the internet, find out all of the articles, and save the URL or all of the information, release the information into the result queue. Currently, the way we design it to make it working parallel, we have search source as the result queue. So each search source will be a single queue. So actually we are processing multiple queues at the same time. And this is our query builder. I'm not sure if you can see or just explain. So on the top, we have the search entity type, and you can choose if you have a host or you want to search for peasant disease name. And if you choose host in the second drop-down list, we have search entity. If it's a host and currently it's a wild bird, you can choose fish or any kind of animal or plant name in it, or if you choose the peasant disease name in the first drop-down list, search entity will be changed to our kind of peasant disease name stored in our system. The third drop-down list is the language. Of course you want to search for different language, and that will do it for you. So the most important part is the query expression builder. Photos don't really know what is the format for Google query. This builder will actually create the query for you. So the first column is the host name that you already chose. You already chose. The second column is the qualifier. So the qualifier can be like the word dead, death, disease, illness. So you want to search for like a fish dead in whichever country. The third column you have blocked terms. This will actually filter a lot of results for you. Giving an example, if you want to know something about Apple, you probably don't really want to know something about Samsung because that will be totally unrelated if you have Samsung in the article and Apple. That would be technical questions. So you put all of the word you want to avoid in the block terms and the block site is like a... If I want to search for anything around the world, I don't really want articles from iBus into my result queue. And if you know some other website that is totally unrelated, you can actually block it. Then I'll try to actually make it bigger. And under the query expression builder, we have the around tick. For you guys, if you know about Google queries, the around filter is really helpful. Giving the example, I want to search for wild bird and dead. And I only want the result that we have three words in between these two search keywords. Then I put around three. If I put around 10, then Google will actually return anything in result between 10 words even between these two keywords. And we can actually generate the query in the bottom. Of course, that's editable by you. So if you know Google queries, you can actually create your own Google query and that will actually save into the system. Here we go. So that's the time to explain what is the geonames and alchemy API. So we want to know which country or where this disease happened. So geonames will actually analyze the article for you and find out where this is happening by the keywords. And it will return the geolocation information about latitude, longitude. So we can actually save the node and display in the map view. Alchemy API is a really awesome tool. It's actually returning all of the metadata, including title, author, created time, sentiment to you. And you have all of the related information you want to build a new node. And this is the API page from Alchemy API. By the way, I need to thank for Alchemy API to sponsor us. So you can tell from here, so it can actually analyze the sentiment. So if this article is negative or positive and they text extraction, you want to know what is the content of this URL. It will actually return it for you. Also, the image information, if there's an image in this URL, then they will actually tell you what is this information about. The authors... Also, I just received an email two days ago from Alchemy saying that it just released a new service. They now actually support new searching functions. So we are planning to actually add Alchemy search into our site. So in Alchemy, you just pass the web service to Alchemy, ask Alchemy to search everything for you. Actually, Alchemy can actually go into whichever online forum and find out all of the negative positive information for you. So beside Google, we have Alchemy now. And after we have the article in our system, it's time to allow us to introduce our evaluator. So by the way, this is a Tasmanian devil. This lovely animal only lives in Australian Tasmanian island. It's an endangered animal. I've found that the face of this Tasmanian devil is quite similar as our evaluators. I'll explain why they're struggling. Of course, the first reason is the diagram on the right. I know you can't really read the text or explain. So this is actually the workflow. The article workflow in Ibis. So you guys know what is the workflow module in Drupal. You want to know what I'm talking about. So when we have a new article, the machine will detect, is this article from a trusted source or not trusted source? So is this from Google or is it from RSA speed? If it's from Google, then all right, we set the state to roll. If it's from RSA speed, then we're probably going to keep it in our system. So the evaluator will go in to find out all of the roll articles because they have the professional background in barrel security. They know whether this article is related or not. If it's not related, this is about Apple and Samsung fight, they will actually trash it. And the current job will actually delete all the trash articles every day. And if it's already a kept article and they will actually evaluate the article, whether it's really important or not. If it's really important for our stakeholders, then they will actually promote this article, or it's extremely important they will actually send an alert. Straight away, our daily digest email will be sent out every day with all of the promoted article information and alerts. And this is an interface that our evaluator have to face every day. You see it's a speed that designed with a really big map in the middle and with really long filters on the left hand side. Well, the filter actually give you all of the options. You want to search for whichever taxonomy you want to search for which channel is from. For example, you want to see all of the articles from the trusted source or you want to search all the articles from Google. And then it will actually give you a really long article list in the bottom. And I'm not going to display the whole page because it will be really long list and then we have like a pager. So the evaluator had to go into the long list and click into the article to find out whether it's related. That's why it is struggling. And this is what we are producing every day. So daily digest. In here you can see we have like a three sections. Section one, plant health. Section two, aquatic animal health. Section three, terrestrial animal health. So in each column we have the link to our site and we know what time this article is produced and we know which country it's article from. Of course we have the original link linking you to the original site and where we get it from. And currently I think our site will have almost 4,000 users from the world and they all rely on this daily digest we produce. All right. It's time to talk about the problems. For those who haven't seen this video in YouTube kangaroo fights really happen in our neighborhood. The biggest problem in the system. Of course it's a performance. So our site had like a five hour or six hour sleeping time every day. Why? I think because of this design. So let me explain that to you. So this is our current architecture. You see we have Drupal and we have everything in Drupal. So the forum, the email, the actual CMS, the third party enhancement and the web crawler. This is good and bad. Why? Well the web crawler can actually use a lot of features from Drupal. For example we are using the Drupal Q class. That's really handy. For your information we are getting 10,000 search results every day from the web crawler. That doesn't really include the articles from the RSA speed. And you have to allow the time for the server to process all of the results. And we are struggling. Giving the fact that 60% of the users are from Australia and New Zealand. So we try to push the processing time into our sleeping time. But that will actually be really bad for the European guys. We're getting more users from different countries. We have new users from a Canadian government, from Malaysian government, from Indonesia, from European. We need to make the site accessible for all of the guys in different time zones. And the data structure doesn't have a problem, which is causing the performance issue. Let me explain what the structure is like. We have an article content type, which is a search result. And you want to know what is the related information about search results, then we have a field collection called places. In the field collection we have gel information, like a field of longitude, a field of latitude, all of the country's name. And for you guys to know, what is the module called field collection? So field collection is a module that actually allows you to combine different fields into one content type. So you want to group different fields into one content type that will actually do the job for you. It's a great module, but you have to use it wisely. I'm not using it wisely at the moment. Because field collection is an entity in Drupal. So if you want to know everything contained in field collection, you have to load the entity first, and you will actually get the taxonomy number or anything related in that object. Then you have to load the term again to find out the content. So it's not really handy. And then we have another field collection called delivery method. But in it we will have a content type called search. So actually that includes all of this related information about when the search launched and how long it will take. In the search content type we have another two entity reference linking to the search query node and the search source node and other taxonomy. You see this is really complicated. And if you want to actually load one article, how many tables in the back end you have to join. I haven't content yet, but it's a lot. And think about if you want to actually run the search function, how many tables it will actually join and how long it will actually take. So this is our problem. And a bit of a review from New Relic. What they told us is, well the search query actually took like a 30% of the system time. And the field collection article places almost 30%. Node, if you want to load a node, that's 25%. And recently I made a change to remove the discovery field collection. So actually it went down to under 5%. So this is telling me the same story, like something wrong with that structure. All right, we have to fix this. And this is a really common bird, Cog2. This bird actually visits my front yard every day. Right. This is a potential solution for Ivers. Or I can think about for any kind of like a news aggregator, they can't really have the crawler in their web server. So what do we do? We remove the crawler from the web server. We create a dedicated server just to run the crawler to protect the web server. So if you can see this diagram, we think we keep all of the Drupal related functions like what Drupal is good at. Organic groups, cool. The emails, cool. The CMS, cool. We will keep it in Drupal, but we put the crawler outside. And we think we need to actually use Node.skl in the backhand to store documents in the flat way. Why is it really good? Because even your crawler is running 24 hours a day, you don't need to worry about your website. Because your website is still there, people are actually going there to see stuff. And the crawler is busy doing its job. The hot part for this diagram is we need to create the interface between Drupal and the backhand. So when people request one article, we know which article we need to actually find from the backhand database. So Drupal actually can request in the backhand database and find out the article for you. For you information, wouldn't I say Node.skl database in here? I'm thinking elastic search in my mind, but it's not confirmed. If we go with elastic search, then we will have the reporting and analysis about the database behavior and the user behavior out of the box straight away. So you guys don't really know what is elastic search. It is not a searching engine. It is a Node.skl database similar to MongoDB. Of course, I know that Chicks has been working in July 8 integrated with MongoDB. There's an online video about that. You guys are interested, you can have a look. So it's really exciting. Drupal 8 can actually completely insult in MongoDB. And another plan or another problem we have now is the user experience. So it's a really common understanding between our clients. Well, all of the users are passive users. So we have a lot of users from different countries. What they want is just our daily digest email. And that's not really our goal. Our goal is to actually allow people to go into our system to evaluate the article themselves. So everyone should be the article evaluator. But they are not at the moment. Why? Because this is an interface. So when you log into the site, you see this and you don't really feel excited about it. Well, what I'm doing here, I'm not going to be involved about this long list. I don't really know what this means. So how can we actually turn it into, turn all of the user into active user? This is a proposed new design for our site. So when people log in, what they should see is what's happening recently around the world. So we think people should see what happened in the plant health. For example, how many articles are promoted and how many new issues are created in plant health so they can actually go in there to have a look at what is the news, same as a terrestrial animal and aquatic animal. So the idea was we can't really limit our articles into this kind of three categories. We can create different groups. So for example, if you only interested in the blue town, lovely lizard, you can have the group and you can actually create your search query about blue town and promote the articles about blue town into your own group. And your homepage will display what's happening in your own group. Of course, you want to know what is the hot topic and what is trending at the moment around the world. So we plan to search in Twitter and Facebook to find out what is the really popular topic around the world at the moment, what people are talking about. Because people won't actually say something scientific about what happens in the Twitter account. They will say, oh, what I see today, and that's important for us. So I just put these slides here for you guys to compare. Now you love it. All right, what is the future for iBiz? So we have contact different potential users of iBiz to get their idea. And they said, well, you know what? iBiz is really, really awesome tool. And we see the value from iBiz. But we don't really want to know something about biosecurity. We want to know something about food security. Or we want to know something about fashion. But we want to use a tool to actually search in our way. Of course you want. And so this is our plan to actually convert it into a generic platform for you. So anyone can actually use the idea of iBiz. We are actually working on the plan. And the next step for iBiz is to create the distribution, or dual-fold modules, I'm not sure. Because if we're going to use the backend of NoSQL, then I don't see the possibility that we create iBiz as a distribution. So you can't really install the distribution because you have to configure the backend. But we definitely plan to open source our dual-fold modules. So there will be a set of dual-fold modules available in dual-fold.org about iBiz, about news aggregator. So you guys actually contribute your time to the modules. And it will be actually better for everyone to use it. So all of the animals around Australia. I think that's the end of the slides. Any questions? Thank you for not sleeping. You got a question? Yeah, I do. Okay. Just looking at that with all the different nodes that would be created, or the articles, what are your storage considerations like? As that grows, obviously. Yes. Well, because we are getting more users every day. So which means we have more search queries in the system every day. Then currently we have... We return like a top five articles from Google Research, Google Search. And if people want to know more about this topic, we have to return for example like a top 20 or top 10. Then that will be the problem for us because we currently are running out of time every day. So the story... Do you mean the future storage we proposed? Yes. Yeah, so you're going to have to search through a lot of articles as time goes by because it's going to continue to grow. So the idea for one is the noise control. So the noise control is the most important topic when you talk about any kind of like a news aggregate because we try to keep the ratio of the noise ratio down, which means we need to do some work in Google and we need to do some work in Drupal level. So make sure that the article is related so we can actually store it into our database. Second... Well, a second way is we have to actually work on research on our scalability on our server. Since, well, I understand that if we convert database from our relational database into NoSQL, that will be actually... The size will be much bigger than the original database. But that's not really our concern because, well, our goal is to get as many as possible so we can actually scale the backend database quite large. And the idea is we have Drupal as a frontend and any kind of like a Drupal distribution can all contact to one backend. So the one backend can be really big and actually the backend can actually share all kind of information to different users. So it will be a massive database with all kind of information but with really good like a tagging system on it. Yeah, cool. Thanks, then. And if you guys want to know about news aggregator or how to use Drupal to deploy that, drop me a message in Twitter or via Drupal.org. I'm happy to work with you. All right, thank you. Another question? I was just going to ask you... Yeah, all right, so the question is about how do you avoid the duplication of the article? So we have like two kind of like a filter that working when we process the search result. So one is if the URL is the same in the queue then we don't really push it into the result queue so we don't actually put the same article into our system. Another way is we use Alchemy FBI. Actually it returns all of the related information so if we find the title is same then we don't really want to put it in there. But the thing is even the two articles are talking about the same topic, like the same kind of outbreak in whichever country. Our researchers, they want to know the article from different perspectives so they want to know what the farmers say about it, what the government say about it, what the traders say about it. So it's not really duplicate for us but for any other stakeholder when they think that's duplicated we can always add custom filter on it. Does it answer your question? All right, cool, thank you.