 Right. Good afternoon, everyone. Today, my topic is about security enforcement in OpenStack. So first of all, I'd like to introduce myself. My name is Zhou Zhengsheng, the second name. I'm a software developer in AWS Cloud, and the author's name is Zhang Xiaobing, and he is the director of OpenStack Research and Development Center in Suning.com. So the bad news is that Xiaobing is too busy to come here. And the good news, we have been discussed a lot for a long time on the security topics, and we also discussed a lot on the slides, so that they will give the presentation instead of him. And if you have any questions, I will try my best to answer and explain. And if it's beyond my knowledge, I will take down your email address and send a question to Xiaobing, and I will also call him on the Skype and see if he can answer. And if I get you frustrated, don't throw X to me and don't be mad because Xiaobing only pays me a cup of coffee and a small piece of cheesecake. So here is my agenda. And to begin with, I'd like to briefly introduce Suning.com and the use case of OpenStack. And this will be helpful for us to understand the design decisions and present it later. And I will cover some of the risks and possible security holes we can think of in OpenStack Cloud. Then I will talk about services implemented on top of OpenStack in Suning to provide security enforcement. So I think the antivirus part is pretty interesting. And so here is the overview of Suning.com. Suning.com is one of the largest e-business enterprises in China. And the business line includes retail, logistics, supply chain, real estate, investment and so on. And by the end of 2012, Suning has stores and e-business platforms in over 700 cities in China, Mainland, Hong Kong and Japan. The number of stores is 180,000. Among them there are about 4,000 IT-related colleagues. So they have forward search centers in Beijing, Shanghai, Nanjing and Silicon Valley. So this is the basic information. And about Suning Cloud, the major cloud business in Suning is private cloud for its own use. So the workloads are mostly typical services you can find in an e-business website, such as database clusters, 3Bus servers, web servers, Hadoop clusters, virtual desktop. And besides this cattle, the workloads also mix with pets. For example, develop and testing virtual machines for the research team. Until today, they have thousands of hosts and tens of thousands of virtual machines. And the public cloud is not as big, but it also provides various services. For example, cloud server, virtual private cloud, shared and object storage and monitoring. So before May 2014, Suning Cloud was not based on OpenStack, it's based on another Blastack, whatever. And Suning already deployed KVM and Docker mixed cloud with four regions and thousands of virtual machines today. There are approximately 100 new virtual machines every day. And some of them are imported from the older cloud. And there are various use cases, including virtual desktop infrastructure, managing CDN nodes hosted in third-party data center, multi-layered application, clustered and standalone application, as I mentioned. And Suning uses host aggregates a lot heavily with a customized scheduler. The scheduler is to try his best to deliver the virtual machines of one application clustered to different availability zones and hosts. So Suning Cloud offers auto-scaling, typical three-tier websites application using heat. And it makes use of Sahara, Chove, KVM and Docker in production. And I will not go to the details in this slide, it's not very relevant to our topic. If you are interested in OpenStack operation in Suning, you can send email to Xiaobing. So our topic, real topic. So there is a Chinese saying, the collapse of a solid dam, a river dam, begins from the end hole. It means that a hacker with patience can make use of any tiny security hole and step by step the hacker bridges the whole system finally. And when we think of OpenStack cloud, we can list at least these risks. And to name a few, when we think of an OpenStack cloud, if the physical security policy is not strictly followed, the data center may be broken through by social engineering, physically. And if the operation people forget to log out the terminal, the host may be operated by anyone who can enter your data center. And both host running OpenStack and virtual machine needs security patches. We also need to disable SSH password logging and enable only key authentication. And the ports are easily guess so we have to minimize the services and the ports. And usually there is a firewall rules, but that is not enough because the hacker may enter your management network. For example, for now, VNC password is empty by default. So there is firewall rules, but we can easily guess your VNC port and connect to your VNC easily. If you have access to management network. So in my opinion, I did a little investigation. I see in normal configuration there is an option to set a default password for the VNC console. But this is not enough because it does not change. So my proposal is that if anyone wants to get access to your VNC, it should be done like this. First of all, we have to disable your VNC password logging by default. Actually, Liberat and KVM and QML provide this option. You can set a password expiration time, so we can set just an expired password for this. And when you want to log in using VNC, we just make a random password using Nova and change it and make it just available for five seconds. When you log in, the password is expired, but your connection is not interrupt. And no one except you can connect to the VNC. So this is not done in Nova yet, but I believe that developers in AWS Cloud will try to submit patches to fix that. And also, my SQL root logging should be restricted to only the local host. So there are many I will not go through each one. So here are some security accidents from China. You can get this information from public media. And Hubei Wuhan Health Center, the system I mean, use weak passwords and causing millions of users' privacy information to curate. And some developers upload their password and configuration to GitHub and leak your password. As you know, many companies use a small and short default password for everything, so your cloud is open to everyone. So what I want to say is that I hear, I only name a few. So the main point is that there are many trivial security risks everywhere, and we are human. And human make mistakes. And so the antidote is automation. We should minimize human interaction and operation. And this is our goal. I have a look. How many passwords in an open-stack cloud? So many. And you have to set it to use random and strong passwords. You have to update it frequently. So it is all done by hand manually. It's not possible. And for the sake of completion, here is a slide listing problems to consider when we automate the operation and deployment. You can find lots of these guidelines on the Internet. Adjust Google. You have to patch kernel. And you have to do service screening, firewall, designated DNS. You have to adjust current map parameters to restrict your resource and password restriction, validation, and so on. So a single mistake might leak everything. And if you are running a large cloud like suening.com, we face two more problems. First of all, cloud platforms involve many different departments. You have to change your developers, your operators, and your users. However, people come and go, join, and quit. So not all the people are well-chained. They may be not aware of security risks. A single mistake leaks everything. And everyone does automation today. As I said, I believe all of you are familiar with the architecture here. Is there anyone that don't do automation? No. Okay. So I think here are the things you might want to consider with. So you have to put strong password restrictions onto your automation process. And you have to patch your image continuously. And so on. Here, I think one useful practice, I think, that suening creates a pipeline for handling log entries. See this. And in a stable cloud, they said error and warning messages are unlikely caused by code bugs. But by the environment and the users, if you operate a cloud for some time, you can accumulate and categorize the problems and the solutions. And you can write scripts or add new features or functions to solve them automatically. And finally, you can reach a point that in most cases, the operator will not need to log into the host and run the command manually. So this eliminates the chance to make a mistake. And the next slide. As I said, suening has some cloud application to enforce to enhance security. Here is an example, an antivirus data league design on OpenStack. This actually runs in a prototype and experiment environment in suening, and they are about to put it to production. And here is antivirus issues for virtual desktop use case, because it's possible for the user to access web pages containing children's hosts or spyware. And there is also a kind of virus that can penetrate hypervisor to the host. So the service runs on the hypervisor host to provide uniform antivirus service for the NOVA instances. Firstly, we install an agent in the image, which we have an image unloaded as demon in the guest operating system. And then the agent listens to the fire creation and modification events from the iNotify interface. And this is also feasible on Windows, because .NET provides these similar capabilities. And there is also a scanning coordinator on the host. And this scanning coordinator queries the guest agent about the events through the serial channel. It usually filters the events to leave only executable and shared object files changes. So then it feeds, it gets the file while lab gets the FS according to those events, and feeds the file to the series of scanning engine. And if any of the scanning engine reports positive, then we catch a virus and alert the other mean. So this is because SUNY provides virtual desktop machines for the retail store. And usually the software is pre-installed in the image. And the user should not install software by hand. So the user just use the virtual desktop to do some office and sometimes go to internet sending and receiving emails. And the virtual machine density and CPU over commit ratio can be very high in this use case. So by running antivirus service on the Hypervisor host, the license can be shared by all the virtual machines, so the expense is a lot lower. And it also saves bandwidth when updating the virus definitions. The prototype is running right now in SUNY and it, as I said, it will soon be put into production. Do you have any questions regarding this mechanism? Sorry. Okay. We make it in the last part, right? Sorry. And the next slide. And SUNY is also developing virtual private cloud for public cloud. The service includes intrusion detection and prevention system, firewall, VPN, NAT, and layer 4 and layer 7 load balances. This is all done in the virtual machine and not in Uchang. And if the customer, if the customer request this service, he just make a ticket, service ticket to the admin and the operator will, by default, you can see the VM. This is a server and by default it's attached to the default network. And if the admin creates a new network segment, it's on the right hand. And he will also create some service machines and the IDS, IPS load balancer are run in those highly available service machines. As you can see, they are wired. Like, firstly, it's the IPS intrusion prevention system and then it delivers the traffic to the firewall. And these two routers is highly available using VRRP and they're running OSPF protocol. And then it delivers the traffic to the load balancers. And sometimes we will connect this private cloud to the network service carrier network. And the network service carrier runs parallel VLAN universe. So it's not related to neutral VLAN. So here it will enable VLAN in VLAN mode. So the carrier network VLAN will be wrapped inside a Neutron L2 segment. And the operator needs to assign the carrier network VLAN to all the service VMs in the guest operating system. And hopefully this will be automated as an improvement. So here is a list of open-source security tools for your consideration. And Suning has done some experiments on the SNOT and Suricata. Using SNOT, you can dump a mirror of network traffic to another server and the server analysis and analyze the traffic according to some rules. And it's capable to detect some common layer 4 attacks such as TCP sync flood. And another tool, Suricata, is actually used in the IDS service in the previous slide. It's a high-performance multi-threaded IDS system and networking monitoring engine. It makes use of Hadoop and Elasticsearch, all the big names you will know. So it's worth taking a look. So Suning is also doing some research on Open SoC. It's a network traffic analysis engine. It provides capabilities for full packet capture. And sorry, sorry. The Hadoop one is Open SoC. It provides capability for full packet capture and indexing, storage and advanced behavior and analytics. It makes use of Hadoop and Elasticsearch. It's this one. And next slide. And the last use case I want to share is a security group isolation in Suning's platform as a service. And Suning has developed a platform as a service based on OpenSoC. And the platform management service, the load balancer and the pass applications are all hosted in NOVA instances. It's like API network is exposed to the platform management service so that the service can create new application virtual machines on demand. This way, this way. To enforce network isolation in a cheap and flexible manner, Suning uses security groups because it's not really possible for you to tune all the switch, reliance and physical network isolation if you don't have much people and time. And it's error prone. So in this case, they use security groups. First of all, the traffic comes from the internet, which is the load balancer. And this traffic, which is the load balancer, and this traffic cannot reach the API network. It can only reach the platform management service and all the tenants network. And the next is the security group also prevents the tenant network to access the API network. And all the tenants VMs only allowed to access the other VMs in the same tenant network. So all of these restrictions are done using security group. A new tenant network is created and deleted on demand. So the security group is also created and deleted on demand. And it's flexible enough for it to change on the fly. So here are all the cases I got from Xiaobing. And you can ask questions and I can call him. All right. So the last question, could you please repeat again and using that microphone? The question was just you had agents running on the Windows and the Linux VMs. Yes. How do you guarantee that they're running? Sorry, beg your pardon. How do you make sure that they're running? The agents are actually running as processes and passing their events through? So the question is about how cannot prevent it from killed by the user, right? So in this use case, it's virtual desktop. So the root access is not, the root privilege is not given to the user. The user only gets average privilege. So he cannot kill that agent. Thank you for the talk. I have two questions. The first question, why do you guys switch to OpenStack? What makes you guys choose OpenStack and your cloud platform? And the second question about the platform and service you guys develop, is that OpenStack project or it's just some project you guys create by yourself? All right. So I can answer the second question. I also ask this question to Xia Bing. And the past platform is written from scratch on top of OpenStack by Sunim itself. It did not make use of any other open source projects. And the first question I think is somewhat obvious because today all the platforms, all the API, the API matters a lot because you have to integrate other services into your cloud. So if you are using OpenStack, there are many choices. But if you are using another stack, we all know you are on your own. And you can see Citrix on the marketplace. That explains a lot. How exactly are you using LibGestFS? You were talking about iNotify and the scanning. And with LibGestFS, are you doing a mount or do you have a C program? How exactly are you accessing the files? All right. So it copies the file from the image out to the host file system. Then it passes the file to the scanning engine. So it's not mounted directly. Otherwise it might trigger other problems. And that doesn't cause any performance problems. All that IO. It seems like there's probably a lot of files that change. So the use case is this because it's strictly restricted in a virtual desktop use case. And they only trace executable files and shared object files. So these files don't change much. And if you download some application from the internet, it will scan it. So the workload is actually not very big for the scanning engine. My question is related to scanning. Is there full scanning done periodically? Is that event-based, time-based? It's time-based. It curies the file list. First of all, the guest agent stores all the files changed during, for example, this minute in its memory. And then the coordinator will curies this information every minute. And then it feeds these events to the scanning engine. Do you have some performance data for this? I mean, you're running the scanning engine on the host. That thing is a resource sucker in terms of CPU memory and file system access. And also, because you are copying files from the host, sorry, from the virtual machine to the host, what if all the hosts are just doing some concurrent update to download some large files? Will that be possible to bring down the host? Yes. In fact, I didn't get to sleep last night because I crawled this with Shabin. And I think this is a huge problem. And my suggestion is that you can use C-group to isolate all the scanning engines. For example, you can boot a suite of scanning engines for each VM instance. And you can use C-group to bind them together. So if you are downloading a big file or a big executable file, and Linux CFQ scheduling will make you slow, but it will not affect other virtual machines because it's CFQ, right? But Shabin does not agree with me, he says, because it's in a highly restricted environment. It's in his retail store. So the manager, the manager man in his company knows everyone. So if he does something unspected, just calling. It's solved by operation team. One question regarding security. The hypervisor level should be at the highest security level, and yet you are transferring files from the guest domain into the hypervisor for scanning. Isn't that a big security risk? A simple vulnerability in your antivirus scanner might introduce compromise your hypervisor? All right, this is a nice question. I cannot answer you shortly. I will take down and send it to Shabin. And could you leave me an email? Because Shabin is actually the expert of this area. Okay, I'll do so. Thank you. Hey, how is the IDS, the Snort and Siri Cata is deployed? Especially is it monitoring every traffic from every VM? Okay, so it's actually mirrored traffic from the carrier network, from the network carrier network. And as you can see, it's deployed in this node, in this virtual machine, right? Cannot reach it? On this IPS virtual machine. So it's mirrored and sent to another server to analyze the traffic. Where are you getting the traffic from? I mean, the source traffic that are using virtual tab or...? So the question is the interface or the mechanism? Yeah, I mean, are you using some sort of virtual span port or tab as a service? Oh, okay, okay. So it's using span port or something. They are using open with switch. And I will check within. This detail I did not know. And could you also leave your email? All right, thank you. Thank you very much.