 My name is Tony Irwin and I'm going to talk to you today about some work my team has done with the BlueMix UI, which is the front-end to IBM's cloud offering with Cloud Foundry as the PAS layer there. Talk about the process we went through in migrating a monolithic application to micro-services, Node.js micro-services on Cloud Foundry. Sharing this talk, we'll talk about the origins of the BlueMix UI, some of the demons of the monolith, our original monolithic architecture, how micro-services helped slay those demons, and then sometimes you trade one set of problems for another, so we have some new demons to slay as well. So as I sort of alluded to, the BlueMix UI serves as the front-end to BlueMix. It lets users create, view, and manage Cloud Foundry resources, but not just Cloud Foundry, we also have containers and virtual servers and other resource types coming around. So when we first started BlueMix, some of you may know it was pretty Cloud Foundry focused. Now Cloud Foundry is just a part of our BlueMix offering. It runs on top of the BlueMix PAS layer, which is Cloud Foundry. As I alluded to, it started as a monolithic app. It was a single-page application, so all the HTML, CSS, JavaScript was served, had to be loaded to the page at once, basically, in one HTML page, and this was all served by a single Java app, which was also deployed to Cloud Foundry. This was a common stack in IBM not all that long ago where everyone was using the Dojo JavaScript framework for the UI piece and serving it with Java. So that was very popular in IBM in recent years, but we started moving to other stacks. And this is just sort of a screen that shows that the BlueMix UI is pretty large. This is five or six different pages there for dashboard, catalog, resource details like manager Cloud Foundry apps, users billing, and there's a whole lot more. So it's a large application. And this is sort of a diagram, or it is a diagram, not sort of a diagram, of the monolithic architecture that we started with. The top where it says BlueMix UI or client is basically the web browser, and the orange boxes there roughly correspond to the pages I showed on the previous slide, but that's just a show. That's really just all the JavaScript logic is out on the browser, essentially, in our single page app. And then on the back end, the Cloud Foundry side, we've got a single UI server, which is Java. It was bound to a DB2 service for some of the data persistence we had to do. And basically that passed through then to a whole lot of back end APIs, including Cloud Foundry, Cloud Foundry, Cloud Controller, UAA, et cetera. So the monolith had some problems, or this architecture had some problems. We had performance issues, the heavy weight, Dojo is a rather large library, and we also wrote a lot of JavaScript to do all the logic on the client. So heavy weight JavaScript loaded to the browser can be slow. And also in a single page app, a number of you probably worked on them, you're relying totally on Ajax calls back from the client. So that can also create some bottlenecks, because really nothing is in your initial payload except for the code that loads your front end. There's nothing about the data that you actually want to see. Another problem was it's very difficult to integrate code from other teams. So as we'll talk more about here in a bit, there's probably 15, and the list is growing, other teams that want to plug in to our UI. And so we all look like kind of one big product. With the stack we had, that was really not a practical, you have to tell people to write Dojo, we had fewer people, and fewer groups that wanted to write Dojo for one thing, and it's just not that easy to integrate these sorts of plug-ins into a single page app. You have to push the whole product just for small changes. So I fix an old pointer exception, I have to redeploy the whole product as opposed to just being able to deploy a part of it. Poor SEO, search engine optimization, because as I alluded to there's not a whole lot of content in the one HTML page that was served. There wasn't much that was crawlable by Google and other search engines. And new hires wanted nothing to do with Dojo as we brought on some new front end developers. They're like, why are you guys using Dojo? There's better stuff for, or if you've been in UI development, there's always, every six months there's a new cool toy. Dojo had a lot of good things about it, but it was also kind of viewed as legacy, old IBM kind of stuff. So now we'll get a bit into the microservices architecture that we migrated to. I call this slide the weapons of microservices that we used to slay those demons from the previous page, but the approach helped us to migrate to a more modern lighter weight stack based on Node.js and other tools without starting over. So we were able to keep, this was also in a live running product that people were using and we couldn't just suddenly throw away all the stuff we had during this rearchitecture, but with the microservices we were allowed to slowly break pieces of the monolithic part while still leaving the core of the monolithic app there, and I'll show a diagram of that here in a bit. The goal is to break the monolith into smaller services to improve performance because these services would be optimized for speed and page size. This architecture, we believe, would increase developer productivity. You can push smaller changes. There's less chance of breaking the entire product. Loosely coupled services can deploy at their own schedule. Teams can use stack of their choice as they plug in. And you don't have to wait on others because I lead the core, what we call the core team, and I think we've got about 25 microservices. We have some of the core components like catalog and dashboard, but we have a lot of other teams that want to do custom things and they don't want to wait on. My team doesn't have the resources, for example, to provide that and teams don't want to wait on us either, so this helps solve that problem. The way we started serving pages led to improved SEO because we started using a little bit more server side templating, so that it was more of the user data in the initial payload. So there was more crawlable content. And the way we did it, and I'll show a diagram of this as well, when teams plug in or when microservices plug in, you want them all to appear to be part of the same product. So we were able to help promote some UI consistency with some microservice composition. So this slide basically shows our general microservice pattern, kind of focusing on UI microservices. We also have microservices that just serve APIs as well. But all of our microservices are written in Node.js. They serve lightweight HTML, CSS, JavaScript, trying to go for the simplest approach that works. So if we could use vanilla, JavaScript for a particular page, we did it. We do have some teams using other frameworks, including my team, like React, where it makes sense for some of our richer pages and stuff, but still a far smaller footprint than we had with our dojo. We use server side templating, Dust.js in particular, to make as much data as possible. Now, of course, you don't want to spend a bunch of time on the server collecting data to include in the payload, but there's some things that we have cache, like the username and picture and a lot of the stuff in the header we can include. So when the page comes up, the header renders right away. And kind of that goes to the next point. If you look at this diagram, there's a common header microservice that was added. And all of our UI microservices call that to get the HTML for the top row, the stuff at the top, which I'll show a screenshot of that here shortly. We introduced a shared session store. It wasn't as required with our Java app because you just have an in-memory session, but when you have distributed apps, you want to share things like user tokens and things, we had to add Redis to the mix there. And then, of course, these UI microservices can call any of the other back-end APIs or other API microservices. The one thing I always kind of gloss over in this picture is the proxy there at the top in the Cloud Foundry box. The proxy is really what sort of holds the whole thing together, which is now instead of having a route, say console, bluemix.net, that just goes to a Java app, we now have that route go into the proxy. And based on the path of the URL, it routes to the right microservice. So this shows what I was talking about with page composition. And I mentioned the common header that microservices call. So basically, the green box is any microservice, let's say, the catalog. On the server side, it will invoke the common API and get the HTML for what we call the common header. And there's sort of a picture of it here. And then that's combined with the server side templating into one payload and sent to the browser. And so then all pages that use this approach look like there. There's other things in common too that you access, like some common style sheets and things too. So that, plus the header, really enables the product to look consistent. And here's a picture of our first stage, the migration. This was, as of about December 2015, we formally introduced a proxy layer that I alluded to. We added three microservices alongside of our Java monolith. One was the common header. We had a home microservice just for the home page. And solutions, which was some marketing material which we no longer have in the core product. So we started trying to pick pieces that we thought would be the simplest to migrate, kind of as a proof of concept. We also introduced two additional Cloud Foundry services. I alluded to the shared session. I guess we actually used data cache for that back in this time frame. We're using Redis now. Data cache was an IBM product. And a no SQL DB for some data storage. So these micro-services are just Cloud Foundry apps and bound to those services. Phase two, basically I'm just showing more boxes moved from the top browser side of the product and down into the Cloud Foundry at this point, which is about a year after the previous slide. We were probably about 90% complete. We still had some account user management and things that were not migrated yet to the new architecture. And then this is our end goal, which we're more or less at today, except we do still have our Java server. We do want to port that. It's working fine for us. I think we do want to still port it to node before we're done. And we do still have a small amount of legacy dojo code. So we had to balance new function over rearchitecture. So over about two years, we were able to do a pretty complete migration of the original product while adding some new function and things like that. This slide I alluded to earlier. We have a bunch of other teams that want to plug in. What I've been showing are really the core microservices, the core pieces of the architecture that my team owns. But we have things like Watson, Internet of Things, our new Kubernetes service, OpenWisk, that want to be part of the console, but not necessarily deployed with all the core microservices or owned by my core team. And this just shows that the proxy, so we have proxy rules for slash Watson that will route to a server owned by the Watson team that could be deployed anywhere. We proxy through. They can use our common header and look like all part of the same product. And I might do a, well, I guess I don't have the, I was going to do a demo, switched to a new PC. Thank you. Just before this, I don't think I'm going to try to fire up the browser and log in and stuff. But what I really wanted to show is that as you're clicking on different pieces of the UI, you'll seamlessly go to stuff owned by other teams. You'll just see the path and the URL change. The proxy routes across appropriately. And it all kind of blends together. So there are a number of new challenges. As I mentioned earlier, sometimes you trade one set of problems for another. I think we're glad we did it. But there were other things we had to worry about now. There's more moving parts, more complexity. It makes your build pipeline and test automation and all those things all the more important. I think we probably underestimated that when we started down this path. Collecting federated status, monitoring the health of the system. Something goes wrong at 2 AM. The console's not rendering. Which of our microservices is a problem? Or sometimes it's outside of our control, like the Cloud Foundry environment we've been deployed to has problems. So we needed a way to monitor those aspects of health. And to kind of, we're all responsible for reliability and HA and everything. But sometimes just to get past a particular problem, you need to find the right team to actually get looking at the issue. So we quickly developed some monitoring tools to help point the problem at the right team. Sometimes it was my team too, but not always. The granularity of microservices and memory allocation, the previous talk mentioned 512 megabyte is good for Node.js app. When we had the monolith, we had three or four instances at two gigabytes a piece for Java. And so that's about six gigabytes total. If you have about 27 microservices with three or four or five instances a piece, I think at one point we had about 95 total instances. They're all at 512 megabyte, or even a little bit higher in some cases. You end up with a system that's taking 55, 60 gigabytes now to deliver a lot of the same function. Not as big of a deal for our public offerings, but we also do deploy into some customers' local and dedicated environments. And they're not necessarily happy about paying for a lot of memory just to run the console. So that's a consideration. We did have to solve some issues with seamless navigation between our new microservice UIs and our dojo UI, our monolith, trying to make things look as close to the same as we could and those sorts of things. Blue-green deployments, one question. It's one thing to do a blue-green deployment of a single app, but if you have 25 apps, how do you do that? We ended up having an on-deck, basically, version and a production version of the apps. So we had two versions of all the microservices deployed. And we would do a blue-green swap on the proxy for each of those. And so basically you're setting up a whole new, you're suddenly routing to a whole new set of microservices. We want to be more granular with something we're still working on in our pipeline is to be more granular than that and do blue-green swaps at the individual microservice level. This is a big problem is promoting uniformity and consistency while still giving teams freedom. So we have a large set of UI designers in IBM, and there's different UI designers applied to the different teams. They often have differing views on how the UI should look and behave. So you want to give people what their imaginations go and come up with innovative UIs, but then you run the risk of having half the UI looking and behaving one way and other pieces a totally different way. So that's still an ongoing challenge for us. Another point I threw in here, and I guess I've got another slide on it, too, to go into a little bit more detail, we did this work to be microservice-based within one deployment of Cloud Foundry and a lot of resiliency and HA work as part of that. But then how do you go and make that more globally available? And I'll just go ahead and go to the next. Well, I'll get to that in a second. I did want to drill down into the importance of monitoring. I alluded to that we underestimated how important monitoring was when we first deployed our first two or three microservices. So as we deployed more and more, this became all the more important. Lots of things can go wrong when you have this many components and root cause determination can be difficult. So we did a lot of work to start collecting metrics on every inbound and outbound request for every microservice with response times and error codes and as much detail as we could get. And we can look at those things in Grafana, which there's a little screenshot here at the bottom. And in fact, I have a talk tomorrow that's going to go even deeper into all this if you're interested. We were very interested in memory usage, CPU usage, and uptime, crashes for all of our microservices. So now our monitoring is set up so if an app crashes, for example, we send an alert to the appropriate people. General health of ourselves and our dependencies. For example, if we can't get to our Redis server, we can't share the token. And people can't really authenticate and do what they need to do in the UI. So that's part of the health we have to keep monitoring. And we also do work where we have synthetic page loads. So we run site speed IO strips regularly in the background so we can always see how certain pages are performing and from different parts of the world. Now back to the global load balancing. So we've got currently four public regions of Bluemix, Dallas, London, Frankfurt, and Sydney. And we've got all these microservices deployed in each one. And each one of those has its own URL. It's console.region.bluemix.net. So as I mentioned, we wanted to do some more HA work here because if one region goes down, maybe there's a problem with Cloud Foundry, which could never happen, I suppose. Or there's a networking problem or some other thing. Users going directly to a regional URL are going to say, well, this thing doesn't work. They see errors and white pages and all this stuff. So what we're currently rolling out, we call global console. So basically, we're starting to distribute the load over the microservice systems in all of our regions. So we'll have one global URL, which is actually live today, console.bluemix.net. And we did have a concept of a region switcher, even in the old model. But it would totally switch URLs to the deployment in the other region. Now it just really does a filter within the current. So wherever the UI is being served from, you're doing a filter within that. We're using Dyn geolode balancing so that whichever of our regions is closest to you when you go to the browser is where you'll get the UI served from. So if you're in New Zealand, you would probably get routed to our Sydney, Australia console. And the other thing that's kind of alluded to monitoring and health checks and stuff, we've gotten our health check to the point now where Dyn consults in the different regions, if a region is considered down by our health check, then Dyn will stop routing there for a bit until it's healthy again. So in this way, hopefully the user never sees what looks like a full outage. They may not be able to manage their Dallas CF resources at any time, at any given time. But they could still manage their Kubernetes clusters and stuff because they still have a UI that works and is able to talk to the appropriate back end APIs across the world. And that takes me to the end. I already have a question. Well, the jump in memory was because we only had one app with two gigabytes in instance. And then we suddenly had 25 apps with 512 megabytes in instance. And if you just do the math, that's a lot more memory. Now we have looked at some of our microservices were probably allocated more memory than they needed. And we've cut some of that stuff back over time. Sometimes if the microservice isn't doing much, 256 or even 128 may work. But we were never able to cut it all the way down to the memory usage of one app. It's not some material to our public deployments, more so when we deploy onto a customer's hardware. And they're like, well, why do we have to pay extra for this? But I do get to see, obviously, my team is not paying for the use of the Cloud Foundry resources. But I do get to see what we would be billed if we did have to pay. And it does add up. So certainly if you're not lucky enough to be able to work at IBM and deploy on IBM resources, if you need to use 55 gigabytes for your microservice system, there's going to be some cost involved there. I think no matter which Cloud Foundry provider you use, you have to pay more for that. So on your global load balancing, how did you keep the different regions in sync? Different versions? Or different regions in sync? Oh, yeah. So well, that's a good question because we do try when we roll out a new. I mentioned we have an on-deck and a production set of microservices in each of our regions. We do typically upgrade them at roughly the same time. So we'll usually upgrade like Sydney, and then Frankfurt, and then London, and then Dallas. But there's certainly times there where they're not exactly the same version. So I think it's in general where it doesn't hurt us too bad yet, because if you're getting your UI from Sydney and it's not exactly the same version as Dallas, at least the Sydney stuff is going to work with whatever APIs and such are there. It may not quite have a new function that we've deployed to Frankfurt, so there could be a half hour where a user being routed to Sydney won't see the same thing that they saw in Frankfurt exactly. But it hasn't been a huge issue for us yet. But if we do a major upgrade of our theming and stuff at some point, we'll probably have to think about that a little bit harder, because if you get routed to Sydney and you see green and orange and whatever, and then maybe you fail over to Dallas and you see the old style, that might be a bit jarring. Also, we're in the process of doing something very similar. What suggestions or tips would you give us based on your experience? So we've had pretty good luck with Dine for this. We're actually have a little bit of a weird situation, because we're using Akamai for our CDN and WafD DOS and such, so we really have a combination of Akamai and Dine, which is a little bit weird. I would, we're actually looking to see if we can move one way or the other, so all the configuration for this is housed in one place. There's also issues that we're working through with Dine right now about when there is a failover, why did it happen? Because sometimes it could be our health check legitimately returned, we're down because Redis is down. But sometimes requests don't even get to our health check. If there's a networking problem or a firewall config problem, and it's not always easy to tell from the Dine alerts why that was. So we have to start getting our networking guys involved to say, well, why can't, why weren't these requests getting through and things like that? I don't know if that does that answer the question. Oh, I'm sorry. I thought you were trying to move to a global load balance, I'm sorry. Oh, so, OK, so tips on, OK, now I understand the question. So tips on moving from a monolith to microservices. Yeah, so that's, yeah, I think I've got a lot of those. Probably not enough time, but. Yeah, I wonder if we could let people go to lunch and anyone who still understands could come up to the front and talk to Tony. Thank you very much, Tony. Thank you.