 It's about time. Let's get started. How are you guys? Good Thanks for joining I'm Smith and I work at Dubizel and today I'm going to talk about Dubizel's journey of moving a legacy code base to a microservice architecture I'll talk about what challenges we faced what mistakes we made and how we got out of them and our current state And where we are today So, yeah, let's get started a little bit about me So as I said, I work for Dubizel. I also run a consulting company called xypher in my part-time It provides services for building apps and websites You can check out our website xypher.tech. I Also contributed very little to open source I started my career working at Red Hat and I contributed to some projects of Gluster. I also contributed to NGUI auto-complete plugin in past and I used to run a community Called GDG Gandhinagar when I was in India and Yeah, that's pretty much about me to know more you can Check out my GitHub or if you want to connect we can always connect on LinkedIn. So let's get started Just to give you a little bit idea of what Dubizel is Dubizel is a class number one classifieds website in UA We have users who are trying to sell their wardrobes washing machines to you know some cool stuff like Boeing 747 or Dinosaurs skeleton and actually this dinosaur skeleton was sold for 24 million dollars and it is one of the Very nice thing that was sold on Dubizel in the history We have people selling their yachts everything. So it's in Dubai. So, you know Antiques yacht everything Anything that money can buy people sell it on Dubizel. We operate at a massive scale of 3.5 million monthly active users and every minute 20 new ads are placed on the platform So we help our users to find houses cars jobs and used items and Our business is structured along these verticals and each vertical at Dubizel acts like its own independent startup And is empowered to deliver value and act fast So to summarize as an organization what we believe in is to empower teams to sustainably deliver value So to in order to sustainably deliver value Team should have access to all the tools that are required to solve any problem Team should be able to experiment and eat red fast on the solutions that they make and the solution should be data-backed And effective to our users the team those that builds the technologies should be able to own it and the system should be there in Place in order to support it So these are the requirements for us to you know deliver value Fast and in an independent manner. So just to give you a little bit historical context about Dubizel It was started in 2005. This was back when George Bush Started serving as the president of United States for the second time and the first ever video was uploaded on YouTube It was started off by Two American guys who saw the market opportunity to build a platform in UAE There was no Craig listed at that time in UAE. So yeah, they built to be so it used to look like this And it was the initial version was outsourced to a company And it was built with one of the most sophisticated technology of that time called Adobe flash as you can see the concept of individual Categories have been there since the very first day But when we started we were only operating in one city called Dubai Yeah So in 2010 Movie inception was released and we launched in another city called Abu Dhabi Abu Dhabi's you might have seen Abu Dhabi in Mission Impossible fallout So Abu Dhabi is an Emirate. It's also a city. It's also a state and a country within the country Because in Dubai there is no city or states It's an Emirate and an Emirate has its own laws and blah blah blah So yeah, as you can see there was a huge difference in time between when we first launched and when we launched in another city It was because not lot of VC funding was available in the region at that time And the co-founders were hustling to get the platform up and running And then in 2011 We decided to launch in other Mina regions like Qatar Saudi Arabia Bahrain and the website had to be localized the features had to be tweaked as per The market sophistication of these regions some features had to be turned on and turned off as per different regions because of the laws or or how the people adapted the platform and Some features were acting differently. We had different pricing models and this all led to Some spaghetti code base Also to buy being a difficult region to hire engineers were learning at their jobs And as you can see this could easily get out of hand especially when you are thinking about operating at a scale of more than million users and When you are thinking about operating in multiple regions so to get out of this what we decided was to create a whole new platform for this emerging markets and this was when the micro micro services era was just starting and Our research recommended us to build credit apis that are rest based and we went API crazy for every new service that we built We built it in form of crud apis That brings us to crud apis the frameworks like Django Ruby on rails promote crud and support crowd out of the box and Yeah, we built crud apis, but what we didn't realize was when we build crud apis the business logic sweeps into the client also The dogma of crud apis Great space crud apis opens up a lot of discussions Like for example to feature an ad should we send a put request to feature the ad or create a new feature resource via post request different engineers have different opinions and this could lead to a lot of fragmentation and Also just the fact that business logic is in the client. It is prone to reverse engineering Though at that time luckily we had some security policies rate limiting etc in place but still Having business logic in client is an can be a nightmare Testing can be get difficult because the logic will be implemented differently across all the platforms and you might notice different behaviors Refactoring code on the client side can be easy but refactoring this rest apis can be difficult because If you update one api you have to update on all the platforms doing that on web is possible but doing that on Mobile is impossible because not all of your users will update the api So you have to start thinking about backwards compatibility for your crud apis We realized that when we build crud apis it forces us to think in terms of DB operations But what we are creating actually is a layered architecture across network across code bases Developing a feature is equivalent to touching multiple code bases and multiple platforms It is important to think in a way holistically how this things and different parts will interact with each other and Then we had another problem. So after a few years our architecture looked like this And it helped us to be fast by being separate But with every new service that we built we were increasing the load on our monolith The idea behind running on cloud is to scale horizontally, right? But it was this architecture was forcing us to scale vertically also Failure in let's say Layer, which is a chat serve third-party chat service that we have to the chat service internal to the mobile service to the monolith The failure might propagate and could potentially bring the whole website down There are single points of failure and the failures propagate throughout the system and we can only trace these dependencies till We notice a failure and we wanted to look ahead of time So after two to three years of doing this we had to take a step back because every task on our product roadmap was either too big or Way too big because the time to deliver These features or the maintenance of these features were very high Microservices gives you speed, but we were not getting any speed We had release pipelines. We had containers running in production Etc. But we were not getting any speed being fast is one of our core values and teams are structured to be independent Independent teams are structured to act work independently, but we were not getting speed We're not getting any independence and we were very dependent on each other To do stuff monolith our monolith being a single point of failure could potentially bring all the vertical down we were promised to take us to the moon, but we couldn't get out of Berno and This brings us to 2017 We thought it is time to revisit our culture and methodologies So we came up up came up with set of guiding principles We went back to our drawing board to understand what we are doing wrong and how can we get better and what we can change We attended a lot of conferences build POC's Did a lot of unlearnings and learnings and reading to help us? To make our way forward after this entire process. We came up with set of guiding principles that we could keep in mind when architecting the systems in future and I Re-architect our existing platform so these principles are inspired from a book called domain-driven design by Eric Evans and This and our adapter to the way we work The gist of this principle tells you is to model your system in a way how it acts in real world We will talk about these principles in Future slides the first five principles tells you what to do and the other five last five principle tells you how to do So the first one is a low coupling and high cohesion We always thought that users and ads are the two main entities as you can see from our previous architecture This is very evident. So we built an API and User API to solve all our problems But not really what we didn't realize is that the entire context of ad is the entire ad Information is not relevant in all the context. For example, the person who is looking at an ad will be interested in title images descriptions, etc. The person who is Placing an ad might be interested in the boost packages the promotion package package information his credits Quality score of the ad etc. Since this context are completely Different they can be decoupled and the communication within this context can only happen via a predefined touch point This way we can minimize the impact surface area if any of the context goes down For example, if the place and ad service or place and ad context goes down The users who will be browsing the ad will not be affected The next one is no service can depend on another for uptime functionality and data availability Teams are independent and we need to be we need to optimize for the independence of team in order to be truly fast Success and failures of team should not be coupled with one another Team is performance should not affect team beast performance for example As you can see in our previous architecture, we built a user API and ad API so if we continue with this that example and We add one more service called chat email notification service This service will be responsible for sending the hundred Notification for under chat messages now this service would require the listing title. What if listing service goes down? The team with KPI engagement will also be affected So let's take one more example We have a service or we have a feature where user can you know Perform a search and then save that search and ask do be able to send search alerts and Doing this requires us to scan the database and see what new ads are placed on the platform after a given point in time And send the emails to user doing this is a very database intensive operation But the performance of any feature should not be affected by doing this, right? So instead if via message bus or any mechanism We synchronously Take this data store it locally somewhere or cache it and then use it when sending notification or when sending search alerts this way if If Listing service is down still the people the other teams KPI won't be hurt, right? The next one is one transaction can span at most at one service So as you know, we operate in Dubai and it's a difficult place to hire. We don't have any PhDs in our office So as a rule of thumb what we decided was To not do any distributed transactions till we get a PhD Many times you are developing a feature and you need to interact with multiple services For example in this case, we have a functionality of placing an ad Browsing ad and then favoriting an ad on our platform Here the favorite service is the source of truth for the favorite information But when you are browsing and looking at the same ad again the listing Detail view should also show you that yes, you have favorite the ad, right? So in this case if favorite service is down, you won't be able to see your favorite information But instead of that we can implement CQRS design pattern Which CQRS stands for command query responsibility segregation what it tells you is to create a read-only replica database which supports the query that you want to perform and Your microservice will be responsible for Listening to the domain events produced by the owner service so if we implement CQRS in this scenario what we can do is we can Consume the events produced by the favorite service after the user has favorite the ad and locally store or cache the Information favorite information in the ad service and this way we don't have to be dependent on the favorite service to Get the favorite information when showing the listing But this also introduces eventual consistency in our systems and we have to build clients that are Smart to account for such inconsistencies No vertical can impact the stability of one another How many of you here Know about Conway's law Okay, can someone just give a brief of what you understand about Conway's law Pardon. Yeah. Oh Absolutely, right. Yeah, so Conway said that any organization that designs a system will produce a design whose structure is a copy of organizations communication structure he said this in You know, I think 1967 and submitted it to Howard Business Review and they rejected it but I think in 2000s Howard Business Review managed to do a survey and was able to prove Conway's law also Eric remand says the same thing That if you have four people four people or four teams working on a compiler, you will get a four-pass compiler So at Dubizel our business Is aligned along this vertical and it thinks in terms of vertical also our Teams themes in terms of vertical and our management things in terms of vertical. So why not? We embrace the law in how we build the things It's okay to have shared services There is no harm, but as long as it has clear owners and no vertical specific logic In old days we were fighting against Conway's law. We were building inheritance hierarchies and We wanted to be dry but it actually increased complexity in our system because the vertical specific logic Was all over the place and It's you know, complex to build good abstraction and many cases, you know You have a release that you have to do and some engineer will just add a statement in the base class and just get it done and it will lead to poor and immature abstractions and Yeah, so instead of doing that we can just separate this vertical set build separate services that does not have that has only that vertical specific logic and in case if something is going to be Shared by all the verticals, which means shared by all the categories. We can build a horizontal service that Does not have any vertical specific information Bounded context are defined by business realm not crude API's So around three years back. We build up build the place and add functionality in a generic manner Now today as our business grows We want to have more vertical specific more vertical specific user experience Taylor to our users And we want each vertical to Taylor the experience differently As per the needs of their own users Like placing an ad for property and placing an ad for a used car It's completely different and the customers are also different the person Most of our customers who are trying to sell properties on Dubizel Are real estate agencies while the people who are trying to sell cars on Dubizel? A lot of people like 50% of population is a seat RC to see users so yeah, we want to Optimize and you know Taylor our experience towards our user, but When you if you do user API and ad API it does not encapsulate what's happening in the real world People are registering saving searches and we started to think more about bounded context as business realms not as the ad as a Business realm and So yeah, we should have a business context basically as a first-class citizen not The ad creation as a first class. I mean not the ad entity as whole as a first-class citizen So today in 2018 We have moved completely to a event-driven architecture where most of our communication happens via message bus this event bus allows us to decouple our systems and as well as the sidecar model that we have to you know Send the messages in the background Rewire the stability that we desire Since our API is are no longer crowd Based the degradation of service doesn't happen because of the synchronous network calls and the testament of this architecture has been When our monolith went down, but our property vertical was completely up and functional during this downtime Also when building a microservice architecture Observability plays a key role because a lot of stuff is happening in background. So I'll just quickly talk about What strategy do be so adapted towards observability a lot of stuff that we do to achieve Observability has been inspired from this book called Google SRE and the four Golden Signals that are mentioned there So We use New Relic for application performance monitoring and error reporting It also allows us to do distributed tracing for all the synchronous Interactions that are happening between the services These New Relic alerts are routed to pager duty And pager duty is further responsible for sending Notifying the people who are Responsible and the owners of this component We use like every other organization we use ELK stack for logging These logs are propagated to elastic search via lockstash. We also have some alerts configured there But these alerts are more More business specific some things that cannot be that are not alerts But are sent to ELK which are logs and we want to have some alerts on them Also, these alerts can be routed via pager duty or slack most of cases. We send these alerts on slack because We don't want to Jam pager duty and send lot of note calls to the people who are on call because if we do that people will start getting Annoyed and won't answer the calls of pager duty For infrastructure monitoring since we are on Amazon we use Amazon cloud watch and we send it to pager duty as well and Pager duty routes to the respective people For crown jobs and background tasks At alerts looks like this or if pager duty calls you it tells you there is some issue with your own job and We have integrated alerting for crown jobs in our Kronos we use Mesosphere for container orchestration and we use Kronos for running Kron's scheduling Kron's so we have put This alerts in Kronos itself. So every new crown job that we deployed deploy The status is recorded and we can configure the alert that if this crown job fails Please notify the people so that they can act or react Sorry, this was Q alarms. Sorry. So, yeah, since a lot of communication is happening via bus And things are in rabbit MQ. We need to make sure that the queue doesn't get full There are kids that have been cases where the consumer died or the There was just a lot of load or a lot of traffic and the consumers Were scaled to a limit and the message there was a lot of messages in the queue So at that time it is important to notify people that please scale the consumers further or look at it Why these consumers are stuck and why these messages are stuck? As I said about the crown job, we have integrated it in the Kronos Then we have built a small in-house tool called telemetry. It is based on So we use influx DB, Grafana and Redis for that Why Redis because we don't want to put heavy load on influx DB because it might go down and we might lose some data We don't want that. So we send it to Redis. Then we have a small service called telemetry collector that runs and takes the data from Redis and puts it into influx DB and Then we have some code in telemetry that automatically generates the dashboard for the matrix that you send and After that you can just go there to Grafana and configure your alerts so Telemetry out of the box provides us Ability to see the time lag for the messages. So time lag is basically Consumed time minus the time at which the message was produced Then you can monitor the execution time. So after the message has been consumed, how much time did it take to get processed? You can monitor the size of messages. So there have been instances where our consumers were acting weird because of heavy size of messages or the Consumer just crashed because it didn't have enough memory to process that message so we integrated Ability to see the message size and in case if the message is stuck because of lack of memory we can Vertical scale the consumer and let the message flow We can monitor the throughput as it's also very important and Also, we can monitor the matrix produced by the Crohn's Their success failure and the execution time, etc also telemetry offers a low-level API To send the matrix and generate dashboards So it provides you two methods basically a record point. So record point allows you to send a matrix Just like you know Counts or numbers if you want some more information you can use record data you can put you you IDs or You an email address whatever you want. It's like logger dot info. Just put whatever you like and Grafana will have the dashboards for the matrix that you are sending automatically and You know, you can configure the alerts or you can modify the dashboards that are generated because not every time The dashboard that is automatically generated might they are all the needs that you have So just to summarize These were the things that helped us through the phase of trans transformation So bounded context clear helps us to clearly separate business context and user context Eventual consistency helps us in decoupling our domain and context There is a decoupling in a way where producer is not aware about the consumer who is consuming the event and this makes the system very extendable Moving to microservice based architecture is a cultural shift. So developer training is essential Developer education and empowerment is an important step in making this transition successful When migrating to new architecture, it is necessary to reduce the known and unknowns So we have standardized the tooling and build have built the tooling in a way that it interacts with the Frameworks and text that we use so that you know, it might not happen that Two services have different way of sending matrix to influx database. That's why we will tell you my tree So similarly we have tooling in place to make sure that you know We can reduce knowns and unknowns Yeah questions We implemented it ourselves Yeah So before getting to microservice based architecture, this was our architecture before service oriented architecture So it we started as a very spaghetti code base and then we did at that time microservices were in boom So we said, you know, let's do microservices, but this architecture resulted into, you know the one that is there on the left side a service oriented architecture because Microservice architecture system should not be dependent on each other for uptime and availability While in service oriented architecture your services can rely on up other systems for its uptime and availability It's just the things are separate. So Yeah Yeah, so we always have been with python throughout our journey and most of our services are written in python The message bus we use rabbit mq and we have built a small abstraction on top of the messaging library called comm sorry That does hide lot of details about how we are sending messages or You know creating the queues Etc and telemetry integrates well with comm sorry. So You know, whenever you are writing any events for your bus You get this features out of the box like Knowing the message sizes or a throughput or the organized execution time, etc. You just have to add a decorator Record comm sorry, and it will do the job for you. Any other questions. Yeah What version Different version Yeah, we had to do that for our mobile apps Yes Of course API versioning can get tedious and there were scenarios where We broke the backwards compatibility But then we added tests as much as possible to make sure that we don't break the backwards compatibility But so this services were interacting internally and for our mobile apps, we had a mobile gateway Skull called combi and that was taking care of the low level details So the apps were not so as you can see mobile API. So that was connecting all the services for the mobile apps and our web was Web can be changed. We don't need backwards compatibility for that. So, yeah, any other questions Last but not least Thank you and we are hiring so We are doing doing some cool tech stuff and If you are interested in disrupting how classified industry works Please get in touch. I would be happy to share the opportunities that we have We are hiring Python engineers full-stack engineers and mobile engineers. So for all the platforms we are hiring So, yeah, thank you very much guys for listening