 I. I'm here today to talk about something. I'm very, very passionate about. Kod en messy data. Yes, I'm passionate about messy data. But I'm mostly passionate about making it beautiful. Sasa wise men once said learn the most by sharing your knowledge with others. Så här är jag omstås det i dag. I. En kastis tack är något intressant till ju en. I get nothing else out of this talk. Here's a picture of me with the dragon. She's the one who causes those Friday fires. You're so afraid to deploy. I'm Michelle Sanmer. I'm an addict. I call her and code addict. So it's nothing I'm working to get rid of. I've been programming since I was a little child and I never really wanted to stop. I just kept doing it. En skit. Så att du har spännande entire talk, wondering about my accent. I'm from Sweden. I'm half Danish. I lived in Netherlands for six years and since 2014. I live in Switzerland, so it's European. In Switzerland. I work at this really amazing company. God leap. We are self managed, which means I am my own manager. And it also means it's really tricky to fire people because I don't like doing that. So basically we can do anything we want at least. Except we can't because we still have to get paid. In. Basically, yeah, it's great. Hur många people har hört av Lippe? Bevor. Okej, the people who already know me. I was gonna take a picture of all the people putting your hands up, because you heard of Lippe and it's cool, but let's not do that. Så. Lippe is a web agency in this agency. I've been working with the same thing since I started and more than five years ago. We build a product that improves the way the Swiss people do shopping. That's pretty cool, because it also improves the way I do shopping. How often do you get to be a consumer of your own product? I love that. A little bit of a disclaimer. I do a lot of ranting in this talk. Retail data is a really complex business, and I don't mean to. In two. Be condescending or whatever. We have nothing against our data providers. They all have a lot of data that was not meant for a digital age, and it's our job to take that messy data and convert it to a digital age. It's their job to provide us with the data, so it's messy. But it's okay. This talk is for everyone. I hope that everyone can get something out of it. Some concepts may be a little bit confusing, specifically if you haven't used symphony before. How many of you have used symphony in the framework before? Okay, so it's probably not that confusing. But if it is confusing or have any questions, I will have a Q&A in the end. So you can write down your questions and I'll try to answer them or answer them in a break. So the agenda today, first, I will talk about the project, the biggest retail of Switzerland. Then I will discuss some challenges. So it's a huge API. How did we solve the serialized bottleneck? When we import the data and the third party data provided lies to you. It's supposed to be a string. Why is it suddenly an object? I don't know. En mapping. How we map the data to contain the mess in one little place with unicorns. And then I will talk how we evolved in a symphony project, how we kept that project up to date, how we still love working on that code today. We're still the same team today, developers that we were five years ago. And that's really, really rare in this industry to everyone to keep working together for so long and how did we do that? So let's talk about the projects. I wanted to have one of these Star Wars screens. In a country far, far away, it started as a small API. It was meant in the beginning to output a few products on a website. And then the retail thought, okay, this is really cool. We can actually give them all our data and it looks nice in the end. So they just kept giving us data, giving us data year of the year of the year. We keep adding features and now there's a whole range of websites that this retail has using our API and apps and everything. And even the cash registers are using our API now. So that's started as a small API for a website to something really complicated that cash registers is use and we grew organically and we managed to do that because we kept our code up to date. We have a huge technology stack and there's no way one person in our team can know everything. So we use mainly symphony and elastic search. We use a rabbit and queue for queuing. We use MySQL to store the original data. We use Redis for caching. We use Neuralic and health check for monitoring. We use Xdebug, of course, for debugging. We use OpenAPI, PHP, CSFIX, PHPStand, Node, PSN React for admin panel and Golang, because the logo is cute. And since there's so many symphony people in this room or at least people who know symphony, yes, we have an API. No, we don't use API platform. And if you don't know about the API platform, don't worry about it. It's just a disclaimer so that you don't wonder further on in the talk and get confused. Så, this is us, this is our team. Timmer is the new PO, so he doesn't have a cool colorful picture yet. But we are eight developers. Ray, Toby, Christian, Theresa, Martin, David, Emanuele, and me. And we all have the same role in our development team. We try to not distinguish between who has the most experience, who has the least experience. It doesn't matter because we all have input and we all have eyes and we all think of things in different ways. And that's one of the keys to this, no hierarchy. And then we have Timmer and Colin, who keeps us sane when there's a lot of customers trying to talk to us. And then this Leo, our scrum master, who is way too passionate about agile, it's scary. And then the stuck guy, who's really annoying because he throw a problem at him and he solves it when you have been trying to solve it in two days. He solves it in 10 minutes. That's Shrego, we call him our cloud Timmer in lack of a better role. Så. In this talk. I will now assume, you know what symphony is. And the general structure of symphony. When I talk about controllers, services or config. I expect you don't get confused, but if you do get confused, that's fine. That's not the main part of this talk. This is our API, very, very simplified. There's a lot more to it. So we have our rest controllers using the rest bundle, fast rest bundle. We have serializing. Elastic search, mapping, minskål, importing and the data provider. Så if you look from the bottom, the data provider gives us data. We import that original data into minskål. We store that data there and then from minskål, we map that data into elastic search and elastic search is the source of beautiful data that would serialize to get the output that the consumers request. Så I'm not going to talk about controllers or API at all in this talk. But if you have any questions about it later, feel free to ask. So let's jump into the challenges. Did I mention that our API is huge? Both code and data. Så if we look at just the code, the source folder is 8.7 megabytes with 2067 items, tests is a bit larger because of all the fixtures. A config is 329 items in it. And then I'm a bit afraid to show you the vendor folder and that's just code. Så this is the structure that we have in our API under API. That's basically our domain in here. We have things like products, discount coupons, and we name everything that clients is the third party clients for the data providers that gets us the data. And then we have infrastructure for everything that we need to do. In between and migration for data and doctrine migrations. And a few loose things, that makes sense. Så one of the challenges is importing a lot of data and not just from one source. How do you import data from all these sources and put it in one nice system? This is the part of our API that does that. Well, it's importing a lot and it's data providers. Så the way we import data is with import commands. Så we have a command in symphony that talks to our data providers and imports it into my school. We also have workers using rabbit MQ and this is really, really, really, really important and crucial. We store the original data in my school, because if we don't do that and we map the data, then we don't know if something went wrong, because we will make bugs. We can't remap it. We have to re-import it from the data provider and it becomes a huge mess. Always doesn't have to be my school, but always find somewhere to store the data important lesson. Så problemet är att any of the data providers can send us data anytime. And when it happens, we have to make sure we can handle it and they can send us a lot of data at any time. They can send us millions of products updates. Let's say a category change names, the main category of this big retailer. The mid send us millions of updates. Sådann, we have queues and workers without the scaling to handle this. And how many people worked with queuing before? Most of you, cool. For the ones who didn't, and if you have questions, feel free to ask it later. So our queues and queuing, we used to use the MQP rabbit MQ plugin, but we recently switched to symphony messenger, which is so much nicer. Basically symphony messenger took a lot of time of switching. This is just a couple of pull requests that the colleague of mine made the symphony messenger itself when we started working on it. So it's actually thanks to our project that symphony messenger is so cool today. So here's his merge request. 167 changes just switching to symphony messenger. Så, you can see, there was a discussion for the nine comments. Yeah. Switching to symphony messenger simplified our code a lot. So normally, with any queuing, you would have to define your queues and you have to define your workers and everything, but with symphony messenger. You can remove a lot of that boilerplate code. And that's the result you see here. It forced us to use more value objects. Before we had the MQP envelope and it forced us to write better code. Now we have a category deleted message, for instance, and everything is clear. Now we know what message we get and everything is clear. It's not just Jason Blubber and XML block before. Of course, we could have done this before, but we didn't because it's so easy not to. So I love when components force you to write better code. Ja, so before I couldn't even show you on one screen all the commands we had, but now we can use messenger consume show you that. So switching to messenger was well worth the time. And how many here use symphony messenger? Almost no one, but a lot of you have been using queues and a lot of you know about symphony. So you can use symphony messenger without using symphony, and it's a very. You can. It's a tool that doesn't depend on rabbit MQ or Kafka. You can. It's a generic tool, which really helps because if we now want to switch to Kafka, we can very easily do that. So play around with it. In I love that we could give back to the symphony community with it. It's beautiful. But actually the one thing me as a developer love the most in my day to day work with symphony messenger before with rabbits MQ. I couldn't exit the worker with control C. I can do that now. Yay! It reminds me of this vim. Joke. We're trying to exit vim. Every time someone tries to exit the worker, it felt like that. So let's talk about consuming bad in quotes. API's would are crying or becoming an alcoholic. So first let's talk about some of the API's that we do consume. That are not that great. En av dem, I call soap ish API. So they do. At first, they had a soap API with one request, which said action or something like this. And that worked in soap and you send it to your XML and everything is great. But they decided to do that with rest and then in a post body, you're supposed to define what you want from them. It's really messy. It's really confusing. That's not the rest full soap. So then we have the flexible API. Which means they basically have a key value store for complex data. And it's like using redis for objects. It doesn't really work. It's. In their flexible API, they try to be so flexible that you tell them what you want in the URL. So if I want a product ID and a name, for instance, I say in the URL, I want to field product ID and name. And then I get a JSON response. And then it gives me a pointer to get the string of that product ID because it just says it's a string and here is the string is stored. And then I have to make another request. Ja, you get the point. It's flexible and it's horrible. Also the way you defined. The variables in the URL, the data comes back in that in that order, but without a key. So you have to remember where it was flexible. It's great, but when you do. Have to consume APS like that, that are not great. First of all, you get a lot of inspiration, because you know that you can do better and that's great. And then you laugh about it and you pair program and write songs and you play baby shark and annoy your colleagues. I really, that's how we managed to survive when we get headaches. Literally, it's to joke about it together. Päring is caring, suffering is best done together in. I like to think that we are superheroes shielding our consumers from the pain that we have or maybe that's kind of Stockholm syndrome or something, but I like to think that we are superheroes. So let's talk about third party data providers and when they lie to you. We had cases where we were supposed to get one format of data, but got something else. We're supposed to get one product before we get a list of 10. I was supposed to get a string, but we get an array. All these kinds of things that makes it really difficult for you. So again, Päring really is caring. But what we started to do is that we realize that it's difficult for them to validate their own data and that's okay. We live in an age where there's just too much data and they can't do it on their own and someone has to tell them when the data is bad. So we started using Jason schema for that. How many of you have used Jason schema before? Okay, about a third of you. So with this, we can take the Jason data and we can define exactly how it should look like like this. So here we see title products. It's an objects and has some properties in it that we can find. For instance, the ID property, we can give it a pattern and. Etcetera, etc. So we tell them exactly how the data should look like. In with this, we can validate their data for them. When we get the message from them, we run this to validate. So here we see validate with the Jason string. And we throw an exception if we can't validate it. Don't worry, I will share the code later and the slides and also something that becomes extremely, extremely, extremely important. When you don't know what data you get is defensive programming. There's no such thing as this won't happen. Write your tests first and when you think you have all the tests, write the ridiculous test that you think could never happen, because it does. So yes, write the test for what happens if the data is not, even though they told us the data will always be there, etc. It helps. There's a really, really good talk about this if you want to learn more about defensive programming by Marco Pivetta. It's a new tube and it's called extremely defensive PHP. A lot of you might know him better as Okramius. So. We started out, as I said, as a tiny API, we had only one data source, and it was pretty easy for us. In yeah, it wasn't that hard, but then they added another data source and another, and then we have to prioritize. In we have to figure out, okay, this data comes from this source, that data comes from that source, this product comes from this source, etc. And it becomes messy really fast if you don't think about it. So we made a thing called a decider service. So you're that ID, you're from API X, you're from that ID, you're from API Y. So before we import a product, we use our decider service to decide where we should import data from. And importing that way becomes easy. But mapping. Did I mention that data quality sucks? You know, missing spaces, strings instead of in arrays instead of an object, objects instead of a string, differently named fields. When you expect a name to be field to be called description, but suddenly they decided to use German for description. Yeah, required data is missing and more and more and more. But you can keep your sanity by containing that and that's very, very important that you don't try to early to fix their code. You can use validating to tell them that their data is messy, but don't try to early to fix it because then you will get into a loop of having horrible data hacks everywhere in the code. We recently looked in our code everywhere where we had a comment that said hack. We had to change it because nothing is a hack. It has to be purposefully done. So we have a product mapper. And that puts the product on elastic search. And then we output that from elastic search with serializing in between. Pretty simple concepts. And so a product mapper could look like this. We map name brand, category, price and description on it. Each one of these is a mapper on their own. This is how it used to work like. So NBC PD's mappers and we add some more mappers. Oh, and we add some more mappers and we have so much data on the product that the product mapper eventually explodes. Det är just way too much data, but what if we take all the data, we're mapping on a product and we split them into tiny parts and we decide that instead of mapping all the data that we need on a product, on a product, we go through every mapper, each one. So we have the name mapper. We have the description mapper. We have the image mapper and everything piece by piece. That way every little thing we need to do, every data we need to double check and convert, it's contained in this little box and we don't have these ridiculous classes and tests anymore. So then we put the mapper interface on it so we can map it and put it. Put the factory around that. You give the factory all the classes and then it can loop through. And that's why you have clean data to store in elastic search. Great. In theory in practice, this doesn't work very well. So first of all, you have to configure all of the mappers in order, because seven depends on three. You have to have something at description before you have ingredients or whatever. For depends on one, 25 depends on basically everything. So you have to order it and you have to be careful. And every time we change the config, a test might blow up or even worse. If we missed the test, our entire application would blow up because we changed a little bit in a config. That's not nice at all. Also languages. Since we live in Switzerland, we always have to deal with at least three languages, which is German, French and Italian. So with this product factory, what we used to do. Is that we called it three times once for German, once for Italian and once for French with some cashing in between. This meant that mapping a product with language specific data often took three times longer when it didn't have to. It was a mess, but let's see how we solved that. So mapper dependencies in symphony. They have a thing called a compiler pass. How many of you wrote compiler passes before? A few of you. How many of you know what it is? Is that less than wrote? Okej. So in symphony, when you compile a compiler pass is basically code that runs before you warm up the cash. That's all there is to it. And also it can hook into the process. And play with your config and make the services and everything for you. And basically that's what we want to do. We want to make the services. So every mapper in the mapper interface, we said it needs to have get dependent fields and get fields. So here we see the retailer mapper. So it depends on bus number, categories and additional categories. In it's the mapper itself handles the field retailer. And then in the compiler pass. We process that and we check if it's the product factory class. And if not, we continue and then. We deal with the mappers here. Find sort of mappers. So this is basically our dealing with dependencies. We can deal with dependencies in the compiler pass magically. We don't have to have a config anymore. We just check all of the mappers that we have. And we write our services. It's great. Yeah, so we sort and we loop through mappers until the dependencies can be sold. And we have a logic exception here if it counts. Luckily, I never saw that exception, but it can happen in theory. So the dealing with languages. Only deal with languages when you have to an image doesn't have three languages unless there's text on the image, which is a whole other story. But in theory, a lot of the things when it's code like category code, etc. It doesn't have a language. We don't need to deal with all the languages in your application. So here is how we did that for each translated product as products. We deal with it that way. We just forage in every mapper. Instead of running it three times in when we don't. When we don't need a language, then we don't need a forage. So obviously I said there would be unicorns. And this is our product factory unicorn. Yeah, I did implement the unicorn ticket as soon as I started in this team, and that was cool. Of course, there's some things that goes without saying you have to ensure quality, so you need to write tests. You need to have logging lots and lots and lots and lots of logging, and you need to write documentation on how you log so people can actually find those log messages. We had a lot of logging, but a lot of us didn't know how to deal with those. That was embarrassing and awkward, and then we have monitoring. Så you have to monitor accuse. And everything. Så lagalatt i bänken, you will love you. Monitor accuse monetary uptimes and react quickly. We also have a separate acceptance test. When you deal with a lot of data, you can have things when you deploy and when you remap the data, suddenly there's a bug you didn't catch in your tests and data disappears. But luckily acceptance test will catch that test for critical data. If a product should always have a product name and make sure there's an acceptance test that checks if there's any products without a product name. And then there's, of course, another project challenge of big API responses, because we started as an API. The biggest retailer in Switzerland has products and just a few things on that product and organically, our consumers started to want more and more and more and more and more things on that product. In that becomes a mess, like this slack conversation we had recently. How much stuff came put in a JSON file? What can that put in a JSON file? The way we saw that is that we use Varnish ESI so that we can partially cache everything, every product. So if you have a product listing, it doesn't matter what combination that is. They are partially cached. So I'm not going to talk that much about that, but it's a term you can write down as Varnish ESI that will really help you. So let's look at serializing, versioning and groups. So we need to handle different versions in our API when we have breaking changes and things, and we need to handle groups. So we have output something in the data view and we output something else in the list view. And we tried a lot of serializers and we use JMS serializer right now. And all of them suck when you have a lot of complex data. They are slow, they are cumbersome, it's painful. They handle everything great, but they suck. So plain JSON decode won't work that well for this. The symphony serializer is cool and all, but it's slow. Better serializer would maybe be better if we could make it work. It doesn't deal with complex data. So we use JMS serializer, which is great. It has annotations, it has the version support that we need. And so you can say output this and the version two output this since version three. And we also have virtual properties, which means we can have a little bit of logic when we output something. It works like magic with most frameworks, including symphony, which we use. And you have this config and it just works. Remember to set your datetime format, because if you don't, you can get really messy. If you're not consistent in the way you format your date. Yeah, you can read the docs about JMS serializer. But as I said, it didn't really work that great for us. It does everything we need to do, but it was slow. And we had a bottleneck off. Yeah, we called visit property for one of a big product over 60,000 times. How can calling a method that many times ever be fast? Never, but all modern serializers and PHP that I saw, the user thing called the visitor pattern. And that is great unless you visit each property allots. So we wrote something else that we like to call the leap serializer, which is not as much of a serializer as it is a generator. So you have your model with your annotations and then you parse those annotations and you generate a new file based on that. And then instead of visitor property, etc. You use your generator code and you call one function once. And that generator code is some of the ugliest code I was ever part of writing, but I didn't technically write it, so it's okay. We had an overall performance gain of 55% over JMS for a use case. That means our response could go from 390 milliseconds to 175. And CPU and IO weights both down by about 50% and a 21% memory gain. I would consider this a win. We tried using Golang first, and that's an entire talk of mine, how we use Golang and then went back to PHP for serializing. I could talk about this for hours. And if you're curious about the serializer, you can ask me later, or you can read the blog post that I wrote or look at the leap serializer on GitHub. It's open source. So communication. I would say we have so many technical challenges, but the one we still never could get quite right is communication. Communication is hard. Because we have to work together and working together with other people is really difficult, so it's important to give each other feedback and to establish a feedback culture. And that's not only personal feedback peer-to-peer, but it also goes into code reviews. So I recently had a moment where I felt really insecure and imposter syndrome was hitting me. And I realized then that all code reviews that I had lately from others, they only pointed out what's bad. So you get constant feedback on what's bad, but never feedback on what's good in your code. So that can really bring someone down. So we tried and what really helped was to also point out what's good. And that way you encourage that behavior and people keep doing what's good. I like the way you just generate this here. Good. They will keep doing that. I like the way you did this and this and that. Encourage good behavior and always point out when you like something. Because if they doubt something, they might not do it again. And that's a shame if you liked it. And then we also have retrospectives every two weeks. How many of you here you scrum? Okay, about half. So retrospective is what you do when you look back and you try to improve. You don't need to use scrum for this, but you discuss things like why didn't we manage to finish the things we did and you tried to improve the process, both your code, but mostly how you work together communication. Communication is hard. Yes, as I said, respectfully improve code together and team events really keep moral high. Our team is called Team Lego. So one time we made little Lego minifix for everyone. That's great. And also a key, which is really important. We have an amazing customer. They listen to us and if you don't have an amazing customer or you are your own customer, then you have to listen to them more. They can become an amazing customer. You have to listen to their needs. And if they're not, well, that makes everything hard. So in a symphony project, there's some really important things. You have to prioritize upgrades, upgrade as soon as ever possible. Even the minor versions. Fixed application warnings. Refactor often. It's not optional. When there comes new components like symphony messenger, etc. Try to use them. Try to try to use the components that there replace previous parts of your application with a solution that everyone in the community can help you solve. Also contribute to open source. If you have something like that, write in your component, give back, have some control over your tools and utilize the community. Did I mention that you really need a lot of tests? En amazing customer. So some final words. It's okay to start small and refactor later. In fact, it's preferred. Code for what you need now refactor when it needs change. We had a lot of cases where we made really complicated code, anticipating our needs and that messed everything up. Refactor later code for what you need now. Right. Dev docks. Så you write really good consumer docks, and your consumer is happy, and your customer is happy. And then you look at the code you wrote three years ago, and you have no clue what's going on. So really write documentation, not only for the people that work with you, but for yourself. You're not going to remember huge code base, everything you did over five years. And write down decisions you make, especially architecture decisions, etc. If you don't, you're all going to code in different ways and someone's going to comment on your magic list. This is wrong and I will say no, this is not wrong and both are correct. How do you deal with that? You need to write the documentation and especially. Things like how you do things, architecture. As I said, refactor often use the fancy programming, because what can go wrong will go wrong. And also when there's so much code involved, changing one thing can break something else if you're not defensive enough about it. Work as a team, work under communication, at least as much as you work under code. And if you do all these things, then messy data does not have to mean messy code. Thank you.