 Ok, notre prochain Speaker est Philippe Poirotier, de Vagin, et il travaille pour vous, le contrôle, qui est en charge de 2,5 millions de lignes de code de Vida. 2,3 millions de lignes. 2,3 millions de lignes, chérie. Oui. Bienvenue à la 2e place, vous avez recusé un peu. Ok, donc aujourd'hui je vais parler de la performance de la track de la application de développement pour les opérations. En fait, ce que je vais parler n'est pas trop relative à Vida. En fait, il y avait même 0 slide dans ma présentation, contenant une ligne de code de Vida. Donc j'étais un petit peu affrayé que l'organiser de la salle de dévouement me tuerait après la présentation, donc j'ai ajouté un slide. En fait, j'étais connu quand j'ai dit qu'ils me tueraient après la présentation parce que Jean-Pierre, l'organiser de la salle de dévouement, en fait, il travaille pour moi parce que je lui paye pour soutenir le contrôle de la salle de dévouement que nous avons utilisé pour vérifier les règles de code de la salle de dévouement. Et l'organiser de la salle de dévouement, en fait, aujourd'hui, la 2e salle de dévouement est en train de travailler dans mon équipe, donc il y a des raisons de ne pas me tuer mais peut-être, bien sûr, le dévouement peut avoir des raisons de me tuer parce qu'il est en train de travailler pour moi. Voilà. Let's switch to the next slide. What are the objective of performance tracking? So the objective of performance tracking is to evaluate major resources needed by new functionalities that's one objective. Another objective is to verify the estimated resource budget, CPU, memory of what you develop. Sorry. So we want also to ensure that the new release will cope with a current or expected new load and we want to avoid performance degradation during development. For example, imagine that we have a team of 20 developers working six months on a new release. That's about the size of the team working on this application. And let's imagine that each developer integrates x changes per month and if one change on x degrés a performance by one person then optimistically after six months we have a new release which is 2.2 times slower. We start from performance 100% six months 20% one person. That's an optimistic view. The pessimistic view is that the new release is 3.3 times slower. We start from 100% times 1.01 to the power of 120. So clearly we have to do something and we cannot wait the end of the release to check the performance and see where the degradation is coming from. And so the objective is to do a daily track of the performance during the development. So we have a development performance tracking objective which is relatively precise because we want to reliably detect performance difference of one person or less. So a little bit of explanation about Euro control. So a little bit of explanation about where I'm working and what we are doing. So Euro control is the European organization for the safety of air navigation. So it's an international organization with 41 member states. We'd like to draw your attention on the fact that it is 41 and not 40 since last midnight. We have several sites directorates et cetera. The activities we have operations handling management of the traffic in Europe. We do concept development. We do some European wide project implementation and support et cetera. More info on our website. Where I'm working is the directorate of network management. It has nothing to do with network routers, cable and so on. It manages the network which is used by the flights to fly. So we have routes, we have points, we have air routes. This is a network and that's what we manage. We develop and operate the air traffic management network. So we have different operational phase. We have strategical. We prepare the network, the restructuration of the network and so on. Months or years in advance. Pre-tactical, a few days in advance. Tactical, today, tomorrow, post-operation. We have to analyze what has happened and maybe improve for the future. So we are doing airspace route data, flight plan processing for complete Europe, flow and capacity management. So if you have an airspace control center and you have too many flights that are planned to fly through this airspace control center, the controller might be overloaded, which is not good for your safety. And then the idea is that all systems will detect this load and then provide mechanism to balance the capacity and the load. So network management, we have two mission safety critical systems. So these are IFPS flight plan processing and ETF mess flow and capacity management and my team is developing these two applications which are quite a lot of code in EDA. EDA is by far the main language used. We have something like the core of the system is 99% EDA. So some more words about IFPS and ETFMS. So it's big application. The core is 2.3 million lines of code. And then around, we have shell scripts for the build infrastructure for the monitoring. We have a lot of test code around, etc. So in terms of capacity, let's speak about a big day of ETFMS, the flow management system. So in Europe, we have about in peak, more than 37,000 flights to Wendel. And we have around 8.6 million radar position. And these are plan to increase to 18 million in one year because we will add new sources of position, namely the ADSB data provided by new systems. So we have from external users, so the aircraft operator, France, British Airways, et so on, the airports, the air traffic control center, they all query our systems in order to have data. And this results in about 3.3 million queries per day in peak. And we are publishing changes on the flight data through public subscribe mechanisms such as AMQP. And we have more than 3.5 million messages per day to publish. So what hardware do we run on? So the online processing is done on a Linux server 28 core. So it is not a small server, but it is by far not a huge setup. We have some workstations that are running a graphical user interface for our internal flow controllers. And on these workstations, because they are quite powerful, we are also doing some batch processing and some background jobs. So we have many heavy queries. We have complex algorithms that are called a lot. For example, we have queries such as cons or give me a flight list for all the flights that are traversing France between 10 o'clock and 20 o'clock. We have some algorithms like lateral route prediction or route proposal optimization. We have vertical trajectory calculation. I'll show after some drawing to show a little bit the complexity. And a lot of other things you can imagine in 2.3 million lines of code. We do a lot. So this is a graph of a flight departing somewhere in Turkey, arriving somewhere outside of the European Union. And you see that when we have to calculate a trajectory. So for example, these orange stuff that you see here are radar position. And each time the system receives a radar position, it has to detect if it is far or not from the planned trajectory. And if it deviates a lot, it has to recompute the trajectory. So this is just computing one or several routes if the aircraft operator changes the flight plan for a flight. But what we also provide as a service is a route optimization service. So they can call our system and say please find me a good route. And then we have to search maybe this way, maybe this way, maybe this way. Of course, basically, it's a shortest path algorithm d'extra. But we can't use purely a d'extra. It's a lot more complexe than that because there are constraints which cannot be modeled in a simple graph. And so even searching, let's say, the end shortest route is a lot more complex than typically than a typical similar problem. This is the same flight that we have seen, but on a vertical view. We see departs here and we land there. And we see that we have plenty of constraints like forbidden airspace. We have some levels that have to be respected. We have blocks of airspace that can be traversed or not, depending on the direction and other condition. So you see that our algorithms that we have to make are quite complex. For this algorithm, for example, the algorithm that search where the flight can fly, several hundred of millions of calculation to do per day, which are done either on the central server or in batch or background jobs on workstations. So performance needs and it's scalability. So as we have indicated, we have a lot of users. We have a lot of queries and so we need horizontal scalability for this or operational configuration of our main server because we have a lot of instance of this software because we have, for example, an instance of the same system which is doing the prediction for the next week. For example, we have standalone systems that are used to study what happens in several months. Here I'm speaking about the operational system. So on our 28 core system, we have ten high priority server process that are handling the critical input. So the flight plan, the radar position, the external user's queries. We have nine lower priority server process having each four threads which are handling lower priority queries such as find me a better route for flight Air France 123. We have about up to 20 process running on workstation which are executing this batch job or background queries. For example, every or search a better route for all flights of aircraft operator British Airways departing in the next three hours. And that, of course, is quite a high load. We also need vertical scalability. So this is a little bit different of systems where you can have a lot of users asking a lot of queries and you can distribute it, and that's what we do. But we also have some functionalities like simulation where one single user needs a lot of power because we have our internal flow controllers. They have sometimes to take heavy actions like if an airport calls, or an airspace control center calls and say we have a technical problem and we have to take maybe 1,000 flights and we have to find how to best handle these 1,000 flights. All these actions will be done on the complete set of flights in the system. And so we must provide to our flow controller a functionality which goes fast for a lot of changes. So we need vertical scalability so something very fast. For example, to evaluate heavy actions such as we close an airspace, we close a country and we have to spread the route or delay the whole traffic impacted. So to give an example, starting a simulation it implies to clone the whole traffic from the server to the workstation. It's a very fat client. And we need to recreate the in-memory indexes which are needed to execute all these algorithms and so on. So it's about 20 million in-memory indexes that we have to recreate. And in the release, we are busy developing. We have spent quite some time to optimize a start simulation. We are now starting a simulation in less than four seconds including bringing the data from the server. And this is using multi-treadings. So we have one task where flight data receives a stream from the server. One task creates a flight data structure once it is decoded. And we have six tasks which are recreating the indexes. Ok, so what we have seen now is that we have high performance requirement and we can't degrade. And so we have to track performance during development. And one thing we can use for that is performance unit tests. So performance unit tests these are useful to measure things such as basic data structure, hash table, binary trees and so on. For example here you see that we have performance unit tests which is checking the speed of an insert in a balanced binary tree. And then we can with this double check that for example the N log N behavior that you expect is effective respected by your implementation. So we can use the performance unit test to check the performance of low level primitives such as p-tread mutex aida protected objects etc. So this is performance unit test which is verifying various things. So it checks the low level p-tread calls that we are available on Linux. We can compare with a protected object on higher level aida concept for this kind of things. We have also some timing here in this unit performance unit system called clock get time clock monotonic 40 nanoseconds for this one clock get time but for the thread CPU time about 400 nanoseconds. Interesting to remember these figures because I will speak about it a little bit later and the difference between these two. So we can also use performance unit test to evaluate measure and be sure that we have the required performance for low level library for example malloc. In aida we don't use malloc directly we use the aida language allocation de allocation but at least with Gnat it is based on calls to the underlying malloc library. So performance unit test they have a lot of advantage they are usually small they are usually fast and they are usually reproducible and precise. Remember a one person objective. If there is a degradation we want to detect it and we want to detect with a one person objective. No, we have some pitfalls with performance unit test and I'll describe a real life example with malloc. So malloc performance unit test we have developed a performance unit test to compare the g lipset malloc with tc malloc. So, 7 years ago we switched from the g lipset malloc to tc malloc because we had less fragmentation and it was faster with tc malloc all fine, all good but when we parallelize the start simulation where we had for example to recreate this 20 million in memory indexes we saw some not understandable 25 person variation so sometimes start simul was taking 4 seconds and sometimes it was taking 5 seconds and this we saw that the performance difference was varying depending on linking a little bit more or less code in the executable but this code was not called and so the size minimal change to the size of the executable were causing difference and so we said let's analyze where this is coming from and we started to analyze this valgrin call green to really see in detail the instruction executed and we saw no difference we use the linux perf tool to analyze then what was the behavior not under the valgrin call green simulator but the real stuff and perf shows effectively that the tc malloc slow pass was called a lot more when we had maybe 10 bytes more or 100 bytes less of executable code not called we couldn't understand this mystery we saw it was so it was more often in the slow pass but we couldn't determine why so we said that's easy now we will re-measure the malloc library so we do a malloc performance unit test and tasks simulating or indexer tasks doing a million malloc and then a million free what have we seen we saw that jalyp say was slower but had consistent performance with this unit test je malloc with this performance unit test was significantly faster than tc malloc we were really happy but when we use je malloc with a real code the real start simul with a complete system was slower with je malloc so what was the conclusion more work needed on the unit test so that it better simulates what we have to do so we continue to work on the unit test after improving the unit test to better reflect the start simulation work what have we seen tc malloc was slower with many threads but became faster when the unit test was doing L loops of simulating start top simul so you start the unit test tc malloc once you start the unit test start simul once with je malloc tc malloc slower but when you say to tc malloc start stop, start stop, start stop then it was faster and not ok with je malloc we have observed that then doing the m million free in the main task was slower and that in fact when we stop simul is a main task that is doing the free the unit test it does not evaluate fragmentation and I have not put all the mysteries that we have seen with this unit test but still based on what we have measured we obtain a very clear conclusion with this unit test and what to do about malloc conclusion is that we cannot conclude from the malloc performance unit test so currently we have decided to keep tc malloc and we will reevaluate with a newer jlipset in red at 8 we are currently on red at 7 because on red at 7 the jlipset is quite quite old ok so pitfall of performance unit test as we have seen is difficult to have a performance unit test which is representative of the reload malloc we obtain in no conclusion the pit had mutex timing that we have seen it was a very simple measure in fact we have measured without contention but what should we do we should maybe also measure with contention and what type of contention and so on and if we have to to have a unit test representative of the reload what kind of contention do you have in the reload and if you measure and you see where you have contention and you simulate this in your performance unit test that might be for your current release but if you change your code the pattern of contention might change and so performance unit test are nice, small, fast but it's difficult to have them representative even for hash table, binary trees and so on the real behavior depends on the key types of the hash function, of the compare function of the distribution of the key values etc so if it is already difficult to have performance unit test for such low level algorithms what about performance unit test for more complex algorithm for example how to have a representative trajectory calculation performance unit test you remember the picture at the beginning how do you do a performance unit test for that, with which data how many airports, route air spaces with what flights a lot of short all, a lot of long all flying where you might have a lot of variation in the data so conclusion on performance unit test these are useful, somewhat useful but they are largely insufficient and so for this the solution is to complement in fact to do most of the performance tracking not with performance unit test but to measure and track performance with a full system and the real data so we want to replay one day of operational data so replay operational data the operational system ITFMS, it records all its external input so it records the messages that are modifying the state of the system the flight plan, the radar position etc it records the query messages you know the flying in front end systems don't really ITFMS recording it for example query such as flight list entering France between 10 and 12 which might be asked by France for example by the France control center and so we have a replay tool ITFMS has a replay tool which can replay the input data so of course it means that the new release that we are preparing we must be able to replay the somewhat recent old input format so this brings a little bit of constraint on the development and with this we have some difficulties we need several days of input to replay one day because you can have flights that are filed several days in advance you see the flight plan might have been filed two days in advance and so we have to replay more days no what is the elapsed time that we need to replay several days of operational data this is a problem of course what is the hardware needed to replay the full operational data well we have seen that we have let's say a medium size server and work station if a developer wants to evaluate the impact of a performance well we have to ask then quite a lot of hardware so that's a problem to replay the full operational data no also we want as you remember our objective of one person or better how to have a sufficiently deterministic replay in a multi-process system multi-traded system this is quite a challenge and we will describe it a little bit after remember one person so volume of data to replay so replaying the full operational input is too heavy and so the compromise is to replay the full data that changes the state of the system flight plan, radar data etc but to replay only a subset of the query load so we replay only one hour of the query load of the real system and even in this we replay a subset of the background and the bad jobs we also have a problem that replaying in real time is too slow as I've said if we have to replay several days to have the result of one day if we replay in real time it means several days if you want to do daily tracking of the performance you have a lot of replays that will be in parallel a little bit everywhere and so we have to try to reduce the time needed to replay but we can't just take all the input and replay it instantaneously as fast as possible without doing something because an input must be replayed at the time it was received or not if you have a flight plan at 9 o'clock and a radar data at 9.30 and another one at 9.35 you can't replay the radar data together with the arrival of the flight plan you have to wait till the radar data is a correct time to process it so many actions also happen on timer events and so what we need is an accelerated fast time replay mode so what is it, the replay tool controls the clock value and the clock value jumps over the time periods with no input and no even and so we are processing the data at the correct simulated clock but when there is nothing to do the clock jumps instead of doing nothing so the fast time mode with all these limitations on the data and with this time control we can no replaying one day and the data needed for one day in about 13 hours the fast Linux workstation we don't need anymore a big server and we can use a workstation which is not so huge still we have to take attention to the source of non deterministic results so one source of non deterministic result is a network, DNFS and so on if the database, files or whatever are on the network then the replay is not the only user of the network and that can introduce quite a lot of variations much bigger than the one person we want to detect so the solution for this is very easy we replay on isolated workstations they have their local file system their local database and so on now another source of non deterministic result is a system administrator you might say but how can a system administrator give non deterministic result on a replay, well we use to audit to see what's happening whatever and these jobs if they suddenly intervene during your replay it changes the performance and so the solution is to discuss with a system administrator so that we can disable their job on the replay workstation and the solution was not too difficult another source of non deterministic is the security officer they absolutely want to be sure that there is no virus no route executable not allowed and so they also want to run audit jobs and scan jobs and so on and here also the solution is to discuss with them but it was a little bit more difficult we also see that non deterministic results are obtained because of the input output past the story you start a replay from scratch but because you use files and a database that was already used previously even if you removed all the files if you clear the database you still do not obtain exactly the same and this was annoying us with a one person objective and so removing files clearing the database was not good enough and so before each replay we completely recreate the file system and the database for each replay even with that it was not ok because the operating system usage story itself if you do twice two things one after each other on linux the second time it might have been the memory in the kernel whatever might have been changed and so what we do is we recreate the file system the database and we reboot the workstation before each replay with this we still have some remaining source of non deterministic result for example the time control tool we serialize most of the input processing most but not all for example because if we would serialize everything it would slow down a lot the replay for example the radar position that have arrived at the same second we are not replaying them one by one we have several process and the several process will process the radar data that arrived at the same second in parallel and that can introduce some non determinism the replays are done on identical workstation same hardware as I've said file system database recreated, rebooted still despite same hardware, same operating system restart from scratch and so on we are observing difference between workstation so all workstation are not born equal that's our conclusion small difference in the clock of the CPU or whatever but we see we see some impact with all these caveats and the limitation and all what we did we have finally achieved a reasonably deterministic replay performance with 3 levels of results we have a global tracking where we track the elapsed user and system CPU for a replay for in the complete system we do a per process tracking user and system CPU and some per stats recording and we have a detailed tracking and one of replays under valgreen call green this we run on the side because this takes quite a lot of time it's very slow it takes 26 hours but it is very precise so this is a drawing these are all the baselines that we are building so we do continuous integration every day developers are integrating we are building and replaying and this is the release tracking of the global performance of the release we are developing where we have done this optimization for the start seaboo in green you see the total user CPU of all the process that are replaying in blue it is a total system CPU for all the process and in red it is the elapsed type so what can you see is that we have seen a very gradual improvement of the performance during the development and this gradual improvement but we have managed to track to see that the optimizations that we are doing were effectively optimizing or we were able to see performance degradation on a global level using this by the way this and this and that is not that suddenly we got a quantum computer that did the replay in zero point something second is because we had a problem during the replay of course I'll discuss more in details later on this part of the graph remember a little bit the pattern so that's a global tracking we also have a per process tracking where we record the user system CPU the eep status how much was used 3 tc malloc details because we are using tc malloc and so on so that's a kind of thing that we will record for each process so that we can see what's happening here we see four process which are processing fly there is the one which is processing flight list and cons and so on and for each of these process we are recording data that allows us to understand which process if we see a difference on the big global graph which process has increased and then the third level is when we have to analyze what has happened inside the a process then we have a one or of replay under val green call green and then we use the excellent kcash green tool and the excellent call green tool to record the call stack who has spent what we can see the functions which have consumed the top most and we can see the code which jumps we have which condition was called was often true or false and we can go up to the assembly language level so this is the main tool we are using when we want to optimize optimize some specific algorithms or when we see a degradation by the way I am also the organizer of the debugging tool deaf room and tomorrow there is a talk about at least one talk about val green so if you are interested that was the advertisement so another interesting thing to discuss is that what we measure is we want to avoid performance degradation but we also want to see that if we believe we are doing an optimization is that really an optimization and as a real life example of what we believe was a missed optimization something that we could optimize that we try to optimize and then it was becoming a failed optimization so this is the slide with a little bit of data code I promise to Jean-Pierre we would have some lines of data code here they are so here what we see is a data task and it has two rendez-vous so this task is maintaining the automatic loading of data like when the airports are changing it synchronizing the access and the load of this data so while someone is accessing this data it must be locked and when it is not locked the task will load new data in the memory and so this task among others has to maintain the number of locks and so it accepts so it has a rendez-vous which is called unlock where when a client task says I don't need a lock anymore it's calling unlock and the task decrements the number of locks and there is also a rendez-vous called getlock count which returns a current lock to the client what is it used for this is used for because when a process is activated to handle something like a flight plan it's not supposed to have a lock anymore and so at the top level of the processing we are checking that there is no lock by calling this this getlock count ok a rendez-vous a rendez-vous is a data task is something which is relatively costly because it is a task calling between quotes another task so there is a task which there is some system call to synchronize which is relatively expensive so the optimization idea was to decrease the number of rendez-vous by using lower level synchronization based on volatile that was the optimization idea the idea was to take locks not as a task maintain variable but as a global volatile variable we have a function that returns a value and inside the task we know accept unlock and then we are doing the decrement in the body why do we have to do it in the body while above the unlock the decrement was done outside of the body because imagine that you have the task which is processing my flight plan it has finished it releases the last lock and just after says no I will check that there is zero lock here when with this solution we must absolutely be sure that when the task that says unlock gets the lock count that it must have been done and so we must do it in the body because otherwise possibly we could have a risk condition in our check so this should be faster because we will have the same number of unlock rendez-vous but we will have much faster get lock counts accessing a volatile variable than a rendez-vous and so this is supposed to be much much faster that was the idea in reality we detected with the performance tracking that this was a pessimization this thing the compiler, the EDA compiler is quite efficient and helps us to build high level synchronization algorithms an efficient way and so for example if you have a nobody rendez-vous then the compiler will optimize this it will be a lot less costly to have a nobody rendez-vous than to have a body in the unlock calls and so because of this becoming more heavy in fact it was a pessimization and this was detected this is the extract of the big of the big drawing at the beginning so this was without the optimization here we had the little bit of problem and here was with the optimization the system time slightly increased so here we start to say what have we done during several release and after we have said we better roll back and so here we have rolled back the optimization so that we are we are back at the original system time performance so what can we conclude about that is that you need to track performance you need to track performance because otherwise you can have problems you have to track performance of your optimization because otherwise it might become pessimization and a third thing to remember is that you can believe you are maybe smarter than the compiler but it's difficult difficult but not impossible because here you see we did the optimization correctly not using volatile but using atomic and doing a part of the the decrement operation was done outside of the task and so it all went okay so we detect here that we can optimize something we optimize but it is a pessimization we ask ourselves why are we so stupid we understand it and then after we become more intelligent so that's quite positive as an evolution so performance tracking a summary we have good depth performance tracking using a mix of performance unit tests replay operational data as deterministic as possible the replay day we have to change it because the pattern of usage might change we might have new companies appearing airspace being restructored new route old route and so on so we change the reference day that we compare relatively frequently so that's too much new usage pattern we use various tools Valgrin, Colgrin, Cachegrin, Perth, Top and so on but we have to take care about blind spot of your tools for example Valgrin, Colgrin, Cachegrin it's very easy to use it's really a main tool we are using to optimize but it is very slow and it serializes multi-treaded application and so it measures something for contention for example you never have any contention when you run something under Valgrin, Colgrin and it had a limited system call measurement it was not measuring the number of system call and the elapsed time in system call but not the system cpu and we saw here system cpu had to measure as I also happen to be a Valgrin developer I change Valgrin so that now it also measure the system cpu spend in system calls so we need to have global indicators we have to zoom on the details where needed and as I have said improvement in the pipeline the next version of Valgrin, Colgrin will measure system cpu and we are also working on developing a Colgrin to help visualizing the difference because currently comparing the K-Cache green graphs is a little bit difficult so it looks wonderful we have a nice tracking system nice graph we can measure from the global system to the details of the process and can we be happy with that is that good enough to go operational what about you are on call or me I'm on call from time to time and you are waking up Saturday 4 o'clock because the users are complaining that the system is slow well I can't reply I will replay tomorrow Sunday the day and on Monday I'll explain you why we had big problems on Saturday this is not acceptable we have other questions like is the reference day that we replay is it representative of what happens on ops what about evolution of the ops work load and capacity planning for example if we say we believe that our users will do a lot more queries to optimize their routes because we have improved because we have new users that will say we want to use our service will the system cope or do we have to change the hardware setup do we have to upgrade the hardware put more hardware for this we need something else than the replay for example what additional capacity hardware capacities need to support each time more queries of that specific type the solution for this is to have permanently activate response time monitoring and statistics so this I'm speaking about the tactical response time tactical because it's mostly useful during tactical operation but we of course also use this during the replay the idea of this is that the application contains measurement code at critical point such as every removed procedure call invocation begin and end so one process is invoking something in another process it will measure how long it took at the process side that execute the removed procedure call we will also measure how much it took to process and send back the reply we are measuring the database access time begin end of the database and see if we can algorithm begin end such as calculate a vertical trajectory and so on so the measurements are typically nested for example inside an RPC execution begin end we will order begin end for sub things that are being used by the thing so this tactical response time package maintains a circular buffer with the last end measurement and for each begin end measurement it records the elapsed time the thread CPU time optionally the full process CPU time you remember at the beginning I said clock get time monotonic and clock get time thread CPU this is measured with clock get time monotonic and this is measured with clock get time thread CPU this is relatively heavy costly this is in fact a virtual system call implemented for those in the VDSO of Linux and this is really switching to kernel to get the data so if they are kernel developers in the room if you could improve the clock get time thread CPU that would be really nice so we this package also maintains statistics how many measurements of what kind were done an histogram of elapsed thread CPU and details about the end worst case this is giving a reasonable overhead about 1.7% of the CPU is spent in measuring what the application is doing rather than doing real work and for this we largely prefer to have it always activated on ops on our test and replay and so on because this is critical for us to understand what's happening in our system so this is an example of we have an online access to this data structure so this is a screen which allows to look interactively if you are called for example during the night what's happening and you see here a kind of tree a kind of tree of actions where we have received a flight plan we had a flight deviation then we did several derivation phase for the calculation of the trajectory we read the flight in the database we calculate another concept finally we have distributed some data and we have we have committed at the end the data in the database and so we can track the details of what's happening in the last end measurements this is the statistics that are maintained so we maintain how many measurement we had sorry the total time spent elapsed field thread CPU the average and a distribution in the worst case in case something is really abnormal it will appear in this data structure so this tax response time as I have indicated is used from development to operation on development it tells to understand how the system works see message exchange between process the algorithms executed the statistics are used to analyze the performance replay we can compare using this tax response time we can measure the resource consumption for new functionalities because we are recording things that are need to be recorded and so on ops online investigation of performance problem bug investigation in our system in our code the policies that exceptions are used for bugs not for normal behavior and so if we have an exception we take a cordon we can take a cordon without stopping the process we take a cordon we drop the input that has created the problem and we process the next message and the cordon contains the full status of what the thing was doing including in the tax response time the last M measurement of what the process did because possibly the bug might be created by something that was done a little bit before and so with this we can see what the process was doing and recently did we also use this for post ops analysis and thread analysis and input for our capacity planning so performance tracking of a big application a summary so we have a reasonably deterministic performance tracking during development it allows us to detect performance regression on a daily basis we can verify that the optimization that we believe our optimization have the desired effect it allows us to plan capacity upgrade for demand growth and new functionality etc we are using a mix of various techniques and tools such as performance unit test, replay, real data application self measurement we have to take care to avoid blind spots by using various tools perf, val, green, cold green, top s trace all will learn you something about your application the tooling we are also using for other purpose for example the replay tool is also a automatic testing tool because if we say we can replay 50,000 flight plans we can also inject 2 flight plans and verify that this one has this characteristic it has received this delay for example so the replay tool is also the test tool it is also used by your users to analyze, optimize operational actions and procedures if they did something and they say maybe we should have done something else they can replay of course on offline systems and then they can decide to try other actions what they did during operations so as I've indicated your operational system needs to have performance tracking and statistics this is not only for development voilà so that finishes my presentation yes yes typically yes ok so the question is when we implement new features do we inject some data in the replay in order to see how it behaves yes the idea is that during the development when we implement a new feature we are of course developing unit tests for this new feature which are done with the replay tool and whenever we have relevant data that we can use we might have to discuss with our users or we might have to create them ourselves we will measure how the code that we have developed behaves now it's very difficult to know what to do exactly because the real usage depends a lot of the pattern that the real users will do and sometimes users are a big source of non determinism yes most of the load yes ok so the question is I've said that it runs on one server and so how is this highly reliable yes I said when it is working it's enough to have one server but of course we have several servers we have cluster of server and if one server has an hardware problem for example we move the system to another hardware server so operationally it's good enough to have well let's say medium size linux server but we can move the application to another place yes in the yes no we are just taking a second measure so the question was how do we ensure precise measurement in the multi process so first the measurement that I showed where performance unit test so this we just have a single typically a single executable which is the performance unit test in the real system we just say this process has consumed this thread sorry this task has consumed this CPU to do this action that elapsed time no what we want to measure is what is really happening let's imagine for example that we have somewhere a kind of critical sections and two process are accessing some data and are looking this data you might have that one process has to wait a long time so that we will see because it says I see that I have spent let's say 0.1 second waiting and consume only one millisecond of CPU ah either there was not enough CPU on the test was overloaded or there is wait time and so we just measure what the kernel gives back via the clock get time system calls other questions no thank you all speakers