 Okay, it says no signal, I don't know why. I think it fell out of the stick. Probably not. Let me try to exercise a little more. Work until five seconds later. Okay, well, I can... Here, try this one. You can take that adapter off. Oh, okay. Take this off. Okay, so we will... All set, it seems? Okay, good. This session is about the way we do asynchronous processing at the WMF, basically. So, let me start with what's asynchronous processing for us? Basically, everything that happens after some event happens on the weekdays, some action is taken on the weekdays, and that's not going to be processed before we give a response to the user. So, sort of things, you upload an image, you upload a video, and there are a few ways in which we process these changes. And these are the different things. One of those, the most historic one is the job queue, of course, which is part of Media Weekly. Basically, any part of Media Weekly can insert some job in a queue and then it gets processed asynchronously by a different cluster. We're talking about the WMF cluster. That's true for any Media Weekly installation. Basically, you can run the job queue in a synchronous mode so that the things are run at the moment, but in general, you can have them be asynchronous. Then, in some of our cases, we don't use the job queue, we just use cron jobs that run periodically on the production cluster, the production cluster, and do some actions. A typical example is synchronizing for week data, right? Since we are moving away from having just the media, which used to be Media Weekly, now we have services that are outside of Media Weekly. Some of those are state. Some of those need to react to the events as well. And so, we created another tool, basically, which is change propagation. Well, it's a combination of different services, but let's say it's change propagation. It's a set of rules that are for instance for extreme events. And as its own set of rules coded in configuration, basically, for the software, that will make it react to these changes and submit things to the different services. So, it's like, it's listen to events and then makes HTTP requests, broadly speaking. And finally, well, not finally, the HTTP purpose. So, whenever we change something on a week is we need to propagate these changes across our passion layers. And this is done with this protocol, which HTTP, sorry, it's hard to pronounce for me. And that happens now. It's mediated by the job queue. So, basically, whenever we have to purge something, a special type of job is sent into the job queue and then the job queue for what in this purge is to be a multicast. And then I'm sure there are other methods that I'm unaware of, but we have even this way, we just see that we have four different ways, basically of propagating changes across the cluster. And, well, all these mechanisms. From my perspective, being an ops engineer, the main disadvantage of this is that I need to know how four different systems work. How to debug these four different systems, basically. And whenever something doesn't work as expected, you have to which of these systems is adding an issue, basically. My point is, we are, you know, beyond the vision of just having MediaWiki alone for serving pages to users. So, we need to evolve in some ways, but we need to unify what's happening, or we are generating a ton of technical debt. And that's why this session is in the section about managing technical debt. When we introduced change propagation, I was very happy because honestly I'm not very happy with how our job queue works nowadays. There are a few issues from the job queue at least. The first one is calibrity. The way our job queue is built nowadays for whoever is not aware is basically the job queue uses redis as a transport. It doesn't use... Okay, if you think of redis as a queue, usually you think of a specific kind of usage. It's not what we do because of the way our job features our job queue and we will get into that later. It poses some scholarly problems. Whoever of you has access to LogStash can just go take a look at the number of cannot connect to server for redis that we have. It's a high amount of these kind of errors and it's proving it's a scholarly issue. We had to intervene a lot of times during the last couple of years on the redis infrastructure that we have to get to the main data center. Right now we have eight servers each running at least four instances of redis each. Just for... it's a lot of machines for this job and it's not always working as expected. The problem of durability which is something that not everybody is aware of but this is an awesome software. It's not exactly the best option if you want your messages to be durable. A ton of things can happen when you're delivering a message or redis crashes or the machine crashes that will make redis database unreadable and it will just be scratched away and it will start fresh. So nowadays it's... set as a job in the job queue. There is really not 100% guarantee that it will be delivered and worked upon. Also, we have a problem. We're moving to multiple data centers. Redis has no such concept as multiple data center awareness. We did set up a multi data center replication mechanism which is kind of working but mostly working, let's say but it's complex, it's hard to operate it's really not at the level where we want it to be and also it's very hard to do a job queue. Every time we have some problems with a job queue like the queue is growing indefinitely it usually takes a moment not casually because usually it involves is involvement. A lot of swearing is what's happening. It's not easy to instrument the Redis part of the job queue and understand what's going on. We should make it better. And lastly the way the job queue the medium of the job queue is built is that it sends messages that are basically serialized PHP. So whenever you want to use that to operate on things that are outside of the media weekly properly, so on our services external services from media weekly it's not really obvious I mean it's possible to make it operate on other services but it's not obvious how that should happen and it's not properly suited for that and that's why we introduced you into propagation basically. So there are issues, we have issues we have multiple systems, we need to do something about it and that's the main motivator for me to speak today. We need to decide together what course fraction we want to take because it's not really feasible to maintain all these different systems to do basically the same thing which is something happens on the weekly somebody does something on the weekly or even some crunch up does something on the weekies in a bot, human or a script does something on the weekies and we want all of our infrastructure of being aware of these changes in a synchronous way. I want to discuss what's the best course of action from now on because it's kind of challenging I mean the obvious thing to do would be just throw away what we have today and use change propagation or something. One of the problems we have that we have a strong requirement which is media weekly should keep working by itself so if you're just installing media weekly standalone or in a small installation in a pure PHP installation we want it to keep working we don't want to break that workflow for small installations so we still have the necessity to keep the old job queue working we can't get away from that completely we can maybe deprecate it progressively but we still will need a pure PHP simply implementation that will keep working on small installation for media weekly. This is a big requirement that's going to be important for the rest of our discussion. So, what's the model of current job queue? Basically any part of media weekly or your media weekly installation can submit jobs so any extension can submit jobs any part of core can submit jobs by itself and this is very different from an event based model where something happens on the weekly and then you react based on that. Here is just any part of a code can say ok you know what I want to insert a job that's going to do this and you can do it it doesn't require logic meaning that if a job fails or maybe not just it fails but just it's not acknowledged and we'll see why this is important it will require a job so if a job fails it will be retried until it succeeds or it meets a certain number of failures and so on. It has some profling mechanism and a back off mechanism and I have to be honest I still don't grasp all the complexity of that it works in detail but it has some mechanism to avoid you know the thunder in your for example somebody edits a very common template and that causes all the tasks to go down because you're exhausting all the resources of a database for example. It also has a semantic for executing once a job that's optimistically executing once because with a number of issues that we have in connecting to redis it can happen and I know it happens that sometimes the job queue is not able to acknowledge that it has executed correctly a job it's not able to remove the job from the queue correctly acknowledging the end of execution so when we say do a job we want it to execute once it's well let's try to execute it once basically. We don't have a strong guarantee at the moment the job queue has dependency between jobs so a job can be dependent on another one. This is another important feature that it has that any system that's going to supersede some way the job queue will need to have it has the ability to do a delayed execution so you submit a job and you say ok I want this job to happen in n minutes in the future basically. I'm simply fine things but this it's for simplicity of a presentation and finally it's very important when you submit a job you can ask to it when you as the programmer that writes the code that submits the job are able to tell the system I want the duplication so what does this mean in simplistic way let's say you submit a request to index a new edit of a page on zero search for example. You can tell it ok there have been 10 edits in the last minute for this page somebody is doing a new one I wanted to duplicate all these events because I'm going to submit just once to zero search because it's expensive to index a new content it's quite hard performance in some way if you index too many things too many times the same thing because if you have 10 edits in a minute and not one of these jobs has been executed all these jobs will end up doing exactly the same thing exactly the same content which is what we had at the end of that minute so what you can do as a developer is to tell the system I want my jobs to be duplicated and there are several ways, several modes of operation I'm not getting into the details of this but these are the things that basically I can constitute the things that are kind of peculiar for the way our job queue works so there is a difference in models between how the job queue works and how change propagation works change propagation is a proper event based system so the idea is that you have media wiki that generates events simple events based on things that happen on the wiki like an edit or a relation of page or you edit template or you upload an image and that will send an event to a system called event bus which is one of our services that will use Kafka as a transport and then on the other end of Kafka you will have the change propagation software releasing these events and then applying rules that are connected to these events and based on these rules take actions like I don't know let's take the example of zero search again if you wanted to use zero search via change propagation you will have an edit to a page that will trigger a change propagation to call the end point it will probably call with every API to get the content and then submit that content to zero search itself it's very different from the way our job queue works nowadays because now you don't have to you don't have a set of defined events that will trigger a job you can just add job wherever you want on the code and it will be executed of course we can say that is an event in itself but it's not the way it works nowadays so also there is a problem that of course the way the new system works is that very well defined events while in the job queue model you can just submit any kind of data structure PHP data structure to the system and have the system working basically it's serious PHP so reconciling the two things seems to be challenging to me I thought about it a little bit and in particular what we discussed even on the topic of for this specific talk was what are the features that we probably need to add change propagation to make it a viable substitution for the job queue in some way and well there are a few things that I think that Marco is even more aware than me we discussed it a little bit but I would say that basically what we don't really don't have on change propagation as an option is a way to make whoever submits the job to control the late execution and the duplication of a job itself so nowadays let's make a practical example change propagation as a system to the duplicate transfusion so it's for example but it's controlled by change propagation itself so it has its own rule that says if you have a duplicate job for transfusion you will drop one of the two while in the other system you have the coder, the developer saying I want this job to be deduplicated with any other job that's similar any other event that's similar should be dropped and which one should be quickly and also the other problem is that it's based on fixed change on schemas that are kind of restrictive if you want because it's exact in some ways while the way the job queue works is just you basically in principle you just throw some PHP block into the job queue and then you will execute it via the code that you wrote yourself so it's much more free form in some ways the way the job queue works today so reconciling the two things seems to be kind of challenging in principle and I thought of a few ways we can go and prevent well one of the possibilities would be we decide we just want to deprecate the use of a job queue on WMF weekies and we move everything to WMF well the problem is that of course we would need since we have a requirement that small weekies should work we should probably provide some PHP processor that works in the case you don't have a Kafka installation an event bus and change propagation for you so that you generate events and then you treat them directly in PHP some system that basically mimics what change propagation does in PHP but doesn't that kind of exist already in a way because small weekies can still submit job queue jobs which will be processed on the next week so we would just need the bridge but that's if you keep the job queue system inside WM basically which is we get rid of a job queue in middle weekie and we just send events yes the big problem is that you would need to have a processor in PHP for small installations that would take in just these events and process them in PHP directly everything we have basically to use this new system so it's a proper deprecation of a big chunk of middle weekie cover basically somehow add the function to change propagation to be able to have to deal with the deprecation logic and make basically change propagation to be middle weekie API why is this this option is very attractive in some way because I think that the mobile that change propagation has which is truly an empty space is more clean logically but I think it's ton of work and it will make you have basically to write things twice because you will have to basically you have to code it in PHP on one side and then you have to code it in change propagation we get some logic I would log this option honestly but I don't think it's the most viable one but also change propagation is mainly config based which means that if the PHP processor can read that then you can do something like that you don't have to do it twice but at least you would have to do it once somehow yes it's still a lot of things to undertake so honestly is it worth it yes I think it would be worth it but it's a lot of work and I'm not sure it's viable honestly not in the short term at least and I would like to get a solution or a direction that's viable in the short middle term not something that we say okay when we can dedicate it in for six months to this we can get it done and then we have to manage all the transition for everybody else so the verification path and everything for all the extensions there are other options though which is basically treating somehow the job queue basically don't modify it directly well we can say that any job that we want to submit to the queue is an event it's not properly an event but let's treat it as it is an event and use instead of using the job queue that we have today let's use event bus and change propagation to basically act as a transport for the job queue so that would mean that you would have encapsulate the messages that are coming from job queue send them as proper events to the event bus and then change propagation basically to a specific end point like we do already runjobs.khp called something like runjobs not probably runjobs now but something very similar sending the process that we just have one system to maintain change propagation and Kafka are much better instrumented than Redis Kafka replication is much better than what we get with Redis and it's already working and stable in the WMF cluster nowadays it's much easier to operate in many ways also this as an out advantage we could think in the WMF cluster for things that are related to probably related to an event to just turn off if you would still be able to send the jobs directly but we could decide let's take again this template series which is what started this conversation basically so that's why I'm referring to it a lot if we think of series for example we could say nowadays it works like it has always worked but tomorrow we decide that we want to pass it to modify it to work as a proper event based system so we just turn off the production of the jobs in mid-week in with a flat for something and we decide to just write code a specific rule in change propagation when there is an edit of some kind we'll call elastic search directly from change propagation and this would allow us to have a path of migration or at least a partial migration that could be done incrementally and not just like okay we have to do a duplication one day just pass from one system to the next yeah sorry I went ahead of the slide there is another option that's a lower angle root even so I'm going from the most complex idea to the simplest one which would be to just write the Kafka backend for a job queue right so we say okay Redis is not the way we are using Redis because of the characteristic of a job queue it's not viable it's not scalable at our scale it doesn't work at our scale so let's just use Kafka directly as a transport for a job queue this would mean this would be simpler right because it just needs to write we just need to write a driver for the job queue for Kafka yeah it's not very simple I already shaking his head I can't agree it's not as simple as it seems but it's relatively simple compared to the other options I think services that are event bus and change propagation do things in the right way I think for an SOA like we have so this would still keep us with two separate system that work with different logic using the same transport but that's just by accident I mean we could even say Kafka is not ideal for the job queue we want to use software X to do that it wouldn't provide us the unification of the systems of the propagation of asynchronous events that I think we should try to achieve so I'm not really fond of this option I have to be honest I don't think it's it's a great idea and it's the same problem that I see with another option that has been named on the ticket which is just get away from the not invented here syndrome that we kind of have and let's use some job solution system that's not created by us but it's something that's already present in the field which could be there are a few things like Gearman or Chronos these are just the things that I know that kind of work in general it would require a lot of work to do this I'm not sure that they are tailored to our needs and especially to the peculiarities that we named about the job queue our job queue how our job queue works and also honestly that would represent the same problem that I just named a lot of different systems we already have changed propagation with that why introduce another thing and we have to substitute everything we have to get this working also I have to say for example Chronos is a very interesting system it's just being created by Airbnb but it's like oversize for our needs on the other side so basically it's after that if you want to do it properly so options the first one is just deprecate the job queue move everything to the new system which is change propagation with Npass another option is to encapsulate basically the job queue in this new system and then maybe move to use the rule based approach more properly than just using it as a transport another option is to still use Kafka or something like that that's better than the race for our needs and still use the job queue model but then we have to think of all the services that we have outside of media wiki and how we treat them and a last option could be to have third party executor completely and just teach what we've created I honestly think that the best option personally I think the best option is to start to try to encapsulate the job queue inside the Airbnb Npass change propagation but I tried to take the tool to make the presentation short because I think we need to discuss this and opinions from people around or questions about what they just said so please please do your God not hear again sorry do your God not hear again like anything but other than that you're right it's certainly a show on its age and we're having to scale a little bit on Redis especially as far as I'm concerned I don't have a particular skinner game in terms of which route we go but getting away from Redis would be nice making sure that whatever solution you use still works out of the box for the Airbnb there's just plenty of different users who can install extensions that apply on the job queue working so whatever is your plan if we decide to go about deprecating and moving towards more of a bus sort of model having some sort of fallback at stores for people who aren't us I think it's important other than that it sounds like it'll be a bit of sandy inducing if you just start to realize the PHP the objects then deprecate that job queue if you choose a new job queue you just have to type that's what's complex to basically sorry that was Andrew I think and Bill so I'm here to ask you why transportation was this question or a couple of them all sorry I can't hear you why the transport and the types of jobs are those couples or not well currently I would say yes because that's the way currently things work but I can see that there are people saying they're not going to travel for all of our domestic vacations and that will solve some of the problems without having to pick apart what actually is going to be into or pulled out of the queue and I guess I believe you know what I mean so no because right now there's only these two are bundled together currently in the job queue right so like let's say the transport is transported and where is basically one would say close system I understand that they're coming now but I'm just thinking that seems like the first thing to approach is to decouple so that we're neutral on our transport and then we have to have to have to make a lot of the problems because he's been at work the whole time that's what Joe is okay the question the same proposition that you for various reasons what would be for me term objective it sounds more reasonable because it involves some changes some of the problems of the technology that transport and not full software refactoring that we use by changing the ideas it also goes into the direction of maybe being able to change the idea with more once the change has been brought to Kafka probably a scalable way to actually change the ideas more easily that it can carry on so that's good I also go with what I have said which is even if you manage to have a more scalable way to transport action with a joint size it would be awesome to still have a project and not to forget to have this in scale so you're designing a way to go with the topic in terms of the job queue which I'm concerned that I fully agree with the general idea that has to be done my question is if the previous problems that we have with the job queue are not maybe so much about the technology or the technology may help in terms of better monitoring better debugging with the fact that sometimes those jobs create recursive jobs or have bugs that create a single user action creates millions of bigger jobs and my question for you are in general is if this really can help avoid that or or if maybe the problem is a bit more model that technology what actions are being handled on the job queue I know it's a bit of a problem no, it's not a topic I think I agree there are some intrinsic problems with the way we generate jobs so yeah recursive jobs is a classical case that's one of the problems that we have you cannot fix tasks you're not going to fix that directly you're not going to transition into the system the idea is that if you have specific type of jobs that do these kind of have these kind of bugs having a system that's more inspectable and is better instrumented but that's more than one person understanding how it works in general and again looking at you are not sorry but sometimes I really have to rely on understanding how things are working in the current stock it's going to get better in if you have better instrumentation in the buggy you will be more able I think you should be able to find out these kind of loops and calipers and all these things one thing that I have to stress that I didn't do while presenting is that at the moment there are a couple of things that the job queue is supposed to do that the event bus change application system is not doing right now one is voluntary voluntary decoupling the duplication of jobs that's not supported at the moment in the model but we were discussing this with Maricola this week and I don't think it's extremely hard to make something like that work and the other thing is that you keep from wrong guys but I think that you don't have a semantic for executing exactly once right why in theory the job queue right now has something like that but that's theory, that's not practice in practice I didn't think I've seen what I've heard about the topic I think that the change propagation has roughly the same that he wants from the job queue he needs to be once at least once well it's something better than the job queue that we don't suffer from we can't connect to this thing but it's also possible that if you execute a job it won't get committed until the job is until it's acknowledged from the other side so if that doesn't happen then we're going to track so maybe if the job actually we got executed but I don't know the connection broke or whatever and so I think that it didn't make okay so it's the same problem that the job queue at basically nowadays usually what we call at least once exactly once it's streaming it's usually very difficult to achieve the current system that try to do that but the regular way of doing that at least once definitely to me it seems that of the different options that you proposed some of them are more targeted in the short term especially option 2 that you prefer seems to be very incremental and you can do soon with limited impact and it would also be possible to combine that with something that could eventually be a full migration but the requirement for check hosting that you called out at the beginning what a pure PHP environment that that might be something that could change the long term and check hosting will probably not be the dominant hosting environment in 2020 anymore but it's four years away but it's kind of a time frame that is we can start thinking about this is going to go away but in the long term I think we can potentially especially if we invest in institution solutions and potentially rely on its service being there even for third-party users so to me I agree with you that option 2 seems to be the most promising in the short term but I think it might be possible to do something like one as a follow up in the long term I agree the point is that just fixing this problem so it's I mean deprecating completely the job queue as it is now has profound meaning even for development and a ton of extensions of media wiki let's think what I think about this I was thinking about how WordPress at some point I don't remember if it was between version 2 or version 3 or version 3 or version 4 but I looked at the whole set of API that most plugins were using and it was a mess like tragedy for anybody using WordPress outside of automatic and I don't really want us to get to that situation in our case that's the main reason why I say a full transition right now of course once you change the way I mean you can't go option 1 in one year it's going to be easier basically to move to option 1 because of some of the logic but nowadays the system, the change progression of an past system doesn't have like the application user driven to the application in some cases which might not be needed in the long term I agree with that the acceptable depth option of work time I have a question and that might be same something still good because you mentioned on the first in the scope of things that are done generally as synchronous the change progression and current jobs the current job will maintenance the sprints maybe put on the same category because current jobs was not yet anything that's run from third year basically everything in the current sense is grown is apart from some things that are proper things so it's something that you need to run every day once and it cleans up things one time but that's a bit different of a system right you have something that does clean up and you have to run it periodically from something that's run periodically to get things in sync which is a wicked data wicked data nowadays is basically doing the job partly we have several jobs we have some which do clean up only very rarely because of something but we also have more or less continuous jobs which do some form of change propagation from the data to the video and we basically need to run all the time from every 3 minutes at the moment and in class 5 minutes or so several instances of this running all the time basically that's a classical thing that should be moved to a job you like still I think when I was looking at the change propagation rules I noticed that it actually duplicated a huge amount of just configuration that was inside of ORS as well so I think there would be something a little fishy about the architecture there like in the remedy way that we should actually apply the jobs themselves to know more about themselves and change propagation this yeah that's a specific case that we are kind of discussing would be ORS guys coaching that is not the character is not the best way to do it but unfortunately we cannot the reason why we have this all that in general information as well is because for each wiki ORS guys they want different rules to be applied so we have to know which kind of code for each wiki it would be much easier if ORS new would say that if we just say we said it don't that wiki and then ORS would say okay for that wiki I need to use that rule but character that's not possible so we have to do that in this case yeah that would make sense but I don't think that any decision would come or something that is not a wiki word that is coming from so I think Aaron wants to say something about this so I just want to talk about that it's not like a specific code for ORS I think that even if you had a copy of wiki events back to wiki copy of wiki propagation is going to know what events send back to wiki which is really wiki's concern and so you're going to you could hit a configuration again one in the chain propagation and one in the visibility that actually perform jobs and so that's the situation where now ORS reports the jobs and it means to tell chain propagation that it's not happening and so I think that this is actually going to be a general product of specific colors I kind of agree because it's I mean it's describing the fact that there is job management so you send out the generic event basically from media wiki and then whatever needs to manage that event in some way needs to have some intelligence into it you can manage the fact that you can have a shared configuration between two things like I don't know having the yaml file that spreads in two different places in the same way you can manage that of course you have to have some knowledge of the kind of job you have to do and you have to encode it in the rules that change propagation has I mean it's it's obvious whenever you have a system that's event based right it needs to have some intelligence I mean change propagation is not just a system to relay messages from one side to the other that's the way I want to abuse it for moving the job to it it's not the way it's intended to work properly right so yeah of course that's needed in some ways it's not necessarily a bad thing as long as you can manage the to have a configuration shared between two systems in some way it's not good to be really a problem but I don't think it's available right if it's cheap to just do nothing or just know what you actually process and which means you just say okay thanks you did nothing put it all in words just say it's all yes and we'll handle it ourselves but then it's just listen to Kafka directly you have to decide where maybe the one with the boundary below those lines isn't it more to have the consumers subscribe or keep the information within the things that are actually using the events then shouldn't the consumers proactively subscribe to things that's what the chain of navigation navigation is but it's totally external to the services that are consuming but one thing you get by separating those things is that we try the areas and so on that you would have to implement from scratch if you just have a background I'm saying the orders could be talking to something that would allow it to give itself I'm saying subscribe me to these types of events yeah the problem is that when you have a series of features like yes retrying, logic concurrency limits the duplication of jobs a whole lot of class of things that you can handle essentially you have to code in every single system for me personally as an ops engineer it would be a nightmare because when I will have to know all the characteristics of consumers in every different software that we have and probably all the different e-cups of every Kafka library that we use in the different languages that we use and yeah and when we will have let's say we have a Kafka video of some type and saying Kafka because we are using Kafka currently it could be any system of stream propagation then every software will react differently to for example one of the Kafka consumers going down all Kafka machines going down we already had that I mean we have a few things that consume from Kafka in different ways and whenever one of the machines fails we will find out that different softwares react in different ways in different ways it's one of the reasons why you say that you don't use database as integration layer you don't want 5 different softwares from me getting directly with the database well I would be very happy if we had like 5 different softwares interacting with database yeah it's basically the same algorithm I think yeah I'm sorry but just to give you an example of what I would suggest the extension that Jason describes the purpose of the extensions we handle and I'm just proposing that we can do something like that for the events also you'd say this expects these events and then there's something else that we can just check here there was an idea to do this exchange propagation in the early days but the figure and like said the currency is wrong then some burst will basically just stop the cluster and right now all these configurations are in 1.5 which we can look at and then tweak and it's kind of safer also let's give another example let's say for some reason we've released a new version of MiniWiki and we introduced a bug that sends out a burst of events like 1000 events for every edit ok we just did something wrong and in that moment what happens is that all the system go crazy at the same time that's what it's going to happen for sure if I know by the way the central point that's causing this kind of harm I can just turn off exchange propagation and then understand what's happening go look at Kafka and understand what's happening if I don't understand something like that I turn off Kafka that means send to Kafka properly so we are going to lose those events it's just a simple example but in general I think it's I mean every a central point which is not a single point video but it's a simple space where you manage the routing of these things it's a good idea and these are destroy our dreams yeah I agree with that it gives you a lot of a lot of metrics and like you know like so we're getting out of that kind of job maybe in a event plus subclass we could just start drawing a serialization of actual like I already did we could push shops to parties we should use a certain class actually of course we'd be able to use services but it's sort of like actually compared to the last jobs we could actually be serving Jason which that enables you to avoid using that view to dispatch for some of the cases because I guess in your case some of that would be seen as a point where the jobs that you go with the change problem have a sort of generic view of these jobs and then you have other scenarios maybe with the 8th, 9th, 8th, 9th, 8th you don't have to do this you can be you know let's say you're all for that sometimes you can say like okay let's make a search box how do we do this now it's just basically not you it's not the HP group and then when you're talking to your staff you don't have the other like that and then like that might be useful to you so everything's there so if you're a third party you could it's still useful you can get around where you can get around because there's a lot of options for nothing else that seems like the wrong thing I don't know it seems like it's a way to it's a way to get to something it's not all you want to know it's more steps in the right direction do you think that there's something that we left out of features of a job queue that's important I mean we're not considering at the moment in this I just tried to summarize what came out like it's not hard to do in like show jobs it's how easy it would be to re-implement other questions, comments okay, mostly the way it works depends on the way Kafka works basically and so for Kafka Otto's not here so he can't correct me if I say something wrong which is very good and appear to know what I'm saying but basically you have a main cluster in the center one a main cluster in the center two and then you have a tool that's called Mirror Maker that has been created by LinkedIn that propagates the jobs to the other so basically let's say you switch to the centers basically what you would like to do is do nothing on the event pass because you can have a Kafka cluster in IKEA to continue to produce jobs I mean it has a series of jobs in it's queue and you can just say okay, I'm just turning off media wiki I'm moving the active media wiki to the center two and you can just change the endpoint that change propagation is calling on the media wiki API and that's going to work because you would have two clusters active and change propagation the one in IKEA sorry, I'm trying to not use our name the center one will exhaust this list of events and send it to the proper softwares including the media wiki API that would be in the center two at this moment because you would switch it over and then since media wiki is working in the center two the Kafka in the center two will produce events there and the propagation in that center that is listening there will consume that and send it out but if you lose that center you will still have a replica of the data from the first data center in the second and you can configure change propagation to use that to process all the events that were in the first data center that you still didn't consume it's much more solid than what we can do nowadays let's save it today the first that center one goes down and we have to change everything to the other side what's stealing the queue in redis at that point when went down abruptly I won't guarantee it okay, so all the topics that we fixed with the producer DC so and they both cross replicated so in each DC you have two sets of topics and choose to only consume the local ones so yeah normally just listen to the local ones but if you lose that center like a crunch event you can listen to both and you will process them so it's like ones that are still in the time that the topic are too big also the race in the case is easy to affect your life or the vaccines in the public and still re-synchronize after that so yeah it's much more solid in general multi data center for redis nothing is big there are still some wrinkles in the switchover process because after 010 it's only introducing a timestamp based on the indexing which we don't have yet and right now the offsets are local that you see so you can't just say okay, you can see you can also say you can also say they have one because they can't do the offsets so that's one thing you can still manually see to the red offset and actually re-process something yeah, you might re-process something but you're not using jobs I would argue it's more important well maybe medicine is that we're sending out and we're sending five or two e-mails instead of one yeah, can I? we're doing better definitely doing better I've seen better so yeah we're doing better but you're still using jobs it's like to have lost jobs but I think it's important not having to do this just to specify that for job type to you say that at least once and you're specialized so I guess one last question or comment you mentioned like that as long as I can engage in the chronology of jobs you mentioned it but you also said it's not very good do you think it will actually ever be a part of the change or whether it will be a piece of the piece of all that I've been talking about yeah, I'm concentrating the presentation on the hard part which is the job queue but I think for example let's say for example the cron jobs that we use for weekly data to synchronize we could just generate an event and make change propagation do the job instead of having cron jobs do that it's yes, we have multiple systems but the hard ones to recognize are the media we job queue and change propagation the media we job queue in making it as some real world issues right now we're not being operated so it's very cool that way we have one that can work as well because of that yeah and also at times jobs finding out one over the other and not really doing much because we were contending that basic issues like what was it one year ago or something some problem like that but yeah the cron jobs are mostly I mean we do have a check that checks that weekly data and synchronize at least within 300 seconds but that's basically what we have and when that alarm goes off somebody goes to look at logs that's the amount of instrumentation we have basically so yeah anything different from that is going to be better probably we have to in-protein passports something like this yeah yeah looks like something that they comport between people in different teams yes but I just wanted to have some people in the room and see if there were things that I was blinded about that made it unfeasible, completely unfeasible as an option we might work on scheduling it sometimes in the next year hey thank you all for your attention thanks for that no thanks