 Thanks for coming to our talk today. I'm Peter Webb. My work at the MathWorks focuses on MATLAB deployment and remote execution. I'm joined today by Lucio Seto, who knows much more about machine learning and math than I do. We're here to share a use case story with you. We used federated machine learning to develop a predictive maintenance application for industrial cooling fans. We trained a model to detect anomalies in sensor data streaming from the fans and trigger service alerts. And we were production ready in 30 days. How did we put a big system like that together so quickly? No big secret. We used a box. You know, that box everything comes out of. Out of the box, off the shelf, software we could just get and used. We needed a physically accurate simulator to generate training data, a streaming service to transport sensor data from the fans to an analytics engine and a metrics visualization platform. And we had to configure the connectors that glue all the pieces together. But then we could focus on the work that mattered. Lucio had to write the machine learning analytics, and I had to create that metrics dashboard. But before we get to that, a bit about the project itself. So here's the general idea. Scan the data produced by fans from a factory to generate service alerts. But we've got more than just one factory. And each of them has knowledge that the others don't. If we combine all the data from all the factories, we'll have better models. But that requires extra infrastructure, a central server to train and update the model, and some big pipes to push the data to that server. And then once the model's updated, the local classifiers have to pause to fetch the new model. That architecture works, but we thought we could do better. What we want is called federated learning. We replace that central compute server with a much simpler and cheaper data store. And then the only information we need to send back and forth is the models and their parameters. No need for those expensive data pipes. And the local classifiers at each factory still benefit from the knowledge accumulated by their peers. This does leave us with a question, can federated learning build a model as accurate as a trainer that has access to all the data? Later on, you'll see some analysis from Lucho that helps us understand the trade-offs. We're using federated learning to simplify our infrastructure. But another important use case is data privacy. Since the data never leaves the site that owns it, any secrets it contains remain local. Well, this might not matter too much for fan-sensor data. It can be very important for medical or other types of regulated or protected data. And speaking of data, our training algorithm wants lots and lots of it. But it's not just a question of volume. If it were, we could just connect the model to the data from the fans and our data scientist would be out of a job. So instead, we turn to a source that can provide us with both signal and label, a physically accurate multi-domain simulation model. That part about labeling is worth emphasizing. Once the model is generating the data stream, it knows when the fan isn't working right and tags that part of the signal for the trainer. Without the model, we'd have a lot of manual classifying to do. We need to model three characteristics of those fans, electrical, mechanical, thermal, and scan for anomalies in each domain. A load anomaly, for example, indicates the external temperature has risen unexpectedly quickly, perhaps in response to increased load on the device cooled by the fan. It's not sufficient to model the domains independently, because in a real fan, they interact. For example, when the temperature rises, the fan controller raises the supply voltage, the motor runs faster, and the rate of cooling increases. This provides us with enough accuracy to be production ready on day one. So here we are on day one. The fans are streaming data to the classifier, and the dashboard provides the operators a near real-time view of the system. But now we have another source of labeled data. The classifier is labeling the live fan data. How can we exploit that to improve the algorithm? Now we need a human in the loop. Operators may periodically determine the classifier missed a real anomaly or raised a false alarm. This shouldn't happen often, but when it does, the operator can send the misclassified signal and the correct label to the trainer. When enough have accumulated to warrant retraining, the trainer updates the model and sends it back to the classifier. Manual classification becomes feasible only because it is occasionally required. The machine learning algorithm handles the bulk of the work. And with that, I hope I've given you the background to understand the next section. Lucio is going to tell us, tell you about our modeling techniques and how we evaluated their effectiveness. Lucio? So today I'm going to play the role of the data scientist. For me, there couldn't be a better scenario than having a physical model to start with. This will help us to understand better the problem that we have at hand, such that we can propose an incremental and federated learning approach that could allow us to detect and classify the anomalies of the fans. Let's start by understanding the data that we can observe. We observe three different variables from sensors in each system. If you will, we can call them also fans. We observe the voltage applied to the motor, the angular speed of the fan and the temperature. For visualization purposes in these examples, we have introduced more anomalies that we usually expect. Anomalies are annotated at the top of the first plot. There are three types of anomalies. A load anomaly is detected when the system is working on overloading conditions. That is, we are demanding more work to the system that it was designed for. A fan anomaly is detected when something bad happens on the mechanical subsistence of the motor or the fan. Finally, a power supply anomaly is a deficiency that is shown with a drop in the voltage. Now let me zoom into a small area. This area is about two minutes. In our example, the anomalies are manifested as a small pulses on the traces. That lasts something between 10 and 100 seconds. Notice that different types of anomalies have different profiles. The information contained in only one of these traces usually is not sufficient to do a perfect classification of the different type of anomalies. As one anomaly or another anomaly can be manifested in several of the signals. There are some complexities here. For example, the pulse for the second anomaly in the fan speed is within the noise amplitude. Also, there are two apparent pulses in the temperature in the temperature trace towards the end, but they do not represent any anomaly. These are the data analytic challenges that we can identify in this problem. The system conditions vary across different locations. For example, the load schedule. You can see in the second trace on the right that the temperature varies along the day. Individual systems are different. That means that the age of every fan might be different. Therefore, the characteristics might be different than the things that we measure. Also, observed anomalies are different at several locations. There are some cases that in some factories we can only observe one type of anomaly. Finally, anomaly detection lags should be in the order of the seconds. So we cannot use big windows in order to correct these signals. Before talking about federated learning, let us make some assumptions. A view of serving the previous slide, there are local system conditions that are either dependent of the factory of the plan or the individual fan itself. For example, a given factory might have an array of fans inside a controlled temperature environment. Or there might be working 24-7 on a reduced amount of time. Additionally, every fan might be at a different point in the life cycle. Therefore, we will assume that these variable conditions will be mostly represented by small variations in the signal trend lines. For such, we will be using remaining useful life modeling and some robust mechanisms for the trending. I will not go into detail on this. We have already talked about this in our previous presentation a year ago. However, I will show you the effect of the data after I apply these corrections. Here, I plot the data points by looking only at two of the variables. We can see that after removing the local effect, we can clearly separate the load anomalies from the normal data. By the way, the power supply anomaly is also discriminated easily when we look at the motor voltage, which is not plotted in these figures. Most importantly, let me emphasize that any information about the local conditions, that means the robust trend and the remaining useful life, is not communicated to other systems nor the federations. It will state local to every motor. Now, let's go to the modeling technique. To model the anomalies, we use a set of binary classifiers that model a given type of anomaly versus the normal behavior. We experimented with incremental linear support vector machines and incremental naive-based classifiers. SVMs use stochastic gradient descent algorithm to update the model that starts from a random hyperplane. Naive-based classifiers update the summary of the statistics that represent an independent normal distribution. Naive-based classifiers have the advantage that they can provide scores within a probabilistic framework. However, in our experiments, we found that support vector machines to be more accurate for this type of problem. Now, recall that not all the anomalies might be sufficient represented at every node. Therefore, we only train those binary learners at every fan for which we have enough evidence. Let's make a small parenthesis about the labels. We assume that these are provided but either two of the following ways. In the first case, an operator at a given factory might label the anomalies after experiencing them, retrofitting the information to the stream. In the second case, we can consider an engineering team that might be studying the operation of the system under certain conditions at a given factory for which they have created a physical computer model that generates the data in the labels. OK, let's see how the federation works. Let's assume that we have five nodes. Each node has been trained to its local data with an incremental learning algorithm. Model parameters and model weights are sent to the federation. Weights is the value that contains the amount of evidence that we have observed for every anomaly at every local node. At the federation, we also keep global model parameters and global weights, denoted here by w0 and theta0. We can simply compute model parameters using a weighted sum. In the case of the support vector machines, we compute new hyperplane coefficients and bias. In the case of the naive-based classifier, we compute a weighted sum for the standard deviation and the mean of every variable. This computation might be synchronous or asynchronous. That is, we could update the global model as soon as we receive one model update or we can update in batches when we receive all model updates. Later, Peter is going to show you that we might also compute the global updates at every node. There are three important aspects to remember from this setup. The federation weights the model parameters using the amount of evidence for each type of anomaly. The federation is also in charge of reconciling new type of anomalies reported by the nodes. That means that as new anomalies appear, the federation is going to update the inventory of known anomalies. And as I already said, the global federation weight is capped at an arbitrary value. So we keep considering novelty introduced by the nodes. At any point, the new updates are set back to the federates and we have updated models at each one of the fans. Let's look at the accuracy of the classifier model. Recall that here we are presenting the combination of two different paradigms, incremental learning and federated learning. To understand better the accuracy, how the accuracy is impacted, we are going to explore the accuracy following these steps. So let's first start by looking at the accuracy of the simple linear SVM. This is an offline linear SVM. So we get all the data and we train at once an SVM. Let's look at the confusion chart of the offline linear SVM model. Accuracy here is computed by looking at a five-fold cross-validation over all the data for 10 fan systems that run over 24 hours. The first rowing column represents the normal data, which you can see is obviously the most frequent condition. Next, we have three types of anomalies. The last rowing column is the most infrequent condition. For us, it represents when we are observing two or more anomalies at the same time, which is a very rare condition. You can notice that we only missing to flag seven real anomalies on the not overlapping cases. Let's compare with an incremental support vector machine learner. Here, we measure the accuracy by looking at the forecast predictions after a one-up period of 24 hours. Notice that we also miss very few anomalies. However, the model produces more false alarms. Something worth to highlight is that we do not misclassify the anomalies. This is due to the nature of the modeling, that is, because we are using binary models between normal data and types of anomalies, having a misclassification would actually imply to have an error in two of the binary learners, which likelihood is much smaller. In summary, both scenarios, we have very low false negative rate and false discovery rate, which is good. Okay, so now let's compare with the incremental federate learning approach. In this case, every local model only trains using one type of anomaly. The model is then sent to the federation and we measure the accuracy using forecasted predictions on the federated model. We still miss few real anomalies, but under these conditions, the false discovery rate increased approximately to one error every 200 flag anomalies. Let me emphasize here that while incremental model on the left needs to have access to all the data, the federated model on the right does not require to share or communicate any data. Instead, we only share the individual models, which is the federated learning paradigm. Let me show you some of the typical errors that we are doing. The most common error is when two anomalies occur at the same time. We actually expected this type of error as the amount of information is not really sufficient to tune up the model app, to tune up the model. Another error occurs during the transience of the system. This was also expected. There are two possible workings around for this that we did not fully implemented in our proof of concept. First, we could consider some small lags. We can also incorporate into the system the scores or posterior probabilities to give the operator the idea of the significance of any detected anomaly. To close the data analytics part of this presentation, let me show how I will pack me and my algorithms we just described. There are three streaming functions. Two of them run locally at every node, and two updates the federated model. The anomaly predictor receives a stream with observations and feeds to an output stream with anomaly types and scores. There are additionally inputs, the classification model and the local conditions. While the first one is not required to be updated, the local conditions of every motor or fan need to be updated. Recall that we keep the local conditions for each fan or system. The function that incrementally learns the anomaly classifier has a very similar signature with the difference that it also received anomaly labels in the input stream. And it also updates the classification model with the updated parameters. Finally, the function that updates the federated model does not receive any data, labels or local conditions. Input and outputs are only the classification models and the evidence weights. That's it. So now, let me go back to Peter, who will tell us about how he configured the whole system and drive us through some implementation details. Thank you. Thanks, Lucio, for explaining how the anomaly detectors turn analog signals into square waves, discrete state changes, which we need to count. That's the task we started from, after all, scanning a set of real-time signals to create a predictive maintenance solution for cooling system operators. So let's take a look at that system and how we put it together so quickly in a little bit more detail. The system essentially consists of three components, a data source, an analytics engine, and a metrics dashboard. Connectors link or interact with the streaming service to manage the flow of messages through this pipeline. So that's it. That's a high-level, abstract view of the whole system. But it's not going to do anything until we fill in those abstractions. Now we go back to the box and pull out all that sophisticated software. Simscapes multi-domain models generate our training data. MATLAB production server runs the training and classification algorithms. Redis preserves model parameters and other state. And InfluxDB and Grafana provide the dashboard. To help you understand the trade-offs in building a system like this, I thought we'd follow a signal on its journey through the pipeline. We'll stop at each of the three components, where I'll highlight the challenges we faced and the choices we made. So let's take a look at the data generation. This model allows me to vary the type and number of anomalies that occur during a given period. That variation raises the diversity of the scenarios I can throw at the trainer and increases the model's eventual accuracy. The colors you see indicate the different simulation domains. And here's the simulator at work, generating about an hour's worth of labeled synthetic data. There's four charts here, the three fan signals and the load anomaly. And as it runs, you'll see what I mean by the anomaly labels. The load anomalies are those that square wave you see up in the top chart. In this section, you see several of the R-injected load anomalies. The case temperature rises, and the fan speeds up, drawing more power and compensation. In the top chart, you see the labels, which are logical values, true during the time the anomaly is occurring and false when everything's normal. That's why the anomaly signal is a square wave. Here is a voltage anomaly. See the large voltage drop in the bottom chart and the corresponding slowdown and fan speed. The temperature rises a little, but not enough to trigger an anomaly label. Another thing to note is that each of these red areas is about 90 seconds wide, and therefore contains about 900 messages. Those square wave labels are chopped into lots of little pieces. That'll be important later on. The next stop for our traveling signal is the classifier, where we'll see how to design federation for high throughput. Here's the anomaly detecting pipeline at one of our knowledge sharing factories. Let's take a look inside that signal classifier. Each factory has a number of anomaly detectors, each using a model cached in a local data store. The parameters required for federation are stored in a shared data store that all the factories can access. The anomaly detectors run simultaneously, each processing signals from a unique set of fans. When it comes time to integrate parameters from other factories to the local model, we'd like to pause each local detector for as short a time as possible. It turns out, if we incorporate federation into each detector, the atomic read-write guarantees of our data store make it possible for each detector to operate completely independently. As soon as any given detector determines it has seen enough data to change the parameters in a measurable way, it pulls new parameters from the shared store and updates its model. Note that at the same time, the green factory sent an update to the shared store. Did the federating detector get that update? It doesn't really matter. If not, it'll get it on the next cycle. What matters is that the other detectors didn't have to stop and that the parameter data wasn't corrupted. Then the federating detector updates the factory's model in the local store. And at the same time, a different detector loads the model. Again, we don't know if it got the pre or post update version and it doesn't really matter. Since each detector can federate on its own, there's no need for a synchronization stage that would make some of them wait. And that keeps our throughput high. Our signal's been turned into a square wave now and it's traveling toward the dashboard. I'm going to use this last phase of our signal's journey to highlight a couple of subtle points regarding the time series data. You'll need to at least consider these questions when you design your own data model. Though these two signals look like atomic, indivisible units, the sampling rate divides them into segments. And that's how they're transported to the system as multiple messages. Now you've probably seen the problem. The data parallelism required for high throughput can combine with network delays to scramble your signals. Most messaging services will give your data an ingest or arrival timestamp. That's the time the system accepted the message. But to put these messages back together again, we need the event timestamp, which records the time that the signal is generated. So if you're generating time series data, make sure your data model includes the event time. Now our signal is out in the wild streaming across the network. We'll next encounter it at one of the connectors. Our poor little square wave is in four pieces, and the pieces are no longer in order. But the connector uses the message timestamp to reassemble it, placing it into little buffers called Windows. And then it emits each window as a single larger message. In our case, the window heads to influx DB. I hope you can see why using the event timestamp was important here. Note here that the anomaly signal splits across a window boundary. That's why the signal is decorated with those little squares. Those represent a single unique signal identifier that the database queries use to join the windows back together. So at last, a time-coherent window of data is bound for the dashboard. Let's take a look at what happens when it gets there. You've already seen video of the fan sensor data, so I thought I'd start with the classifier and then show you the metrics dashboard. Here's the classifier during development. I'm using a MATLAB session as an Airsoft server, which is far too slow for production but very convenient for debugging. Now we've stopped at a breakpoint. We can take a look at the fan input data, which is stored as a timetable. There's a few of the values of the sensor signals. And after passing this data through the prediction functions, we can examine the results. I'd be better able to judge if these were correct if I were a better data scientist. But I trust Lucio's math, so I'm sure they are. Now that same classifier is running in production on my eight worker local server. Mostly, I wanted to highlight the throughput here since each request contains about 1,000 messages. You can see we're processing between 6 and 8,000 messages per second. And here's the dashboard. This is an overview of the cooling system. Number of anomalies at the top, that bar chart. And below it, a breakdown of the anomalies by fan. On the right, there's a timeline of total anomalies showing when they occurred. We can also drill down to see more detail about each individual fan. This is fan six. You can see the anomaly counts and signal values that triggered them. Now we're much further along in the run. And you can see another few fans have come online. Let's look at fan nine. It's only reporting one type of anomaly so far. On the right, you can see there are more downward spikes in the fan speed graph. Now we'll go back to the main screen. And you can see we've accumulated a couple hundred more anomalies. So that's the overview of the Grafana dashboard we developed. So this is what we've learned. The training data provided by physical models gives you a head start. And using off-the-shelf components lets you put a system together really quickly. Federation makes your model smarter and careful separations of concerns and interface design makes it easier to scale where necessary. Physical models provide the seed for rapid growth of a robust predictive maintenance classifier. Thank you very much.