 So Stefan is basically from Munich and he's running his own software consulting company named Stefan Barisch Software Consulting. He'll be talking about golden master testing in Python and he has interest in text processing and machine learning with respect to text. Apart from that, he's quite inclined towards automation and GUI programming and he has spent a lot of time during his BST on automation testing. So thanks a lot, Brian. Sorry, Stefan for joining. Yeah, over to you. Yeah, you're very much welcome. So it's been a long conference so far. So I'll try to keep things very, very simple. We are essentially talking about rubber ducks and well, how you would test rubber ducks and other things with Python. So to provide some background, did a lot of testing in my past, actually what brought me into programming a couple of years ago. And yeah, since then I thought about different ways to test and how to make testing easier to do well with Python and with some other technologies. And well, before we get started, let's have some background and what's golden master testing. Let's imagine it's the summumments that you have a summer job and you work for somebody who produces rubber ducks, a lot of nice little rubber ducks. Now let's also assume that you're not a rubber duck expert. So you don't really know what a good rubber duck should look like, how much it should weight. And well, honestly, you don't really have the time to learn all these aspects of rubber ducks. But in your job, you should do some quality control. So you should be able to decide is this a good duck or a bad duck. So you come up with a following approach. You ask the person that you work for just to, well, be available. And every time that you see a duck, it looks kind of particular. So like this red guy here, you ask them, is it really a good rubber duck should that work? And if he says, yes, well, then you continue with the production. Maybe even you remember, okay, that's a good duck. That's what a good black duck is supposed to look like. Now, we are not only testing rubber ducks. We are testing programs, but you can go to a similar situation. You might imagine that you have an old, well-established system. And thus this system produces some kind of output. It might be an image, it might be text, it might be a webpage. And you don't really know what this data is supposed to look like. So you have a rough overview. It should obviously not be, well, totally defective, but you don't know the details. And you ultimately ask the computer to do the same stuff that you did earlier with the rubber ducks. So you say, okay, computer, please compare the output of my current run with all the previous runs. And if you see something that's different, then please notify me and I'll have a look. But maybe whatever now happens is the new normal, the new standard, and we'll just continue with production. Okay, so bringing this back from our ducks to my computer-like things. This could mean, for example, if you look at the website, if you look at the website here on the left, there are a lot of little document, well, little elements that might, may have changed, may not have changed. It might be a log file. So maybe you monitor the behavior of your old system with a log file. And you want to know, does things change? And those things get slightly more interesting. Of course, you would need to make a decision. What do I actually look at? In an example of the log file, do I only look for the presence of some log messages, or do I want to make sure that certain errors are not there? Do I compare execution times? So what you really want to do is, take an established known good output, compare it with the current results of your execution, and then write a program that compares some part of the past execution with the current execution. And if this is not the same, well, notify you. And ultimately, that's all that is to this talk. So we will just go through some more further examples, provide some details, and show you extremely simple implementation houses can work in Python. So I think it's not even on GitHub yet. I think it's 100 lines. Okay, I hope that's interesting. So when would you use this? Maybe you have a legacy system and don't really know the outputs we talked about this. Maybe you have an extremely large system of data. So you want to compare, let's say, well, a large data set, and you don't really want to write many, many small assert statements, but just want to compare the output. Or maybe you must just do some small changes and ought to make sure that there are also just some small changes in the output. Yeah, and this is all the process we have to implement for this. So we have a program and we run our program and capture the output. Now we have some files or maybe some data set entries that show us this is how things normally look, what they should look like. Then we change our program and then we run it again and capture the output again. Now, if nothing has changed, so we can assume that if we catch enough output, that the program is still correct. So if you do some calculations on a data set and have maybe some two Excel files and compare, well, the results in Excel file one from one, one from the unmodified programs are the same as an Excel file two from the modified program, but everything's the same, then it's good. Now, it gets slightly more interesting if you ask yourself what we like when things are different. This brings us into the, well, why these kinds of tests are actually known under different names. So what I call a gold master test for activation tests because the execution without characterizes the run. They're called approval tests. It's the assumption there that you'll have to approve changes quite often because they're quite often, well, that's our program changes, changes also the output. Or you could call them snapshot tests because especially in web programming, the JavaScript world, you basically take a snapshot of your web page compared with another snapshot and compare whether things still are still correct. So you might find this with, well, under different names. Now, the most simple way to implement this is just to say, okay, I capture all the output, whatever it is, compared to all the new output and if there's something different, I'll have to look. But this can be quite tedious. So imagine you have a program that produces, let's say a couple of megabytes of output and log data, a log file or you have a large website that you generate with some jungle templates. You don't really want to say, okay, something's different, how do I find this? So you need to do, you need to look at some implementation decisions. If you follow this execution flow, we, first of all, we want to introduce this in our usual test framework. So we want to have some things that we can insert in a certain statement. And we want to check if data already exists. If no data exists for this test, then we just capture whatever we want to produce. If we want already have data, then we have to compare it. Then we'll have to create something that our computer can help us or that Python can help us to compare. So essentially we want to diff some part of the output. And when we do this, we can run our test. We have now a slightly smaller part of information that we need to look at. And we can hopefully look at the diff result and approve or reject the test results. That's an essential part. The first part is a search check data will trigger whenever there is a change in the part of the output that you specified. And the approval process, that's you where you say, okay, that looks different than the last time, but it's still okay. Well, and in order to keep this manageable, we'll have to make some design decisions. We cannot in good conscience say that we just capture everything. So we have to decide what do we look at? Do we look at our project, our program state? So maybe we just use an automated debugging and capture whatever the internal data structures of our programs are like. Do we look at a log data? Do we look at a textual output and Excel file, et cetera? So what can we look at from our programs that would tell us whatever we're interested in? Then the next question is, how many tests do we actually want? If we have just one important output, let's say just one Excel file, might be tempting to say, okay, we just compare everything. As soon as the one cell in this Excel file is changed, we want to review it. Or we could say we break it down. We make, let's say 10 tests and each looks at a one specific part of the output. So maybe at a different worksheet. Or if you have an image, maybe you say only look at this particular area of the image because that's where we maybe in a graph print out some of the relevant information and the rest is just, well, more for interest. We have to decide how do we actually store whatever information we're interested in? Because at some later point, we want to be able to go back and look at this information. But first of all, have our computer compare and tell us what's different, what has changed. And we want to also be able to look at it ourselves and say, is this the change that we really find relevant? And well, to make finding the differences easier, we also have to decide on how do we actually div that? If mentioned that you, let's say have a Python program and you write out some the internal state to disk. So you could use pickle, you could use JSON, you could probably and write it out as a, well, purely textual presentation. You could write something yourself. And depending on what you choose, you have more or less readable data and you have different data volumes that you would need to look at. Well, and finally, whenever you do something like that, there's always something in your data that you're most likely want to ignore. Let's say you have a run that you do every day and you create a log file. And this log file most likely includes the current date and time. If you just did a comparison with the date and time, then you would have a change in almost every line for almost every run, every day. So it would be, well, you would say that everything has changed. So whenever you design such a system, you would need to decide which of the data that I want to use for my comparison. Do I want to throw away? Do I want to ignore or do I want to simplify? The same with floating part numbers. If you have floating part numbers, maybe you don't really need all the digits, but maybe you just take the first, let's say three digits after the point and move from there. And, well, these are all things that you can decide for your little program. So what do I want to test? What it is that I'm really interested in? How do I want to represent it so it can easily compare it and see whether it's still what I expect? And what are the things that will change with everyone, even if the program itself hasn't really changed? Yeah, and what I did for this talk is I went for an extremely simple approach. The reason being is that while you have some libraries that can do that for you, what I found that these libraries can only provide so much in terms of workflow. So they help you to integrate your tests with a pie test or with other unit test frameworks. But all these decisions that we saw before are decisions that you won't have to make for your program. So the libraries, for me at least, didn't add too much. On the other hand, Python provides us with a lot of interesting little modules and what building features will it that make it quite easy to take a piece of data, transform it in such a way that only the relevant parts remain and later on compare with something different. And yet for my little test case, I just went with basic JSON format. I went for JSON pickle, so I can pickle an internal data state as a JSON file. But well, if you wanted to do something different for an image, you have pillow for data frames, you can either compare them at Excel then you have even an external diff tools for Excel or you could also write them out as JSON. That's one of the things that is, well, we're Python and build master testing work quite well together because in other programming languages, you would either have to implement everything yourself or you would have a harder time to represent internal state in its own difficult format. To a certain degree, we talked about that. So what do we want to ignore? In essence, everything that will change even if you didn't change the programs that you want to test. So everything's changed the same anyways. There is no reason to keep this data in. On the other hand, if you have a log file and you want to compare execution times, so don't really care whether it's executed on a Friday or on a Wednesday, but you wanted to know from the time that this particular program part started to the end of this run, it shouldn't be more than two minutes, then you, well, would need to look, take this log file and transform it so that you have the difference in time while you ignore the total time. And we already went for that. Most likely you would start with just one large test or a couple of large tests and then realize as you execute them that you have to approve too much at once. It shows you too much data and then you break it down into many smaller tests. Yeah. And, well, once you have your data, you won't probably want to start, meaning that, well, first of all, you have to be able to compare it to the next run, but in some cases, you might even be interested to see how it changed over time. So you don't even compare the latest with the previous results, but you go back some time just to avoid that you have some drift in your data. It could be interesting, for example, if you don't really program the system for one specific output, but if you train some kind of model and want to look into drift in the model. Okay, now we go for extremely simple implementation. It's that we keep things extremely simple here. So I just use JSON pickle to have a more or less human readable, but also machine friendly format. You could use X and I, you could use a custom text format, but for most of the things that I worked with, JSON is usually something that is easy to get and good enough to compare. I just went for standard unified diff. We'll see this in an entire example, but it just allows you to see how did my object change. And what I did here was just, I write some JSON files somewhere in my data dictionary. So close to my program code and I just keep two version, the QN version and the latest version. And I just write four operations. First is a check operation called check operation as you will see with the name. And it just tells you, this is my data. This is the name. Please compare it to the previous one or if you don't know it, store it. You have a list operation that is usually something that you would run from a command line that just shows you what the human status of all these code master tests are. So it's currently a conflict between two different versions or do the versions, other versions equal. And finally, for your approval process, you have the review and approve steps. Review just shows you what is different between the current and the last result. And approve just tells, okay, I know that this latest version is different, but that's okay. Just save it as it is and consider the new changed version, the new, well, code master. So the version that you test again. I went extremely simple here. So I just have a simple Python class just for testing purposes and well, some different data types in there. And for the store example, we'll just use two different instances of the data. So we create just one class instance with all the default data here. And then we just change it a little bit. So we remove one entry here and we change some data here. So what we can do here is that we don't really have to filter and ignore anything. You could always say that when you prepared the expert that you take your export JSON and just remove some of the keys, but for simplicity's sakes, he would just look at everything. And this is what a standard run would look like. So I just have one little class and this class just stores where it should keep the queue and golden masters. So all the data that we're working from. And then in order to run or to use golden master tests, assuming that we don't have any data previously saved, we just take our test data or test class and provide a name and say check. The first time we say check, we just save the data. So as you'll see later, we just create a JSON version and write out this JSON version. And since there's nothing to check, this would be a success. Now the next time we run the same thing with the same test data and with the same name, it, the assert statement will still be true because the data hasn't changed. Now normally you wouldn't run those just after another, but you would say this is my test one in between these. Okay, thank you. This is my last one. This is my next one. Nothing has changed. Okay. Now if we have some change data, then the same thing will happen. But now the assert statement will fail. Okay. So I know that my results have changed and I have to approve the results, different code master. What I can then do is just list the differences and see, okay, test one, everything's okay. Test two, there's difference. So I'll ask them, okay, please help me to review test two. So what is different here? And if I'm convinced that the result is still correct, I would just run the approval. Well, and from a user perspective, it could look like this. So you list the results and you see, okay, name one, everything's okay. Name two, there's a difference and the difference is apparently here. If I review name two, I just get the unified diff format. So that's the, well, standard lip library, it gives me and I see some value here has changed. So you see here it's 2.80, it's 2.3. And we have different entries in our array. And if I say, okay, that's still okay. That's basically what I expected. I did make some changes. Then you can approve it. Okay. Now quick run over the pattern implementation. I think it's hundred or 80 lines. Extremely. Well, it wasn't much work. And I guess what I mostly want to show you with this is how many of the required parts are already part of the parking ecosystem. So you can just take your object. Is that say Jason Pickle, please provide me with a text representation. And then we just check whether we have already saved something with the same name. If we have, we just compare it with the previous version. And if it is the same as well, if it is the same, then everything's good and check returns true. If not, we return false. And we save both versions. That enables us to later list which of our tests were different and which of us, which of our tests was the same. So, yes, I guess I'll keep it extremely short on the slide because it's really not so interesting code. You just should remember it's basically 10 lines or less. And then you have we view and we view you just loaded the two different files again, prepare diff and print up the different some nice format. So here it's Jason, but you could also take two image files and just use pillow to say, okay, these pictures changed. Or you could use open pixel and load to Excel files, or it could already be to log files that you have pre-processed with some regular expressions. It's more the idea that counts and releases implementation. And finally, your proof and proof just means, okay, take the new file, move it over the old file, and the whole process starts in you. There are some existing libraries. If you're interested to see how other people implemented this, which is mostly interesting if you wanted to have a really good integration into, for example, pie test or as unit test frameworks. Look at these. This one is, well, let's say the from a live from a code writing perspective, I like it more. It's easier to handle. This one is interesting because it is the Python implementation of approval tests, which was originally written in C plus plus. So it's an interesting different take on how approval tests can work in different languages. And we've already reached the end. Why would you be interested in golden master tests, approval tests, snapshot tests, however you want to call it. Whenever you don't really want to write assert statements by hand. So if you have a complex object, you don't want to test a lot of data at once and assert that every property is the same, but you want to test a lot of data at once. And your idea is that it won't change too much between ones. So, so when you do that, you just capture whatever is relevant, filter out everything that you don't need and compare the two versions with each other. And yeah, when you do this the right way, we take some fine tuning, you will end up in a situation where the computer does most of the work for you and you just have to look at some results and decide whether it's still correct. So it's basically a good combination of machine and your work. Why should you do it in Python? Well, because Python has so many modules that can help you to filter out data into different formats, to filter out data that you don't need to look at different output in so many different formats that it makes it extremely easy to run these tests. So I would even say, even if you work in C++ or another language for your main implementation, it might be worse if you want to have some Python scripts just to run these specific tests. Because, well, it's basic data modeling and this is what I would say is one of the things where Python is really good. Okay. And that was it. So in the future, if you look at a large log file and say, okay, I don't know whether this is correct. You might just start to say, okay, what changes would I look at? What's really the interesting part? Maybe you let's start, I compare everything and then I select a certain part and then I automate the comparison process when this particular relevant part starts. Well, yeah, that's it. It's a simple idea. It gets even simpler with Python. Yeah, we have a couple of questions. The first one is from Andy. It's, do you have any external tools to recommend? Are you asked? So, yeah, so let me tell one more. It's the same form Andy. So something to use without editing the code. The code. Yes. So the assumption is that your production code should not be changed and you want to compare the results. Well, you don't have to do the filtering in your production code. So let's, let's take an example, take a log file. You could say I have run as part of my test preparation. I just break the log files into different segments and filter all the data. The result of those is what is finally saved and compared. And for external tools, there are many different diff tools. So I'm on the mega loose and use a kaleidoscope, which is basically just a three way diff that can also compare images. And since most of the forms that I would work with are somewhat textual, it works quite well. So it can compare pretty pretty Jason for example. I guess that Excel, for example, does have some kind of diff functionality. So you could use that. But I think ultimately the idea would be to combine some preprocessing. That's a filtering. And maybe some external tools or, well, if it is as simple as here, just print it out. It depends on how large your test is. And if you, if you put one use two megabytes of output and you don't want to change your code, I would still preprocess it. So you could look maybe at 10 little files and the initial approval test, the automatic path sees that something has changed already guides you to the changed interesting segment. So Andy has a follow up question on the same. He's saying whether this use, whether like if replacing the Apollo version with Python script. Will that fit in that use case also if we just change the, or replace the CEO of all version with the Python script. The Apollo version of, of the, of the program. Of the product of the system on the test of the production system. I suppose. Yeah, why not? It depends on how much data you need to process. So to give a counter example, if you will see program is some kind of video editing software and you have gigabytes of data that you have to test in something resembling real time, then probably you would need something that is extremely optimal or quite optimized and maybe you could invite a simple Python script. But if it is something where, well, that can normally process and Python, I would use it in pipe. Of course it depends. If you have everything as usual, if your whole organization only know C++ or C and use the only Python program or sale, then you probably wouldn't want to do it all in pattern because your colleagues couldn't work with it. Yeah, it's, that's why there was this slide with the design decision. It depends extremely on what you want to do, what output you have. So, for example, if you had video, I'm just thinking aloud, you could say, okay, I'm only look at every fifties frame and I want to see if there are large differences between what I would expect. So if I would usually expect somebody in front of a background like you with a European background and says something completely black, just a black frame, this is something that you would need to review. It's a thing. It's a tool. You'll have to apply it. That answers the question. We have last question from Christof. Yeah, you were to say something. Yeah, so if we want to get more specific, maybe we can do it afterwards in the chat. So I'm quite curious what the scenario is. Yeah, so one question is there, like, is this valid for database tables also? And if so, are there any frameworks to do so? I'm not aware of any predefined framework. I'm pretty certain that you can do database diffs. I would be extremely surprised if this wasn't a thing given, well, how central databases are. I guess what I would do just for me to keep the number of external libraries and external tools down is say what I'm primarily interested in and print that out. So if I have a standard test run that creates a customer, does some transactions and then writes it out, then I would look at these customer records only, even if I have some couple of hundred thousand other records in there just for load testing purposes or to provide some background. So the filtering part and the selection part, what is the change that I would expect? What do I want to look at? Well, that's something that is extremely application-specific. And, well, to come back to the original launch, I would just try to get it into some kind of difficult text format, also because I can then stir it and get some other format and look at changes over time, but it almost depends. So if you're talking about terabytes of data, that wouldn't work. All right. Thanks a lot for joining. People, please join in talk, test, writing, write themselves. Brian Breakout Channel for any further questions if you have. Thanks a lot, Stephen, for joining. Welcome. Have a good, the women of the conference. Thank you.