 Hi everybody, so my keyword is now for something completely different. I'm going to talk a bit about... Yes, I did unmute the microphone. I'm going to talk a bit about what we're doing at Train Code Labs with Hoyser MVM. My name is Johan. So I was tasked to do this talk and then the people from Marketing came and they said, you need to talk about our company. So they gave me some slides. I will try to be as quickly as possible. There are dragons in the room. I don't know where they are, but there are dragons in the room. Do you get a reference? Some people do, some people don't. Okay, whatever. So what's this company about? This company is basically a compiler company. What we do is we build compilers. This is the core of the company. Clients come to us. They need a compiler for something. We build them a compiler. You need a compiler, I build you a compiler. I need a compiler, I build you a compiler, et cetera, et cetera, et cetera. Okay? So this is pure marketing speak. What this thing is trying to say is that usually a compiler division is part of a big company. You have Intel. They built our compilers. You have Microsoft. They built compilers. Apples built compilers, et cetera, et cetera. Really small companies that do not depend on an external big entity. I don't know any. And people claim that, okay, we are the first meaning the biggest small independent compiler company in the world. I told you, marketing speak. So what do we do? Client comes to us. They say, okay, we have a problem. Usually something in legacy languages. We have this old legacy language. There's no longer support for this. But the core of our business is implemented, sorry, in this legacy language. What do we do? Do we rewrite everything? Like 30 years ago, the big company started writing that. It's not 30 man years. It's 30 man years multiplied by the number of people. Do we rewrite that? Or is the other option okay? Let's try to take all of this source code, compile it to something else. So we do consultancy services. We say, okay, this is feasible. This is not feasible. If it's feasible, we can implement the compiler for you. I told you, I can build you a compiler. I can build you a compiler, et cetera, et cetera. Full scope. I mean, from the beginning all the way to the end to the delivery. That's what the company does. So we're a compiler company. Expertise. All of these things, these things are our DNA. Okay. I have a guy sitting in front of me. He was a big DSP guy. I am more a DSL guy. We have some people doing this, strangely. All of that. Okay. So what do we do in terms of projects? You guys are all interested in this, but I won't talk about this right now. Start with something else. This is a cool one. This is nothing to do with Java, and this is nothing to do with basic, but it's called GBasic. Whatever. My goal here is also to surprise you at some point. I have a number of surprises. First surprise, I don't work on this. I actually work on this. The story behind that is, yeah, we need somebody to go and talk about that, but the guys who are working on the LLVM work, they don't have time. They can't be there. We need somebody to talk about the stuff that we do. So let's go and talk to Johan. What? Okay. So I know some stuff, but I'm sadly not the guy technically, the guy is technically doing this. I'll try to answer your questions as best as possible. I've been doing that. So people right now take old languages which are no longer supported, take all of the source code, and generate cobalt. Who here knows a bit of cobalt? One, two, three, four, five. No, no, no, no. It's not because he's supporting it. Think like a manager for a moment. Try. Try very hard. Yeah, I know. It's hard. It's very, very simple. Yeah, okay, let's not go there. Cobal is a thing that's been existing for 30 years. It's extremely stable. It doesn't break. It's been existing forever. It's going to exist forever. That's basically the idea. So we at some point, sadly, generate cobalt. I wanted to shock you. This is the least one. So who are these clients? Big companies that I can't talk to you about. Banks, insurance companies, those kind of big companies. Bank or an insurance company. An insurance company, for example, right now, all of their asset is in their software. And all of the software exists since the 70s. It's written in cobalt. It's written in PL1. It's written in all these kinds of things. If the mainframe ever goes down, the company is bankrupt. And the story. That's the kind of things that we talk about. We work with these kind of companies, but we are not an old company. Actually, we talk a lot to academia. In my previous life, I was a university professor. We have sponsorings of a number of events. So for example, we'll be at compiler construction this year. Et cetera, et cetera. This is the last dirty slide. I don't feel good saying this. So let's go. So end of the poll. What am I talking about? I need to talk first a bit background. What's the background of the company? Where do we come from? We come from Raincodes, which is different from Raincode Labs. And the motto of Raincode is mainframe to .NET. So what does Raincode do? Mainframe programs. We try to run them on .NET. Simple. Makes sense. So what am I talking about? We have a PL1 compiler. We have a cobalt compiler. And we have an IBM assembler compiler. Yes, an assembler compiler. It exists. So the first work that was done was actually on the PL1 compiler. The PL1 compiler is the oldest compiler. After that, it got going. Work was started on the cobalt compiler, which had caught up at some time. ASM compiler came in later because it's really tough stuff. But it's going along quite well. Why do you need all three? Because usually on the mainframe, two or three programs, they work together in different languages. You have cobalt calling PL1, some assembler in between, et cetera, et cetera. So if you really want to have everything, everything in all cases, you need all three. Not everybody uses all three. Some of them only use one. Some of them usually only use two. But if you want to have the full service, you really want the three. So we work for mainframe people. So stuff which is 30 years old and is solid is good. So what is really, really crucial for the company is backward compatibility and stability. We cannot break the software or the insurance company goes bankrupt. That's the key. This is very, very difficult and very, very different from what I did previously when I was doing research, whatever you do, whatever you want. It doesn't matter. But here, the stability is core. Everything that we do and everything that we support, we need to be sure that we have stability and backward compatibility. So one of the things that we do is we do not have any external dependencies if we can avoid them. So we're not going to use the latest, greatest new tool because we don't know what it's going to be like in 15 years. So any external dependency, we try to avoid them as much as possible. This is also why we have our own compiler building infrastructure. This is not a typo. This infrastructure has been built first time 25 years ago. It's been worked on, et cetera, maintained, et cetera, et cetera. It's ours. We fully control it. We are sure that we have something stable, backward compatible. The thing that we do is we do need the C compiler. We generate C code and then the C compiler does stuff with the C code so that we have executables. Okay, let's try to shock you again. C compiler can change and then... Yes, I know. But you need to have some point, right? DCC 2.95. I have no idea what version of DCC they're using and I don't want to know either. I'm not in charge of the guy, I'm not the guy that builds the infrastructure, luckily. Let me show you some cobble. This is what cobble in general looks like. Yummy, yummy. This thing is a comment because there is a star at position six on this line. Yeah? This is a loop, of course. This is not move, this is copy. This is cobble. I learned last week, I still don't know everything about cobble, which actually makes sense because the standard is huge. There is a statement in cobble which is called exit. What does it do? It doesn't do anything. If you say exit.it doesn't do anything to me. Yeah, whatever. So, okay, cobble. Cobble is really, really very weird. It's very weird. Yeah, let's just say that, whatever. So that's one example. I won't show you the assembler, but I'll show you P01. P01, it looks like it's okay, but then you encounter something like this. Keywords are not reserved words. You can have a variable. You can call it if you can have a variable. You can call it else, et cetera, et cetera. Yeah? Parts that. Try to make sense of that. So because of these strange kind of things, we really have a parser which is different from all the other kinds of stuff that we do, that we usually do. Some of the steps need to take place of where in the offset in the source code file is the little star. You need to do some clever backtracking. If is it a variable or is it a keyword, et cetera, et cetera, et cetera. So there has been a lot, a lot of investment in getting these parsers to work and it's tough. It's really, really tough. Okay, LLVM. So let's start with what we have right now. What we have right now is we have a P01 compiler which uses LLVM as backend. It took us three and a half man years in total, which is not a lot, if you think about the complexity of the language. It doesn't support everything. It supports about 75% of the specs. The specs doc is huge and complicated. It's online if you want to have a bit of light lecture. Just look it up. You'll find it on Google easily. So this is the core of the work that we've been doing with LLVM. Get a PL1 compiler with the LLVM backend in the end. I'll talk more in detail about that later. We have a global compiler. So this global compiler, when I wrote this slide, was two weeks of work. So now it's like, I guess, three weeks of work. And two weeks ago it could do Hello World, which is great for two weeks. But the thing is it's not two weeks starting from scratch. The point is that there is a lot, a lot of shared infrastructure. A lot of this infrastructure was written, in this case, for the PL1 compiler. And a lot, a lot of this infrastructure we can reuse for the global compiler. So that makes it possible to have reuse of all of this infrastructure and have a compiler that compiles Hello World for global in two weeks. All right. This builds off our original PL1 compiler. Our original PL1 compiler was a compiler, mainframe2.net, which has a .NET in the end. Okay. So what does it look like? We take source code, source code, and we have our custom parser. So remember, keywords not reserved. So we need to think of that. Also a lot of other things that you need to think about. Custom parser gives us an abstract syntax tree. We do a bit of semantic analysis on that, gives us the types, gives us the cross references. PL1 is statically typed, but not really. You can also forget to type things, and then the compiler tries to figure it out for you, and usually it's right, but sometimes it's not weird stuff. So that gives us an abstract syntax tree, which has some, some antique information, types, cross references analysis. And then actually what happens there is we take this tagged abstract syntax tree. We do code generation, output the common language, runtime, assembler for .NET, DLLs. All of this is written in our own infrastructure. Okay. We have this parser called Dura and other things, but let's just say that this is Yaffle in the end. So that was what was before. Then we had this client that was interested in our technology, and he wanted a PL1 compiler, but he wanted to have this running on Linux. So this is not really a suitable option. So one, it's a long story. I won't bore you with that. So first version of the LLVM, what we took, remember I had two lines there. First line is just all of the parsing, and the second line is to go from a tagged abstract syntax tree to common language runtime. We just take the second line, so we reuse the parser, we reuse all of the, part of the semantic analysis. First version was just, let's generate plain C code. We do an intermediate step in between here, the gentry step. I didn't choose the name. What does gentry do? It does a bit of simplification of the abstract syntax tree, because for example, in PL1, you can have nested procedures. So procedure in the procedure in the procedure in the procedure. And then the thing is, you need to think about the scoping, lexical names. So that makes the AST a bit complicated. So this gentry thing, it simplifies the AST. It also flattens stuff in the control flow. And then from then, the first attempt we did was to do some code generation and actually generate plain C code, then pass that through a compiler and we have a running executable. Okay. So one man here actually is one person doing it on his own. He didn't have a lot of experience on PL1. So this was also a big learning experience for that person and also for us as a company. Issues. You lose your debug info. So once you have the C code, you want to debug it. You don't know where this thing comes from. So you lost your debug info, so it's very hard to debug the generated code. Semantics of stuff in the C code is not clear. If I have a name in the C code, is it the left-hand side? Is it the right-hand side? Do I want to treat this as a pointer? Do I want to treat this as a value? It's really not clear. Executables at the end, they turned out to be very slow. Two reasons for this. This code generation was suboptimal and GCC that was used as a backhand didn't do all of the kinds of optimizations that for example LLVM can do. And the entire process in the end was too complicated and we were starting to hit the wall at that point and then we were lucky. I guess I can say we were lucky. Our client said, you know, actually what you would like to do is you would like to use LLVM. All right, so let's switch this. Let's throw this out. Use it as a learning experience and start again and go to LLVM. What we do is we talk to the backhand via the in-memory intermediate representation. So what do we do? This student has many years. It's more than one person now working on that. Via this simple tree with the flattened control flow, using the C API to generate the in-memory representation of the intermediate representation. So yeah, client wanted this. We did a thorough rewrite of this pass and we integrated there the lessons from version one. Why the C API? Stability, stability, stability, backwards compatibility, backwards compatibility, et cetera, et cetera. Remember the talk of previously LLVM changes so fast. We want to treat that. Stability is key. We want this thing to work for now until forever. So developers were looking, okay, what's the option that we can do to try to generate something, to have something which is as stable as possible? And said, okay, yeah, we need to use the APIs. There's the C++ API, which is very good. They will support it. And the C API, which is also very good but not so well supported. But then you go and look at a procedural language like PL1 and you try to have a square peg and put it in the C++ round hole doesn't work. So there's an impedance mismatch there. So developers made the decision, okay, let's go for the C API, even though we feel that it's treated as a second class citizen. I have some examples later in the slides. So I have some examples in the next slide, actually. So we're on LLVM 5 and there are some things with the C API that we are missing. First of all, again, debugging information on the variable or metadata of variables. This is something which was discovered quite some time ago and actually there is a patch, which has been on the review for six months but the threat is that nothing has happened there. And by the way, C++ API has it. So the C API, bit of a second class system. For us, it's not really a big issue because actually the Go compiler had these patches first. Before the LLVMC patches submitted, Go had some patches there that worked for their compiler. So what we did, we took them, we worked them a bit because it didn't compile out of the box. We applied them, they worked for us. We were still using these patches because they work and sadly, nothing's happening there. I don't have a URL for you. I didn't look it up at the time and yeah, should have done that, sorry. There are things missing from the mainframe, which is actually not such a big surprise because the mainframe is a strange and old beast. So you see things there that you don't tend to see anymore. Packed decimals. Anybody knows packed decimals? Some people knows packed decimals. I've seen the Dwarf standard. Dwarf standard says that you should have all PL1 and cobalt types, but they're not there. I understand. Endianness. Mainframes are big endian, like Motorola M68K, Spark, original PowerPC, et cetera, et cetera. Intel is little endian. So yeah, when you want to start debugging that. Floats, yeah. In the 60s, there were no IEEE standard floats. IBM has their own floats. It's not in there. Makes sense, huh? Good news. This thing which I still can't pronounce, which is pre-asky encoding of characters, it works out of the box. In GDB, no problem at all. So that's great. So some things that I was talking to the guys and say, okay, that's all cool, but you know what's your feeling about it? How do you feel about this? What's really notable? The tough part actually is just doing the mapping from CL1 to the LLVM intermediate representation. All the rest is easy. So that's really cool, huh? Because the stuff which is tough for us is tough, and the stuff which we don't care about that should just work is easy. So that's great. That's really, really great. We use as simple a stuff as possible. Plain vanilla, even planer than... because we have this problem that stability, compatibility, no dependencies. We are very simple people because of this. We don't care, it's cool that you're doing that, it's cool that there's a lot of advanced stuff, but it's not for us. As simple as possible. So that we are sure that we won't have problems in the future. And actually it only took us three to four days to go from version 4 to version 5. Which is cool. That's done one week, boom. It took us so long because we have this old, very strange thing that we needed to regenerate our bindings for. So that took us a bit of time. And we had to reapply the Go patches again. And that's all that we had to do. It's great. Something that happened a couple of weeks ago, actually. That's all I understood. So in a test, we're compiling a test program. This program on Windows takes 30 seconds. On Linux, it takes 12 hours. Something's wrong here. So, did some digging. Found out where's the cause. The cause is it's trying to compile a basic block for about 4 million intermediate representation instructions. The inliner went a bit crazy, whatever. 4 million intermediate representation instructions trying to compile that. So actually, yeah, that's the block. So what's happening there? When you calculate the offset of an instruction inside of a block, this happens in linear time. It walks the list apparently. This is cool if you have to compute this for a number of instructions. But actually, when you're generating code, you're going to do this for all the instructions in the block. So it explodes. What's the fix? There's a number of instructions in the basic block. That way, we're just limited to a reasonable number and the problem's still there, but we won't face it anymore. So let's just fix that. Ah, yeah, but the CAPI doesn't give us account of how many instructions we already have in this basic block. C++ API does it. So, okay, our solution, we have our simple tree, which has a number of nodes. Let's just say, if we are in the basic block and we already passed more than 100 nodes, let's cut this basic block off somehow, start with the next basic block. Problem solved. So we still don't know why it's 30 seconds on Windows and 12 hours on Linux, but we already spent enough time on this, so we can't go on. Windows is clearly better. I will not comment on that for many, many, many multiple reasons. So yeah, this was already two days of work. We're a company. We need to move forward, so we don't know what happened there. We have a fix. We need to move on. So then I asked my, the people that I was talking to, okay, you guys, you seem very happy, so tell me, what's really nice? What's cool about LLVM? First thing that's really cool about LLVM is that it just works. This is great for us. I mean, this is not part of the stuff that we want to spend time on. Most of the job is taking PL1 and translating it to something that can run. So the fact that it just works is great. Intermediate representation, we like it because it's documented, it's clean, it's focused, it does exactly what it should do. Great. The ecosystem. It's very nice to have a broad ecosystem, a lot of different tools, but also an active ecosystem. It's not dead things, it's not moving, so it's really great to be able to do that. That's cool. What's not cool about LLVM? This is really a big stumbling block. The documentation of the CAPI, this was something that took a lot of time. So for example, sometimes in the beginning, it was not clear if you allocate a string, who is then responsible to feed? Is it me? Is it somebody else? Where does this thing feed? Do I need to do it? It's not in the documentation. You can't find it in the documentation. So what did the guy do at that point? He said, okay, let's just generate a whole lot of test programs through the API and see what happens. And via the outputs, try to understand what the API does, what are the responsibilities there. So there was three weeks of work just trying to understand what this thing does by generating code and looking at the output. This got him a certain level of understanding of, okay, this stuff that I worked like that, et cetera, et cetera. And then from that point on, he just looks at the source code of the implementation of the API to understand what does the API do, how do you need to do things. So that's not really nice. And then I will say, okay, do you have anything else? He said, no, that's everything. There has to be something, right? I need to fill my slide. Something, give me something. There should be something else, and there was a lot of stinking and humming and hawing and all. Yeah, okay, yeah, if you really want something, I can give you some stuff, but it's not really very convinced. Okay, so the thing that he came up, yeah, it's that there is somewhere, a cert fail in the backend because we did some calls in the API and we built something and we built something in the wrong way. And then it takes, and then we need to figure out where does this come from. We need to map it back to the source code. But yeah, that's actually an issue of our thing because this is our stuff. This is our yaffle which takes source code, parses it, builds intermediate representations, et cetera, et cetera, and then starts calling that. So this is a complex mapping. So there is no straightforward way to make the conflict mapping trivial. So there is no straightforward way to figure this out. It's a problem, but yeah, it's not much you can do about it. And then, yeah, the other thing is, makes sense. It's big, it's huge. We're not experts. It moves around quickly. So we don't look inside, really. It's only when something blows up somewhere that we're trying to figure out how did we mess up so that we cause this explosion. And then we start looking around inside and then, yeah, it's complicated, but that's normal, and it's a big thing. So you can't really simplify that a lot. So I filled my slide, but actually this is the important stuff. This is, yeah. So that was a bit quick, actually. We are happy. We are very happy customers of LVM. Do keep in mind that by design, we keep our use as simple as possible. We need to take care of backwards compatibility. We need to take care of our stability. So we just do plain vanilla things by design. The CAPI could be improved. It feels like a second-class systems. So there's no debug info, as I told you before. There's this issue with basic blocks that you can't get the number of instructions inside of a basic block where C++ does it. Our developer said, yeah, we thought about submitting patches, but actually, no, we didn't do that because, yeah, in the end, we only had these two things that we really found. And me, myself, one of the developers said, yeah, I hate anything which has to do with bureaucracy. I don't want to spend time on all of that. The process is too heavyweight. I've worked in open source software also, not in LVM, so I know you need to do the process because otherwise it won't. But yeah, he's a special guy, you know? You have him everywhere. Yeah, it's heavyweight. It's very difficult to justify the investment. I will push him a bit more and we'll see what happens. Okay, this is where we are now. What are we going to do more for this client? So we need to finish the LZlops, the PL1 compiler. The global compiler, we just started on that, so we're going to move forward on that. Global as a language is simpler than PL1. PL1 is quite a complex beast. Global is simpler as a language, but there is more of it. I mean, if you look at the definition of the grammar of global, it's huge, gigantic, it's really incredible. So it's going to go forward linearly. I don't predict any big stumbling blocks because of the simplicity of the language. So it's just going to be a train that goes and it will take its time, but it should be okay. So the ASM370 compiler, which is a compiler for the IBM mainframe, so the version that we have is for .NET. The version we should have will be the one with the back-end LLVM, which is going to be quite complicated, but that's a long story and I won't bore you with that here. It's not clear when there's going to be people working on that. That's why there's a question mark there. The plan is to do it, but I don't know when. This thing, 75%, this thing, we're starting on that now. This, the idea is to start on that, but it's not really clear to me when this is going to happen. And that was all that I would like to say. If you have any questions, go ahead. I will try to answer them. I can't remember if you explained exactly why producing LLVM, my art directly, had the benefit over producing C codes. Is there anything you can say about that? So the problem with the generation of the C code is that it was a very complicated generation, it was a very complicated output. Maybe the generation step itself could have been redone again to do it better. So there might be a way to fix that. The first problem that the developer came to me about was like there was a lack of debug information. When you're debugging the resulting binary, there was no way to go back. So we took advantage of the fact that the client said we want to do this on LLVM to say, okay, I'll... Yeah? So your special step, like, you need to go to management. It just seems to me that you are... I totally understand why you're coming from instability, and I've spent 20 years working in banking myself, but you're not gaining it by leaving your patches out. I know. I'm sorry, you're preaching to the client. Yeah, I know. I think you need to go and have a stronger word Do you have some time tomorrow? Yeah, it's close to South Station, actually. Tomorrow, I'll be there tomorrow. All right, it's much better. Performance is much better. Two reasons. Again, the first time that we did this, we were generating C, and the C that we were generating was not optimal, so we learned from that to make sure that the intermediate representation that we're generating now is more optimal, and then you had the LLVM optimizations versus the GCC optimizations. No comments. So the client is only targeting Intel, but with LLVM we should be able to target other stuff as well, right? Yeah, we're a company, right? So if the client doesn't want it, we're not going to spend time on that. The company is targeting Intel because that's what the client wants. So the question was to, how do we treat with floating points? We need to have our own implementation of floating points. We're using integer instructions, so you actually try to use floating points instructions. I don't know. You've talked a lot about eco-competitivity. Do you also implement an underlined behavior of your own language, or do you just implement a reference to sign and say they should fix the software? Yeah, so the question is, do we also implement the non-standard behavior or do we tell our clients that want to use our compiler to change their software? We implement behavior exactly as it is on the IBM mainframe, exactly, including all the weird, strange things that don't make any sense at all. We implement, we reproduce them exactly because the core idea is the client does not, the client of the compiler does not need to change his or her source code. The idea is to take the source code, you run it through a different compiler, now it runs on, in this case, Linux. The old servers that you have to adapt your compiler to have the same bugs? I'm sorry, I don't understand the question. What are the bugs of the old compiler to the new compiler so that the software who just want the accident or just compiler accident also looks on the user? Yeah, the question is, if the compiler itself behaves non-standard, do we also reproduce this behavior? The answer is yes. The thing should produce exactly the same output. Warning messages in between, yeah, we're not exactly the same warning messages, but that's okay. The output, the executable, should behave exactly the same as the original mainframe software. Let's generalize a bit that question. How do you ensure that this code behaves in a similar way? The question is, how do we ensure that code we produce is exactly the same? Yeah, this is testing, yeah? This is testing. So it's the entire testing question, and we do as much testing as we can, and then also it depends on the client of the compiler of what does he or she want. We have a huge, huge battery of tests, for example, for the COBOL compiler. Yeah, it's a huge battery. We have to go on the client and run with their org load to make sure that your result is identical to the one they have, because sometimes even in your test you don't pick up everything, and then one client has the one thing. Yeah, the language is so big, and people are using it in so many creative ways that there's no way that we can test everything. This is also why we do installations. We have people which connect with the client, which are special technical people, but also good in communications that go and they do the installation, and they also take care of part of testing, et cetera, et cetera. It's not just like I make a compiler, and I leave. No, no, no, no. It's much more than just that. So mainframe migrations in the past used to be my results are identical to yours in the demo. Everything else, nobody cared. The client didn't care. The marketing guys didn't care. The engineers didn't care. It just has to be identical when you sell it. And everything else is whatever. And the engineers are saying, no, no, it's fine, it's fine. So, yeah. This is not us. Did you consider just writing an emulator? The question is, did we consider writing an emulator for the mainframe? I don't know, but I don't want to consider it. Time's up. Thank you.