 Velkámo, můžete představit nejvědětní architektuře, které byli představit v Valdrindu v představě. Mám je Ivor Izrael a můj koleg Tomáš Ilička bychom představit v 50 minutych. Jsme být být dělávali na Oracle. Jsme být být dělávali na Solaris. Až jsme dělávali na Linux. Takže to je dělávalo, když jsem tady představit. Jsme být dělávali na Sparku, které byli představit v Valdrindu v představě. Jsme být dělávali na Solaris a můj koleg Tomáš Ilička bychom představit. To je úplnětní představit. Jsme být dělávali na Tomáš i můj koleg Tomáš i můj koleg Tomáš i můj koleg Tomáš i můj koleg Tomáš. Jsme být dělávali na Solaris a můj koleg Tomáš i můj koleg Tomáš i můj koleg Tomáš i můj koleg Tomáš. Solaris je tam, co je představit. Solaris je na x86 a i64, co je představit. Obvízda se toho konklučního vyskává, že je to všetný Spark. Ještě jsme vyskávávali na Solaris, a jsme vyskávávali na Solaris. Vždyž jsme vzítekli na Linux Club, takže jsme představili na Linux, ale vyskává je toho. Máme se na toho, co je představit. Vyskává jsme představili na Solaris, a jsme představili na 2 vyskává vyskává, ale jsme představili na 2 vyskává. A to je můj koleg Tomáš. Vyskává se toho, co je představit. Vyskává se toho, co je Spark? Spark je představit představit. Představit představit na 80. Vyskává se toho, co je představit. Vždyž se toho vyskává vyskává vyskává. Vždyž se toho vyskává vyskává. Vyskává se toho, co je představit. Toho to, co je představit. Vyskává se toho, co je představit. Vyskává se toho, co je představit. Gdane se půjde ubirat v baking. Přiště je to jedna z úplnějních komputerů, které je 32 TB fysiklný. A myslím, že to je 2,5 tisí z hodiny hodiny. Můžete vidět klasit dvou tisí, ale můžete mít 64 TB fysiklný a 4 tisí z hodiny hodiny hodiny. A bylo to velký zněอ vědějíš, že nesokrý vědějí, že za denerJacky to je hrajejší a všichni nezávělí. A taky se vás na světě je nejlepší. Takže jsme množně hlavní aplikace, které se vyskává velký mělk, které se musí být vyskávat z výzvědných třet. To je závědě, že vyskává je tu a můžete se vyskávat závědě. Takže jednou proč než chceme se vyskávat závědně, jak se vyskává arhitektu vyskává. Takže, se musí se vytvořit, protože je to timing, takže to je zvuk, co je závodný z válgrinu. To je zvuk ještější a zvuková se na sebojí. To je tím, že to je tím, že to je zvuková sebojí. To je všechno, že všechno je zvuková sebojí. To je zvuková sebojí. To je zvuková sebojí. Anovala návržení návržení nám nebo dělává, které jste zvukou, které jste zvukou, když jste zvukou, když jste zvukou. Nětším, že návržení návržení nám nebo dělává, když jste zvukou, když jste zvukou. No to ne, takže vždyjí také dne. Bízikl se taky z easiesta na generální růz vždy a je udělat do vynuchovat zašličnou rozkoupní vždy a po tebe věděle. Přece onové instruci tři opětť a zvěděleské z největné dvedvenou moci dělat 2 fusingy a 1 fusingy a tady nám je fakt volová. a to je úplný rejster. Takže, to je množství formátová instrukčna, která se vzájí. Vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící vzájící. Dutch carnault a so we can slice the big machine to smaller chunks and each CPU is running in non-privileged mode, privileged mode which holds the carnault and hyperprivileged mode, which means it actually is inside of the hypervisor handling. Why it's important for lagrind, it means the hypervisor is still there and it takes care of some of things like interrupt deliveries all the time a se na tom vlastně se to závodovuyčí, protože je to důstutek, značí je.intellect vydělá se, když je opravdň štět. So Spark has a general propose, it has a 24 general propose registers, I will have another picture, which are alias into the input ones, output ones and local ones. We'll probably see it on the next picture, it serves for emulating of a function call, so we have something like input arguments, output arguments for the next function call and the else are representing a local variables in a function. But beyond that, there is a lot of more other, there are like really global registers, there are floating point registers, there are something called ancillary state registers, which is technically other kind of registers for other usage. And yeah, I will now peek to the next slide, try to explain more about what the I's, L's and O's are. So from x86, you probably are used that the stuff is being pushed to the stack directly. What the Spark architecture has is, it has many more registers than the number, which is accessible at any single moment for the chip. So this is called a register window, so the I's, O's and local's technically compose a single register window, which represents a single frame on the stack or function execution. So whenever you call another function from the caller, you will see that the CPU can allocate another window, so it will basically select a different internal registers and then perform translation of the I's, O's, L's to the new window. And you can see that there is an overlap, which means whatever you have put in the O registers are output arguments of your function. So after you call a function, which will ask for a new window, the O's will automatically be in. So in the next function, it will be in the I registers. I hope, does it make sense, like, yeah. So the operation is called a safe restore. So whenever you save, you are going to allocate a new window. If you call a restore, you are basically freeing the window, unwinding the mechanism back. The issue here is that this is all happening inside the chip. So you cannot observe this behavior on the stack unless when the kernel or the debugger explicitly dumps the data to the stack. Or if you arrive to a place where you basically allocated all of the windows, so there is no other place for the next function call. At that point, a CPU will issue some spill fill traps asking the system to actually dump the windows somewhere to the memory and then either reload them back as necessary. So after you fill all the 8 or 7 windows, you will start seeing the old ones being pushed to the stack, available for the debugging software. So yeah, this is quite tricky. And especially for working, it gives us a lot of headache to try to emulate this behavior on the guest state. Yeah, you can see the globals are always the same. So they are not affected by the window switching mechanism. Yeah, floating point registers. There is a bunch of floating points sized from 32, 64 bits and up to 128 bits. But they overlap. So a double precision register is actually two adiatsent floating point registers of 32-bit precision and so on. So this is another thing it was really talking about slightly. Yeah, so this is about the register set. So next highlight of the Spark architecture is the control transfer. Spark is using a delayed control transfer, so it means whenever you branch to some different place, we are still able to execute one instruction after the branch before we get directed to new destination. So there is an example. You can see the branching works. For example here, we are branching based on condition codes. There are registers which holds the condition codes for us and there is the destination. So technically this instruction is sitting in the branch delay slot and it will be executed before the instruction at the new address. So this is part of the problem. Second part of the problem is you may have an anulbit here which says but if we actually haven't branched please discard the instruction in the delay slot because we are running on a risk CPU so everything is out of order inside the chip. So this instruction may have been prepared but not committed yet and in case we didn't decide to jump, we need to discard this in the inside CPU buffers. So practical usage for example a compiler can put first instruction of the destination here and just said but if we haven't jumped discard it because I have preemptively executed something from the destination. So as you can see this is managed by having two program counters. This is another difference from regular architecture. So you have a program counter which points to instruction you are actually executing and you have a program counter which is pointing to instruction you will be executing after this instruction. So technically it's always plus four because we have a fixed size but in case of the jump you will see that the program counter will be pointing to the branch instruction and the next program counter will be the branch delay slot. After we execute the branching instruction we will move to the next program counter but we said that the next program counter will be at the address. So basically the transfer is handled by the two registers and again this is something which is not easily supported in Valgrind because there is no way or easy representation for this scenario. So another nice thing about the Spark is whenever you try to load or store something from and to memory we use a 64-bit virtual addresses but we also have something called other space identifier which is a 8-bit additional information telling the MMU how to alter the load and store so it can do something different. So some of the ASIs are designed for privilege or hyper privilege mode only but some of them are accessible from user space which basically means you can ask the MMU for example to load data from memory but switch and the anus. So the MMU will access the address in argument but return you the reverse value in the register. So there are also other groups of ASIs so some of them are coding and translating which means we will involve MMU and use the address as a virtual one. Some of them are non-translating which means they can reference a physical address and there are of course a special ASIs which allows us for example to address something which is not tied to address space like a register. So for example SparkChip has more registers as a scratch space so you can use ASIs to play with 8 extra registers if you need to have some spare place or for example hypervisor can provide you special ASIs for performance counters so whenever you read something from a memory address with that ASI you actually invoke a routine in hypervisor which will fill in the data for you check whether you have permissions and so on. So you can see the instruction itself is not different from a load store or a regular load store but can have an ASI in its opcode which means if this is a compiled in ASI we just say do this load and use this constant but to be more dynamic you can also say just load from this address but use whatever is in the ASI register of the CPU so you can dynamically change what the current ASI will be by writing to the ASI register so only at the runtime you know dynamically what actually this will do because for example there is a good example this is something load, float from memory to load 8 bytes by default by default means there is nothing like this in the instruction this works because a CPU has some notion of the default ASIs so if you omit everything we will use one of the default values which is described in the documentation but you can see whenever you put a different ASI to the instruction it changes the behavior of the instruction how many bytes it loads also there are things like block stores we should for example read 64 bytes from memory instead of 64 bits and so on so this is basically the problem we have at VEX that if it is not written there yet we have quite a hard time to do some compile time translation of this other space and also we have a big problem in the memory tools because the tools doesn't know nothing about what the ASI means some of them are even vendor specific so they can be different between the implementation on Spark with different vendors so OpenSSL and other things can use it Crypto Instructions Spark has a lot of crypto instructions built in the CPU so they are for example using some special ASI to do the stuff this is a new feature of the new Spark CPUs KDI the logic is we are trying to detect memory issues in the code in the real time on the running chip so what we do we are actually using unused bits in a 64 bit memory address so we can put a color on a pointer the pointer is stored in the cache line and later in the memory module and on each access we are checking whether the color is still in match so a CPU will issue a fault whenever there is a mismatch between the colors and that way we for example can put buffer boundaries by coloring different things before and after the buffer and detect them on runtime however it's cache aligned so it works for 64 bits you cannot create a smaller chunk or smaller piece of memory usually for user process we deliver a synchronal signal to signal that the ASI has been messed up and it can be handled somehow it depends on the tool I would say last thing because on Spark whenever you do a syscall you need to have a mechanism to enter the kernel somehow it used to be by issuing an interrupt syscalls is enter instructions Spark is still using the trap however Spark has a hardware support for traps so each time you trap the context will change slightly in the CPU so you have some fixed amount of nested traps you can create so for that reason if you do a syscall it will jump or push the CPU to different trap level we need to move it back to trap level 0 in kernel we need to find a syscall so we use something called fast traps on Solaris which basically allows us to create a syscall which is directly handled in the trap level context so it's used for some really high speed stuff like getting timers or something out of your process so instead of doing this mechanism we just directly trap and at the trap level handling code we handle everything and return back to the client sometimes even without switching to privilege mode because it's not required so this is also big problem for us at the two level because we need to handle them differently yeah so now so thank you Tomáš for introducing us interesting aspects of Spark architecture and now I'd like to tell you quickly what parts of Vagrant are affected by Spark support so this is my picture so don't blame me the most interesting part is of course done in vex because we need to provide a frontend which does this translation from binary to intermediate representation and we also need to provide instruction selector and emitter to go from IR to back to assembly so and then there is a whole bunch of coregrind modules which needed to be hacked somehow some of them slightly, some of them quite heavily to support Spark on Linux then there is gdb support which works over the gdb remote protocol we had to provide support for this in gdb because it was not there and for the tools themselves there was not so much work required only some work in memcheck to support some kind of some more IOPS so this is our guest layout so you can see globals outputs, locals and input registers as Tomás mentioned then there is whole bunch of floating point registers some ancillaris state registers and then there is support for lazy evaluation of condition codes registers for integer and floating point so you can see we didn't emulate all the register windows like Tomás described here we emulate just single one why? because all the vulgar machinery expects a stack pointer to be just an offset in the guest state and because the stack pointer is one of these registers and because stack pointer constantly moves in these register windows so we cannot say oh this is the stack pointer because it's always changes also one of the reasons is the user space program has no visibility in the other windows it just sees one and also for other solaris other always specific problems we need to synchronize the guest state to stack to spill the register window it was very difficult to achieve it if we had all the windows so we simulate only one register window and we do immediate fill all spill based on how do we encounter safe and restore instructions so and we have also some scratch pad here which we use for loading and storing some of the ancillaris state registers if they do not support moving between registers but they do support moving between memory and register so that was register window support instructions in branch delay slots need to be simulated carefully because of the annuling bit so we have kind of branching code which the front and middle front generates this is simulated based on other architectures they do it they evaluate the condition codes this problem Tomáš mentioned if there is AC in the register then we have no way how to tell at the translation time what the instruction actually does so this problem is still not solved we have just worked around it bit also because memcheck does the shadow operation on 16 bytes floating point registers so it converts them to integer operations so we had to provide some basic emulation of those byte integer registers in the instruction selection we also get on the MIPS platform we had hundreds of warnings from the compiler because of the misaligned access we have problem how to efficiently represent fast traps which return 5 return values so it's still to be solved and we also evaluated possibility if a valgrin can leverage application data integrity somehow to check the addressability aspect of the of the program also because load and store can be decorated with different ACIs we have to we have to put new attribute here in for the load and store operations it's still questionable if this is the right way to go I don't know basically it means that all the tools which somehow instrument load and stores will need to be taught about new things so this specific address space identifier says you perform the load or store but do not fault if there is a problem with addressability so we will need to teach them check about it and it is still unclear how so there are still a lot of open questions and problems to solve and I hope to facilitate some discussion about these problems at this meeting and the current status of the Solaris and Linux ports are we have been working on Solaris port on Spark for two years so it's quite quite mature so Memcheck basically works the other tools somehow and this month we have started port for Spark on Linux so at this point just builds nothing else and that's it both ports live at this at this moment they live at this address on the packet repo under different branches and I think that's it so now we take questions and answers if there are any I know there is support for get index and put index we have considered and we have first we had this prototype with get i and put i but it didn't work because there is no way how to get the current index from the user space you have no visibility which window you are in I would like to say there are additional registers which are supposed to be those keeping for the window mechanism so there is a current window pointer what windows we can say restore which or how many windows belong to different other space because you definitely don't want to pass kind of stuff to the user land so but those of them are usually privileged access yes so we had no way how to tell which window it was and we got lost exactly yes yes it is expensive because the safe and restores the pushing window there and there is very cheap because it just switches between the registers the only expensive way is when you encounter fill or spill trap but now we do fill and spill immediately every single push or fill so it's kind of unfortunate but we were forced this way so for this reason and also for the problems for the problems because there are two program counters so the current port is quite slow I haven't done any comparison but when I run something on x86 and on spark it's quite slow so some performance optimization needs to be performed first so any other questions window specific for this like extension to the architecture on spark we have spark v9 then you have a motivable platform from for u, for v, for r and then you have specific ships best by the windows so at the top level is the only common thing you have asin spark v9 standard will be the common one but as you travel the dam architecture you may say but if I know I am running on on for v, there will be 2 additional asin ...on the MLA day, currently basic, without any extra things like that. Yeah, this is what she's doing, we actually replace the mem check and stuff with that version, which doesn't... ...repril some functions to find out how to avoid this stuff. Yes, so that's the workaround currently for now. So we don't... So that's the workaround, just avoid this stuff until we can solve it. But I fear the proper solution would be... ...if we ever encounter a write to this register, so we will probably basically end the basic block. And then this block will be translated by VEX. We need to check the current register value with the expected one and provide the proper translation... ...and have the ability to choose between different translations of the same basic block. Yeah, but that's the kind of thing that you already have. Exactly. Yeah, so you have the translations, which have a sum she's about. Exactly. The state in which she can be translated. Probably the solution here. But it will be... Exactly. Because all the translation caches and stuff is now assumes there's just one basic block translation. It's 8 bits. 8 bits. And it's just half, like 128. But all the... All the different instructions have a limited set of identifiers, maybe 20 for every instruction. So that's... Most of them are changing the scope of the whole library. 1 to 0 memory, you would have to repeatedly write 64 bits to memory to get the chance 0. Just one instruction and the inside tells 5 out of 5 is 64 bytes of the whole library after this pointer. So this is most of the insights that I have seen through this stuff. Looking for it instead of a small channel. It is. It is. I don't know why they chose this way because... Hard to say. Maybe they are running out of opcode space. It was a cool idea at the beginning, but it's been abused terribly. It's like the delay. It was a cool idea. And I just went to systems. Yeah, but now it's legacy. And you have to delete it, but no one is happy about it. It's legacy in the architecture. You need to do it first part with that. It happens, it's part with that one. I don't know. OK, and the other one? Any questions? How usable is the Linux port actually already? So if I took the patch and put it in the devian package... No, no, no, no. You are too far ahead. There is no package yet. There is just... No, I mean, I couldn't take... Currently you have a patch. Basically that would be on top of the upstream version. And it would add something. So you're wondering how usable that something is. And could already give it to some... Good question. Ask in a few months, just now it builds. That's it. OK. That's it. We have tried, but there are still... It's not so easy to run in just after the... Unimplemented functionality, you know, and things which break for some reason and inrestigate. So it's at the beginning. The Solarisport is quite ready. I have tried some complex programs and all this is going for interfaces and the other problems are either solved or work around. So it runs, gives you some expected outputs. So how do you program the run on the Solarisport? Yeah, much more complex ones. Maybe like on storage servers they take gigabytes of memory, just code and other gigabytes of other data. So is this stuff getting used internally inside R? Sorry. Is it getting used internally inside R to find products? It is, it is. So the X86 port on Solaris is used heavily internally and I have reports from people. They found it very useful, like in the Linux community. The Spark port is not so widely known, so I don't know. I don't know yet. Ha, good question. I think Ilumos runs only on X86. There is some Spark support. I don't know. So it will continue on jen with X86? I don't know. I was in contact with some Ilumos maintainers who are running regression tests on X86 and they package it in some distributions. I don't know about Ilumos on Spark. Big Solaris still kind of applies to the old code as there's nothing actually for Oracle specific. OK. So it's based on the latest public spot, I guess. Yes, yes, yes. The Oracle Spark architecture, which are exchanges to the Spark B9, so it's based on 2015, I guess. There is maybe 2017, probably a little bit better, I'm not sure about this. But it's based on 2015, which corresponds to M7 or T7. OK. OK. So this is the Spark feature. So it can be enabled on Linux. I don't know what's the current state of how to enable it on Linux, but this feature needs to be enabled first and it needs to have a support from the memory allocator, such as Maloch in Lipsy. So you basically tell somehow to enable this feature for this current program and the allocator behaves slightly differently. Instead of just returning pointers like it used to before, it now returns colored pointers. So the topmost forbids contain the version of the color. And I don't know if the allocator in Linux Lipsy is capable of that. Not yet. The kernel bits are being submitted upstream right now. So they will land upstream very soon. And now we are working to have the support in DLPC as well and also in DDP, so you can get nice messages. So instead of getting it back, you just need to get it back. I think the most part is that X86 has some debugging support in hardware for kernel engineers. Spark has a very little debugging support. So you are not able to set that much point, for example, because there is no such debugging register when it gets set. Hey, if you take this space, give me a fold. So this is actually how we can introduce memory watch points for kernel guys. So they will definitely want to demo something like this. At least they will be able to demo. What is the grand variant? The check is the page or the cache line. Cache line, 64 bytes. So basically for small buffers, it's way too big. So they use clever technique because you are usually interested in buffer overflows. So they position buffer at the end of the cache line. So you can detect the end of the buffer right. Let me repeat the question because it was not clear here quite a lot. So the question was about if we can cheat and not simulate or emulate the program counter and next program counter exactly like the processor does. To write the program counter on Spark because the next program counter is not even writeable, it's just written away. So you need to issue a branching section of something like this to actually write something to the next program counter. So you will have to do the same on the website. But I think at this moment there is no such check which will be formatting the legs to not do something like this. So probably if you try hard you will find some way how to cheat the program counters from the legs. We have to hack the block disassembly project so that when it sees a branch instruction then it has to also pick up the instruction afterwards because it will be a representation of the word for something. Exactly. We basically unvind the delays so this will simplify some of the order of the instruction. So if we know it's only going to execute it we revert the order into the assembly so we will put the delays in front of the branch and then issue the exits or we do stuff like site exit is here or jump over the message for the unknown bits. Part of it is just about changing the disaster program. That would be helpful. KnitSport does quite the same thing because they have also delay slot but they do not have a null bit. So they have easier situation. OK, someone else? He has still two minutes maybe? OK, you are done. So thank you for your attention. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. A tenšný a tenšný Valgrímbin.