 So welcome to this import-export-not-sharing session, also known as two ones. This is our second one we do at the Manus team. James Lopez working the Manus team at GitLab. This session will be recorded and uploaded to YouTube. Also these are not proper slides but I wrote a markdown document because the purpose for these non-sharing sessions is mainly to act like at a point for the rest of the engineering team to know more about the code but also with some general knowledge sharing as well. So we'll create a method quest to get this to the CE repo in the import-export namespace to have like a read me there. Yeah the link to these slides is in the yeah that one and also in the Manus issue there's a link in the calendar basically. So this is probably the prettiest slide that we're going to have next one. So what we cover is I will do a quick demo on what the import-export is then we look at the known issues and problems with an emphasis on performance which we have more problems recently to do with that. Then we talk about the security and then the versioning and probably the most awaited question which is when do we have to bump the version and then we look at the other code and we'll have some time for a lot of time hopefully for some questions but feel free to interrupt me anytime as well. So I have this project here so the import-export basically means that we can export an archive of data for a project at GitLab. In this case if we want to export this project we have to go to the general settings and then export project and there's a few things that get exported such as the project configuration, Wiki, the repository, uploads and all the issues, comments, metrics, steps, labels, the LFS objects and many other entities as well because we keep adding entities to projects so a lot of them in every release we keep adding things to this and there's a few things that we don't export like the traces artifacts and some variables encrypted tokens and anything that is encrypted we don't normally export it. So if we click on export project it should be pre-fast because there's more projects anyway and we refresh the page. There's also a notification that gets sent to the email of the user that schedules this. It needs to be maintained by the way in order to export any project. Then you can click on the export which is ready now and we have the archive. Now we can go to create a new project and choose the export that we just downloaded. There's a few options here. The top of the first one is the github export. There's also a similar one which people sometimes they don't know about which is the github.com import and it's slightly similar. The difference is that the export as I mentioned before it it does export quite a few things related to the project while the github.com it uses authentication so you need to configure the github.com integration through OAuth and then you'll be able to import I think mainly issues not many things at all. So people don't normally use these options that much and let's go for the for the extra one. Question, we avoid exporting CIS in the past for security reasons. Yes mainly for security reasons and also there's another reason for that and it's because it's not easy to export these variables. Most of these tokens and everything that are encrypted which means that we use this DB encryption key that is sort of unique the instance and normally these issues although I'm doing the input in the same instance right now normally this issue for moving these projects between instances like exporting from github.com to a local instance or that were around a lot of people are migrating from self hosting to github.com and they will export from the local instance and into github.com that has a different DB encryption key so those variables those columns that are encrypted they won't work properly because the key is different so there's an issue about this some way I think it has to do this comes out very often with the backup restore and thing because then people are more interested in exporting these things what in the import export is not something that they normally ask about so I work around for that I could be asking for the pbk at some point or something other but yeah that's one of the reasons they allow us obviously a security as well so let me just grab this one we just downloaded and to fill this in port test very simple I had to have it all by the way so now it's in progress we'll get more into what's happening behind the scenes a bit later okay you can see it's a copy of the the other small project it's basically it's called the repo and one issue one was a quest and everything should be there and good in terms of the configuration recently started encrypting webhooks I guess we export those and encrypt it um I'm not sure actually but I can check um I think I think there's a niche somewhere actually um yeah so webhooks I think they were added by accident let me check quickly check this because anyway it's sort of related to what we import and we don't um yeah maybe those but yeah this was added by mistake so there's an issue somewhere that we shouldn't we shouldn't export this and this is basically ignored by the input and and probably by the export because we have a um absolute cleaner or something like that it will probably remove these things we get to this class a bit later anyway so um yeah but this was added by mistake we do export this but they they probably don't don't contain anything and they're not very useful they should probably be removed because of the issue I mentioned earlier anyway um so that's it from the demo um the social bunch of things that I won't cover today because the input expert is used in quite a few places such as the instant level templates we also have the project level and soon the group level I think is gonna I don't know if it's been merged already but all of these templates they are using the input expert behind the scenes as well there's also the API for the input export which provides a few extras like overrides for the project some project columns and configuration and there's also something like a hook that we have after we export a project and we can pass a URL to the API saying hey after you export this project then upload the export archive to a server say s3 or wherever so those things I won't cover them today because there's quite a lot going on there but so that you're aware that all of these they use the input export and the next point is specific to debugging and what to do when we found a problem so the I mean they normally you get instant feedback in here instead of the pretty much the sexual imported there should be an an error there that stays hey something happened which also reflects this input error that we can check from the console and so if something happens we probably get notified there sometimes it's not as simple so we can we can check what's going on and there's a few things that we can do in order to debug any any errors I think the the key columns to check these three the job ID which comes from sidekick this is a this is scheduled as a background job and sidekick returns this job ID and we keep in the database because it's quite useful to to have so we can get for it later input status this is significant as well and we talk about why a bit later and the input error which should be reflecting the eyes I mentioned earlier then there's the logs this hasn't been transitioned to strata logs yet but hopefully soon so a bit annoying to grip but this one thing that we always do so if there's any error we always log in an statement that starts with input expert error so we can get for that and there's also backtrace then the job ID we can rep also the sidekick logs for it and in the next slide we dig a bit more about why that is useful and so Tiago says those columns were migrated to another model called blah blah blah which is associated to project by projecting yeah that's right I think uh and if he's called for your status or something like that but I think this works still um I think all of these columns um may have a method that basically calls this other model anyway um if you want a new version okay good to know um cool but I think we um yeah we can just uh call the new model anyway project blah blah dot um and I can find out quickly maybe maybe Tiago knows import status or something set yeah sounds like it yeah so yeah we do have structure sidekick logs on gitlog.com which makes it harder to be to be back now because I think we remove the other ones maybe um but having checked that um we should move to structure logging version uh and this is quite easy uh with the import with the import export because this um there's a single place that we call the errors and everything just like to share class and it's called when there's any error um anyway moving on next um so I made it a section because performance is probably one of the most um recurrent problems lately uh with the import export um we have two kind of issues that are quite recurrent one is the this is common to other sidekick jobs anyway and other imports but it occurs conflictedly uh with the import export which is the out of memory errors uh related to the sidekick memory killer so this this um sidekick class um it's it's probably most if you know if not all of the sidekick jobs and basically when we reach um maximum uh recent memory of two gigs I think that's the setting killer.com right now it will send a kill signal I don't know which one to sidekick and then sidekick will and sort of end the process after I think 15 minutes or something uh with a hard kill signal um and this happens when any job goes over that limit um so this doesn't happen often unless the job is um is using a lot of memory which could happen when we export like or import a big repository um so how do we know that the job got killed we have um um uh we have this is not as easy as before because we have the import status that will remain as it started um so after a while we'll notice that it didn't finish and we see no errors and also you can check that the um the job is no longer is no longer there in sidekick the best thing to check this is with the sidekick logs and this is why this job idea was important because then we will see something like this works is still in progress blah blah blah and this is probably the first kill signal that we send and in here you will see a struct with a list of job IDs and one of them would be um the one that we um that related to the import export um this is a bit hard to debug because the we couldn't we couldn't get for the kill signal because it will kill the process and it may get a random job ID and not necessarily this one um basically when we schedule a safe sidekick worker for doing the import job and this will be picked by a process that has different threads each thread we have a job ID um each thread will do different uh different kind of uh jobs one will be the import um so when we kill we kill the process which means that other things will get killed as well um the work around for this is easier in if we are self-hosted because then we can just increase this memory um threshold uh more about that in this this document there um there's also another thing that may happen and is that if um if it doesn't get killed by the memory killer but the export is really big it may get killed by the um import jobs worker which basically marks an import has failed after a while um I think we have these 15 hours um different so in the logs we'll see something like this again the job ID we can grateful that and we'll see that this is what happened and in this case we do mark as fails which means that we will see the error in the um um in the import error column basically um John asked would it be possible to predict the size of a private export before we do it um most of the times we can because unfortunately I don't know if this is the next one no but anyways um when we export a job we we don't watch this we just call um to Jason and all the project models and this is done in one go uh which means that say if we have a um uh let me let me check this in here actually so it might be easier if I open one to see so you can see the uh the contents of one of these archives so basically the key one is this Jason file because we load all of this in memory so the size of this will represent probably what we're gonna use uh the memory that we're gonna use so the issues that I will link later to to fix this this problem and a few things we can do basically but it's not bad or anything um so it corresponds to the size of these maybe I'll be more because we are creating also some um in the case of the input we're creating some active record models and things but pretty much about the size of the Jason is what we can um uh we can guess um how come we are accustomed to find the amount of memory required by the instance for the input export of our project so this is I mean it's difficult to figure out but uh the project is a good approximation because this is the the um the key issue in performance is related to this Jason and you can check the size of it and then you can pretty much guess that we are at least gonna load the whole jason in memory uh so it's gonna use that um plus a bit more so you can predict that if you have like um one gig jason then it will be at least one gig and obviously then the set kick process will just have at least 500 max that it's just race loading um then the input itself or the export itself may use a bit more memory so that would be about two gigs maybe um plus um well the the the RSS is not perfect either it's not difficult to to measure the memory but it's a good approximation anyway but there's more things going on because we are sharing this process with other threads that may use a bit more memory um so it's a bit of a difficult guess but I would say the the jason file is key you know hope that answers the question um so the input jobs the input job that that's that again it loads the jason memory then we do patch here and this is quite important because if we didn't do this then it will use way more memory because we um we do load say a bunch of issues comments and then we commit to the database and we free the memory as in we get rid of those objects and then we let um the um the garbage collector do its thing which works quite quite well uh we didn't have to I mean we don't have to call it or anything it's just uh it gets rid after a while but this is still not enough we talk about about this a bit more um so slow jason is one of the key problems loading and dumping um especially when you think about millions of builds and um mesquests and sometimes we have to touch the um repository as well think about mesquests we may have to do some some action per mesquests such as especially your own forks and because we don't have the fork project we have to go to and check the um the shot of um of the mesquests um I think where it points to and things like that and then create like a fake um rough for it and but I won't go into much detail but and it could take a while to load all the active record um models and this is which goes in the high memory um so one of the things here we can do is split the worker we do this already for the github inputs which improved quite a lot and I remember for the uh kubernetes input I think we we scaled it and it took like I don't know at least three months to get imported so that that was that was crazy but that was much faster after we split the worker um so we we can split this into different uh threads well processors really and then also stat made um I feel good points about both scores in this uh link to this issue here um basically some of the when we use we use active record to jason which means that um because it doesn't know how this um what we're going to do with each model then it doesn't it doesn't do anything clear but we you know we may uh call unnecessarily um unnecessarily selects to the database that we could have but and things like that and uh there's a few more things into it um some of the things that um we talk about on those issues as I mentioned splitting this uh but the export which is everything going in one go at the moment um optimize sequence so active record is in a lot of things and it's not it's definitely not calling single it's definitely calling single instance but model so uh where we could batch those as well we could batch the reading and writing to this we don't we don't we don't do this at all move away from some active record callbacks would be great because then it means that we could practically just insert into the database like if it was a CSV import but this would be difficult because sometimes we do use callbacks um do we commit this is another issue so for each batch depending on when we commit to the database this is related to um active record transactions uh so basically um when you uh when we import something we we sort of batch a few records a few inserts into the database and then after each batch we commit into the database because if we don't do then all of that will be in memory and it will it will use a lot of memory basically and the problem with this is that it gets lower so the more often we commit these lower it gets um so that's where i say i see the spot there is it's quite useful um we try we i did try a few other tools such as oj but it doesn't help i mean tested this with um a few i don't know i think it was like a four gig input and it did help but just a few seconds which and and the memory was exactly the same because it doesn't use active record behind the scenes this is promising this uh Fox jason api i think stan mentioned in the issue as well that may help um then well there's a few a few links there so feel free to uh to dig if you're interested in and digging a bit more into this um there's a few open issues and suggestions there um at the moment uh this is being ongoing we haven't scheduled this but i think it would be useful uh this one thing that we do now for customers which is in the infra tracker we have these foreground imports so if we want to import a big project for customers we do this in the foreground using a template so these we these are basically avoid these sidekick issues and it would use a bit more memory but it would do the job basically um so any questions okay move on to security um so the thing with the import export is that it's about three years old now about what i've been in the company we really haven't taxed that much it's pretty much the same as what it was so that is why performance now because we keep adding things to it and got really slow then we should also perform a code audit code audit because we we keep encountering a few security issues especially lately and we haven't really did much into it there's this issue there um to to do this um i think and i mentioned is that we should prioritize that um to help with the security aspect there's a few more things and in terms of security that we do i mentioned earlier the attribute cleaner this removes any anything that ends with id unless uh there are some references that we don't really use but we change the id so i think we um we need them for mapping basically so when id is one we map it to id black and things like that but um there's a few other prohibited keys that um such as token and things like that that we may and so the attribute cleaner may may just ignore basically the other things is that we have a few specs that um check for for instance for the addition of new columns so um this happens quite frequently when we add new things to anything related to a project or anything that hands in the project tree um we this this automatically detects that there's a new there's a change in a model that added a new column and it makes you think about it and say hey do you think this is safe to export or not and it tells you what to do similarly we have the same for new models um so we are a new model that hands from project then this aspect will fail and will let you know that you have to decide whether it's a good idea or not to export it um there's a few more things such as detecting encrypt all sensitive columns um so this is practically the same as the other specs uh we have a safe list um so words like pass password token and things like that um anything um that is a bit suspicious then we will long and this aspect will fail and we let you um confirm whether that should be exported or not um any questions i'm going to drink a bit of water so next slide is versioning um this is probably quite a frequent question i seen um from engineers uh which is basically um when do we need to bump this version up and the way the way the versioning works with the import export it doesn't follow a proper sender um thing um so the version is basically if we increase this number then it will mean that it won't be compatible and this is because we keep changing it so for a specific release we we add quite a few changes that do not require um a version bump person they won't break the import export and if we were to add a new increase this number per um per change that is not it doesn't break anything then um yeah it would be probably maybe 10 times every time we release a new give that version because we keep adding new things um so this is a bit of a pain for the customers because obviously when every time we bump the version up the customers may find that they couldn't import an archive from from these um all the instances into the new one so we try to limit this as much as possible um which is to the next question uh when do we need to bump the version up um mainly when we rename a model or a column or we have format modifications that make it completely compatible um or the structure or the file itself say you won't see it here but we may have like an uploads folder where we keep these things in there if we rename that folder or we change the structure of this file it will break and that doesn't happen often what normally happens is renaming um sometimes removing a column um although most of the times adding a new column or a model uh doesn't imply any version bump because the import we ignore this anyway so it wouldn't it wouldn't complain um so we can we can do these things without needing to to change the version um one thing when we change the version is that the integration specs that contain one of them contains a file like this and contains the version um those will fail because we we bump into the version and it will mourn about the um the new version there's a there's a red dot that basically um bumps the version on that um spec file for us um it's quite easy on max as you can see um renaming is one of the main problems we have um especially when we because we run um i see a bit like a comma most of our customers or users they don't um it's a bit of a problem and this has changed recently and we try to now support at least one further version so we don't have this problem um so basically now we can use this um this service here that for one for one version say 11.6 we make it compatible and in this example we renamed pipelines to CI pipelines um so for 11.6 exports from 11.6 will work on all the instances because it will get exported as pipelines as well um while in 11.7 we will expect this to be to be removed and and this um this is great for customers because then once we're deploying our cnglap.com um most of the exports that they have they won't work if we bump the version up but with this change hopefully it will it will help um so yeah um any questions um okay look he's asking what was the reason for not using standard server because it can happen a number of times each release yes um because there are compatible changes um all the time we keep adding um not only motors but columns and changing stuff very frequent then we could do it but I think it's a waste of time because it's not very useful really and and sometimes you don't know that you're actually modifying the import export because um as I mentioned earlier it's very dynamic and if you add a column of things like that you're actually changing it but you're not aware because it wouldn't complain you would just ignore it so that is the main reason and to be honest the this is really useful mainly for um compatibility reasons uh because then you can always check really the deep lab version and there's this mapping and version history in the in the import export document which is great for as well um um let's go to the next slide which is um basically a quick dive into the code um the probably most important thing about the import expo is that and this is used by customers quite a lot the there's this configuration file import expo in ramon and in here we can specify what we actually export or import in any instance um so everything comes from this project tree and then we say we want to support labels milestones events issues blah blah blah and there's quite a few things in there um as you can see here we export um um quite a few models one of these are models that we export uh pipelines nodes blah blah blah um and sometimes customers may not be interested in exporting certain things um don't issues with that with this is that they do need to resolve an instance after a change here but um yeah there's another thing about this file that we can um say port attributes we want to include and which ones we don't want to include so we can exclude certain attributes this is quite useful for security purposes as well we can ignore tokens and things laid out there's this method section as well um in here and this is we don't use it very often but sometimes we may want to export certain things that may not have like a relation or we may want to ignore the relation itself and do something um to that model uh and that's the case for a few things like the uh this utf 8th that i think um well this has been a few years but i think it was an issue uh when it's not utf 8 and the jason it wouldn't work properly so we changed this to use this and export the utf version uh this is uh you can see there's a few types there this is because it's a reserve keyword for rates and for some reason just doesn't export the type unless you specify and next is the input status we talked about this earlier uh this is in general the same for for all inputs uh they they change from known to schedule when um we schedule the um the job in sidekick once the job gets uh picked by sidekick then it changed to started and they did a it did a change to finish it or failed and what happens in the import export after the the job gets picked by sidekick it basically it calls this importer class um that um there's a few things like extract the file from the uploader so we keep all this um uh file uploader we keep we keep both import and export in these file uploaders which actually has its own class as well here but it's basically a file uploader um this means that we can keep them in um object storage if we configure it like that so this basically this import file extracts that file and then the next thing we do is check the version to see if it's compatible or not then we call a few restorers that basically what they do is just um either extract the repo and extract or everything to do with the database models um uploads lfs um yeah and a few other things um and obviously we we don't have any errors and we'll always clean up after that's what we have done um there's a few places that these can go wrong such as when we um with a sidekick memory killer for instance so sometimes we have a few problems there but um um yeah the um the export is practically the same uh we just um we just basically uh call a few services that uh do the opposite uh they just save the version save the avatars save all the JSON uploads repository blah blah blah um and after that we notify this access and send an email uh we also send uh we clean up and send an uh an email as well with everything that happened there um yeah this is pretty much pretty much it um any any other questions um i think this is the practically the last slide so anything related to uh turn the other slide as well and okay i i listed up um a few links in the end as well um there's a few things that we didn't cover such as the um admin documentation which covers the break tasks um the api um and then the um the link to the presentation would be here or there's not a presentation but say as i mentioned it's a markdown document and i'm hoping i will i would submit a message request to add these to the um to the root of the um import export um namespace which is around here so uh i think this would be useful for developers to check uh so we have like a read me there uh we throw all of this information and um okay i'm gonna wait a few seconds okay i will give you back 15 minutes then um thanks for your time uh definitely if you have any other questions feel free to join the uh the manage channel and ask there uh thanks a lot