哎,hello哎,你聽到我對吧?哦,no好的,那我就攻向屏幕了嗯,可以,可以那現在嗯,好sorry, I was on muteGood morning or good afternoon or good eveningdepending on where you arewait for a few more minutesthere should be a few more people joiningI see the Alibaba team are onare you good to be presenting the fine arts projectok, thanks Alexyeah, we are onso shall we just wait a few minutesor we just get startedum,let's maybe wait a couple of minutessurea few more people should be joining soonthank youalright, it's nearly five pastum, why don't we, why don't we beginhi, Erinhi, Alexok, let me get startedshare the screen firstcan everyone see meyepokah, hello everyonemy name is Ren Yuan Yuand I am from the Alibaba demoacademyand today with me also there are a fewmy colleagues Andy Taoand Siyuan onlinetoday I'm going to presentour recent work called one yardit is distributed in memoryimmutable data managerand we have planned to donate itto CNCF as a sandbox projectand we would like to hear morefrom the communityespecially from the six storagecommunityabout the feedbacks on our projectfeel free to interruptif like you got any questionsok, the first question is whyBowser, why we need likeanother data storage engineso the problem is as followsyou can seeyou can see from the likepi data is a defective standard for data analysispeople like building various data applicationsin Pythonand normally if like they may likemultipolar libraries or projectsin the pi data ecosystemfor different kinds of worksfor example if we want to do visualizationwe use matplot.lyand we use to analyzedataframes we use pandasand if we want to dolike enumerate calculationswe use numpyand we want to do machine learningwe use pytorch or tensorflowand all thoselibraries works together very nicelybecause sharing dataespecially intermediate resultsvery efficientbetween the systemsor librarieshere is an example like we want to passan array from numpyto pytorchbasically pytorch understandsthe data structure of numpyand understands all its metadataand for the payload partthe actual array partit just shares the same arraywith a pointerpassing the cpointerand its lengthbetween those two librariesit's very easyin this caseyou change the tensor positionzeroto minus oneand for theend array parton numpy it also changesbecause it shares memoryso in this way sharing databetween those two librariesbasically with zero copyzero actual copyand it's like zero actual costbut what if likefor some reasonwe cannot do thewe want to access the same piece of databut we cannot do themdo it on the same processor we may need to do themultiprocess processingon the same like datait is not as easyas the first examplelike we canbut it's still possiblewith plasma from the app.it arrowit's basically a local object storeusing shared memoryit has a Pythonalso comes with a Python clientpeople canget an objectbasically that object memoryjust mappedon to that processand use that memorylike you can access the databut for thelike a metadata partit's not as straightforwardas the first onebecause for the object store partonly like managecontinuous memoryfor each objectit requires everythingstore in a continuous memorya section of a continuous memorybut for themetadata partyou either you handle thatyourselfyou got the metadatabecause metadata typically does nottake much spaceso you take thedata partor you just serialize those metadataalong with the payload partthe actual payload parttogetherand put it as a single objectfor plasmajust like what app.it arrow doesthat's the two waysyou can solve the problemof sharing metadatabut you can still sharetheyou can still share the payload partwith plasmajust betweendifferent processes or runtimeson a single machinebut then likeif we want to handlea bigger applicationwe handle big datathe big data itselfcannot fit into a single machineandplus we want todo many different tasksdifferent workloads on the same piece of dataoron the resultof the differentworkloadswhat can we doin this case we wantlaboratekubernetesand also theproject we were doing in the vnrlet's lookinrealize big dataapplicationwhat we have done in Pythonlikeactualthe real lifebig data applicationsis actuallyvery verycomplexthey are involving many different tasksfor example wefrom the raw datawe need to do some ETLsto do the drawingsto do the transformationsand then they can feed that datawe may want to feed that datato a graph systemto do a community detectionfor example we do the label propagationand then wewant to feed that datato a deep learning systemlike tensorflow or pytorchto find towe try to learn some patternslearn somethingmodelsand finallywe want to usedo some visualizationon the classificationwe want todo some inspectionwhether this classificationmake senseif that's the pipelineyou seefor each workloadthere are some dedicated systemsand between the systemswe need to shuffle the datain a distributed file systemandno more like zerocovid sharing of the data anymoreokhere's the observationthebigly data application involvesmany systemsand each system sharesmedia dataeternal file systemand finallythis kind of workloadoftenoptines as a chain or a deckandlike each individual taskrequiresthe results produced by theprevious taskshere are the problemokbecause firsteven for the big datawe have distributed file systemthe building and production ready systemslike high tensorflowspark, pytorchthose are very hard to developwhy? becausewe need toconsider different kinds ofdifferent distributed file systemswe may need toconsiderwhat's the file formatdo we use csvfor tablesor orcsorpuppetsthere are so manyfile formatsnot mentioned for graphsthere's no standard way to store graphso we need to dump themas a tablethat table may lose lots of informationand very inefficientand likeyou need tocouple with this kind of inputoutput and the systemssecond sharingthis data with externalsystem of courseevolves huge IO costand sometimes those costs arenecessaryif we want tooptimize those tasks as a wholewe want tohigh-plan this kind of thingsthe jobs is very challengingso that's the motivationwe want to build a systemlike solve thoseproblemswe want to make big data systemeasy to buildand we want to reducethis kind of IO cost in theand finallywe want to open the opportunityfor cross-task optimizationsthat's thequest for one yardso what is one yardone yard is a distributed in-memoryorgestor for immutable dataand it supportsthe recovery memory data sharingbetween different systemsit comes with out of theboss high-level data constructionfor developing big data applicationsfor examplewe havetensors, we have data frameswe have distributed graphswe have likescalersand also like common data structureslike arrays, hash tables, etc.like those kind of data structuresare justlike come out of the boxand those kind of data structurescan be mapped to the memoryjust likenative objectslike c++ objectsyou can use that datalike you can do the local dataaccessjust like anative objectsand finallywe provide driversfor data partitioningio and checkpoint migration, etc.that means the big dataapplicationsdo not need to care ofthe computing engineis startedthe data is already therein one yardthe computing engine itselfdoes not need to caredo we needto load the datafrom an external file systemor it's coming fromanother streamor whateverso the one yard comes withdrivers that can do thisfor thethe application build on tophere is the architectureof one yardone yardobject consists ofthe beta payloadbasically thoseconsume much of the memoryand the metadataand the data payloadstoresoring in the shared memoryjust like a plasma datawe open a big chunkfor storing the payloadand themetadata in one yardis likethink through a clusterusing eccdand we currentlywe support data framesgraphs, tensors, etc.many kinds ofand the one yard demoninstancesaccess with IPCand RPC connectionsand the data payloadfor IPC connectionsso RPCs you can only accessmetadatafor IPCyou can just nap theshared memory into your processand we have comes withmany like platform driverswhich can provide certain functionalitiesto certain data formatsfor examplemigration or IO loadold data or unload the datasave data to an externalinstances, etc.herecould Icould I just ask a coupleof clarifying questions thereis theis thedatareplicatedacross the different instancesor is it shardedor is it just separate data setscurrentlyactuallywe are just shardingsharding the data or partitioning the datais for the commonbig data applicationswe also support replicationsfor examplei will cometo that laterfor examplebutthatfor exampleif the data is replicatedwe have two processing processlike working on the same piece of datato speed upor for some reasonlike we want to have a data backupin the memoryor for some reason we just dump the datato an external file systemand we free that datafrom the memorythat's all can be controlledby something called a driverand we can builddrivers to do thatdrivers can workwith Kubernetesand theapplications on topto decide thatso on the very low levelthe datafor the one year does not careliketheactually it does not really understandthewhat does the data meanbut because we havemanager and driverswe can easily plug in newkinds ofdata structures or typeson to likeone yard solikebasicallyto make sense of the datais more like a client thingor the application thingnot the one yard thingit just maintains the metadatado you have any concernsoveraging at CDat a certain scaledue to its brittlenessusing it to sync the data through the clusterhave you run intothe performanceyou mean the performance considerationwell not only performance considerationsbut just the additional trafficthat flows through at CDthat isn't part of its normal functionalityat scale it seems like that wouldyou could possibly run into issues therewe have down testedbut currently we deploy theetcdas a standalone clusteras a standalone workloadin addition to theetcd required by Kubernetesalso likeit's not a big problembecause we only useetcd formetadata and only if thatmetadata is consumedbycross different cans of applicationwe put that data toetcdwe maintain that dataas local as possiblelike if that data is notrequired by a remote walkerlike that metadataor that object is notrequired by a remote walkerwe try not tolike expose thatand our data is mostlylike immutable sothose kind of metadatakind of static unlessyou're creating something newthat does not happentoo frequentlyso that's not a problemlike tomanage traffic or something like thatok thankshere I will show an examplefor to accesslike globaldataframe in one waythat means like we have a data framebig table and we partitionthe table intomany chunksand forlike each chunk maylocate in one ofthe wire instancelike we have a client on onelike on top of one walkerfirst we connect wirethroughdomain socketand then welike try to getan object of somespecific IDwe can getframesand we can accesswe canaccess a chunklike local chunkwe can check whether the chunk is local or notand we can get a local chunkand then we canlike just use a chunk like normallike pandas chunkand we can dowe can inspect metadataso as each stepinvolves differentinvolve componentsand basically if the chunkis localyou can use a chunk just likenormal native objectandcurrently we have some integrationwith Kubernetesand we have a vision thatlike with one-year supportand the abilityof Kubernetes we canmaybe we canfind a newfor big data tasksfirst we will coverhow to deploy one-year in Kubernetesandhow we leverage the power of Kubernetesand furtherhave how weuseKubernetes ability toco-schedule dataand workloads on top of itlike just flashbackto the previous taskwe want to solvelikewe want toreplace first we replacethe decrease faster with one-yearok and then wewewewant the differentkind of workload toshare the datathrough the means of CRDsso they find the CRDsthey want to accessand we use one-yearto map the data tohis workersandfor example, we change theworkload as we usedataframe engine called Marshas integrationwith one-yearand it's builtby Alibabaandgrass system called Grascois also built on top ofone-yearwe directlyshare the data can be directly sharedthrough the CRDsand then we usecouple flow andthe graph analysiswe can still usethe one-yearto share becauseit talks in thelanguage of NDE arraysthe chance of NDE arraysand we havepycin sdk to easilypycin-basedlike librariesand in this waythe end-to-end big datatask is deployed on Kubernetesand the intermediate resultslike dataabstracted as CRDsand leave the in one-yearand into memoryand maybe we canhave a scheduler tooptimize the locality of the next joband if the jobis matched we canassist scheduler orwe initiate anotherjob to migratethe data for the alignmentor we repartition the datafor the alignmentI was just thinkingthat would bean ideal candidate fora custom controlleror a mutatingcontroller or somethingpotentiallysofirst wehow to deploy one-year on Kubernetesactually it's a little bitlikenot very straightforwardbecause one-year requiresIPC communication between one-yearserver port and the application portfor memorycurrentlywe deploy one-year as ademonsetso in this way we need toeitherusing a hosthouse for an IPCsocketor we needseparatevolume claimjust to putsocket thereand we have done the experimentit worksand as long as we canhave a domain socket mappedto different containerswe can bundle one-yearto the same portand the domain socket can be sharedand for deploymentof one-yearwejust said it could be deployed as ademonsetand we haveyou leverage the helpto install one-yearquickly install and deployon KubernetesandSecondly weone-year objectswe expose it asquantitative resourceslike basically if some jobrequires some kind of datathat's in themeans of one-year objectsit can look up throughthe Kubernetes CRDsandyeahthat'sjustwe have plans in progresslike to build aone-year operatorthat's be responsible for thedev ops of the one-yearon your ownkubernetes clusterand we want it to be responsiblemanaging the status of one-year clusterand managing the CRDsand provide thescaling and out capabilityof one-year on Kubernetesand to be responsibleif we wantsome kind of data checkpointingwe want some recoveryfor tolerance and sexualwe can use the one-yearoperatorfirst ahead we planto leveragethe schedule offunctionality of Kubernetesand then we wantlike useKubernetes tooperate the datahow the data is partitionedor how the data is migratedhow the data is replicatedhow the data finds the workloadand how the workload finds the datalikefirst the worker poddescribes the required one-yearobject in their specsand schedule tries to alignthe worker pod withrequired one-year objectsby retrieving the location specfrom the CRDsand then we can triggera data migrationall reputationto ensure that the podcan accept data they requirehere is an examplelike first we have a joblet's generate a partdata part oneand data part twoon the one-year instancewe want the way twoand it's part onepart two and they cometogether theymaybe theythey are part of a global graphsome graphglobal graph andfor the next jobwe have a opportunity tolike collocatemaybe we maybe want tolike put P2 and those P2togetherlike on the P1 part onepart two but if notif it's not possiblelike trigger a migrationlike to satisfythe requirementsand then thejob is launched the data thereand it can be directly mappedto their processhere is thesorryjust a couple of questions therethe data migrationthat would be happeningshared memory to shared memoryyesit's kind of by a driveractually driver is a special clienta special client that livesin the wired containerso basicallyit's in a separate processof the wired team but in the same containerit's just likeokay like for eachnote there'slike wired containerand there will be also a driverwe send itcommand so you need toorate this object to this objectto this instanceto just create a newobject hereand maybe removehere or we just keep it hereit depends on what commandit was givenso it's basically likethe driver isdoing very primitive jobslike it's a special clientor special applicationlike but it can providelikesome like for examplewe want to check pointing is thatwe want to save the dataa copy of data toto the disk we just likedown to the data to the diskthen we can move onthat's basicallythe meaning I feeland for the roadmapinterestingjust one other small commentwe hada projectpresenta couple of months backcalledthe datasetlife cycle managerit was an IBM projectwhichhadat leastnot the shared memory aspect butit had put together a processwhere you could have CRDsthat wouldidentifydatasetsand load them onto particular nodesspecifically forresearchandpotentially big data typeuse casesso I'm kind of wondering if maybe there isyeah yeah I thinkthat's a good idea actuallywe don't currently havelike bandwidthto like tovery deep on the scheduling partactually there's too muchwork to doactually I'm just thinkingthe model of one yard veryfeeds this kind ofthe abilityprovidedby other CNCF projectsand actually we are currentlylooking at another project calledfluidto kindof achieving the similarsince but I will definitelyalso look at the dataset projectbut I didn't pay attentionto the previous meetingswe will definitely checkthank you very muchvery cooland for the roadmapcurrently we have likeeverything build testingwe did that through thegithub actionswe havevarious data types supported alreadywe have arraysgraphsandthe data framesessential and we supportseveral likecomputation enginespy torch mars grascopeexcept for py torchthose two are from alibabaand we releasethe docker image on docker hub and quayand we have integrated with helpfor the deploymentof the vignard operatorsandwe aim to furtherimprove its performanceeither by thosedata typesand also bythe basicprimitive operationson vignard such as create objector remove objectessential and alsowe plan to add morelanguage support such asgo-essentialand rust maybeand we maywant to look intohow to build the storagehierarchy such asfor like objects on memoriesin some device like on dpualso theesternal storage like local ssdsomething like thatwhether we can leverage thatand we want toeither build aschedulerplugging all like people can integrate with otherscheduler framework thatcan handle the data locality problemthe statusof the project currentlyis hostedon the github alibaba organd it got343stars asof yesterdayand we got33 issuesand10013pprand we have6 maintainers currentlycome from alibabaandbut wecome any contribution from the communityand we have a clear pathfor newcomer to becomemaintenorscurrently it's up to2.0 licenseand we haveeither issues, discussions, prsare we comeand we have a website to host the documentsfor the communitygovernance we have a clear pathfor newcomers to becomemaintenor we have designedmany good first issuesopen many good first issuesfor newcomersand before becomingmaintenor we is packedthe newcomer can submit5prsto one yardandhe can reach one of the existing maintainersand wewe are happy tohave a voteand a majority is requiredand it's justgo teamstuff andwe very welcomeas non-maintenors and contributorsand we knowmy building one yardit's likeactually it's notthe effortfrom one teamor just one vendor is not enoughand the community is the keyso we really welcomeas non-maintenors and contributorsand we expectmaintenor could spendat leastone fifth of his timeon the projectandthe enhancement decisionslikepropose as the issuesandwe canhave a vote bymaintenors andto develop it you canfeel free to sell for a signormaintenor is a signand the release cyclewe will follow the same wayto cut release packageslike forone year a major releaseand every two months we areminor release and patchversion every one or two weekendsand they planto distribute our firstmajor releasein Apriland we have alreadyreleased the packagesin like high-five videosdopper images and help chartsoh sorrythe wrong placewhyyeahactuallyMVR is a very naturalfeet to cloud native computinglike it providesefficient distributed data sharingand you know cloud native environmentand itall can be Oscar rated by Kubernetesand we find thatit's really excitingand we alreadymeeting existingabilities provided by many ofCNCL projectscurrently it's Kubernetes nativestilling outit comes with a scheduleplugging and we usehelp for deploymentas a clusterand we use ETCDand CRDs from Kube Kubernetesfor the metadata managementand we reallywant tolike IPCvindernutual to encouragecollaboration and innovationactually it's kind ofa foundationof like buildinglike new big data systemor making the existingsystem better soI think that's likesomething we wantpossibly we can get feedbackand confusion from the user communitythat things they can engageand we want towork with a sense of communityand we believe togetherwe can build the next generationcloud native paradigmfor big data applicationsthat's all for mypresentationany questionsthank you so muchthis was a really good presentationare there any questionsfrom anybody on the callperhaps LuisJaneI think this is greatI learned a lotactually from itand one of the things I look forward tois maybe a demo of itthat would be greatI can see the architectureI would like to see howit gets usedit gets usedmaybe we cando it in the next meetingif you guysbecause we were likefor many likeclover native's workespecially the scheduler partwe were just doing someveryearly poc versionproper concept versionand we haven'tget everythinglinked togethermaybe in next meetingwe can do a demolet's seehow the samemoreother than thatI guess we just have to go throughthe due diligencethat we have to go throughthe projectand check it outbut I amquite impressed with what you showedsoespecially all the memory thingssoand justto double checksoare youapplyingare you planning to apply for sandboxor incubation levelactuallyI want tolike hear from you guys said wellwe are playingfor the sandboxI think it's easierand our project only opensa few months agoonOctoberNovember I can't rememberit'spretty new and we only gotmentainers from Alibabaand we really want toget more like externalmentainerslike before we canmove to the next levelI think that's our planbut yeah incubatorno wewe are happybut we justit's not ready yetyeah I think the sandbox is correctyou need to have like end usersright Alex I believeI think they looks likewe don't want that yetfor end userswe have Alibabaof course but yeahwe need tonot yourself other end usersyeah yeah I understandyeah so basically sowe have manywe have two other open source projectslike build on topwe are already open sourcedby Alibabaand they got manywe got a few end usersI'm not sure whetherthat counts as our end users as wellI think yeahbut we just want to improve ityeah it's notenough yeahyeah it looks pretty decentfor sandbox projectyes I agreeit may not be enoughI'm just concernedI agreeon the amount of time the sandbox project has a lifetimeis there a limitbecause this project seems goodbut I'm just concerned that it may notcollect enoughend users or othercommunity membersis there a lifetime for sandboxor do we wait untilmaybe the team has morecontributorswellI think the projectis just at the rightat the right stage becauseyou're closeto the 1.0 releaseandusing sandboxtoincrease the number of maintainersand to build out the communityis perfectonce you're in sandboxthere are reviews every 6 monthsor sobut yeahI don't anticipate thatthat's going to be a problemyou can then make the decisionto move to incubationonce you're readyok thank yousandbox is correctbrilliant okone thing I will do isI will sharethe recordingof the presentationand the deckwith the TOCat the next callso that they'll havesome backgroundof the project beforethey go intothe nextsandbox reviewthe TOChave a regularregular schedule nowwhere they dothey review all of the sandboxapplicationsin one go everyone to review two monthsI need to double checkwhen the next review isbut I'll find outand let you know as wellthank youif there's anything morewe can providemake it more solidjust let us knowany other questionsor comments for the teamalright I think we're goodwe also hadwe also hadanother item on the agendatocontinue to reviewthe DR documentthat Rafaelahad shared at last timebut unfortunatelyRafaela had to drop offsomething came upso he had to drop offfor that half-bustso ifI would strongly encourageif you have commentsor any other contentto feedback to that DR documentsto apply it to the documentto review the commentsin the next SIG meetingI'll make sure I review itcool thanks Liesit's a really good documentthat's coming along nicelydoes anybody elsehave any other itemsthey want toto raise or discussokay so we get 12 minutes backthanks everyonehave a good rest of your dayThanks謝謝