みなさんこんにちは。今日はご視聴いただきありがとうございます。私はタクヤフクシスターです。ITアーキテクトのNTTデータコープレーションです。このセッションは、マサルドバシーです。私は、データレイクアーキテクトの歴史と歴史を説明します。まずは、自分のアーキテクトです。私は、ITアーキテクトで、コンピューティング・ミッションクリティカーシステムを担当しています。NTTデータコープレーションは、マーチブルセクターで、パブリック・フィナッシュ・コープレートを作っています。マサルドバシーは、ITスピチュアリストをオーエッセスで、ディストリート・コンピューター、マシン・ラーニング・インフラ・エッセトラを作っています。彼は、アパッチ・カフカ・アパッチ・スパークについて、テクニカブックスを作っています。私たちは、プロジェクトで、オーエッセスで、データ・ユーティライト・インク・トラクチュアを作っています。このセッションは、2つの話をします。まず、マサルドバシーは、オーエッセスで、オーエッセスで、大きなデータを作っています。次に、新たなテクノロジーを作っています。前に、このセッションを、私たちは、このセッションを、主に説明します。データ・ユーティライト・プラットホームは、たくさんのテクノロジーを作っています。このセッションは、私たちは、ストアリング・プロセッシンで、大きなデータを作っています。とにかく、私たちは、エンジン・アクセス・イントフェースと、ディストリビュータを作っています。このセッションについてお伺いし、この辺りのストレージレイヤーについてお伺いします。今、マッサルのスピーカーについてお話しすることができます。こんにちは、みんな。私はマッサルトバシーのNTTデータです。まず、私たちの歴史についてお話しします。オープンソースソフトウェアのプロセッシングで、ビッグデータを使用しています。これは、私たちの歴史についてお伺いします。あなたが見ると、多くのエクスステムがあります。アパッチヘディブのために、私たちのエクスステムについてお伺いします。私たちは、このセッションについてお伺いします。まず、私はアパッチヘディブについてお話しします。これは、私たちのエクスステムについてお伺いします。まず、私たちのヘディブについてお伺いします。クラリーエンジンについてお伺いします。ヘディブは、エンタープライズを担当するために、プロセッシングでデータを使用しています。前に、ヘディブについてお伺いします。ヘディブは、アパッチヘディブについてお伺いします。私たちのイエスサーバーは、2GBの記憶を持つのが、CPUたちのコードを使用するために、2GBの記憶を使用するために、CPUのコードを使用するために、プロセッシングでデータを使用しています。例えば、私たちのオリジナルLDbMSを使用しています。後、ヘディブは、たくさんのマシンを使うことができます。それは大きなコンピューティングレソースです。コンピューティングファミリアーのために、コンピューティングレソースを追加することができます。このようなフィギュアルです。では、このようなフィギュアルを追加することができます。左側の情報はフィナーションアリアーの仕掛けです。そしてフィナーションアリアーの仕掛けは、広い時間を調整するための8時間までに行って、5時間までに行います。右側はまた1時間までの仕掛けです。この使い方は、名前のアイディンテリフィキューションプロセスを使用しました。プロセッシングタイムは、100が必要です。HDFは、オープンソースソフトウェアのデータをレベルしています。しかし、私のアプリデスアプリケーションを使用することが難しいです。 kills of techniques to increase productivity of applications were born.It abstractedooh map reduce based on distributed computing.HDF Map reduce transparently provided us with difficult features to realize,such as scalability,H attribute, Congratulations and comorry.この理由は何でしょうか。様々なソフトウェアに関しては、例えば、アパッチハイフを使ってスクリームのアプリケーションを使ってSQL-like 語りを使ってCascading is for frameworkor libraryto buildthe custom DSLbutHadoop also has a weak pointfor example, Hadoop, HDFScombined with MapReduceis not good at handlingsmall files.of course, like this figureHadoopHadoop is good at handlinglarge files sequentiallyhowever, it's difficult to handlelots of small filesor lots of small dataupgrade recordsin large datasetand partially delete recordsin large data filesthe more you store small filesthe more massive metadata will growthis results in the high loadto the master's hoverwhich manage the metadataof the file systemfor example, in left sidethere are manyand one gigabyte datathe number of filesis less than one millionon the other hand, the right sidethere are many one kilobyte filesbut the number of filesis one trillionthe total size of datais the samebut the master's hoverconstructs more memoryabout a metadatain right sideit might hit the performanceof the file systemand secondly, Hadoopis not good atdata processing with the low latencyHadoop is good at batch processingusing the large input fileand also the output fileon the other handit's difficult for Hadoopto handle small input output dataand respond to the requestin the online mannerok, then secondlyI talk about HBaseHBase has a featureto handle the small size of datainto Hadoop Exystemyou can insert a recordin a low latency, for examplein millisecond orderand you can use simple apito read datasuch as get and scanyou can read only a part of dataefficientlyusing HBasewe can use the apisimple apiget,put,scan,deleteand so onthe one of key strengthis the performancestupid of a put apiHBase is based on Hadoopto that it alsolevels the feature of Hadoopsuch as scalabilityfor exampleyou can expand the space ofstraight by adding nodesand also you can improvethe throughput of read and writeby adding nodesa famous use casewhen HBase first appearedas a background databaseof Facebook messengerthird oneis the distributed stream processingthe patch stonein the apache Kafkaprovided scalabilityto the stream processingthe patch stone has been bornto process the massive stream datain near real timethe patch Kafka is an open sourcesoftware for the messagingof massive stream datawhich was originally boning linked inusing apache Kafkawe can receive and transfermassive data frequentlyand also using stonewe can receive dataand process them in near real timeone of example of these kindsof use case is analyticsabout tweets to clarifywhat content is in buggythe first topic is writesof error of CQL on Hadoopand the columnar formatafter the distributed computingkind popularvarious CQL on Hadoop solutionsfor the ad hoc queryto improve usability for the endusers which were developedsince about2013 kinds of CQL on Hadoopfor the low latency query were bornthe difference between originalhive and such productis the dependency on Hadoopmapple useCQL on Hadoop in this contextdoes not depend on Hadoopmapple useCQL on Hadoop technologiessometimes performovervaluably betterthan Hadoopmapple usecombined with Hive approachin specific workloads which they are good atthis is because some of themwell leverage memoriesand networks as well as disksthis is an imageof hardware trendschanged since Hadoop was bornwhen Hadoopand Hive was bornHive was bornthe mainstreammachine hadonlyhow this drivesand the memorythe size of memorywas a few gigabytes per machineon the other handthis is a typical Hadoopspecwhen fresh storedand in parallel was bornwith often usedSSDinstead of how to describethe size of memoryincreasedto tens or hundreds of gigabytesper machineand also the column orientedformat was getting to be knownas a technology to assistalhoc queriesthis will help systemas well as Hadoop ecosystemuses these kinds of formatscompared to row orientedformat the column orientedformat achieves the following improvementsfirst isefficiency to encode andcompress datathe second is betteroptimization for read operationfor analytics queriessofirst i introduce aboutthe improvement about column pruningdata analystsoften only usepassure kinds of data for specific analyticsapplication can read onlytargeted columns using column orientedformat the secondsize is an exampleof row oriented casesapplicationsneed to read every columnnot required for analyticson the other handcolumn oriented caseapplications canread only targeted columnsfor analytics and skipnot requiredrequired onesthis omission reduces IOfor analyticsnextimprovement about predicated pushdowndata analysisoften only usea partial range of datafor a specific analysisapplication can use metadata to narrow rangeof actual read operationthis is an examplein this exampledata are data sortedby datein this caseif we want the searchingdata createdin a certain dateaccording to the metadatawe can skipdata one and data threethe left side data or right side dataand we canread the only targeted datathe second databut column oriented formatis not a single bullettransformationto the column oriented formatand update operationneeds high computation and resource usesthis is an imageof transformationfirst we createthe row orienteddata in memoryand with a statistical dataafter thatwe can convert such kind of row oriented datato the column oriented datathis conversion needs somecomputationor re-arrimentof datathis is why the data which is written onceand repeatedly readis fruitful for these kinds of technologieshoweveri don't recommend to use it forintimidated data of computationthis is becausethis transformation is notcost effective for theintimidated datait's not easy if we wantfree leverage the optimizationfor examplethe predicate pushdowni mentionedas usefulonly when the data is sortedenough bythe targeted search columnit's essential tounderstand the charactersof the dataand access patterns to themthis is an examplewhere the data is not sorted bythe targeted search columnin this data setthe datais sorted by the alphabeton the other handif you want to searchthe datacreated in a certain dateyou cannot use themetadatayou should read the fall data setin this datathe next topicis object storageas a data lakethe object storageis one of the architectureto store and manage dataon the granularity of objectit's recently also popularas an architecturefor a data lakea persistent data storefor the long termit has featurescommon between hdfsscalabilityand is also able to storepetabytes databut there are differences between hdfsfor exampleit's often good atstoring lots of small fileson the other handthere are charactersunfavorablefor the data processingthe movement of data within the storagenot lightweightand also a consistent levelof some operationsis often not highas it latelyAWSS3-S276is one of the popular solutionand has a long historyhdfs and other solutionssupport s3 protocolin the hdfs communitystarted to develop the object storage featureas an ecosystem of hdfs since hdfs3apache-hdfsbecame some feature of hdfsthennext I'll talk about writesinfoof open source of theorysome open source of theoryslowed down the maintenance comparedto the first release errorand some changed the main featuresapache-hivemanyabout apache-hive manycompetitors appearedand the development is still activeonly on the other handfor exampleapache-picks the final releasewas in2017and also aboutapache-mahautthe main feature is now the distributed linearLJVLA frameworkand a mathematically expressivescholar DSLjilafjilaf's release cycle slowed downsothe productswhich depend on a particular producttends to beaffected by a deteriorationof dependent productsfor examplethe above products seems to beaffected by the rights ofapache-park anddeterioration of hdfs mapreduce howevercq-like languagegot to be thedata processing enginecombined with data lakeas well as data frame apion the other handmany users did not useextremelycharacteristic languagesand the long termand also low-layer technologiessuch as hdfsand hbase are still activelydevelopedthe architecture is evolvingand the developers are addingambitious featuresfinally, I summarizethe part-warningthis is our view oftypical architecturescombined with some kinds of technologiesI'll talk about featureswhich the typical architecture wants to achievefirstwe often tend toachieve the batch processingof massive dataand storing the massive dataand secondthe writing, reading, recordsin high throughput and low latencyand alsowe'd like to acceptlots of small dataand also near viewer timelyprocess lots of small dataand finallywe tend to wanttoadahawk-curingfor analyticalprocesswe tend to desire the idealwith scalabilityHDRD supportlow latency, high throughputefficient ckl-channel accessefficient random accesschip, and so onhowever, it's not easy to realize itusing one technologyso we often integrate several kinds of productsto balance trade-offsand this isexample of trade-offswe oftenwe often metthese approachesoften forcesforced users to accept complexityand it's sometimes too trickyto use appropriatelyso there are new demandsfor the data lakerecentlywe can categorize recent demandsfor the data lakewe can define two categoriesof recent demands for the data lakeand the data platformfirst one is the left sideof handlingdataand the second is the right sideto provide features for the analyticssothe rest of the partis presentation byTakuyaan introduction of relatedrecent technologiesin the first halfwe mainly talk about the historyof OSS.hand big datain the second halfI'd like to talk about relatedrecent technologiesfirstlyI'll talk aboutrequirements for the recenttrade railsecondly I'll explainthe use case of these requirements thirdlyI'll talk about the challengesof the architecturethat realize the requirements furthermoreI'll introducethree OSS productsto solve the challengesfinallyI'll mention the considerationsand conclusionsand this slideI'll talk about the traditionalrequirements for the straight railwhich process big datathere are four mainrequirementsscarabilityusabilityaffordabilityas for scalabilityuser want to be ableto add resource easilyas for usabilityuser want to be ableto hand big data easilyasaffordabilityusers want to be ableto process big data by arrangingin expensive resourcesas scalabilityusers want to be ableto continue the servicewithout losing dataeven if one unitflags downwe believe thatuser dealing with big datawill continue to needthese demands in the futurenowI will mention recentrequirements of straight railprocessing big datathere are two mainrequirementsone is swing processinganalyze various analyticsas for strain processingthe straight rail need to keepreceiving large amount of smalldata that are constantlycomingfor exampleit is IoT dataas various analyticsexample fivefour requirementsfirstuse of realtime datausersmay want to analyzerealtime data thathas just arrivedsecondconsider historical datausers may wantto analyze data withlarge amount of historical datathird low latencyusers may want to get result quicklyfinallytrial and errorusers may want to try andwear some quality to a dead statefor examplemachine learningupcoming slidewe considerrealtime analyticsbased on these requirementsnextlet's consider an exampleof a use case thatrequires realtime analyticshere as a generalizeduse casewe will introduce an examplelinking information to storeand users in realtimelet's take a step by steplook at the data flowfirstwe accumulate thebehavioral historyand operation historyby batch andstream processing in advancenextwe execute machine learningand create a modelthat is easy to handlein processingthere is a contentto be finished in advancefrom now onit will be realtime analyticswhen a user with a smart phoneapproaches storea system gets this informationby streamthen the system analyzesusers historical dataandranks the customer service policyandrecommendation to the storeas a resultusers can receive a smooth servicebesidesthe system analyzesinquiry from usersand pushes useful informationto user in realtimeby repeating thisprocesswe will be able to continueto provide useful informationbased on fresh datato store and usersnextlet's consider the challengesof architecture for realizingsuch realtime analyticsI presenttwo typicalarchitecturesthe first is batchfocused architecturein batch architecturewe use the batch ETLpipeline to correctand processdatahoweverthere is a problem that we can't analyzerealtime databecause it takes a long timebetween correcting dataand using datathe second isstream-focused architecturein stream architecturewe usedata hub to correctdataand we use stream pipelinevisualize datanotify informationhoweverthere is a problem thatwe can't handle large amountof historical databecausestream architecturedoes not expect thequery to as muchdata as batchesand it isdifficult to execute an adhocanality queryto stream pipelinebecause the queryto the stream pipelineis predeterminednow I introduceramuda architectureto meet thesechallenges, ramuda architectureintegrates batch and streampipelinesRamuda architectureconsists of three layerto batch layeranalyze large amountof data in batch processingto service layerprovides a batch processing resultto speed layeranalyze what ishappened in realtimehoweverwhy the ramuda architecture looksperfectit has the following concernsfirstto analyze batch streamprocessing togetherit is necessary tointegrate the resultof two systemsbut it istoo difficult to make themconsistentsecondpipeline complexitycostincrease systemdating costshere is some of the essencethat have comeout so farwe think about the four essencefor the storage layer to realizerealtime analyticsfirstthe storage layer must be able toinput both batchand stream processingsecondthe storage layer must be able toaccamulateutilize bothrealtime dataand historical datathirdthe storage layer must be able toexecute various analytics queryfor examplemachine learning, ad hoc queriesand so onfinallythe storage layer must be able torealize a simple pipelineto optimize costnextI introduce the recent storageware open source softwarefor realtime analyticsas explained in the firsthalf batch the worldis full of old and newproductsthat can process bigdataof course there are manyproducts that are notcreated hereeven in these situationswe need to fix the properproductsthis time as a recentstorage layer OSSproducts for realtime analyticsI would like to brieflyintroduce a batchice bag, a batchhoodie, and delt lakefirstI'd like to introduce a batchice bagI think the main feature ofice bag are transactionalcapabilityand merge format supportthe transactionalfuture allow userto analyze theconsistent statebesidesuser can easilyreproduce the analysisand execute analysisusing past days seton the other handmult formatsupport allow userto choose the formatthat's it their requirementsnextI'd like to introduce a batchfoodieI think the main feature offoodie is multi viewsaccording to purposeusers can usereadoptimized viewwhen the userwant to reduce read latencyusers canuse the realtimeview when the userwant to analyze the mostrecent datauser canuse the incrementalview when the userwant to use the updatedifference.finallyI'd like to introduce delt lakeI think the mainfuture of delt lake aretransactional capabilityand rich dml commandsas same asicebag transactional functiongive users consistencyand reproductivityof analyticsin additionusers cancomplete gdpr by usingenough dml commandsfor example deltupdate and mergethenI'd like to show you the comparisonof internal approachof these three productsfirstI'd like to focusto the concept ofdead store layer and processall three productshave the same layerstructure and processconcepticebagfoodie delt lakeare all located inlayer betweenapplication anddead storewe tentatively call thissoftware storagelayer software. generallydead storeprovise simple functionto applicationswage layersoftware provideslogical dataset andtable to the applicationas useful futureapplicationtotransparentlyread and write to thedead store using convenientfunctionswhen applicationrequest to read and writedata to storage layersoftwareswage layer softwareread and write not onlyactual data but alsothe management informationin this waythe applicationusedead store with easein other handswage layer software has nodead store and nodemon processswage layer softwaredeligates scale related todead store such ashgfsobject storage in cloudin this waythese three productsand your scalabilitynext, I'd like tofocus the format ofit's producta patch-foodie anddelt-like mainly use theconformat for efficientreading drawing analysispatch-icebox-spotmultformatpocket-washyand avruso that the user can choosethe formatit's product keepsread efficiency by usingeffective format for analyticsfinallyI'd like to focusto the file structureand read-write mechanismof it's productthe file structureand read-write mechanism aredifferent between foodie andother products, I suspectI talk about the difference betweendelt-like and iceberg in thenext slidefirst, I'd like toexplain foodie's marginwrite. foodie holdstwo main type of filesthese are the files calledbase fileand append logwhen writing, the writerhas data to thelow-oriented append logthis file iscompacted intocalm-oriented base file atregular intervalsthe readerread only the base filewhen user want toread with low latencyother read-optimizedviewthe reader read both basefile and append logwhen user want to readreal-time data as real-timeviewin this way, foodierealized various analyticsin combination with multipleviewnext, I'd like toexplain iceberg and dealt-likeboth of thesecreates management informationfile and actualdate file at the same timewhen the writer writesthe versionetc. isdescribed in themanagement informationthe reader identifiesnecessary actual date filefrom link in the managementinformation and reads themlet's take a closerlook at each product innext slideI shall explain the differencein mechanism between icebergand dealt-likemanage date fileusing the file calledmanifest listand manifestthe reader traces the manifestfrom manifest listof snapshots to identifythe required files dealt-likemanage date filesusing the files calleddelta logsthe reader identifies therequires date filedelta logDelta log aggregatesall dealt logs intocheckpoint to prevent the growthof dealt logsboth of products havethe command to aggregatesmall dealt filesfinally, I'd liketo talk about considerationsand conclusionsfrom now on, I'd liketo carry out considerationbased on the internal approachesand firsthalf we mentionedthat products processingbaked letter are strongly awareof one in many trade-offsthe recentstraight layer softwareproducts take an approachthat strike a balancebetween these trade-offswhile resolving boththere are three pointsthe firstis write quicklythe second isto analyze efficientlythe third isto fill the gap betweenthose twofirst, in order to write quicklyfoody first write toa low-orienteddate formaticebag and dealt-likewrite version informationat the same timeas the datasecond, in order toread efficientlyeach productadds an efficient formatthird, in orderto fill the gap betweenthe trade-offseach product comesin a variety of formatsand viewsand alsohas background fileaggregationI'd like to talk a little moreabout the future focuswe haverealized the utilization pipelineby combiningspecializedsoftwarerecent OSS productsaboutbaked data processingevolved in directionof balanced trade-offthat existing in the pastthey have creatednew bodies such asreal-time analyticsas this evolution continueswe expectthat thoseinterface becamesimplerwhile the internal mechanismbecame more complexafter thatwe think thatsoftware for theable hardwarewill be born to solve thenextfinally, I'll give a summaryof this sessionOSS and architectureare ableto meet the needof the timesand the range of data utilizationis expandingrecent bigdata processing OSSevolved to balance trade-offsthat existed in the pastusing various approachesto meet variousrequirements such asreal-time analyticstherefore, the internal structureof the products hasbecame complicatedand it has become difficultto set the optimum architectureand optimum productas architectwe should continue todevelop our sense and powerto realize the architecturethat meet new requirementsif we have any problemsin this technical fieldwould you discuss them?thank you for your attention