I'm Masaki Kimura from Hitachi.I'm very happy to have this opportunity to talk to you today.You may interrupt me anywhere during my talk,but you will have time for questions at the end of my talk.There is one important thing before we start presentation.I'm not a native speaker of English,so please try to ask me questions slowly and clearly.In your loud voice.Thank you.Okay, let's start.This is content of my presentation.First, I will share the background,the use cases of SCSI commands from guestsand the requirements from these use cases.Second, I will share KVM features for guest SCSI commandsand then I will share the current status of these features.After that, I will share the summary and future works.Okay, let's start from background.Enterprise systems expects thatvirtualized environment has the same level ofmanageability,availability,and reliabilityachieved in parameters.For example, thin provision storage for manageability,ache cluster for availability,and backup server for reliability.In parameter environment,some of these requirements are achieved by usingstorage features such as SCSI commands.In virtualized environment,the same use cases exist for guests.Therefore, issuing SCSI commands from guestare required.I will explain three use cases,thin provision storage,ache cluster,and backup server in the next slide.First use case is thin provision storage.Many types of enterprise storagehave thin provision functions.For achievement of thin provision,a disk block is allocated on access.However, once it is allocated,it cannot be reclaimed by storage automaticallyeven when the disk block becomes unused by OS.This is a waste of disk blockbecause if we can reclaim the disk block,other systems can use disk blocks.So to reclaim and use disk block,OS needs to let storage know unused blockby issuing right same SCSI commands to storage.Once storage can see which is unused disk block,storage can reclaim the disk space.This use case exists for both bare metaland KVM guests.To achieve this by KVM guests,guests always need to issueright same SCSI commands to storage.Therefore,right same is requiredto be issued to storage from guests.Second use case isache cluster.To improve availability,ache cluster is commonly used in bare metal.Forache cluster,assistant reservation SCSI commandsis generally used to guaranteean exclusive access from active systems.This is the left hand side figure.ache cluster is consist ofactive system and standby system.Once active system fails,it fails over to standby system.both active system and standby systemneed to share the dataso they share the same LU.However,standby system must not accessactive system is using the databecause it might corrupt the dataof the LU.So we need an exclusive accessfrom active system.To guarantee an exclusive access,persistent reservation is used.Once persistent reservation isheld by active system,the IOS from standby systemis blocked by using the storage feature.This use case also existsfor KVM guests.To achieveache clusterin the guest OS,guest OS needs to issue SCSI commands,persistent reservation SCSI commandsto LU.Therefore,persistent reservationis required to be issuedwith range from guest.There is one more requirementfrom aache cluster.The requirement isIT Nexus is required to be unique.Let me explain more.Persistent reservationis held by so-calledIT Nexus,the combinationof initiator IDand target ID.Please see the left-hand side figure.HBA on the serverhas initiator IDand storage have target ID.By initiator IDand target ID,I mean WWNfor fiber channel caseand IQNfor ISCAG case.Once persistent reservationis issued to storage,storage holds the dataof the combinationof initiator IDand target ID.In this case,persistent reservationis issued from initiator 1to target A.So, storagestores the combinationof initiator 1and target A.I was from standby systemablocked because theIT Nexus is different.In this case,I was from initiator 2so,I was from standby systemis blocked.Therefore,IT Nexusis required to be uniquefor persistent reservationto work properly.If standby systemshare the same initiator ID,I was from standby systemis not blockedand this isnot expectedbehavior.The same thing happenedto KVM case.If you set upHA cluster between guests,IT Nexusis required to be uniquein this case,so,initiator IDacross guests should be unique.The third use caseis backup server.Persistent reservationis also used bybackup server productto guarantee an exclusive accessfrom a backup serveron backup.Therefore,persistent reservation to storageand uniqueIT Nexusare required by these products.Let me explain with thesefigures.Business systemshave a lot of serversand they need backup.To take a backup easily,it is common to usebackup server.Backup server sharesare used with business systemsand to take a consistentbackupan exclusive accessfrom backup serveris requiredfrom backup.To guarantee an exclusiveaccess,persistent reservationis held bybackup server.This use case also existsfor KVM guests.In this case,all business systemsand backup serverare in the sameserver.Persistent reservationneeds to issuefrom storageto guest,to storagefrom guest.Also,initiator IDneeds to be unique.This is summaryof the requirements.There are two requirementsfrom three use cases.First requirementis SCSI command to storagefrom guest,becauseSIM provision of storagerequires write sent to storage from guest.And HE clusterand backup serverrequires persistent reservationto storage from guest.Second requirementis unique initiator IDacross case.Because HE clusterand backup serverrequires IT nexusto be unique.Let's quickly reviewwhich KVM featurehave guest SCSI command.His presentation focuseson following three device typesand their configuration.The device type we are focusingon isバトエーブロックバトエスカジPCR device assignment.Each device typehave some configuration.There are 11sorry,there are 11configuration.I will explainthese configuration in thisorder.First device type isバトエーブロック.I usethis kind of figureto explainand comparethese configuration.バトエーブロックis a paraberturized diskand it is shown as3D devices in Linuxcase.To makehost device visibleto guest OSQM devicelayerBartWideBlock in the QM devicelayer works togetherBartWideBlock driverin the block layerin the guest kernel.maximum number ofdisk is limited bymaximum number of PCI devicesbecauseBartWideBlock device occupiesone PCI devicein the guest kernelso the maximum numberis 32.andBartWideBlock is improving performancewithBartWideBlock trainand BIO based IOthe detailwill beexplained byMr.Corefrom IBM or yesthank you and I willgo to your sessionto see detailmove on to the nextandBartWideBlock has3 configurationdepending on thebackend devicewe can attachfile as the diskand block as the diskand run as runand the pointin this slide isscaji command from guest reachesstorage only when attached as runso only this configurationin other configurationscaji commandis blocked by theQM device layerand returns errorto guest OSand I summarizehow to configurein the below tablenext I'd like toexplain about BartWideScajiBartWideScaji has3 types of configurationQMTargetGeotarget andLeaveIscajiand Reotargetis working astarget and the difference between themwhether they are workinginuserspace or kernel spaceDeaveIscajiis working asIscajiinitiatorand it directlytalk to storageI will explain this configurationin detail from the next slide1st configuration of BartWideScajiis QMTarget1st I'll explain about the BartWideScajiBartWideScajiisparavirtualized scaji transportworking under the scaji moduleandguest sees this device asscaji devicetherefore it is shownas sd devicein Linux caseQMTargetit is user space targetjust as same asBartWideScajiwith QMTargethas 3 types of configurationattaching as fileattaching as blockand attaching as spoonscaji command from guest reachesstorage only whenattached as runein other configurationimmediated resultwill be turned by the QMTarget soguest will see the resultimmediated resultfrom QMTargetso scaji commandfrom guest does not reachstorage in file caseand block caseand I also summarizehow to configurethe third configurationin this tablethe third configurationfor BartWideScajino, second configurationfor BartWideScajiis a real targetthe characteristic ofBartWideScaji is same tothe QMTargetand the differenceis real target is workingas a kernel space targetand it usesbackend as backendreosupports following backstoresblockfile.iopscajiand ramdiskas for scaji command capabilityI'm sorryI haven't evaluatedyet enoughbut as far as Itestedby attaching as blockimmediated resultguestso Iwouldno, scaji commandis not reach tostorage withblockpscaji ispasser scajitherefore it is expectedto work wellbut I haven'tcreated the configurationwith pscajiplease share with meand third configurationfor BartWideScajiis live ice scajilive ice scaji isice scaji initiatoruserspace ice scaji initiatortherefore it only supportsice scajiQMU directly talk toice scaji storagetherefore host does not seeguest disksas for scaji command capabilityscaji command fromguest teach to storagein this configurationI've explainedBartWideBlock and BartWideScajiand this is the last configurationPCI device assignmentPCI device assignmenthas two configurationsregacy and VFIObothregacyand VFIOassign PCI device to guestand host PCI devicesdedicated to one guesttherefore the number of guestis limited to the number ofPCI devicesas for scaji command capabilityscaji command from guestteaches to storage in bothregacy and VFIO configurationand this is the summaryof scaji command capabilityI have explainedeleven configurationwith BartWideBlockand PCI device assignmentof alleleven configurationsonly five configurationscanissue scaji commandto storagethe configuration areBartWideBlockwith runeBartWideScaji with QMtarget with runeand BartWideScajiwith device scajiPCI device assignmentwith bothregacy and VFIOin the next chapterwe will see scaji commandcapabilities deeperonly with configurationsmarked as yesin above tablethe current status of these featuresI have setthree evaluation itemsfrom therequirementexplained in chapter 1first item iswith the scaji commandbeach to storagefrom the requirement 1and the second item iswith the unique initiateid is assignedfrom requirement 2and the third item iswith the scaji commandreturn to return proper resultI have evaluatedthese items withfive configurationsexplained in thelast chapterfrom next slideI will share what problems remainin which configurationfirst evaluation item iswith the scaji command beach to storagethere is a problemin permission checkthere is a permission checkin host kernelwhen guest scaji command isissued hereBartWideBlockwith QM targetpermission checkis consist ofthree checksfirst check isfeather QMprocess hascapsis low I.O.capability tolow I.O.divartmanaged KVM guestrun as QM userso it lackscapsis low I.O.it iswith the scaji command isreadOK commandand third check iswith the scaji command iswriteOK commandand the process haswrite permissioninquiry and reportLUNscaji commandisreadOK commandso it isallowed to issueeven whendoes not havecapsis low I.O.howeverpersistent reservationand write sameis notreadOK commandnor writeOK commandso this is notallowed to issuescaji commandif QM processdoes not havecapsis low I.O.I.O.by this checkunless KVMis running aswith userto solve this issuefollowing patches has beensubmitted byMr.Paolo Bonzini since 2012there are many discussionsabout this topichowever,neither of themhas been imaged yetto introducea flagto allownon-route usersto issuescaji commandsthe concept is very simpleintroducingand if this flag is setallscaji commands can issueto the storageand this flagcan be setby usingsysfs interfaceif this kernelpatch is mergedKVM guests running as QMuser will be ableto configure to issue anyscaji commands to storageone of the reasons why thesepatches have not been mergedyet isit is still under discussionon how to avoid off-code overlapproblem let me explainabout this problemdifferent device classshare same off-codefor examplereadsubschannelscaji commandsshare the same off-code0x42with unmapped for diskit isthere is no problem in usingonly in a hosthowever,if it is used inguest there is a problembecausein the guest intended asreadsubschannelbut interruptedininterrupted as unmappedinhostdistracted commandsmight pass through to hostfrom guestto avoid this problemthere is a discussionon implementationsplitting permission checkby device classusein-par-device filterwith anunplugged sgiothis discussion still continuessothis kernel patchis not much yetnext evaluation itemis whether a uniqueinitiator id is assigned or notthere is a problemwhen bothguest1 and guest2are on the same hostand use the samehbathey sharethe same initiator idwhen the configurationis partway or brokeor partway or scudgywith qm targetin such a conditionexclusive access is not guaranteedbecausestorage cannotsee which guestissues theunplugged sgio commandbecause both persistentreservation from both guestis come from the sameinitiator idthere is already a solutionwith npivnpiv is inputvirtualizationand this can assignvirtual idin the host kernelby using npivpivwe can assigndifferentinitiator idand we can pass these devicesto guestso we can assignunique initiator idto each guestthereforewe can usepersistent reservationproperlyfor your informationwith live ice scudgyor pci device assignmentexclusive access is guaranteedbecause initiator idare unique with these configurationslet me explainwith these figuresthis hand sidefigure showsvirtual scudgywith live ice scudgyand right hand side slide showspci device assignmentwith live ice scudgydifferent initiator idby usingthis live ice scudgy initiatorthereforepersistent reservationis issued from differentinitiator idas for pci device assignmentpci device assignmentneeds todedicatehost devices hbato each guestthereforedifferent initiator idis assignedto each guestso theunique initiator idis guaranteedin this configurationthe last evaluation itemis feather scudgy commandsreturn proper resultthere is a problem withbertile blockdist of alienanceincluding alienancewhich are not assigned to the guestlet me explain with this figureI assume a configurationwiththree runesto one hbaandone of the runesisassigned to the guestreport alien scudgy commandis a scudgy commandwhich canreturn the list oflunthat oscif the scudgy commandis issued fromhoststorage returnthreelunit is a expectedbehavior howeverifreportlunis issued fromguestbertile blockqm device layerjust pass through the scudgy commandand return the resultto guesttherefore guest getsthe resultthe list of alien that oscit is notexpected behaviorbecause guest only seesthe deviceattached to the gueston the other handbertile scudgyreturn the proper resultby returningemulated resultdepending on the scudgy commandsobertile scudgy casebertile scudgy can returnthe resultemulated resultin the qm layerso only onelunisreturn to the guestthereforebertile blockneeds emulation function toreturn proper resultsfor particular scudgy commands such asreportlunisthis is the summary of my presentationenterprisesystems requirescudgy commands in virtualizedenvironmentkvm has some configurationwhich can issue scudgy commands tostorage from guesthoweverit has some restrictionsbertile blockandbertile scudgy with qm targetcannot handle persistentreservation and write themif it is not running asroot userand it also requiresnpivto assign uniqueunique wwnandbertile block hasproblem in reportlunbertile scudgy with live ice scudgyand psi device assignmentcan properly handleall scudgy commandsand can assignunique wwnhoweverthey have restrictionsI think it is very big restrictionsbertile scudgy with live ice scudgycan only handle ice scudgyand psi device assignmentthe number ofguest is limited in this configurationso for as long asthis restriction allowsuser can choosebertile scudgy with live ice scudgyor psi device assignmenthoweversome user willfeelsome user cannot acceptthis restrictionslike a user whowants to use fiber channelor whowants to use many guesttouse hech cluster or write samescudgy commandsso for these usersthey needpersistent reservationor write samescudgy commandsto be fixedto be fixedfuture workthere are three feature workfirst one isto allow QM userto issue persistent reservationand write sameproper permission checkin host kernel is neededsecond one isto make reportLUNetern proper resultwithbertile blockan emulation functionfor reportLUN isrequiredthird one isto make scudgy command capabilityofbertile scudgy with real targetclear evaluationis needed forbertile scudgywith real targetany questions?let's go back to scudgyQM target and file caseand this oneand could you repeat the question againincarnate targetand as far as I testedI can't issue1st racebut Iexpect that piececudgy ispassed so it can work wellbut I can't testwellthere is reverse side interfacebut kernel side interfaceis missing sonotedbut I willthank you very much