 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor for Data Diversity. We would like to thank you for joining this month's installment of the Monthly Game International Webinar Series. This Webinar Series is designed to give our Enterprise Data World Conference attendees education year-round. And we are actually presenting a live from Enterprise Data World 2016, having a great event so far. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we've been collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share your questions via Twitter using hashtag GAMA. If you would like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the top right for that feature. As always, you will send a follow-up email within two business days, containing links to the slides, the recording of the session, and additional information requests throughout the webinar. Now let me formally introduce today's speaker, Glenn Bell. Glenn is the Director at Visual Explanations. He has worked in data management for almost 30 years. He is the President of GAMA in Australia and delivers training in CDMP and data modeling. He is an independent consultant and has worked for a variety of organizations in Canada, Europe, Malaysia, Singapore, New Zealand, and throughout Australia. He holds the Master of Business Information Technology Management from the University of Technology Sydney, and a Bachelor of Science Computing in Mathematics from the University of Queensland. He also has a CDMP and CBIP certification, both at mastery level. Fantastic. And with that, I will turn this presentation over to Glenn to get us started. Hello and welcome. Thank you, Shannon. So today we're going to be drawing on my experiences with a federal government client in Australia where they needed to integrate data from different sources and needed to do that in very controlled fashion. So I have a series of slides that I'll be working through. The first seven slides will set the context and the drivers and the reasons why we have to be so careful when consolidating data. And then slide number eight is actually my favorite slide. It's got a knockout diagram that I'm sure you will all love, and that's the one where we will really focus a fair bit of attention. So if you get a little bit bored with that during the slides up to number seven, don't leave. It will get very, very exciting on slide number eight. So the dangers of consolidating data is that if a citizen would consolidate data from their education, from their employment, maybe some health records, then that becomes very powerful and could be misused and represents reputational risk for the agency that's consolidating the data and a breach of privacy for the citizen that's involved. So you might say, well, if there are these dangers involved, why consolidate data at all? Keep away from the danger and just don't do it. However, for government to be able to ensure that it is producing the outcomes for citizens that it wants to achieve, frequently it needs to be able to get data that's located in more than one government agency. If you could imagine as a hypothetical example, a citizen with disabilities and the government puts in a program to provide support for that person and they want to do that so that they get improved outcomes for that citizen in terms of their improved employment, improved educational outcomes, if possible, better health outcomes and just general increase in integration with the community. No one agency has that data so to be able to measure the efficiency and the effectiveness of that policy, you need to consolidate the data. And that then allows us to make policy decisions that are informed and make the best use of public money. So we're really needing to engage in the dangerous activity of consolidating data from different sources to be able to ensure that we are getting the best outcomes for citizens in a cost-effective manner. And what I'm going to explore in this presentation is how to manage the dangers that are involved, the risks that are involved. So the Australian government has recognised, as has many governments around the world, these dangers and the fact that they still need to embark on this activity. And so within Australia they have established what's called the Integration Authority Accreditation. This is not law, there is no active parliament. What has occurred is that the heads of government departments in Australia known as secretaries for the major government departments have come together in a committee recognised that they need to consolidate data but acknowledged the dangers and then developed the accreditation program to give comfort to those who are leading their agencies that all the controls are in place when consolidating data to protect citizen privacy. So I'm going to be stepping you through an example of a Integration Authority Accreditation application submission. And in essence, key elements include having people who are doing the consolidating work, performing different roles and those roles then give you different access to the data at various stages along the integration path. Also the implementation of this can be done in either a cheap and manual way and my experience with this government agency was a cheaper manual approach. So we would have paper based logs with managers physically sitting in the room when people were performing various roles and ensuring as a check that nothing wrong, bad happened. However you can also go down the expensive and automated route so more mature agencies use automated monitoring and logging of what is occurring during the integration process and sophisticated mechanisms for controlling the access for people's roles based on using virtual machines and more advanced concepts like that. So when I step you through the fantastic diagram that will be appearing on page on slide nine, eight, it's technology independent but you can do it at different levels of expense. So we've discussed the dangers that are involved but the need to still consolidate data and so how do you go about it? Well essentially there's a whole set of policies and procedures that surround the consolidating of data if you were to get the integration accreditation to give comfort to this agency that things are being done properly and audibility. There will be different roles for loading, separating, linking and analysing the data and my diagram on page on slide eight will highlight that. There's also costs that are involved. There's increased audit resources both internal and external for ensuring that data is secure and the policies are being followed and even in the cheap and cheerful approach you still have to do work around software. In the example that I participated in rather than a virtual machine they did folder level locking to be able to control who could write and who could read to a particular folder but that still requires effort to do that. So I suppose the tension that comes in is that the easiest thing in the world is if an agency wanted to consolidate data is to get some extracts from their other government agencies and they give it to a person and they just work with it and put it all together. But unfortunately that would expose all sorts of dangers if that person went rogue. And I might just as an aside talk about how important these controls are as an example in the Australian marketplace and Australian governments. This is a publicly available example. It's been in the newspaper so I'm not giving away any secrets. The Australian Bureau of Statistics which is probably the most sophisticated gold standard for integrating data is very strict controls and procedures and culture around protecting data. Unfortunately one person went rogue and they were involved in releasing quarterly economic data and they had gone to a university with a friend who wound up working for one of Australia's major banks, the National Australia Bank. And prior to the quarterly results being released this person would sms the figures down to his trader friend at NAB and then this trader would take advantage of that and put in trades that would make lots of money. Now the National Australia Bank's risk controls quickly highlighted that this particular person was off the scale and alerted the Australian Federal Police and the Australian Bureau of Statistics to this. And it didn't take too long for them to figure out what happened in this scenario. Both of the individuals that were involved have gone to jail for years. The longer term is for the National Australia Bank person who was also found guilty of bribing a public officer. And for me one of the more important things is that the Australian Federal Police also reviewed what the ABS had done in terms of their procedures and controls and they were found to be fine. What I mean to say is that there was no systemic risk. There were all the controls that you could possibly want to have in place were in place and all the culture and training and everything around it. But we had a person who went rogue, possibly a mental health issue. But for the Australian community, they could take satisfaction that, okay, if something bad happens then the individuals are penalised and jailed for a significant period of time and that the controls are in place to be able to ensure that this is not a systemic problem and that it should never happen again. And the sorts of things that I'm going to be showing you on the infamous slide 8, look at how you can ensure that the systems are in place to prevent as far as humanly possible a breach occurring. So the integration authority procedures come in when there's at least two data sets, well, fair enough. Can't consolidate one data set with itself. It's for statistical research purposes. So there are other powers that control consolidating data for investigating fraud or other operational matters. It's where the data is subject to the Privacy Act, so it's personal information, potential to compromise commonwealth outcomes, and it occurs at a unit record level. So it's where it's individual people data that's happening. On slide 7, one more to go before slide 8 and the integration authority has a set of criteria to evaluate the suitability of the agency's controls to see if they were worthy of the integration accreditation. So criteria one, which is really super important, is the ability to ensure secure data management. So linkage is not performed directly. That is, it's not just one person who's in control of these files and consolidating, but it's separated out. And that the data separation principle is being adhered to, and I'll talk about that in the next slide, and that when you finally get to the point where you do the analysis, the analysis files do not contain identifying data, such as name, date of birth, address, or the identifier from the original data set. So you will now see the much-heralded slide 8, which gives you the end-to-end process for being able to control the consolidation process. So you will see that there are three large boxes, the shaded boxes that form the background, and they are labelled the librarian, the linker, and the analytic. These are the roles that are performed, and a person, they could be different people doing these roles, or if it's the same person, they can only be in the role at one point in time. So we'll start off with the librarian role. Actually, I'll tell you what, I'll just give you a brief overview, and then I will drill down, as John Zuckman would say, in excruciating levels of detail to make sure everything's perfectly clear. So the librarian is the one who receives the data set that needs to be consolidated, the data sets, and they will then put that data set through an extract and standardisation process, and they will then separate the data into the analysis data. You'll see the analysis file 1 on the right-hand side, and the match file data, the data that will be used to match the individuals involved. The match file data, so I'm now looking at the linker box, helps to link the data together, enables the linking of the data to a link file, and then eventually the two analysis files in the analytic box get matched, and you get a consolidated analysis file that has no identifying information. So it is completely de-identified. So I'm just going to use my pinch feature on my Surface Pro 3, and I'm also checking on Shannon's screen, and that looks like you can view what's going on here. So let's look at the initial steps that impact on the librarian. So the librarian receives the original data set, but before they can do that, there's a set of approvals that need to occur. So the external agency that is providing the data set needs to approve, but also the request from the agency doing the consolidation needs to go through a set of approvals as well to ensure that the... In the Commonwealth Government, we have called the senior executive service level, the leaders of the organisation, to understand that this linkage is going to occur and approve receiving the data from the other external agency. So there is full visibility that this is happening, and I would also have, as a data management person, also have the involvement of the Data Management Committee or Data Management Board, or whatever the government structure that is in place, to also provide that SES officer with the guidance that yes, this is required, this consolidation work is required. The trigger for this consolidation work may have been some analysis of the performance of the organisation or in government. It could be something such as a ministerial question. Step number two, so before we even get started with the librarian and the original data set number one, step number two is to ensure that all the people who are involved in this project, whatever the role they're doing, librarian, linker, or analytic, that they have no conflicts of interest. So in my scenario, we developed a no conflicts of interest form and the person had to sign that, and we would then scan that and store it in the document management system. And that form also made people aware that under the act that guided this agency, that if you were to compromise data, the penalty was two years in jail. So people are very clear that this is a serious business and you will be prosecuted if you misuse this information. Step number three, before this project begins, under the integration accreditation by the Australian government, is to inform the community and the National Statistical Service, NSS, set up a website. You can go to it on nss.gov.au and any integration projects that are happening by any accredited agency are listed there so that there is transparency to the community that this is occurring. So the next step then finally is to receive the data set. But obviously this can't be sent as an attachment to an email. We used a secure FTP product to be able to ensure that the data was encrypted, even though we were behind government firewalls to just make absolutely sure that there could be no tampering with the data. Okay, we're now back to our diagram and we're looking at the librarians role. So you will see the original data set has the record ID number one, a record ID. That's the record ID for this data set as held by the original agency, the agency that provided this data. So that has to be stripped out before we can let it go through to the analysis stage because if you've got a hold of that record ID and you were looking at consolidated data, you could go back to the original data set and you would know who the person was. There will be the other things that are in the original data set are the data linkage items. So there are the things that are used to try to determine if two people are the same person. In our example, we used what are known as statistical linkage keys and SLK, which takes various characters out of your first name and surname and your date of birth. And it's a pretty good guide. It's not perfect. Nothing's perfect when you're trying to match data, but it's a fairly good guide for being able to see that two people are the same person. So we need those data linkage items. If you're interested in statistical linkage keys, there's an excellent explanation on Wikipedia. So that gets extracted into a landing area. You'll see the blue box. It gets standardized. And by that, I mean things like Rob, Robert, Robbie and Bob all get standardized to some standard name to help give the statistical linkage key its best chance matching people. Also things like sex for male, female, maybe in the source system. It was called one, two. So you would have some translation standardization to be able to put it into a standard form to again help with the matching of the people. And then the third step that occurs by the librarian is to separate the data. And you will see that they separate data that goes into the linker area. I'm just moving my finger along and looking at Shannon's thing. Yes. So you see match file number one. Oh, sorry. I should have said back at the librarian, source file number one over here. Yes, you can see my mouse. So we've put in a project ID so that we've replaced the original identifier with a identifier that we've created uniquely for this project, the data linkage items. Linkage checked items are things such as country of origin or other things that could help with the matching, but are necessarily part of a SLK, a statistical linkage key. Demographic items are the things that we want to and analyze, which might be things about, you know, where this person lives or could be things about their salary or whatever, whatever democratic, demographic items, not democratic. And some other analysis items. Now in the match file, so this is this part here, so when the person performing the role of the librarian separates the data and creates the match file, in our example we used folder level locking so that librarian could only write to the match file but they could not read the match file. Read anything that's in that folder where that match file goes into. So only the person performing the role of linker, that is that they had signed on to the system with various user IDs and passwords that gave them the authority to go into the linker folder, could see the match file number one. And more importantly, the link file that gets created. And I'll just show you the other thing that comes out of the separate, is that we go over to the analytic area and you'll see analysis file number one, which just has the demographic items, the analysis items. So the match file has stuff that's for matching and the analysis file has stuff for performing analysis, demographic, analytical type items. Again, that analysis file, the person who is the librarian, can only write to that folder, that's the analytic folder, but they can't read from it because if they could, they'd be able to see the consolidated analysis file, which you'll see down here, and that would then compromise things because they've got the librarian has access to the original file. So match file, so we've got the analysis person. So match file number one has been created. It's got a project ID. It's got data linkage items. It's got linkage check items to help try to link people together. Then we receive a second dataset and the librarian follows the same process of extracting, standardizing, getting rid of the record identifier, putting in a project identifier, this is project ID number two, linkage items and demographic items, and they do the same thing. They separate the data into data that will be used for matching and for analysis. So you can see that we've got a match file here and then over on the purple bit, over down here, we've got a demographic items. All right. So then the person who is performing the role of the linker comes in. They access their folder. They've been provided with match file one and match file two. They don't know who these people, well, they don't know any demographic sort of information or analysis information about these people. They only have the information that will be used for linking people for identifying who's who. And so they run the link process, generating statistical linkage keys and mapping people up and create what's called the link file. So now we have project ID number one, project ID number two, and we created a link ID, but we're saying we've allocated a random number to this person here and a random number to this person over here from two different datasets, the same person, and they're linked together. That then takes us through to the analysis portion. So the person doing the linker role can then write that link file into the analytics section, but they can't read anything in the analytics section. The person who is then doing the analytic part then runs a match program based on the link file to match the demographic stuff to produce a consolidated analysis file that is de-identified but has rich information about people, but it has been confidentialised but it is de-identified. That's the thrust of what I'm talking about, but I will just, and I'm expecting a few questions from slide number eight, but I'll just finish off the slide presentation and then open it up for questions. Shannon, are we getting a few questions through? Go ahead. Okay. Page down. So Criterion 1 also goes on to having order programs in to ensure that data security is informed, that the people who are performing the roles of the librarian, the linker and the analytics have been subjected to police checks, that the physical security on the premises where this data is being held is in place with key cards and those sorts of mechanisms. So it's not much good having all these sign-in, sign-out and all that sort of stuff to come in and steal the computer that the data is on, that the internet gateway security is this data accessible from the internet, if so, what sort of security is there in place and that there's an information security policy in place in the Australian Federal Government called the Information Security Manual, the ISM. And then there was a set of other criteria about controls of external data access. Not only does it need to be de-identified, which is the natural process of this activity, but it also needs to be confidentialised, which means that even though we don't know who it is, if the data related to a small country town and the person earns a million dollars a year and they had HIV AIDS, then people in that community could pretty much work out who it is. So there's a process that statisticians use to perturb the data to avoid that from happening. Criteria number three, availability of appropriate skills for the people who are doing this work need to have the experience and the expertise to do that. Criteria four is about the technical capability so that there is the software and hardware and other things in place to make it happen. Number five is the lack of conflict of interest and again the reforms that people would fill in. Number six, that the cultural values, so when you were onboarded that there was training in place or records that you'd been through the training. And other evidence that there was cultural values about confidentialising and protecting information. Criteria seven is transparency of operations which was things like publishing the details about this project, this integration project on the internet and the existence of public governance and institutional frameworks. In our example, not surprisingly we would have data management committees and we would follow as an institutional framework the DMVoc wheel. And that was it. Number 1127. So that was the essence of the presentation. It is about managing the dangers about data being consolidating making sure that there are controls in place to protect that data and I wanted to give you a specific example about how you could go about doing that. Sure. And we have questions coming in already certainly about slide number, the infamous slide number eight. Yes. I expected that to happen. So just a reminder to everyone before we get started on those questions I will be sending a follow-up email to the slides so you can have a copy of slide number eight as well as a recording of the webinar. So one questioner has three related questions let me read all three of them at once and then we can kind of question out a little bit here. So you showed flowing in one direction left to right and do you also implement a right to left process so that an authorized person can go back up the flow to see where that process may occur. And I can read it again too. The diagram implies a large number. The second question is the diagram implies large number of changes to data during the workflow. Do you attach metadata to data to document the province of the data? The thought being that some chain of custody of data needs to be documented in some manner. You mentioned that. As you mentioned, the folder level is locking. Is metadata attached to the folder? Is that the lowest level of tracking that your control process implements? Yes. Great questions. So yes, you're right. So the diagram goes left to right and the questioner has pointed out what it gets to the analysis stage and clearly something is wrong. So we looked so there are trade-offs, something. So we did consider that as things move from the librarian to the linker that we would delete the intermediary files and similarly when it went from the linker to the analytics stage that we would delete those files. But as the questioner has pointed out if something is bound to be wrong at the analysis stage then you would have to recreate the whole process again. So our procedures were to leave as a final step to delete the files. So the files are protected if you like because of the folder level security but still and all at the end of the project those intermediary files need to be deleted. Otherwise, if they're just sitting there hanging around they're vulnerable and a breach could occur. So I hope I've answered that question sufficiently. The next question was about changes to the data. So we don't actually change any of the information about the person in terms of their demographics and that. We do generate an SLK a statistical linkage key. This is all done as part of one project. So you sort of run it through. We have metadata in the sense that we have a paper form that is filled in by each person as they do each step as well as the signature of the manager who's watching each of these people doing these steps and the start time and the end time and signatures as declarations that this is what occurred. So that was our sort of cheap, manual approach if you like. Obviously you can do something much more sophisticated with monitoring and logging and keeping data lineage metadata in place. But we did it as a paper based form that went step one, step two, step three with signatures and dates and times. Still metadata. It's just on a piece of paper. The question was about folder level. Was that our main security mechanism? And the answer is yes. However, I would emphasize that this space that this work is carried out on is only accessible to the three roles here. The librarian, the linker and the analytics person. So the only people can get to that space and then at a folder level the librarian could only write to that, read and write to their space. But they could only write to the linker and write to the analytic area. And similarly the linker could read the linker folder but only write to the analytic and then the analytics person could only read in the analytics folder. So obviously you can spend more money. You can do sophisticated things which other agencies do around the area of virtual machines and that. We've been in this project, we were subject to external audits and we're deemed to be satisfactory. So what's my answer to that question? Fabulous. Metadata has become such a hot topic. We get questions in every Web Spinar no matter what the topic is related to metadata. Anything you want to expand on that and just the importance of it in the digital area to play a role in this project? It's absolutely crucial to this type of work because we're going to be subject to continual audits. We are undertaking activities that are very dangerous. We are consolidating data about citizens and so we are continually subjected to audits and the only way that the auditor can be satisfied is if we have the metadata in place to be able to support that yes, we did what was here and it's an approved process. Otherwise, what have you got? You've got this lovely diagram slide number 8 but if nobody's following it then it's complete pass. It's a waste of time. Sure. Everyone's been quiet today no additional questions coming in anything else coming in that you want to know to put it in the Q&A section we'll just hang out here for a couple of minutes. A lot of our attendees are here at the conference anything that you're excited about and seeing or you've learned so far at EDW? Yeah, so there's been some well, as always, great sessions. Happy to meet you at EDW. I'm particularly interested in creating data models for NoSQL databases and Dave Wells gave a great tutorial yesterday on that. He had some very clear examples about the syntax that you would use for document data stores and how that could be modeled using a particular modeling technique and I found that really quite useful. Especially somebody who I teach Graham Simpson's data modeling essentials course in Australia people increasingly in that course is very rich and does a great job around the relational space but I'm always keen to hear about extensions or ways that we can do stuff for NoSQL databases. Oh yes, that's marvelous. I also went to Donna Burbank's session on setting up a data management practice and that was perfect for me just at this point in time because I'm currently at a different client setting up a data management practice so I drew a lot of comfort from the fact that I was doing the sort of things that Donna was saying but I also like the fact that I can go back to Sydney Australia and sit down with my clients and step through some of the slides in particular that I thought were pertinent to them and be able to say well look that's international best thinking so that sort of stuff is just gold. I love it, we do have a couple of questions coming in, more questions too on your topic and we love Donna too and how long did it take to set up the library linker and analytic LDCs? To be honest that was kind of the easier part of it so I've just clicked slide number nine with all the stuff about physical security so I had to demonstrate two auditors that the physical security logs were all in place which meant I had to talk to the property security people and I said I need a list of all the people who have access to this area and of course the property people were not giving you that and I need to be able to show the auditors that these are in place so there were things that were very time consuming like the internet gateway security I had to find out what that was and get the proper documentation to be able to support the auditors to say yes this is in place or this particular display is not being accessed and you know I had to demonstrate that our people had the appropriate skills so I had to get their resumes put it into a format so the submission itself might have been 40 pages but it then referred to all these other supporting documents so I don't know why I'm using my hands here because people can't see but I'm showing Shannon two hands that are about three feet apart because I just had this wide of material to be able to substantiate that we had all the checks and controls in place and that was that took time yeah if there is additional documentation available about the various criteria for example written policy and procedure documents well I do have those things but they're sort of client specific so I don't they're general I've sort of generalised as much as I can here but if you go to the nss.gov.au website you'll see you'll see you can find submissions that have happened from other agencies and you can look at their submissions and you can look at the criteria that's used however they don't actually publish the direct policies and procedures that are used by these agencies I suppose there is a bit of security stuff around this as well sure anything else you want to add I bet all the questions that we've got come in so far anything else you want to add before we end the session no I think I've performed excruciating detail on slide number eight and look even if you didn't grasp everything that was on slide number eight I suppose my take home message is to firstly if you're in the workplace and somebody says we need to consolidate data from different organisations that you immediately raise the warning bell and say what are the privacy implications and what are our controls and procedures around that and if you're getting hassled by a boss just do it then perhaps you can take a copy of these slides and sit them down and show them slide number eight and show them the level of sophistication that's required if you're going to do this in a controlled manner that protects people's privacy and if I've achieved that in this webinar then I'm really happy well you're certainly getting some accolades here well I just want to thank you for this great presentation and thanks to our attendees as always for being engaged in everything that we do and asking such great questions and we will sign off here from Enterprise Data Roll 2016 and we will see you next month it's the next webinar and I hope everyone has a great day cheers