 So my name's Tom Honeyman, I've had a few job titles in my life and I'd have to say Solutions Architect is probably the wackiest one that I've had yet. I'm the Solutions Architect for the Humanities, Arts, Social Sciences and Indigenous Research Data Commons, which is also the longest title I've ever had. And today I'm going to go a little bit deeper on what a research data commons is. I've been saying that word a lot yesterday and today. We've been turning it into an acronym a lot, which I try not to do, I promise, although I will later on. But, you know, what is a research data commons? Before I go any further though, I would like to acknowledge and celebrate the first Australians on whose traditional lands we meet today and to pay respect to their elders past and present. I'd really especially like to pay that respect and acknowledgement to the Indigenous people joining us today in the room. And I'd also, given this presentation, like to acknowledge Dylan Sara, the artist responsible for the Indigenous iconography used in this presentation as well. Okay, so what is a research data commons? So I have some pro-forma standard line that I'm supposed to say that my organisation gives me, which is that a research data commons brings together people, skills, data and related resources such as storage, compute, software and models to enable researchers to conduct world-class data intensive research. So that's the company line. I'm just going to go off script from this point onwards. What I like to say is that a research data commons is an exercise in digital placemaking that what we're trying to do in forming the commons is to assemble a bunch of people, pro-tip. Welcome to the commons. You are in it right now. To define a space, pro-tip, you're helping to define that space right now, that meets the needs of the community that it's serving. Like a real physical place, it's a pretty amorphous thing. You can have places arranged such that a city has suburbs, you can have places within a city, you can have places within places, you can have places that are overlapping with other places. And so when we talk about research data commons, I don't know if that's the correct plural. If anyone wants to tell me, I'd love to know. Then it's more that people can move between spaces and it's about recognising yourself as being in that place. So somewhat more dryly, when you're in that place and we're talking about a research data commons, we are talking about the ability to access particular services and I will briefly touch on that today. And in fact, that's what you're getting exposed to over these three days today. In particular, you're going to get a stream of case studies that will take you through those services. And the skills that are associated with those services as well. Yeah. So for the Australian research data commons is both the name of the organisation that I work for, but it's also the place that we're trying to build. And if you like suburbs in a city, we have these specific suburbs that we're trying to build out right now. We have the people research data commons, the planet research data commons, and that massive mouthful, the humanities, arts, social sciences and Indigenous research data commons as well. Or if you can forgive me, I'm going to use an acronym from this point onwards and just call it the Hassanai IDC, just so we get through this. With our people research data commons, we're looking at health and medical research. With our planet research data commons, we're looking at earth and environmental research. And with humanities, arts, social sciences and Indigenous research data commons, that's a rather big family of research concerns as well as the kinds of data that we're talking about that's relevant there. Now you as a researcher, if that's what you are, might actually see yourselves as existing in more than one of these spaces. I know having spoken with people yesterday that there are people who are applying humanities types approaches in a health and medical space, for instance. There's no reason that you have to only locate yourself in one of these spaces, like moving from one side to another, you can move between these research data commons as well. And in fact, you will have noticed that as you have heard particularly from the Indigenous data network and the language data commons folk, and actually you might have heard the improving integrated research infrastructure for social sciences folk over in the library, there are suburbs within suburbs if you like as we move through this space, because as we drill down there are more specific concerns for researchers as well. So where's the RDC and particularly in building out the house and Indigenous RDC, we actually have nominated areas. We're not trying to cover, build resources for every single area of the humanities, arts and social sciences and every kind of Indigenous, well, every type of Indigenous research data community noting that the data that impacts upon community can be very, very broad. We have these nominated areas. So I'll mention it again, but you've been hearing from the improving Indigenous research capabilities program, which is led by distinguished Professor Marsha Langton at the University of Melbourne. You've also been hearing from the, you might have heard El Dakar a lot, the language data commons of Australia, which is led by Professor Michael Hall at UQ. And if you were over in the library yesterday, you would have also heard from the folk at the integrated research infrastructure for social sciences. Now, another area that we've been working in is hands up, who's actually heard of Trove, the National Libraries Trove. Fantastic. All right. If you didn't have your hand up there, you are going to find out by the end of the day. We had nominated work around improving Trove for researchers. It's a fantastic resource if you're the one or two people that didn't put your hand up there. And you're going to find out more about it today. But the specific programs that have been focusing on Trove include the ARDC community lab, who are joining the fold today and you're going to hear from. And we also had a separate project to actually improve some of the back end and documentation for Trove as well. So for the eagle-eyed monks, you might have noticed that I didn't touch upon the creative arts and mediated data. Something that we're really excited about over the coming six months is that we have received a major amount of funding, probably the biggest investment in humanities arts and social sciences and indigenous data research infrastructure. And so we are at the very beginning of the journey of spinning up, if you like, another suburb within the city around the creative arts and around what we call mediated data, which we're going to work on that title, but essentially refers to web and social data. Or if you like our digital footprint on the internet and the traces of our consumption of the internet as well, through for instance advertising. These are both really exciting new areas and in fact, for other areas as well, we're in the middle of an extensive co-design process right now where we're really trying to build out these areas based on the lived experience of researchers. If that's something that appeals to you, if you'd like to be a part of that conversation, we do have an events page where you can see what events we're running and you'll see in that list that we have several co-design sessions happening right now. And if you don't get this down, come and talk to me afterwards, but it's our website slash events. And I noticed yesterday in some of the feedback that we're getting at the end of the day, there were questions about how you could just zero in on the resources that are available to you. And if you've not done this already, do visit this researcher page where you can drill down to the particular RDC that you're interested in. You can sign up for a newsletter. You can also sign up to receive a series of emails that will guide you through. It does take a bit of unpacking to get through all our resources, but that will guide you through the particular stream that you might be interested in. And I did also mention that we have a newsletter. So if you're not, you might have ticked a box when you registered for this event that signed you up for the newsletter, but if you are not signed up for the newsletter, let me reassure you it comes twice a month. It doesn't come every day. And that newsletter is a great way to find out about additional activities or developments and opportunities for you as members of this commons that we're trying to form. I'm a philologist and work on Buddhist inscriptions. You can Google me if you're any interested in my research background, but my alter ego today is, however, as systemic. We develop and support a range of open source digital humanities platforms and specialise in immersive and engaging research experiences. We've been delighted to work with ARDC over the last year or so on a range of fascinating and we think really productive projects. We've worked with them on G-HAP, the Gazetteer, on image annotation workbench which I'll walk through today and stylometric intelligent archive amongst other things. So what I'll walk through today or in the next 20 minutes or so is a digital framework for scholarly annotation of images that we call image annotation workbench. So whilst I'm going to be talking about the digital, this is not a technical presentation. Rather, it'll be conceptual and we'll focus on strategy, models, frameworks and research outcomes. So I'll describe some of the workflows during the session, but we have our workshop tomorrow. So there'll be more detail then. The design brief, however, was that we build a platform to meet contemporary expectations for a delightful user experience. Our hope is that with a little familiarisation, any researcher, irrespective of their technical skill, can be productive with minimal support. So always good to frame any initiative in terms of its value proposition, the why question, why are we doing this? Well, we see three facets. We built IAW for its research affordances, for its application in the curation of images in the glam sector and for its pedagogical potential. Now, there's faint hope of covering even one of those in any detail in 20 minutes. So I'll limit my remarks today to a brief outline of the platform and its research affordances and a case study of a pilot research project. So I won't even try and cover the curation or the pedagogical aspects. So on-screen is an abstract of a business case we put together for the platform. I promise this is one of only two blobs of text that I'll subject you to. Yes. Sorry for the word affordances. What are research opportunities? What opportunities do they open up to a researcher to produce peer-reviewed research or non-traditional research outputs? So in terms of this abstract of this business case, I'll summarise it. The problem we set about addressing was the institutional access issues and technical constraints which were an obstacle to scholarly analysis of images. So institutional access and technical constraints. The opportunity we saw was the maturity and growing ubiquity of a thing called IIIF, the International Image Interoperability Framework. I always stumble over it so I don't have to say it again. IIIF, yep. So the growing ubiquity of IIIF in the GLAM sector and its support for seamless collaboration in the analysis of images across institutions. So that's the opportunity we saw. And so whilst IIIF is open source, it requires an investment in system development in order to be implemented in an institution or in a project. So the value proposition for IAW is that it's a generic framework that can support researchers, and this is a bit of a mouthful, that can support researchers to immediately apply semantic tags from research specific taxonomies on images stored across multiple institutional repositories and publish those annotations as research outputs. So it's a bit of a mouthful, but that's what I hope to unpack in the next 5 or 10 minutes. I will speak briefly in a moment a bit more about IIIF, but for the moment the takeaway is that it's a standard protocol. So we originally built a prototype platform in 2022 to support some ARC and North American funded projects. With generous funding in 2023 from ARDC, we've been able to develop the platform as part of the CDL program. I think it's important to point out that ARDC contribution wasn't just financial. Firstly, it took vision on the part of people like Jenny Fuster to cast forward and see this as a worthwhile initiative. Also, the input of the CDL team, and they're here today, people like Petey Sefton, Tom and Owen O'Neill, their input with regards to the digital architecture and the long-term sustainability have been really most valuable and made a much better platform for it. So in 2023 we ran a pilot project with the University of Sydney on the ARC funded Opening Australia's Multilingual Archive project in which we used image annotation to explore cultural history. We also ran a pilot project with the Power Institute at UCID to trial the pedagogical potential of image annotation with art history students. And then through 2022 and 2023 we ran pilot projects with two international collaborations using image annotation as an art historical and epigraphic practice. And epigraphy is the analysis of inscriptions. I worked through one of these, the Gandharan Buddhas, as a case study. And hopefully it will draw out some of the possibilities of image annotation as a research practice. So the Gandharan Buddhas project was a partnership with the MET in New York, the National Gallery of Australia, New South Wales Art Gallery, Power Institute and the Peshawar Museum in Pakistan. And the objective was to trial the platform on a selection of Gandharan statues, reliquaries and freezes. And the outcomes were featured in an Australian art journal last year. So the project is now being expanded as Image Gandhara with collaborating institutions in Australia, Pakistan, Europe and North America pulling together related initiatives around the digital imaging of artefacts, the development of vocabularies and image annotation as a scholarly and curatorial practice. So I've noted down the core practices we engaged in the project and you'll get a glance at some of these shortly. Firstly, what we do is provide tools to outline components of images and then label and identify those components in multiple languages. You could add scholarly notes, much like footnotes. You could develop vocabularies and then undertake semantic tagging of those components with those vocabularies. You can do grouping and semantic linking of components and then we then undertook the analysis of the annotations data and the publication of non-traditional research outputs for peer review. I know everybody says this but we did find the collaboration to be really productive. It brought together specialist researchers, the curatorial expertise of the galleries with digital methods and design. And we found the Glam institutions were both accommodating and really genuinely keen to collaborate with specialist expertise in expanding both the scholarship about their collections and the accessibility of that knowledge. So our pilot objectives, and I will demo some stuff in a moment, so one objective was to open out new avenues for cross-disciplinary research and for engagement between art historians and textual specialists. So we consulted widely with Gandaran art historians and the initiative, as I said, has taken on quite some momentum. Second objective was to make these cultural assets more accessible to a range of cohorts, not just specialist scholars. We wanted to address general scholars, students and the general public. A third objective which is a whole topic in itself and probably best left for another day is the notion of digital repatriation of these artefacts. It's worth pointing out that one of the items we worked on for the NGA has actually been physically repatriated, returned to Pakistan. And NGA now has, retains only the digital trace that we developed for them. It's also worth pointing out that an important aspect of the next phase is to open up the annotation practice to other voices in Pakistan outside of scholarly cohorts. So Image Annotation Workbench supports any number of annotation sets so you can have different researchers complementing each other or contesting with each other on the same image. And IIIF supports the flexible disclosure of those annotations. Different cohorts, different groups, depending on who they are, can see different sets of annotations. Left hand corner, you can see a border image and you can see that if I, on each component of that image, there is the ability to click on any part of that image. And what pops up are individual annotations. So you can see each one has individual tags, so from domain-specific taxonomies, in this case it's a Gandharan art taxonomy. And so you can see that clicked on there is the tags, you can't see it really, but it's upper and standing position. And if I clicked on the notes tab in there, you would see a list of notes through there and they would be the correlative footnotes. So you've got five or six editors over the years who have worked on this particular piece and have comments to say about that particular component. There is a containment hierarchy of the outlines in that if I clicked on smaller portions of that image, you would pop up individual annotations for hand and for the hair and for the robe and for the crown, et cetera. Okay. I'm just really going to... Well, one way of doing this might be to somehow take the screen over here. I wonder if anybody has any idea about how we... Oh, maybe it's under here. How we get to out of this mode and I can display the whole screen rather than just the... Yes. Do you want to get out of this one? Yes, I do. Okay. We'll just have to live with it. I apologise. I do apologise. Okay. What I was going to show you but can't show you and I'll just have to describe it and we'll back this up with a workshop tomorrow. Okay. So what we can do is... What we can do is we can... We provide tools to allow you to annotate each part of the image. We can generate a table of annotations automatically from that image with the cutout of the annotation with the taxonomy terms and any notes that have been applied to it. So you get a generator out of there, a table of those annotations. You can see there in the other screen on the bottom right that we've also tackled it so that you can click on a particular word in an inscription and it highlights those annotations on the image and indeed it highlights on a 3D model as well. So you've got this annotation between text, image and 3D model. Yeah. And each of those things can produce peer-reviewed digital publications and from our project we have been able to produce digital journal articles that have an overview of the significance and provenance of the piece. The issues about its significance, about controversies in current scholarship about it and then includes the image and then includes that table of annotations below it. Okay, so I'll move on. So the research outputs we produced in the pilot include the annotated image as a nontraditional research output itself. So the actual image that's been annotated is a nontraditional research output. We produce peer-reviewed journal articles both digital and conventional and we produced the annotations data as searchable data. So with one of our projects we've been able to generate annotations data around petroglyphs which are scraped on rocks and inscriptions on a range of rocks that run the threat in the Upper Indus Valley in Pakistan. And those are now searchable on a website. You can search through and find every instance of a hunting scene or every instance of a Buddha image on rocks in these sites. The other possibility is that you can reuse this annotations data as a training set for artificial intelligence by actually outlining each of the components of an image and tagging it with a taxonomy, giving it a label, then you have a training set that can be used that can be used to train AI to recognise those components in other images. So I did threaten to show you one more piece of text. And this is what I did was under pressure for a concise description of what AAAF is. I did what everybody does these days. I asked chat GPT, yeah? And I've got to say it gave me a summary that would have taken me half a day to write. I'll just read out a little bit of this. AAAF is a set of shared API standards for the rich and high quality delivery of images online. It's developed collaboratively by a community of the world's leading libraries, museums, universities, etc. AAAF aims to provide researchers, scholars and other users with consistent and robust access to visual materials. Yeah? So the things to take away is that a core strength of AAAF is its support for interoperability. That AAAF is being adopted at galleries, libraries and museums around the world. It's not just a set of technical specifications. It's also a community. So people working around the world to build tools and sharing best practices. It's built on open standards ensuring that it's freely available to anyone. Anyway, the takeaway from all of that is that we've built IAW to conform to, to take advantage of and contribute to AAAF. And if you're at all interested, you can go to AAAF.org where there's a plethora of introductory resources about the standard and about its usage. So back to the platform that we've built. It's been built with an open source stack and the salient points are that we're running IAW on a Nectar server. Images can be hosted on that Nectar server really only as a workplace or a sandpit, but images can also be accessed from institutional servers. So IAW sits there grabbing images from institutional servers or from our sandpit. Research outputs can be configured for institutional or publication specific standards. So you can theme them to look like NGA outcomes or they can be themed for your particular project. So this platform is really an exemplar of a digital service pattern that emphasises modularity, reusability and multiple access methods. And so the three components of the solution are as I mentioned the image server that you can use as a sandpit, the IAW server that is a full API, application programmatic interface and the IAW application, which is the user interface that researchers use. I probably should emphasise again, researchers can interact with the platform either through a user interface, click and drag and do these things or through an application programmatic interface, through an API. Foundational to the CDL digital architecture is a comprehensive sustainability strategy. So the key components for that again are for us that it's an open source stack, that it's compliant with all of the AAAF and web annotation standards and that you can export from IAW all of the data that you invest in it in a AAAF manifest. So you're not stuck on our platform, you can do work on our platform, export it as AAAF, use it with other AAAF tools. That's really critical to sustainability. Other things that also parts of that sustainability strategy are a persistent ID solution so that if servers move, your links will all still work and an image archive strategy that looks at long-term sustainability and reproducibility of the research outputs, ultimately getting those images onto things like the Internet Archive so that they'll always be there. And of course we're happy to unpack those strategies in our workshop tomorrow. So what I'll do now is just a very brief walk-through of some of the foundational workflows. I'm not going to do all sorts of clicky-clicky-clicky stuff. I'm just going to show you some screens and that'll give you the general idea. But what you can see on screen is actually image annotation workbench and what I have selected is the Image Gandara collection and you can see that there are one, two, three, four four image sets there and I could then click on one of those image sets by annotating it. So the foundational workflows for this workbench are creating new research collections, adding images to those collections, manually or by searching other image repositories and pulling in those images. The annotation, which is the outlining components and annotating them with tags and notes and links. And then there's sharing collections with colleagues so you can start up a project, build your collection, share it with a colleague and they can do their own sets of annotations that you can work with them on. And then there's publishing. Then you can share the annotated image with the rest of the world as a research output. So, and again, I'll just, it'll only take a few seconds. You can see these are some of the screens creating a collection. You're just going to describe the collection, adding an image set to the collection. Locally or add it from a remote image either by a URL or by a AAAF manifest. There's also the ability to search a repository for an image as well. Once you've created your image set, then you can create an annotation set. So there's my annotations and then there's a colleague of mine, Mark Allen's, annotations on this item. And then once I've created my annotation set, I can go in there and you can see the image there with all the little blue lines around it. And to do that, you simply use one of the tools to outline a component. What pops up is some simple UI that allows you to put in a tag and put in a description for the annotation, pops up a taxonomy of terms that you can pick from and tag that annotation, put in descriptions, put in footnotes, et cetera, links, et cetera. So once you've done that, once you've annotated your image, you can just share it with another user. So I can share that image with a colleague and they can do the same thing. And then I can publish it. I can generate a public URL that will be persistent, consent to anyone in the world and they can review this. Or I can embed the annotated image in a web page. So I did mention before that anything you can do with a platform you can also do via API. And in our workshop tomorrow we'll work through this. Don't even try and read that. I'll explain what it is. So all of the features available from the UI are also accessible via an API. And we'll generate some sample Jupyter Notebooks tomorrow that call the API and give you a chance to run them yourself. And the one here that is just a simple query that queries a collection of images and says go and find any annotations that have the keyword robe or a tag called robe in them and bring all those back. And then it started to format those at the bottom where you can see that it's given me the annotations that are on different images that are of a robe. So that's just a simple, really simple demonstration of what you could do. So to summarize all this, the value proposition as a research infrastructure is firstly that it's standards-based. The ubiquity of AAAF supports research across institutional collections. It has wide research application. It spans a range of FOI codes that might make use of this. It's not tools we've developed in the past were just for a tiny little area of research. The very low skill threshold. Anybody within 10 or 15 minutes can start annotating an image. It supports research collaboration and contestation. We've already integrated with research vocabularies Australia so you can search for and pull in the vocabularies, domain-specific vocabularies for your discipline. It's been integrated with the Public Records Office of Victoria to search and retrieve their images. We can use the Search API for other galleries and libraries and Glam institutions in Australia to allow you to search their collections and pull images in. And it supports non-traditional research outputs, NTRRO publishing. You can generate research outputs to support elite scholarship as research outputs. It also supports multiple voices and selective disclosure. So we've got multiple voices that can annotate and there's selective disclosure. Multiple communities can get a lot of work to be done on that side. And it has sustainability baked in in terms of its persistent IDs and archiving strategies. Cognizant of time, so I'll skip through this one quickly. We did a bit of a scan of FOR codes to look at where we see obvious applications for IAW. And I've got to say for us, having built tools for philologists where there's 20 of us worldwide who use it, this one's great. It's like a shotgun. There's only different areas of research that can use this sort of tool. Maybe it's not their main research tool, but it's an augment to what they do at least. So rather than riffing on this now, in our workshop tomorrow, we'll canvas some of these research propositions with the participants. Just in terms of customising it, we're currently designing a plug-in for manuscripts and inscriptions. So I showed you the pop-ups that let you just tag a part of an image. We're developing a custom plug-in that's just tweaked for textual scholars that does all the stuff that allows them to put in grammatical information and these sorts of things. So that's a plug-in that we're developing. We've already had clamour already for the design of a plug-in to allow one to annotate an image with multimedia objects. You can create an annotation and then embed a sound or a video file or another image in there. So that clamour has already started for museum collection applications. Lastly, the platform allows theming customisation and customisation of the research outputs. So I mentioned before that what this looks like is what we wanted for our project. But you can theme the output for institutional branding or for your project-specific requirements and you can also customise the annotations and build web applications that govern what people get to see. It's not easy to see, but you can... This here, there's a table of outputs. There's a toggle there. You can switch from one to the other. You can see the image view or the table view and then you get a whole bunch of settings there about what you see. Now, that was something that we built in a couple of days last week for our application. But I think when you start looking at multiple communities in terms of governing access to what they should see and what you don't want people to see, there's a lot that can be done in terms of control of those things. So thanks very much for your indulges. I'm Cain. I'm a software developer at the University of Newcastle and I work on the TLC map project and I am responsible for implementing geospatial analysis functionality as well as place-name recognition and geocoding functionality. But I'm not really going to talk about that today. This is just a... It's not even really a case study, by the way. It's sort of a mini lecture. I wanted to get people interested in the subject of GIS analysis. And I hope you all learn something interesting. All right, so this is the overview of topics. The field of GIS analysis is both broad and deep and there's no way I can cover all of its volume in 20 minutes. But I've narrowed things down to several concepts that I think are most important for an introduction to the subject. So, yeah, first we'll answer the question, well, what is it, GIS? Then we'll talk a bit about spatial data types. Then an overview of GIS analysis. Then we'll discuss the difference between descriptive and inferential techniques in GIS analysis. And I'll provide an example of each. Then we'll talk a bit about application domains. And I did have a case study video, but it's possible we'll skip that. And then we'll conclude with a summary. So, what is GIS? Well, GIS stands for geographical information system. And it's any computer system that allows you to store, process or visualize or otherwise perform operations on geospatial data. They essentially provide virtual representations of the Earth in space and often also in time. Examples include Google Earth ArcGIS and, of course, G-HAP. And I've used G-HAP as the example picture there. If you don't know what it is, you'll be interested in it, trust me. So, there are fundamentally two different types of spatial data. There's raster data and there's vector data. So, raster data is spatial data that's represented as pixels or grids. An example is raw satellite imagery. So, if you're looking at Google Maps in satellite mode or Google Earth, then you're looking at a raster image. And raster data is often used for the analysis of continuous data. So, if you want to map temperature over some given area on the surface of the Earth, you would shade the pixels based on the intensity of the temperature of that area for a simple example. And vector data is represented as geometric objects. So, you've got points, lines and polygons. So, points represents you could use a point to represent say a city or a disease the recording of a disease or anything like that that's discrete. So, basically, vector data is used for discrete entities. And you would use say lines to represent railways or roads or those are the two examples that come to mind. And polygons. So, polygons would be used to represent boundaries. So, what is GIS analysis then? Well, GIS analysis is essentially just geospatial analysis applied to GIS data. And geospatial analysis is a subset of spatial analysis. And a lot of the techniques are using geospatial analysis that can be applied in all sorts of scenarios. It's not limited to maps of the earth. So, a lot of the techniques, for instance, you could apply to bacteria colonies on a petri dish or something like that. But in the context of GIS analysis, we're specifically talking about the surface of the earth or data relative to the surface of the earth. Analysis is often statistical but not always. So, some techniques such as in network analysis such as the algorithms that are used to find optimal paths which are GPS users, by the way, to navigate you from A to B. So, they're an example of geospatial analysis techniques that aren't strictly statistical. But, yeah, the vast majority are statistical in some way or another. And artificial intelligence can also be used for GIS analysis which is very exciting and something I'm invested in. For instance, predicting where natural disasters will occur or predicting where a cyber attack is coming from, for instance. And there's many domains of application, but common ones are domains that commonly use GIS analysis as criminology, epidemiology, environmental science and so on. But there are plenty of applications in humanities as well. So, what's the difference? There are fundamentally two different types of GIS analysis just like with traditional statistical analysis. You've got descriptive techniques and then you have inferential techniques. So, descriptive techniques involve describing a data set rather than, or a sample from a population, rather than trying to make inferences about the population that the sample comes from or the underlying phenomenon that has been examined. So, examples in the context of GIS analysis is point distribution analysis. Even just plotting points on a map and describing the distribution, that's an example as well. And grouping and summarizing location data, for example, aggregating by state for the variable of median household income would be a very simple example. Inferential techniques, on the other hand, involve looking for patterns in a data set and seeing whether we can make generalizations or inferences about the sample or from the sample about the population it comes from or about the underlying phenomenon. And this includes a hypothesis testing and predictions and forecasting. So, this is an example of descriptive GIS analysis point distribution. So, on the map here we see a distribution of points and the red one is the geo midpoint so that's the average point in the data set. So, that's the center of mass essentially. And we can see that it's the points are clustered on the eastern side of Australia especially in New South Wales and Victoria. And it's Professor Hu Hu. Oh, I was going to ask if you recognize the map what the map is. Do you know which map this is? The points. Okay, it's you know the place references from I've Been Everywhere the Australian version. So, he indeed has not been everywhere but he got kicked out of the truck before he got to finish those so we don't, we'll never really know. So, and then what we can do after we visualize it and we calculate the geo midpoint is we can calculate the displacement stats. So, there's two ways of doing that. So, in this case what I've done is calculated the distances between each pair of points and then found the mean median minimum maximum. But you can also calculate the distance from each point to the geo midpoint as well to get some an idea, some metrics on distribution as well. So, in this case the median is 720 kilometers so I want to average two points 720 kilometers apart. The median is 636. So, that tells us that 50% of the points are within 636 kilometers of each other and the shortest distance between each point is 2 kilometers and the maximum is 3,218 and that tells us that the data set is skewed because the max is so much different from the median and mean and min and that's being caused largely due to that place way up in the northern territory. But just I'm sorry a quick example of where this could be practical in your own lives is if you're going on a holiday to a city and you know all of the attractions and places you want to visit you could calculate the geo midpoint and look for a hotel around that area and you can also instead of using displacement that's the straight line distance you can also use actual travel distance as well which makes more sense in a lot of cases. So, for an example of the IAS analysis the example I've chosen is Moran's Eye and it's a measure of spatial autocorrelation and spatial autocorrelation is a measure of how similar geographical features are to their neighbors so you calculate Moran's Eye and the value will be between it's on a continuum between minus 1 and 1 where minus 1 represents perfect dispersion so points are nothing like their neighbors and 1 represents perfect clustering so points are very similar or precisely similar to their neighbors and yet it's used in you'll see this everywhere in GIS analysis in just spatial analysis in general and Moran's Eye is often conducted as part of a hypothesis test so you have a test statistic and a p-value and it's actually the spatial equivalent to Pearson's R correlation for any stats nodes and this is an example a conceptual example of the spatial correlation autocorrelation so the one on the right, that's plus 1 that's perfect positive spatial autocorrelation the one in the middle is just no spatial autocorrelation so it's completely random and the one on the left is negative spatial autocorrelation so the application domains are very vast you can apply these techniques whenever you have a geospatial data but it's very common in disaster management and response tracking and predicting natural disasters you have uses in urban planning land use analysis for instance plotting the different parcels of land and also spatial optimization for service placement then of course crime analysis is a big one so hotspot analysis for assigning police patrols is one example and then transportation and logistics for instance planning optimal bus routes optimizing postal deliveries so if you're wondering why your Amazon packages arrive so quickly usually then it's because they apply these sorts of techniques like the one, like Dexter's algorithm and things like that that I mentioned earlier but that doesn't really explain why some of your packages arrive damaged though and of course geospatial analysis or GIS analysis whichever way you like to look at it was very important during COVID because these techniques were used to track and the spread of the disease and to identify hotspots what time have we got so I was going to play a video but I don't want to take up Tim's time with a video okay so it's okay it's an example of GIS analysis well GIS technology and analysis used for a humanitarian cause and it shows the profound impacts that these tools can have when putting in practice I wanted to show an Australian example but I couldn't find one if anyone does in 1988 the global polio situation was desperate we had approximately 300,000 children being paralyzed with polio each year we've reduced polio incidents by 99% and there's still the potential for the rest of the world to be reinfected if we don't actually finish Nigeria has been on a bit of a roller coaster with polio eradication one of the challenges that we have in northern Nigeria is making sure that we reach almost every child in almost every single settlement enough so that the polio virus has nowhere left to go we create maps a pictorial representation of what you need to cover during the course of a polio campaign unfortunately that exercise has never been perfect and as a result we miss kids one of the gaps in our micro-planning process is that we don't have enough tools to make sure that we're capturing all of our boundary areas in settlements that sit between boundaries because those are the people who are most likely to be missed what could we learn from mapping to try to improve our standard tools our paper products the cartoon maps could we use GIS technology to improve the quality of these maps and then to use GPS tracking devices as an additional tool for monitoring what the teens do when they actually head out into the field the beauty of GIS mapping provides a more accurate representation of exactly what's out there to help inform the decision making about where vaccinators go it makes it possible to think about the distance between these communities so that you can be efficient around how you deploy your teams it also captures those settlements that are in boundary areas that perhaps have been missed because you've really got to sit together with people and decide who's now going to cover each one of these settlements GPS tracking is a very simple process you've got to carry an android phone they return it at the end of their shift that will actually tell you whether the team is moving through the village or not immunizing kids in front of the village chief's house is not what they're expected to do they have got to visit every household because there will be children who will not come to the village chief's house because their parents want to be convinced or they have questions we're getting to this extra 5 to 10% of settlements that we're identifying now at the local area that weren't actually on the original maps and in the fight to end Bolio all of those communities are important okay so that's basically it for today well I try to put a little some things of this presentation but the idea is to introduce one part of a big project that is called Iris I will introduce more what is Iris and what is Heo Social and the idea is to see the motivation why we wanted to do this and what is the context of all this idea I will as well talk some concepts of spatial data and data integrations and after show a service design and a demonstrator of the idea we want to show today when I'm studying my bachelor in history I'm going to start to understand when the societies start to grow on the time pass they start to become more and more complex so we need to create as well tools that can help us to understand how solve the problems and how understand the change of society in particular in Australia we have some challenges like every day we are facing but we don't have enough tools to enable us to understand all these changes in real time so as solution of this problem couple of years with some universities and RDC the project the integrate research infrastructure for social science Iris was created with the idea to address the fragmentation of the Australia's social science research infrastructure and create rapid tools to improve and help to solve this problem in particular is six work package like have different approach today we are going to focus just one of these that is called health social is more information on the website like you can read about the other work package we have a little limited time to try to get into so health social in particular try to address one problem like I think many people here have been facing before and you have a very good idea to research I have this hypothesis I wanted to make like how I can prove if this is true or how I can do this integration to prove the hypothesis so we have the research question we have some data sets for example longitudinal data sets and we want as well integrated with some health special data and even some other data sets but this integration sometimes require a lot of work like we need to get into health special knowledge we need to get into programming knowledge which could be very overwhelming so the idea of health social is try to standardize all the protocols like a lot of research are doing at the same time try to simplify this process to avoid redundancy so the motivation and I will explain more in the future some actors like creates constraints in this problem but we have a longitudinal data I'm going to explain what is a longitudinal data with more detail soon but we have a data sets like our observating behavior the behavior of the people in Australia for a couple of years and we want to integrate this data set with geographical data for example one data set can have the post codes of the people where they live where they are being studied this geographical data contains sensitive data because anyone wants to have your data where you live so usually these data sets are like the data custodians storage these data sets have a strong policies to take care of your sensitive data and don't give the post codes where you live to everyone so the idea is we wanted to do these integrations we need to face some data custodians and some strong standards to protect the privacy of the people that we are talking here so when we have this longitudinal data and we wanted to enrich the data for example we know where the people live but the context where the people live can say more than the same person itself so for example the neighbor where you grow up have some characteristics and some contextual and sociodemographic characteristics like can explain more where is the the context so the idea is to we can observe different waves or GRs of the sensors and we can try to merge this data to enrich this data and understand some questions like we cannot solve in a traditional method I will show some examples and motivation how we can do this but the idea is to we want to enrich data with special data in this case to solve problems the current landscape is we have we do like we have some fragment data we don't have a good infrastructure about the data everyone is doing similar things but isolated is not a good documentation we just write her scripts and in her own then the requirements and the consequence of this is a lot of duplication work a lot of research in Australia doing the same work and this is how we are losing a lot of productivity repeating all the time the same and is very time consuming so we identify characters some personas and users like I use this tool we have three target groups low skill level mid level and advanced user level the main difference between these three kind of users is the domain special data and as well the coding skills so when we want to do a merging data we need to have some knowledge about where about some decisions that can affect all the research project so in this project we focus in particular to mid level users and advanced users but in the future we want to focus in low skill levels so the idea to is create a more interface to improve and get to the low skill level we identify the preferred language in these groups are in particular because we can have a lot of metadata and we can have a lot of information there but a lot of people start to use in R so we did two years ago which language are using more people and we identify like R are growing in the community so it's important to try to get into this language YR well is a couple of reasons because we identify like it's very fast and well not as good as Python to machine learning but for this study it's easy and fast we identify like it's very you can adapt this program to your necessities and environments and in particular have a lot of community and we love this, like we can create some libraries and the community will be supported and have feedback in this and as well documentation, all the libraries now have to strict standards and have a very good documentation so this is good because we can avoid the problem to miss documentation and have a very good good license and this picture show how the community of R have been growing in the last couple of years in Star Overflow to this very famous platform so after this we identify like we want to create a solution like can allows us to enrich data with special data to communicate to mid-level users and advanced users like R coding script that have a clear documentation and examples and follow the standards of establish the data by the data custodians. I'm going to explain what is the standards of the data custodians later but this is an important team because we can not run all into the cloud. Why? Because we don't want to have privacy data of all the post codes of Australians in a way so this is a very important thing like I mentioned later and as well the other thing that is important is we want to this is open an access and free access and can be built by the community of Australian researchers. Another requirement is for principles we want to this is findable, accessible, interpretable and reusable and we want to this is to this is to this health social world package is part from other six world package from Iris and we want to do this is interactable for the others so for example if the other world package improve the metadata we can connect health social with this metadata and enrich the data better. So the solution given that the constraints was okay we want to create a library like contain the main functions and the main code and can be performs all the data linkage in the local environment. I will show the solution later but I want to just introduce that we are going to build in our library like can address this problem. The motivation why we wanted to do this so the cross sectional data and we want to just introduce what is the difference between cross sectional data and longitudinal data so when we have a popular sensor so a consumer expender survey opinion polls we randomly choose people and we observe in one moment of the time and we say I can use these people for one question at a time. In addition we have more complex data called longitudinal data where we are following people for different years and with this behavior we can answer different kind of questions. For example the main difference between cross sectional and longitudinal data and why we are focused in longitudinal data in this case is because we can answer questions about like cause and effect studies. It's not the best in economics you see a lot of constraints like when you try to always study cause and effects you have a lot of models to complicate this but the longitudinal data is good enough to start to have some behaviors about how the people change their behavior and this is very good because you are observing the same person different points of time and you are controlling by the same person. So the idea too when you use longitudinal data is that you can avoid noise and you can try to answer very important and powerful questions reducing the noise. So the difference between cross sectional and longitudinal data is a lot of cross sectional and I just want to mention a couple of these ones and it's cross sectional like we use a lot cross sectional like survey pole sensors is cheap because we use a small random sample of people but have the problem to have just one point of time. The benefit of longitudinal data is that it's very expensive and it has to be followed for 10 or 20 years. So it's a very high level of attrition because the people is like okay I'm into this study the first 2 years 3 years and the study is 10 years and the other 7 is a lot of noise. So we need to be careful for the same time question, try to let me know if you change your post code, your address, your phone and everything. In Australia in the Australian context we have a lot of longitudinal service. We in particular to this study we use different data providers and identify what is the top 5 of a longitudinal service in Australia. I don't want to focus on each one of that but one for us will be important is the longitudinal survey of Australian youth and I will introduce a little later. But the surveys have as well a bias and it's important to be afraid when you are doing a research of the different types of bias. I just want to touch all the questions. We need to be afraid that the people when they are doing a survey they can lie us and they have different reasons to do it. When I do my income, maybe I don't want to say how much is my income because of privacy or some concerns. So we have a different kind of bias and different methodologies like we can improve this. So the idea is to in the tool we want to show the research is a little bias and you need to be aware of how noise can be affected in your research. So after that another thing that was important for us was the data custodians. We have a different institution in Australia. We have a different institution in Australia. We have a different institution in Australians. This is a definition from ABS like what is a data custodian and what are the roles. Basically the data custodians protect the data and the surveys. Some institutions create the surveys and they try to publish. So you need to create a formal request to get access to this data. We try to get first for one longitudinal survey and we get rejected because we try to basically do a library with sensitive data. So it's kind of difficult to try to create this engagement because the data custodians of course try to protect the data of the privacy of the users. So yeah, we in particular when someone want to get data need to get into a project agreement justify why you want to use this data, what is the research and everything very formal. And the responsibilities of the data custodians as it's showing this picture is we have a postcode here and the data custodian try to protect this. So how? We need to have a safe storage. We cannot storage this data. We need to satisfy certain passwords certain policies to storage the computer and everything. And as well we need to satisfy a safe transmission of this data. We need to have a very protect environment to satisfy this data and we need to create this environment. In the Irish project we need to be able to satisfy that as well. So it's something other thing to the Irish project is working as well and the idea is try to connect all these ideas at the end. So after put the constraints and all this information why we want to do that enrichment? We have postcodes and some characteristics of the people. For example, I live imagined in this address and I wanted to understand how the physical characteristics can define my behavior and can enrich the data using the physical and also the demographic characteristics. So we can do this using health special data. That basically we have a latitude and longitude or a conception where this person and using this information we can have very powerful ideas. This is an example of Philip Island, how you can understand like an island and how is the population that is living in this island to make more powerful decisions. Another example and this was before in the presentation, the previous presenter about the health special correlation is very important because everything is related else but everything is related to everything else but near things are more related than distant things. For example, we have this behavior that the closest things or the people to live close to me have a similar behavior and one example to Australian love is the pricing of the house. The neighbor increase her price. I'm lucky because my price increase as well. So this means it's a high special correlation because we are at a certain point connected and then my neighbor are affecting me in my decisions. So the idea of when we introduce the health special notion is I can introduce in my modeling how this happens in a lot of humans behavior for example politics happens in economics happens in a lot of behavior even in COVID that is when you have this correlation this is a very good paper you can identify or play with four ideas that is called hotspots, donut, color spots and diamonds. For example, this is a paper in COVID when we found the blue areas it are cold spots. This means it's not COVID cases in that area. So as a policy maker I can ignore just the cold spots because it's not COVID cases. I can go and jump to hotspots like the red areas because it's a lot of COVID cases in the red areas. But some ideas we can identify with health special correlation is the diamond and the donut. What is a diamond? Diamond is a hotspot that is really where the neighbors is a cold areas. This means in that particular point is a lot of cases of COVID for example so this is a diamond where all the cold spots are where it's like neighbors of a lot of cold spots. So this idea as well you can do this with pricing of house for example. If all the prices are increasing in your neighbor and you find a property like it's cheap, you should buy that house if it doesn't have a lot of problems because you identify something called diamond that is good and is opposite to the directions of all the behavior. Another use to white hair special data we can use a geographical weight regression it's a kind of regression model when you penalize with the health special data so you calculate the distance about your neighbors and do a penalization of the parable using this instance. And this is very powerful to some cases in crime so you can use the population the income and you can understand how the crime and some human phenomena are explained by this behavior. This idea before is supervised learning that you have to train the models but you can even do as well supervised learning and this means you can use the geographical conditions of the people and try to create clusters of groups of these people just for the characteristics where these people live. So for example you can say okay I have all these people here I have all post codes and try to find similarities about the people that is in this room using health special data. So based in that introduction of the motivation we try to create this health social to simplify all this process and try to empower Australians like Australian research to use health social data and create like good projects and avoid all this time consuming process. So the idea is to people start to introduce these in health research and don't be aware about all the methodologies lies behind. So to jump some of the complications the things too when we are trying to do this tool I want to introduce very fast some ideas that are important because at the start it was very trivial like just merge data it's just a simple way but we have to be afraid of some more methodological things so I want to introduce this with the concept of health special data so we have the system so the health special data is a representation of the reality we have this real world we have now in this room and this room means something we want to create a model a mathematical model that can abstract this system and when I share my location in Google you can find me at the same point of the time even we have some problems when I share to my neighbor she would say oh you are so far to me this GPS is not working the reason of this is because we create a mathematical abstraction of the real world and this is created using and the idea is to we want to represent something unique in a mathematical model we can do this in a different approach we have a lot of parameters, mathematics we have GPS the idea is to we create a representation of the earth we can create, it's called geoid it's a 3D representation of the earth when we have all day latitudes and longitudes and mountains we have everything but this is very hard to use we create an abstraction we create a ellipsoid and we put some parameters where we can find this for example we put a radius we say this information and we simplify the geoid with the ellipsoid but even we are ignoring some things like the latitude, the longitude how we can get this approach we can improve with the GPS data we can use the GPS to simplify and we can just calculate a measure of latitude and longitude and try to simplify this way so when we try to simplify that we use all the data for example the post codes and the information in different kind of type of data in particular is vector maps and raster maps vector maps is like play and with this geometrical shapes you can have operations or you have raster maps that is a picture like google maps and you have pixels in particular in heo-social we focus on a lot of heo-spatial data focusing shapes and these shapes can be three types, points lines and polygons when we have a point latitude and longitude lines is the street and polygons is the suburb or Australia so when we have these polygons we can have spatial operations we can have spatial aggregations, spatial joints I can ask for myself where is the next neighborhood where is the next street how long is this street we have intersections we can have union, difference centroid and other operations tomorrow we are going to focus on hands but the idea is to we have as well aggregations, we can use all the post codes that people were living and we can aggregate to another level for example we can aggregate to a suburb or even higher like state or country but it's something very important is to all these ellipsoid we see have different parameters different countries, different research can use different parameters of this world if I use another parameter my study will be totally different from the other one because it's a lot of bias in the space, if I change a number the latitude and longitude will be very far from the other one so one important thing is be afraid of the different categorizations and the European Petroleum Survey Group create one definition called APSG, this means like a system, like a library like you can find different systems, like different countries are using and as research we need to be afraid which one is better for us we can use Antarctica which is a lot the most commonly is 84, that is a standard version of the world a lot of people agree with that but for example if I'm doing a social research in what area and that metric system is not good for me I can use one better so this is just to show some of the complications but here a level of Australia start to more complications so before we used to have the Australian standard geographical classification was 84 to 2006 and in 2011 the ABS decided to create a better system called Australian Statistical Geographic Standard where they changed all the definitions to say well we have before I think it's a bias because it's beneath the titles of the table so the Statical Areas is 2011 and the other one is Statical Division, sorry for the type I will fix it but the idea is to, this was before and now it's now so 2006 we used to have these definitions and we say we have something called census collection districts we start to create something called mesh blocks and we say okay as a social if I'm doing a research and my research start in 84 I want to follow people for 10 years, 20 years but now the definition where these people live change how I can change this so this is one of the most difficult things that we face in this project and because after that the mesh blocks and the mesh blocks is a unit like little blocks where you're living and you can have other different areas is the small area 1, small area 3 2, 3, 4 states and territories here are some examples like how it looks is one of these areas and all this is something called it's maintained for the ABS but we have as well all the information that is not maintained for the ABS called non-MBS structures that is the postcode that we were talking so for example the surveys or in general the people know where is my postcode 3000 or 2000 I don't know where is my small area 3 or small area 2 so we face another difficulty that is how we can translate our postcodes that is non-ABS structure and the main reason is because postcodes are created by Australian posts and if you want to get access for Australian posts they charge you for this information and it's not a clear correspondence between the statistical data from ABS so the idea is how we can connect these non-MBS structures versus ABS structures so one thing to we found and it's very important is called concordance and correspondence like it allows us in two dimensions the first time is from postcodes to small areas 3 states or ABS structures in the same period of time for example we can use the postcode today 2024 and try to map into the ABS 2021 but the concordance as well allows us to play in the time for example we have some postcodes Australia is growing a lot and the postcodes that we create 2011 is not good enough for the 2021 so sometimes we do some modifications to the current postcodes and the current shapes and they alter the shape of the area so it's a difficult thing when you are doing a social research study because you start living in one area and after this area change so we need to have a translation across the time to normalize these at the same dimension so we can do this with something called concordance and correspondence and the idea like the sensors create or the ABS was why don't create something called population weighted right correspondence so basically if we split this is an example imagine in this place 40 people and in 2016 we decided to split these two areas into two so how I can measure this previous area to the new one I can use this population weighted right correspondence and I approach using the population in this area so for example in the small area 2016 we have 28 people and before was 40 so this means to 70% now is mapping to the new area so before was 100% in this new area we have just 70% after the other case B we have here 12 so this area represent only 30% but this is have a lot of problems because in inside the regional areas Australia is big country when we go to some areas this one is spring this little point are mapping all the small area so the correspondence sometimes should be not clear because you have a small concentration of the people living in a big area and have some bias there here is like looks when you play for example postcodes to a small area in 2011 so we have approximation like how is the space approximated from others and this happens or this approximation of the concordance is better in the cities this system is very good very into the big cities in all the yellow part areas of Australia we have a lot of problems because the population is very sparse and we have some difficulties there but the census or the ABS creates some indicators too we try to introduce of tool to communicate to the social science the quality of your data and this is called quality indicator when we have different ratios for example we have 0.9 of probability we can say you can continue with your research because it's good enough to use this number and you can continue with the same data sometimes we have 0.75 when say you need to be afraid of doing this linkage because you can just doing a misunderstanding so this quality indicator and introducing all of these sorry it's quite boring but it's to try to introduce our tool as a metric system so we can show when we do all the merge and we do this code you can have transparency about what is happening behind the tool otherwise we can you can put a complaint like why I'm saying this you never tell me this so we want to create this tool to all the users so after introduce all of these we can have a lot of data integrations we can do health spatial integrations so I can aggregate the data to live in different levels I can use temporal integrations this means in different period of times I can merge different databases but I can do as well a spatial temporal this means at the same time health spatial and temporal time I will try to merge everything so this is some examples like for example if I want a longitudinal survey like I'm observing people 10 years I want to use a linkage with the sensors for the last 3 sensors I need to understand how these shapes are changing, how these post codes are changing and how I can adapt this data we have more complicated cases like for example we need to try to play how we can normalize through the same area and after that we can do the merge with the other database when we are using the spatial data we have 2 shapes 1, 2 means the other data but we need to have in the same GR to be to allow the merge so this is one of the we have another other considerations when we are doing research we need to be afraid of the causality and temporary like everything can happen in different periods of time like for example we can even if we are observing 2 phenomenons at the same time can be a little lack between this so we need to be afraid as well how the lacks can be afraid the other consideration is the dimension before the spatial dimensions so for example this is the different representation of America when you have different APSGs so the alterated shape of the country based on what is your needs so ideally when you are doing special data all need to be in the same system otherwise you have a bias that is changing the idea I mentioned but one that is important that I want to introduce is the semantics sensors as well have been changing the definitions have been changing with the time so an example is sensors of 2011 have some levels of income 2016 we create more levels of income how you can translate for one GR another GR or even another example is the level of education educational we have some certificates in 2011 and 2016 so when we are trying to observe people for 10 years and I want to see the variables of the sensors I need to try to as well normalize these variables but I do not compare two different concepts so the semantics was so important in this project as well that is another war package focusing on how to improve the semantics and we receive this information from there but it's some limitations that we need to be afraid when we are using special data and it's like we sometimes want to get more into the detail a lot of social research want to understand why these people or the house of this person have these characteristics like go to the detail but sometimes we cannot because we special aggregation and we want to protect the privacy so we need to use a highest version of for example small areas or states information and aggregate data we have as well measure error so some people can lie or have errors into the data they tell us so we need to be afraid of this and we need to be afraid as well about the assumptions that we make about the people so I remember that you need to be very you can say these people have this behavior this hypothesis so we need to try to be like understand the context and why these assumptions and so the limitation is this sometimes could be very expensive computing capacity so after this introduction I want to show now how we create a tool so we start with these ideas we understand all the things that we want to do so we start to create the solution using a five stage process we emphasize with the people first understand what is our cost like our clients in a certain way we want to understand the requirements how we want to create this solution have some prototypes first like we create idea first and we create some prototypes and test and we do these loops couple of times to improve our solution this is well this is the health social service design that we create different work package like the Iris for example we can receive information about vocabularies the work package tool and some curation data work package 6 so in this part we receive some inputs from other people we receive some inputs from the longitudinal data service that we want to enrich and we receive information from the sensors that we want to enrich as well and we can add a complementary data so the solution consists in create a toolbox that is mainly called our library and this our library contains a main script that allows you to run all the functions of the code and doing using a parameter file so the parameter file are defined where is the longitudinal survey how many years do you want to publish which variables are your interest and all the kind of parameters that you are doing to the enrichment process all of these will be create a link data so the output of this solution will be the data already enrichment for you can use and do regressions or whatever kind of analysis already you will have a log report like explain how each process was created PDF report allows you to understand how run each chunk of the code like have a documentation and very well trained material the question is why do our library the reason is we cannot run this in a cloud environment or into my explorer because this means I need to blow the data to a server and damage some of the policies with the data custodians so all this part that is called work environment is responsibility of the user to maintain and they need to allows all the protocols and standards of the data custodians yeah this sounds crazy because how you can communicate our library with people like here is library so we create a user interface that can simplify the process and title ring all the parameters to the users so basically the the user interface interface create all the toolbox and all the parameters you don't load it and you just run the code in your computer and you can have the outputs in our and in the stata to a lot of people use we create first the onboarding like we define this is the theoretical design like how will be I'm going to show the demonstration process like how we did the process but the idea is to we want to target two kind of users middle user and advanced user so the advanced user is the person like play with R they have a lot of knowledge they maybe want to modify some functions or even create her own functions so he has a sd key program like you can have all the error libraries functions like definitions and you can create your own pipeline for example I want to create some modifications and I want to customize all the libraries so you have one different title ring and are the medium users like I want to just do this data linkage easily so you can just click click and continue and trust into the results we are creating so this is a little concept about the the shell idea about how is the interface so interface so we create the user enter to the website