 So, first, a big thank you to Zürnke and Martin from Akasha who invited me here. I'd like to talk about digital content-based identification today, and especially similarity hashing, and a little bit of background about me. I'm not from the scientific background community, so I'm a little bit, I'm not part of this community, and we are actually, it's the first time that we present this to the scientific community, and I'm very eager to see what people think about this. I'm an entrepreneur, was focused on media technologies, a founder, CEO of Kraft, a very small little company from Freiburg, we do e-book production and distribution and trade publishing, and I'm also a techno-creative software developer, basically I started from the business side and I went more into the engineering side in the later years, and I'm also one co-initiator of the content blockchain project and the ISCC, which is what we call the International Standard Content Code. So I have a short history about where all this comes from. The thing started with a project in 2016 where we got funding for a project to check out what blockchain and this new environments and technologies will bring to journalism and the content industry as a whole. So we did a one and a half year project and we had quite some developers hands on. So this was basically the thing, the idea that we are moving, the internet is moving from a basic communication or information sharing platform to a platform where you can move value natively. So what we did there, it was quite extensive and very hands on. So we just took what was out there. We have been looking at very different blockchain platforms that were available at that time and we are working together with the journalists and people from that community and built prototypes, simple little applications like we launched a custom small blockchain based on Bitcoin source code. We built a wallet and tried to do licensing stuff. We built some developer libraries. It's all a prototype. One thing that came out of all this research is the ISCC identifier, which I will be talking about today. So the major outcome of what we have been doing there and while getting our hands dirty was that really the content community, so it's the larger, your scientists are basically part of this community, but the larger content community should develop open standards and technologies and applications that establish content as the subject of transactions on blockchains. So basically the name of the project was content blockchain and the first thing you have to tell the people you won't put content on a blockchain because blockchain is really not the technology to put content there. So this is basically what we came up. This is the big view that we had at the end of the project that we have the blockchain protocol layer services which we currently actually see as public infrastructure like what the internet gave us in a different way to transfer, trustlessly transfer value and what we need on that layer to make content, the content industry work. Then in this new environment, we will need new tools and we have some here like problems that we want to solve as attribution smart licenses and especially content identification and only above that comes the new applications. So we are in between the infrastructure and the application layer with this stuff. So the thing is that we have been of course looking what is there if you want to have content be what we talk about in our transactions, how do we reference content? So of course there are many content identifiers out there, but really it turns out they are not working. You go on to Twitter, you find an image you want to use in your journal or whatever. What's the identifier to reference this image? So actually the existing standard identifiers are basically they are usually centrally issued so you have to send a fax somewhere and register a number. They are often over-specialized in different communities. They are curated by humans. They have mostly no coprographic features. They have high management costs, high barriers of entry and they are really not made for the blockchain world. And the goals of the identifier that we've been working on are decentralized issuance. Basically you don't have to issue them. They are generic content identifiers so they are not just for one sector of content identification. They support algorithmic deduplication, proof of data procession and they want to make management costs and barriers lower barriers and they are designed for the blockchain ecosystem. So decentralized content-based identification. The first thing that we need to understand which many people really don't get is that if we are talking about a multi-sided ecosystem then anybody may have a legitimate interest to generate, look up, register and identify for some digital content because it's a means of communication. So authorship or copyright should not be a requirement to get an identifier. The identifier might be used to communicate authorship and copyright but it is not a requirement to get an identifier. And the authoritative linking of identifier and content, what we call the binding between the two, can be done, of course, by algorithm. We know that. So this is one precordition. You have seen this slide today on multiple times. Hashing is basically a very simple thing. So you take some arbitrary length data input and you put it into a function and you get something of a fixed length short which is deterministic. So every time you do it you get the same if you put in the same data and you cannot refer back to the data from the hash value that you created. This is the principle of all hash functions but they are very, very different ones. So hash values are natural identifiers for data widely used in IT systems like databases, file systems, version control systems, network protocols in cryptography. So the benefits of hashing, there are many. They are used for verifying integrity of files, password protection for comparing files for equality, generation of pseudo random bits widely used in file and data identifiers like JIT version control system, I PFS is using them. And actually it's also used as a universal clock because proof of work is based on that. So we also have non-cryptographic hash functions which are like checksum, hash tables used for data deduplication and similarity measurement. There's a very big difference because cryptographic hash functions are correlation resistant. So if you change anything in the data then the hash you get is not correlated to the hash you would get to the unchanged data. So that's a feature of cryptographic hash functions but we also have hash functions that do quite the opposite. Another point that we went to is if you look at the identifier systems that are out there there's a lot of confusion about what is being identified. So we went into it in a more philosophical manner and checked what can be identified in the area of digital content and we found basically six layers of identification. So on the first more most abstract layer we can identify any collection of information so it could be a journal which we want to identify and all its issues. Then we have a second layer which is more concrete where we try to identify meaning. This is a very interesting part because in science we have made major leaps in there. For example we can today identify meaning by machine learning systems and we could have for example there are cross-lingual sentence embeddings by now so we can have an identifier for a sentence and it would be the same for the sentence in different languages. So you have let's say ten languages and you convey the same meaning in ten different languages but you can calculate the same identifier from it or like the same vector. So this is what we have at layer two. On layer three we talk about a generic manifestation which is basically let's assume you have a word and a PDF document. You take out the plain text. The plain text is the generic manifestation. It's independent from any encoding format. It's like a very clean and natural way of representing some data. You have that also for images where you have in the end you have pixels which are coordinates which you can respect in a quite abstract generic way. And then we have media specific manifestations which are encodings like you have a JPEG and you have a PNG. And then we have the exact representation which is a bit by bit similarity and on the individual copy is basically what you have in the physical world. You have one specific book in your shelf and you have some notes in there and this is what we call individual copy which does not exist in the digital world but which comes into play with the new blockchain based technologies. So the ICC is a proposal for a modern and open content based identification system. It is a universal for across different types of media. These are the generic media types, text image, audio, video. It is a lightweight multifaceted fingerprint designed for digitally encoded content. It is cross sector applicable for journalism, book, academic, publisher, music industry. And the goal is to establish content as a subject of transactions in decentralized and network environments again. So I guess I switched it off. Yeah, well, okay. So it's designed for management of general and granular and dynamic content. It supports content clustering similar to detection, deduplication and it's decentralized. So here's what it looks like. Basically, what you see here is one ICC code. And you basically just drop in some information file and what you get out is this code. And the different components can be used separately as identifiers but they are more meaningful if you use them together. And the first component is generated from metadata, very basic metadata like a title of something. The content code is specific to the media type. So it's for text or image or audio. We have a data code that is based on the raw data which is not decoded. And then we have the last component is basically part of a cryptographic hash. So without further ado, I'd like to give a short demonstration. I hope this will work, will it? Oh, okay. It should be public. Yeah. Okay. Then I just tell you what you would see. So basically you have a JPEG image and a PNG image which for us have the same content for a machine. It's very different content because they are differently encoded. You drop it on there and you will get two identifiers where the first two components will be identical for the JPEG and the PNG and the last two components will be different. So we have the same here. Ah, maybe this one works. Yeah. So we do this here with an EPUB file and a PDF file. So you drop it, no. So as you can see, it's created from totally different files, file formats, EPUB and PDF. So you have basically, you can match them just by the identifier themselves. And even if these identifiers would come out different, you can measure the similarity between them because they are actually, each of the components is 64 bits and basically it's a 64 dimensional vector that places the syntactical structure of the content into a room of 64 dimensions. So, well, okay. So I will, what does my time say? Okay. So I will briefly go into the separate components. So the first component is basically the most abstract. It is generated seeded from metadata. Now, because this is a universal identifier, we cannot say like you have to put in the author there or, you know, we cannot put any requirements on the metadata because, for example, who is the author of a movie? Well, it might be hard who are the creators. So we basically just use the title and anything you want, it might be industry specific, but best you only use the title and for scientific usage you could be tech or something like that in a normalized form. You could put it in there and we use this data to generate the most abstract grouping part of the identifier. And you have control over it what you put there and if you collide with some other registered identifier, you just put in some more information and get a separate free identifier. And this metadata is then frozen so you can reproduce it. Of course, metadata might change over time. This is a separate kind of metadata with you attached to the ISSC. So we distinguish between seeded metadata and floating metadata. So this is what we have been seeing there. So the content identifier is media type specific. So you always identify image data or text data or video data or audio data. We also have a mixed version of it where you can take different assets which are bundled in a multimedia document and create a compiled ISSC of mixed type. So the thing is here that if you look at the data and you apply cryptographic hashing, they will be very different and with the content ID it will collapse to the same based on the syntactical structural content itself. And we have this for text. We have this for image. And we also have it already about for audio but it's still working progress and for video. Then the data ID uses some specific techniques for measuring data similarity which is called content defined chunking or shift resistant chunking. So usually if you cut data into pieces and then you put something somewhere, the cut points will all shift if you cut by a regular amount of data. And the content defined chunking is a trick to find these cut points. They are about some lengths but not always exactly the same length. And the cut points are defined by their surrounding. So if you change some data here, all the other parts of it stay the same. So it's used in deduplication systems and stuff like this. So we use this here at the data similarity. And the last one which I told you is a cryptographic hash. It's basically SHA-256, double SHA-256 actually in a Merkle tree. All in all this is the Bitcoin block structure and we create a top hash, a Merkle root from there which we there take only the front part. If you really want a secure, provable integrity, we also provide in the metadata the full 256-bit hash. And the idea is here that I can give you an ISCC ID, I give you a part of my data and I can prove that this data is part of what is identified by this. So it is proof of containment. So this is an overview of the process how the ISCC is created. It might look complicated but actually I have coded this in 500 lines of Python code. So it's not that complicated and it's just functions, no classes, very, very simple open source available. So it is not a kind of content identification system like you have, let's say, YouTube's content ID. It's more like a lightweight fingerprint. But the idea to make a fingerprint like standardized and use it not as a fingerprint but as an identifier, I think it's an immensely powerful idea. And I'd like to see this happen. So just an example, I have been looking at the data on unpaid wall. They have 25 million records of open access titles indexed by ISCC, by DOEI, and I did some analysis on this and it turned out actually there are DOEI, which your scientists all should know what it is. I don't have to explain it. It points to very different documents which mostly are similar to each other but not always similar. So for this DOEI, we have four repositories where we can find open access versions. I have leaked them here. And these are different versions, additions of about the same document, but they are. And in the ISCC, you can see they match up here. They mostly match up here but not totally. And you can measure the similarity of the text between them just by comparing the identifiers. So the ISCC codes actually create an emergent overlay structure of content relations. And you can infer many things by comparing two ISCCs to each other. And to once again show the difference. What we have currently is like random UU IDs which are used in systems where we just produce some ID and there is no link between the content and it always needs to be an authority who says this is the file that belongs or this is the data that belongs to this ID. But we have to believe them. The other one is SHA256 or other cryptographic functions which are basically you don't have to trust anybody. It's very simple. You put in the data. You get the ID which is very long, of course, to be secure. But it's really secure. You cannot temper with the data. You will always get the same ID. But in content, content is dynamic and it changes. And we lose all the connection between the history of some data if we only use cryptographic hash. So the ISCC code comes up with a multifaceted fingerprint as an ISCC that is generated from the content itself and then comes the blockchain part where we can really, this is ugly and not for humans. And for sighting, for example, we would like to have something short. We can have something this short that is resolved by a blockchain to the long version. And we call this shortcode. This is in the making. It basically takes the components, collapses them to the shorter variant. And in the header, we say this is a short ID. It is registered on Bitcoin or on whatever blockchain. So this would be a pointer to some blockchain, public blockchain. This is the part that clusters the data. And here we have a counter, which we upcount to disambiguate between different things. So we have a completely decentralized registry of short globally unique persistence, resolvable, owned, verifiable, and authenticated IDs, which could be even used for an URL short or whatever. So yeah, this is what I'm working on. Actually, the dean, the German standardization organization, liked the idea and they put it up for proposal at the ISO to make an international standard out of it. We are at that stage, which is stage two from about, I don't know, 49 steps through hell. And but I'm very grateful for this because this brings discussions with the existing community of standard identifiers. But it's a hard road. Well, so that's about it. Digital reality, there's too much granular content to manually assign and track identifiers. Good news, all your content already has an ISEC. It just needs to be extracted. Come join us. We have everything is open source. And of course, we are looking for contributors, donations, funding. Currently, it's only funded by my passion. So thank you. So any questions? Thank you very much. And thanks to Martin for bringing you here. This is really interesting and really cool new stuff. And OK, well, is it too cool and new or are there other people working on similar things? Because we understand sort of what you're doing there. But we don't have an overview in the field, right? What's happening there? OK, so actually it is absolutely not cool and absolutely not new because all the technologies that are used here are widely known for many, many years. I think what's cool and new is the combination of using these technologies first in a standard way because this content similarity, duplication, matching, vectorizing stuff is used in many systems. But it's only used in closed systems and not for interoperability. And I think this is the new part of it. Is there any other questions? OK, I have one more too. Thank you for your presentation. It's really, really interesting. And I see there is a lot of application of such thing. And I'm just wondering what are technical limitations? And for example, if you compare two documents, like one PDF, one EPUB, let's say there is a paragraph missing in one, will it lead into dramatic change of hash or not? If you change a paragraph. Just completely remove one paragraph. Yeah. So the thing is what we have there is basically its estimated similarity that is encoded there by a technique called min hashing, whatever. The thing is it always will depend on how much percentage of the document you change in relation between them. So if you have 1 million paragraphs, or let's say 200 paragraphs, and you remove one, they will still be the same ID. Or if you remove two paragraphs, they might start diverging. But you can still measure that they are closed. The sensitivity here is always based on relation to the whole data. So it is meant only for comparing near duplicates. I see. Yeah. So from another perspective, if I will remove all articles, like not articles, like let's say I will remove all, I don't know, like the in the document. We do that for you. Yeah? No, we don't do that. But if you, yeah, that would definitely make a difference then. So if you remove large parts of the content, you will find that they don't match up. So and because you also have a meaning, a hash of a meaning, which is very also interesting. Yeah, well, you saw the layers I have been showing. So the second layer, like semantic meaning, we don't have a component for that yet. But this would be something that we could do which would be really not sensitive to things like that. Because if the meaning is the same, the ID would be still in the same space. OK, thank you. And you mentioned the media type, the media-dependent ID. So in case of the images, for example, you bring this to a standard image from Earth and then do it like a hashing or? Yeah, actually for the currently at the syntactic level where we are with the content ID and not at a semantic level, at a syntactic level, we take the image and we really just make a very small version of it and then black and white it. And it's a perceptual hash, it's called. And we measured the difference between one pixel to the next, 8 by 8 pixels. And this gives us the fingerprint. How would you handle watermarks? Watermarks. We don't care. That's also implemented. You can use it as. Now, actually, there's some strange thing that we came up with in regards to watermarks. If you have a hash, a cryptographic hash, and you embed it into your document, then it will change. You cannot embed it in the document because embedding the hash will change the hash, which is a problem. So you cannot deliver the hash inside the file only separately. And now with this one, at least the first three parts, you can embed them, then recalculate it. You will get a little bit different one. Then you can embed that again recursively. And eventually, you will find an ICC code that you can embed, which is the same that you get if you calculate it, which is a little bit weird. And I'm not sure what the use case of it, but I guess watermarking could be one of the use cases where this comes in. Or are you talking about visible watermarks? Well, I don't mind so much. I'm just thinking about retractions of articles, which are normally still there, but watermarked as retraction. Excuse me? If you retract an article, then normally it's still in there, but it will have the watermarked. This article is retracted. OK, I'm sorry. I can follow you. Yeah, yeah, yeah, OK. Yeah, and I think you're like, so this, I mean, exactly this is what's fantastic is that we have to look in other fields. And then look at the application. This is what we should call the NDPD, basically, right? Like what we did with your talk. Other talks were also great, but this is great because of the transfer of water. Yeah, I'm totally not from the scientific background, so I'm devolved for that. Yeah, or science communication, culture, or something. But this is exactly what makes it valuable sometimes. You like to do this, yeah? OK, there are like one or two more questions. OK, thank you very much. I just have one question. Can I use your model to detect defect models where I can use against or deep learning to fake the faces and the images? Or not because you can hash the photos and you can copy paste, or you can change something in it? So. Yeah, well, I guess that because we are using like dimension reduction technologies in the different components, actually these are viable inputs for machine and deep learning tasks. So basically, you could, for example, use the identifiers and try to learn, for example, which language the text is written in just by looking at the identifier. Of course, these are all information which go into the area of probabilistic information. So you don't have guaranteed information, but you can do statistics with it and use it for machine learning. OK, thank you. Just to ask, do you also distinguish the gravity of changes? Let's say just a very simple example. You have essentially the same text, but one author would come to the conclusion this is true. And the other one says it's not true. So like from the wording, in the extreme case, you would only have one added word, right? But you would have a completely different statement of the thing, which is, would that rate at the same thing as a word like well or so that you put in, which doesn't mean much, or if, for example, a mathematician has a calculation. And let's say in the simplest case, you have the result 42. And it would matter if somebody else had 43. Would that just count as a single digit kind of potentially erroneously done? Or can you say, OK, this is a major deviation in the scientific statement? Or you just rank that as a typo or grammatical error or very minor deviation? Well, you are already on the semantic level, which we didn't implement yet. So currently, it's really just a syntactical similarity that we have on the content ID. Basically what we do is we delete all the space, we do normalize all the text, we go in 13 character bunches, we go through it, and we create 128 different hashes over it, and it's really just structural similarity. So there's no semantic stuff there. If we do the semantic ID, I guess I have to stop now. We can talk separately. Thank you very much. No, thank you. Thank you very much.