 This is an overview of say it, which is a component for displaying transcripts. That is the written record of meetings I'm Dave Whiteland from my society and I'm going to show you say it I'm going to explain why it's so useful and why we built it And then I'll go into a little more detail about how you can use it And then we'll have a quick look at some of the technical things you can do to put your own documents into say it So this is what I'm going to run through first some examples and then a case study We'll have a quick look at how to make your own say it both manually and by importing speeches and finally Brief look at the Akoma and Tozer XML standard we use for this and a very brief introduction to parsing Right, let's jump right in with an example. This is the trial of Charles Taylor former president of Liberia And these are the transcripts of his trial where he was accused of war crimes and crimes against humanity by the special court for Sierra Leone This is say it presenting the details of the first hearing I Can click on a speaker and it will show me collection of all their speeches Statistics about them. You can see this is Richard Lucic was involved in many of the hearings Or I can search across all the transcripts This jumps into a specific mention inside one of the other hearings Here's another example, and this is the people's assembly parliamentary monitoring website for South Africa The activity of the South African Parliament is recorded in Hansard and the group running this site PMG Take the proceedings of Hansard and put them into say it You'll see this looks a little different because this time we're looking at say it running as an embedded part of a larger website in this case if I click on a Speaker it'll take me through to the page on the parent website Right away you might think that this is just a tiny bit boring and there are more exciting things to put on the internet Than the minutes of meetings. So let's put this in context the context of civic tech It turns out that pretty much everything you can and cannot do in your life has been decided by powerful people in a meeting Local government Parliament Courtrooms these are all meetings whose outcomes affect normal people you might not think about this much But it's pretty much how the bureaucracies that we use to run our society's work And one of the purposes of civic tech is giving citizens the ability to know about these processes Because once you know what's going on you can hold your representatives to account and ultimately influence or change them The old knowledge is power thing Well the outcomes of the meetings that shape the societies we live in are things like laws and bills and timetables Budgets and even court verdicts or prison sentences. These are often very important things But the process by which they happen is nowhere near as widely shared as accessible to the people It is affecting and certainly not as much as we think it should be So here's a third example and this time I'll show you the problem We're really addressing with say it by looking at the original document, too This is the Levison inquiry into the culture practices and ethics of the British press The public hearings for this were held in 2011 and 2012 Some key things to bear in mind Firstly, this was a parliamentary inquiry funded by public money Secondly, although the public at the time knew the inquiry was going on A lot of it was critical of the behaviour of the press Especially the tabloid press which of course had no interest in reporting on it And finally the result of the inquiry was published as a report But there's an important thing here The transcripts of the whole inquiry The actual details of everything that was said They were also published So I'm going to show you how the inquiry published the hearings It's telling that actually the end result is really a paper document Let's have a look This is the official website of the Levison inquiry I can look at the hearings and pick a day that I'm interested in For example, Thursday the 24th of November When JK Rowling gave evidence to the committee Here's the transcripts of that session Now on a pedantic level, the document is digital because it's a PDF But it's really just paper behind glass And let me show you what we mean when we use that term Because it's important to understand what's really wrong here In this case, and the Levison inquiry is especially poor because of it You can see that there are four pages on each PDF page This only makes sense when printed out on paper PDF viewers know about pages But this layout even manages to break that It's actually a staggeringly user hostile way of laying out a document It makes it hard to read And that's somewhat astonishing When the purpose of producing the document in the first place Or to be to let people to read it It turns out it's a hostile way to lay out a document from a machine reading point of view too Because it makes parsing it harder Although these transcripts were also made available as text files Now notice the line numbers That's an attempt to allow deep linking Because you can refer to any utterance by quoting its page and line number Again, that makes sense on paper Well it would if page really meant anything in this case And in fact it's a successful medieval technology And it's been done this way in Bibles for over a thousand years But the important point here Is that this is a form of deep linking in a technology that does not support it You can't link directly to a place within the PDF And in fact, even if you know the page and line number You still can't jump there using pdf's mechanism for page or search The whole thing basically prevents digital linking Which on one level is just annoying But from a civic tech point of view It's actually a real way to lessen the usefulness of this document Because if you can't deep link, you can't cite, you can't share You can't show specific utterances You can only refer to the whole document And that pretty much stops details in document being called out in debate In social media or in any sensible way online So here's what we did We parsed that PDF, isolated every speech Identified each speaker and recorded how it was all linked together And put that into a database SAIT can then reconstruct that structured data To produce the pages of the website that you're looking at now This is SAIT's presentation of the Leveson Inquiry You can look at the session we were just looking at This time you see it's clearer, easier to read Broken down by speaker Of course I can investigate the individual speakers' contributions But perhaps more usefully And the key point is that this supports deep linking If I get the link in context You can see there is a URL here which I can share I can drop somebody into the document at the utterance that I'm interested in sharing Or here directly just to the speech And of course with this I can share it in Twitter, on Facebook In blog posts, in newspaper articles So now you have a good understanding of what SAIT is doing and why it matters Let's start with a simple example of making your own SAIT site My society currently runs a hosted version Which means you can add a SAIT site Which we sometimes call an instance Without needing to worry about installing the software yourself So here I'm going to create a new SAIT I'm already logged in And it's created a new instance Right, let me add my first statement to this instance It's going to be the March Hair saying Have some wine You'll see that SAIT has created the speaker So I can add it to give a little more information For example, I've got a picture of the hair And in this case I can use March Hair as the name So that the doesn't get considered when sorting the speakers into lists So that's already a bit better Actually I can add a section Which is just a way of organising speeches We saw them separately as hearings in the trials that we've looked at before I can move the speech that I've put Looking at the speeches now under Tea Party Have a start Let me add another speech Alice says to the hair I don't see any wine I can add a picture for Alice Let's see how we're getting on now The Tea Party is starting to take shape And you can see I can build up the transcript to the meeting in this way Now obviously this is quite a laborious way of doing it So let's have a quick look to see what we've ended up with So clearly that's quite a laborious way to enter all the text of a meeting And although there are advantages to doing it that way You could crowd source For example you can nominate other people who you're going to give right access to your transcripts So they can work on it at the same time But there is another way And that is to import the speeches The way this works is say it will consume a document Which contains the transcript of the meeting that you're trying to show Provided it's in a format that it knows how to accept And that format is an XML standard called Akomotentosa Which I'm going to show you now In fact the easiest way to see this is to look at the text that we've already got Here's the Tea Party And actually say it will expose the underlying XML This is the Akomotentosa XML format of the Tea Party You can see at the top the whole thing is wrapped in a debate tag There are references to each of the speakers Which you can see here are given an ID There are four, I've added the door mouse as an extra one And then within the debate body you can see each speech is credited to an ID Which matches one of the IDs in the reference section at the top So if you're familiar with XML this won't seem very complicated And if you've seen any HTML you'll recognize some of what's going on inside as well So really in order to populate or say it The challenge is to get your speech document into this XML format Now we're using Akomotentosa because it's an open standard That is increasingly widely used in parliamentary and legislative documents It's used by the Library of Congress in the US, the Italian government, the European Union Akomotentosa itself can get quite complicated but we're using a small subset of it Which is pretty much summarized from what you're seeing here So to show how that works I'll make a new demo say it And import the speeches It wants to know a source of Akomotentosa So let me use this URL So I'm actually importing the speech for the point of demonstration from another site And here we go Now this instance is populated with the speeches from the Akomotentosa file Now you'll notice that the speakers didn't come across That is because actually speakers are a separate dataset Which we could import separately In fact there are several ways of doing that and we use the Popolo JSON format And we'll look perhaps at another time at Popit Which is another populous component Which exports data of people Quite often this is parliamentary people So it's useful for speeches and say it Into your say it Now a thing that I want to finish by pointing out We've glossed over a little about how you get your Akomotentosa XML Because normally you're starting with a transcript Or a record of a meeting which might be a PDF Or a Word document Or even video or audio At some point that needs to be transcribed into text Which then gets converted into the XML format Which say it can import Now the extreme case of what that may look like Is something like this This is the original text of the Tea Party That I've been using as an example From the Project Gutenberg So actually in one sense this is the original document This is the text from which say it is getting its own A simple way to look at that Is the same meeting described as a script I did this by hand Reading the original text And converting it into this format Now if you can understand that process Which as a human reader is fairly straightforward You can see it's not a great leap to do a similar thing here It's just the script marked up in XML The catch of course is how you go about writing such a parser The point that we want to make just in this short demonstration Is that the technical format of the source document Whether in this case it's HTML Or it might be a PDF or a Word file Actually that's a small part of the problem Because by and large there are technical ways To get to the text content of those documents The problem of parsing really becomes How you identify where a speaker is speaking So in this case If this was your source document To turn this into XML It wouldn't be too difficult to write a script For a naming capital letters followed by a colon And assumed that what follows is a quotation Until the next token which looks like a speaker This works fine of course Until you hit something else with capitals in the colon Or if your source document is more complicated Here's an example You can see you can probably identify Where people are speaking by the use of the quotations But it gets a little complicated Because sometimes we've got things going on here Where Alice is thinking not speaking So what I'm drawing attention to Is the fact that parsing a transcript Parsing a document to put it into something like say it Is not just a technical problem It actually starts to become a semantic problem as well Which is why when we're asked Can you put this document for us into say it It actually depends a lot Not when we talk about the format of the document Not just on the technical format of it But the arrangement of the text And the ordering within it The way that the different semantic tokens Speakers and speeches are represented So that's it Say it is free software for presenting Transcripts of meetings online We'll host it for you But if you prefer you can install it yourself Instructions for that are on the project website We looked at how you can enter your transcripts by hand Or better still how you can import them Using the Akoma and Tozo XML format And we also briefly looked at parsing And how when you're turning your document Into the XML format This can be straightforward or complicated Depending on the nature of the document You're starting with If you have any questions Please join the poplis group And ask on the mailing list We look forward to welcoming you