 Helo. Can you hear me? Thanks for coming. More on the spinny globe thing later. So I'm Matt, engineer at the FT, and I'm going to talk about democratising data. So first off, I feel like a bit of an imposter. Many of the talks I've seen today are about openness for the greater benefit of humanity, but my talk is about closed private data, that you're never going to get your hands on. It's mostly for the benefit of one organisation. But a lot of the themes, I think, like hackability, simplicity, usability, are directly applicable to the organisations like the FT. So this talk is about that. So the FT is a 125-year-old news organisation. It's got nearly 800,000 subscribers. 70 of those are digital subscribers. We've got about 5,000 companies who buy seat licences. They access the site from the companies. And for a sense of scale, our data team has around 25, 30 people. And this is Tom Betts. He's our chief data officer. And he came up with this phrase last year about democratising data. So it's a description of a push to help the organisation make data central to people's lives. So take it out of senior management land and put it into the hands of day-to-day people doing day-to-day delivery. That's the amount of what job you do, the better educated you are, the better decisions you can take. So in democratising data, we're really talking about accessibility of data. If your data is locked up in a warehouse, or often dozens of different warehouses, where it's stored in oblique formats, where it's probably fished out with odd languages or protocols, it's going to be less accessible and less democratic. So it's not dissimilar to the various open data manifestos. Here's a list of criteria on what constitutes open data by a site called the Open Definition. It's part of the Open Knowledge Foundation. So is it available online? Can we use it? Is it machine readable? Is it available in bulk? And is it free? So although the FT and other organisations won't make our data available to the public, the needs of our internal community are essentially the same of users as, say, government data, which these sort of manifestos are targeted at. And you can ask yourself, does your own data at your own organisation follow these guidelines? So it's not entirely universal at the FT. You obviously have financial data, personal identifiable information that's kept a little bit more private amongst ourselves, even. So in the next 20 minutes I'm going to explain the uses of data at the FT and the systems we've built over the last year or so to support that. So first up, this is Lantern. It's a tool used by our newsroom to gauge the performance of the stories that we publish. These things are quite commonplace in newsrooms. Most publishers use some form of either bought-in or bespoke solution. So every time the user does something on ft.com, we have a tracking pixel with metadata describing what they did, and Lantern aggregates all of that up and displays it in meaningful ways to the journalist. So it's partly education, like what percentage of users found this article and where did they come from. And it's partly a decision-making tool, like when do you remove something off the front page when it's becoming stale. So this was a UK story published a few weeks ago, and you can sort of see the peak of UK traffic arriving at 8am as people come to work or as people travel to work. We can see people took about 43 seconds to read it, and the retention rate was 14%, so 14% of some people went on to do something else on site. So a news agenda isn't really led by this, but in terms of understanding the audience and the impact of promotion and so on, it has its place in understanding the consequences of our actions. So similarly, a key part of how we communicate with our users is by email. A lot of the FT's users will typically only interact with us by email. So we have dozens of daily emails. Some are automated, some are curated. We've got newsletters, alerts triggered by keywords, breaking news, marketing, and so on. We send several million a day. An email has a life cycle from the point where you subscribe to when we send it, to when you open it, to when you click something on it, and ultimately when you unsubscribe. So editors have dashboards for each of these emails that describe that life cycle. So this one has an open rate of 54%, and a click-through rate of 34%. So I mentioned earlier that FT has a central data analytics team, and they use a mix of SQL and Excel and R and more recently a hosted reporting tool called Chart.io, from which this screen grab is taken. It's fairly typical of the sorts of information they're producing. It's a description of referral to our web app, our HTML5 web app. So you can sort of see the influence of social media referas in red to the green search referas. It's about a 50-50 split. So aside from reporting, we also power parts of our product, our user-facing products purely from the data. So here's the sort of standard top 10 articles that you find on most news front pages. This is generated from a simple rolling count of what people are reading over the last 12 hours or so. And our journalists also tag all their stories with topics so you can display popularity on a more thematic level, like what was in the news today, or what were people reading this week, or what are we writing about as an organisation. And to power this sort of thing internally, we publish a graph of interconnected topics. So we can link our stories to companies. We can link the companies to industry, the people at work for them, their board members, stock prices, and so on. And on the FT site, you can follow any one of these topics. And so by linking our internal model or internal representation of the news with user activity, we can then generate personalised journeys through the information we publish. So here's a simple example of how that looks. So rather than recommend based on what everyone else is doing, we can find things that relate to you personally, which ultimately we think is more valuable. So we've got quite a data-led, multi-led approach to recommending recommendations at the moment. So the different ways of helping people find what they're interested in is a complementary to our editorial line. A completely different case. So you've got an internal communications team, and they build dashboards to display around the office or in the canteen and reception. And typically these things give a sense of who's using the FT now, what are they reading. They need something very visual, something that explains itself in a few seconds as you walk past. This one cuts the article stats by country, and you can sort of infer, as you sat there in the canteen eating your salad, that most countries read about themselves. So marketing is one of the more interesting parts of the FT. The paywall and the pricing and so on are all underpinned by AB testing and the projections are all kind of based on behavioural and financial data that we've collected. So on the screen is the output of an API that we call the propensity API, and it's a predictive model of any anonymous user's likelihood to subscribe to the FT. So if you turn up to the FT, you will have a score. If you don't come very often, your score means your score will be something like zero, which means you have no chance of subscribing. The more frequently you come back and the more things we do, the more things you do on the site that increases your propensity score at the height of your score. Within that, we've got different products, targeted at different markets and different price points, and so the propensity score tries to map your activity onto one of those price points so you can target you as an individual. So it's built by the data science team. They've distilled about 500 variables into a dozen or so key indicators of your future behaviour. So having this in an API form means you can adapt your user experience to the different types of users. So for example, what would it take for a person who we think has a high propensity to subscribe to actually do so? Do we show them marketing or discounts at a particular time in their journey? So an example is we offer a 25% discount for people in a certain propensity bracket. So this is a form on our website prompting the users of feedback. In previous companies I've worked for, the customer research team has used external companies to collect this sort of information. These work well, but if you use an external company, your information tends to be disconnected from everything else you know about that person. So collecting it ourselves allows us to connect to the rest of our customer data in our warehouse. So you've got much better handle on what that data means. For example, you can take each piece of negative feedback and look at what that user's done or what they've done after they've sent the feedback. We might want to know if negative feedback results in a lower subscription when you're waiting over time. Or we can split out feedback just from loyal users which we may want to prioritise or pay closer attention to. And the team who do all this work like spreadsheets. So we pump all of that data into the Google Sheets API for them to play around with and do the analysis that they need. And because customer feedback is very important, we broadcast it to all staff on our internal Slack channel. So it's a mix of praise, disappointment and abuse. So it's a very diverse mix of use cases. So lots of users represent you right across the business from the newsroom to the commercial teams to the digital product teams. We've got different needs. The newsroom needs real-time data to make their decisions. But the behaviour models that we saw for the marketing team evolve over a three-month window. So there's different levels of complexity. So something like the most popular is a very simple counter. It's like a 12-hour counter. Whereas things like the recommendation algorithms require specialist databases. And there's different skills. So some people are comfortable with processing raw data, but processing terabytes or getting hold of terabytes of data and processing it is not often trivial. So some people need abstractions over that. Some people need spreadsheets to make it simpler to work with. But most of all we need one system to do all this. You find in the past that fragmenting of data across lots of different systems ends up in multiple versions of the truth. So this data warehouse is this and this other data warehouse is this. And that's a problem because it creates a conflict and the data should be used to give answers, not create confusion. So to support this variety of use cases we started building our new analytics platform. So architecturally the analytics system collects events in the world, puts it in a data warehouse. It's a relatively straightforward thing. So if you're going to make a usable system, ones that normal people can understand, people need to be able to understand the data they need to send to it. So almost all of the things going on inside the FT's ecosystem can be described as events. There's a mix of things that our users do on the website or different products or email. And there's lots of events that happen within a back office system. And we express each one as a category and an action. So a category is a general domain, like page or sign up or email. And the action is a verb that describe what happened. So a view or a scroll or open in the case of email. The idea is to get away from the sort of arcane descriptions of things within your stat system and to make them more human-relatable. So you've got several other concepts attached to this model. Events happen at a time, on a given date. They happen to a user on a device. The two things are very different. FT users have subscriptions. We can track all of their activity across wherever they come into touch with us. And devices are effectively anonymous users or users that have subscriptions. So context describes everything that you might want to attach to an event. So for a web page, it might be a URL. For an email, it might be the address of the sender. Or for subscriptions, perhaps something like the payment transaction ID. And system is just some metadata associated with the sending systems. Things like API keys and versions. And the API into the data warehouse is just a simple HTTP JSON one. So you should be able to see all the concepts I've just talked about. Serialised as a plain understandable JSON document. And again, the emphasis is on simple. There's no strange interfaces. No limitations on key value pairs. No dependency on libraries. It's beneath this level. It's effectively schemaless. And anyone at the FT can read documents. And as long as their system can make an HTTP request, they can start logging events with us. So lots of systems generating events. There's client-side events, server-side events. There's web hooks from third-party systems. Things like AMP. And we capture things like offline events where we buffer up a series of events and then send them in a batch as the network connectivity reappeas. So the very first version of our system had this API to put events in. And then they put them straight in the data warehouse. But all data, sorry, and the data warehouse we use at the moment is Redshift. But all databases have limitations. So Redshift is very powerful, but very slow. Elasticsearch is fast, 08 cache, but it's just unsuitable for 10 years' worth of analytics data. If you think of the examples at the start of the presentation, you can generate a top 10 from something simple like Redis. The newsroom analytics team use Elasticsearch. It's effectively a sort of rolling 30-day window across users' activity. And the recommendation systems at the moment use Neo4j. Neo4j, because it needs the query relationships across our graph of topics. So we needed a way to let people consume the event stream in their own specialist systems and we ended up with two options. So the API firstly publishes into Kinesis, which I will explain very briefly, but not in a great amount of detail. So it's effectively an ordered seven-day record of all the events. So you can tap into any point in history and start replaying them into your own systems. And SQS is a kind of classic message queue. So you pick up the message, you do something with it, you delete it off the queue, you pick up the next one. And we have two because people liked to. Some people like the simplicity of SQS or message queues. Some people liked or wanted a kind of continuous stream of data. Do you see all of those systems based off, so all of those data stores off the back of those two things are kind of effectively our data platform. It's what all of the examples at the start of this presentation have been built on. So you built this data pipeline. You've got a way to put stuff in it. You can take stuff out again. But it's not adding an awful lot of value in itself. So we talked to people and looked at all what they were doing with it. They often went to transform the data or annotate it with extra stuff to make it more meaningful. And sometimes they were doing the same thing over and over again. So each variation of that increases the opportunity for mistakes. So you wanted to centralise some of that, some of those sort of useful transforms people were doing. And we call those things enrichments. So as it's coming into the API and before it gets to the event stream on the other side, we transform and enrich the data. So if your event contains a URL of some sort, we tokenise it. So rarely do you need to perform operations on a full URL. You typically want a parameter like a query string if you're analysing search terms or a path or a domain as a filter. And each annotation is just appended to the event. So the original data exists alongside the annotation. So you've got a time annotation. This transforms a simple ASO time date stump into lots of useful properties. So some of it is to speed up the data processing as it hits some of the downstream systems. So if you've got a database of poor date functionality, it's typically quicker to process integers rather than piles date strings at query time. So you ensure everything is transposed into UTC to avoid any confusion over time zones and dates. And at the bottom there's a sort of annotation for week. That has a very specific meaning within the FT. It's our sort of reporting cycle week so that everyone across the business can then generate their reports from Sunday to the following Sunday. So the enrichments help centralise the business logic. So sometimes, enrichments need to connect to APIs. So in this case, some of our events have IP addresses associated with it. So we fire that at the Maximind API. So Maximind is a service that geolocates an IP address to a reasonable degree of accuracy. And so our events all now are annotated with a sort of city-level geographic annotation which can help with some of the analysis. The spinning globe at the start was sort of the basis of that visualisation. And to that again at the bottom somewhere we sort of hacked on an FT office detection to help filter out false data or test data. So the enrichments let us take data from any FT API or any external API and add that to each event. And lastly, sometimes we need to invent APIs that don't exist. So given a domain name like Facebook, this API will classify it as social media or in other cases it classifies as a search engine or news site or one of our partner sites. I think it helps everyone who's going to do that analysis locally doing this in one place sort of helps standardise. It's very easy to not keep your lists of social networks up to date or if we add a new partner every week nobody is going to do that. If you've got lots of these downstream systems all inventing the wrong way of achieving that. So the enrichment results in this huge JSON document. This green wasn't big enough I think to go up three or four stories. So it results in a huge JSON document and all of those consumers that you saw in the diagram are effectively picking and choosing what they want to take out of that stream and store in their own systems and they chuck the rest away. So we've got 22 annotations to date. Each one adding a bit more meaning to the events we collect. So it takes about a second from the events to arrive at the API to get onto the event stream. And the enrichment part of that is essentially 22 asynchronous operations take under 200 milliseconds. To replace those huge amounts of data we're sort of doing 200 requests a second and it peaks in the morning at 1500. So it's somewhere in between big data and small data. Oops. We've got a long list of enrichments that we're going to experiment with over the next few months. So some of them like the freebase enrichment points that start to connect our internal vocabaries to external ones. Some involve hitting external APIs like ShareCount which stores sort of social media activity and harvesting that. And then there's some more experimental ideas like whether or market prices will help us find correlations with the rest of our data. So it's working okay. As of March this year we've no longer using any third-party tracking libraries on the FT. It's an entirely in-house system. It's the most painful thing at the moment we've discovered is a lack of standardisation of what comes in. So I said it was schemalous and that's bitten us a little bit. We've got three or four different products tracking video in three or four different ways. That makes it very hard for the people doing the data analytics to try and join all that data up and fix it. So the big thing that's going on at the moment is writing some schemas around the data coming in with the different types of events and then trying to validate that. If anyone's got something... If anyone has a nice way to do that I'd love to hear from them. Please don't say Jason's schema. Where's my timer? It says about 15 hours. I don't think I've been talking for that long. So how have we reached this vision of democracy? I think we're somewhere there. So focused on the users as you saw by the many examples at the start. We made it very learnable I think so it's conceptually quite simple but it's wherever possible. We made it very easy to use. There's APIs to get stuff in, get stuff out in lots of different ways. The whole development process has been very iterative. We've got hundreds of production releases. Let's adapt very quickly to people's needs. The code base is effectively open for contribution internally. So you have lots of people as they come up to problems just patching the code and getting the annotation that they need. One of the most gratifying things I think is none of this has been planned as a traditional project. We didn't sit down 18 months ago and write a long list of all these things we wanted to do. So the thing is a testament to the philosophy of open data. So if we make it accessible it allows curious people to take it and play with it, stimulates ideas, they learn something, they share it, they can create value from it. And that's the sort of freedom and that's where the ultimate benefit for the FT exists. And we're hiring on the bell. Thanks. I think that when you say other than they are removed, it's no longer interesting. And because of users who are not, technology is not for the main thing, they want to work with the data and they have elements that are appearing or disappearing in their feeds. How do you do this? We manage it. I think a good api should be managed. It's effectively a standard and if you want to deprecate something we'll say this isn't going to be here in three months, you've got three months to fix your system, or it's disappearing. For the first six months, I think we annoyed everyone, because we were constantly changing the output and the scheme and adding new things and taking things away. But it's sort of been stable since the beginning of the year, and we're just going through a process of tidying up some things, some of the annotations that people aren't using, and bringing the new ones in. Anyone else? Mae'r gweld cyhoedd amsered ar gyfer, neu fewn gwahanol oherwydd y gallwn cydweithio cyllidol sy'n meddwl. Dwi'n ei wneud bod bod yn cyntaf gyweimdo'r wirfellywch i'w gwneud yma, i ddweud i'r cyflogwch meddwl o'r wirfellywch. Felly byddwn i'n gwneud i'r cyflogwch meddwl o'r wneud i'ch gwirio.wn ni'n gwneud eich cyfrannu i'i cysylltion o'r cyflogwch. Oni'n pastigol dŷl oedd yn gwneud hynny. Fe dyn nhw yn gwneud.