 Hi, everyone. My name is Ben. I'm a software engineer at BBC News Labs. You might have seen me speak at EuroPython before. I did a lot of talks when I was at the Raspberry Pi Foundation, so I used to work there. I'm based in Cambridgeshire in the UK, and you can see my website and Twitter GitHub links there. Of course, I would love to have been at the conference in real life. I've been looking forward to coming back to EuroPython post-pandemic, and I haven't been in real life since 2019 in Basel. Unfortunately, I recently got COVID for the first time, so that was a real shame. Fortunately, I'm getting better, and I did test negative today, so hopefully I'll be on the mend now. A big thank you for EuroPython for making this possible. Normally, I think before the pandemic, a speaker wouldn't have been able to present remotely because of the nature of how things have changed, so I really appreciate that. So a little bit about where I work. In BBC News Labs, we're a multi-disciplinary innovation team within BBC News and BBC R&D research and development. We build prototypes of new audience experiences, new ways to get content to audiences. We come up with solutions to help journalists with their jobs and help automate things that would not be taken a lot of time and improve their processes and help make things easier for them to get on with the things that they need to do. We also do a lot of research, and we try out lots of ideas that don't necessarily immediately solve problems, but we think there's something worth investigating and seeing what we can do. We like to build prototypes and just see what happens. We've got a website, bbcnueslabs.co.uk. We write up all our projects and write blog posts about the ways we work, and we have a presence on Twitter, BBC News Labs. So what I'm going to be talking about covers some projects I've worked on in recent months, particularly these three. We did a project called IDX, which also this gives you a context to the kind of problems we're working with. So IDX, identify the X means, was a project where we tried to automate the clipping of live content in live radio to get it ready for social media. So if something is said on air that would be worthwhile sort of taking that clip from something whether it's local radio or national radio, it's nice to be able to resurface that individual clip of that person saying a thing or a whole interview or something like that and be able to tweet that out as an audio clip. And the process for doing that is a kind of real painstaking long process that takes a lot of effort and lots of steps. And it's one of those things it's kind of it's worth doing for that one time you really want to do it, but sometimes it would be nice if it was just automated so you didn't have to spend all that time doing it for just smaller things that might not be as big impacts or we might not think as big as big impacts as they could be. So that was a really interesting project. Another project, Mozra Manager, which is a running order management software. So we use this to process TV and radio running orders so the plans and what's going to be in a TV or radio program and extract the structured metadata about what was actually in that program and give us all that metadata and see what we can do to chop that content up or help people find bits of content that they might be interested in. And also a project around BBC images, the search tool that journalists use to find the appropriate images that they want to use in an article on the website. And we built an image metadata enrichment pipeline to enrich the metadata that we have about images in that system and make it easier for them to find what they're looking for. And we kind of these three projects use the kind of approaches that I'm going to be talking about in this talk. So first of all, I want to explain how our project cycles work. So we kind of generally, this is how we used to work about the next slide and show you the sort of slightly tweaked version of this. So we used to work in six-week projects. So we'd have three two-week sprints. And then we'd follow that with an extra two weeks of kind of tweaking things and wrapping things up. Or if you hadn't quite finished it, continue to work on something or if you had finished just being able to move on and do something else. More recently, we introduced a kind of book-ended version of that. So there's still the six weeks in the middle, the three two-week sprints. We now introduced a week dedicated to research before you get started because that gives you a time to speak to journalists and figure out what's going on and figure out the systems before you actually start your sprint planning. And then we have a week to just wrap everything up. And then we have a period that after that big project is complete. We have a small project cycle, just two weeks. You can pick up something new or try out a new idea that something that might become a big project in future and that kind of thing. This kind of helps us not just continuously run over projects going back to back using that extra two weeks. And it gives us a way to build up that momentum in the first week and wrap up without adding new features in that final week. And that seems to work really well. So projects tend to start with ideation. So there's always something that we're beginning with. So a kind of rough idea of working in a particular area or working with a particular system or in a particular problem domain. But from there, it's kind of down to the team members to determine what it is they're going to spend their time doing and what they're going to try and achieve and what they're trying to build. So we kind of have these department-wide objectives of what things that BBC News is trying to do. So trying to reach underserved audiences, for instance, is a big part of what we're trying to do. We kind of start with those and we devise how might we statements. So how might we achieve certain subject? How might we provide, find ways to get content to this type of people, that kind of thing. Start with how might we statements and come up with loads of ideas that might, off the top of your head, what would solve that problem or what might be worth investigating. And we have this concept of explode and converge. So you start really, really broad thinking about all sorts of different things, really, really, really wide thinking, cast the net really, really far. Don't worry about how stupid an idea sounds. Just get everything down and then do a sort of separate session after that. Okay, let's converge down onto these certain things. Let's look in, these are all quite similar ideas or in the same vein, trying to achieve the same thing. How might we actually do that and kind of converge into certain ideas based on that? Then once we've kind of got pinned down what we're trying to achieve in this project cycle, determine what the objectives are and what we're actually trying to achieve and write that down. Then we start the project and we do the research week, which I'm going to be talking about next. Then once you've done the research week, it's a case of the bootstrapping, the tech that you're going to be building on. Doing what we call spikes, which I think is a really helpful use of a bit of time. So just investing a small amount of time and just trying something out, seeing is this technology suitable for this kind of thing? Can we do this kind of thing? And just trying something out tech-wise. And then coming back to the team and saying, yeah, I've tried this, I spent two days doing it and this is what I think we should do or I don't think it's suitable or yes, I think we can go out with this and presenting it back to the rest of the team to work on. Then at the end of your research week, you're doing those spikes and you start writing out what the sprint goals are. So for the next two weeks, what are we going to try and achieve? What are we going to focus on? And then it just comes down to ticketing and starting picking up tickets in that sprint. So about the research week. So the main things to do there identify the stakeholders. So work out, is it certain types of journalists in local radio that you need to be working with? They're going to be using this tool that you're going to build or they're your audience. Set up calls with journalists and start talking to them, start building those relationships and asking questions, figuring out what it is that they have problems with or what they're trying to achieve or everything that they know about what you're trying to discover. And learn about the existing systems that you're working with. So we generally don't tend to just turn up with a whole new piece of software and say we want to use this instead. We've built it instead of out of the system use. We tend to try and avoid, if at all possible, avoid them changing their workflows. So we build on using existing tools and extracting data from those existing tools they use or automating the passing of data from that tool to the other tool, that kind of thing so that we're able to let them get on without having to learn new things and stop using the systems they're already using and already used to. Also getting access to those systems. Sometimes you need to dedicate a bit of time to making sure you've got access and getting hold of the data you need and getting to understand and getting to know them. And also setting up shadowing. So that's when we're talking about next. So sitting down with a journalist or a producer or anyone who works in that kind of environment and watching them do their job and just sitting watching, asking questions, just being present in the room while they're doing that. Now this is something because I started this job a couple of months before the pandemic. So I never did any shadowing until very recently. So this has become a real boost to my kind of awareness of what it's really like working on the newsroom floor in a local radio station. So actually going there, turning up, just sitting and watching them use the tools, working out what their workflows are, looking for any pain points, inefficiencies, slowness, manual work that you think could be automated, and just finding ways that we could actually make that make their lives better, make their lives easier. And sometimes it's a case of, well, this thing takes you an hour currently, we could automate it. So it just happens out of the tools that you did, the stuff you already did, that you wouldn't have to dedicate any extra time to doing that. Sometimes it's a thing that it's just not worth doing, because it takes so long. So if we can just provide all these things for them that they can use, can make their lives a lot easier and get more stuff out to audiences. Excuse me. So I'm going to be talking about a lot of the AWS services that we use for rapid prototyping. So this, a lot of what we use is things like Lambda functions. I'm going to be talking in detail about each of these, but Lambda functions, step functions and state machines, databases that AWS provides, things like S3 for file storage and SNS SQS, that's notifications and queue systems, and CloudWatch, which is AWS's logging tool. I'm going to be going into more detail about each of those. So Lambda is a really cool thing. So it's being able to run code without it being on a server as such. So this is what we call serverless. So it's just a piece of code, a function that just, you know, it's just a thing that does a particular job and can be triggered sort of asynchronously or separately from anything else. And you just get built for the compute time that you use. So if you have a little job that goes right, every time a photo arrives in S3, send it into this Lambda, do the function. So, you know, I don't know what was this example, Lambda is triggered. Lambda runs image resizing code. This is one of the examples from AWS. Just resizes that image that's been uploaded to various sizes and just spits that out and sends it onto where it needs to be. And it's just that implementation of that tiny little thing, that small task, rather than that being on a server that might get overloaded with, you know, might be sitting there for a month not getting used, and then all of a sudden get a thousand requests because somebody's just uploaded a load of photos. So the way Lambda works is there's just a separate instance of that code being run every single time that it's called, that it's triggered. You can write Lambda's in Python, in Node, in Go, all sorts of other things. So it's really handy for this kind of thing. So as I say, you pay for compute time rather than paying for provisioning of servers. So, and it's very, very cheap to run as well. So step functions or state machines are a kind of workflow for lambdas essentially in AWS. So you can design a workflow of, you know, this function gets called and then it passes data onto this function, then it calls this function, and then it sets off a parallel job, and this thing happens. And you can design what your workflow is of how some code should, or how a sort of procedure should be structured, whether it's data processing pipelines or just series of jobs being done to something. That means you can kind of implement something and just have a Lambda that just does one small thing, and then passes on to the next thing, and that has its own job, and that can do its own thing in its own way without being this kind of model with, you know, system which does everything. You can kind of isolate all the little environments. They could even be implemented in different languages, if you like, and use different, you know, requirements and have different dependencies and all that kind of thing. You can define the behavior of how it should do retries or how it should handle failures, parallel, parallelization, if things don't need to depend on it, jobs don't depend on each other. You can do all of that kind of thing in the designer. So here's an example of where we've used one in the running order project. So the way it works is you execute the state machine with given some initial data. So in this case, it was the idea of the running order, and it goes and fetches that running order file and extract some data from it, passes that on to the next piece of the diagram, passes the data on, and then there's a parallel path. So it goes off and does two separate things, and basically the whole lambda, you know, kind of, sorry, the whole state machine has a success or failure criteria as to whether everything's been done, and you basically get a thing saying whether the state machine execution succeeded or failed. And if it fails at any point, so here's an example of failure, you can see where it fails, and you just click on that and see the exception of what caused that failure. So you can see in the larger picture what actually happened. You got really easy access to all the data that, you know, that it was passed, the exception information, and click through to the lambda logs to go and see exactly what came out of the logs for each specific lambda. So it works something like this. You have, you define a lambda handler function. That takes an event, which is the data that is being passed in, and essentially what you might do is take that piece of that data that was given, do something to it, and then push that new data onto that event and pass that on, and then the next lambda receives the new bigger version of the event. We're actually using something called Pydantic to do the data parsing. So this is ideal for what we use because you can validate and sort of, you can validate the lambda's input data and also the configuration, so the environment variables that define how that thing should behave. So Pydantic works by defining models using type hints. This is really easy. You just say, okay, the input event takes a file ID, which is a string, one of these, one of these, which is a date time, one of these, which is a time delta. It even does all the parsing of those types for you. So obviously it's coming in as JSON, and it will just take that JSON string, turn it into a date time or into a time delta object, or, you know, a list of strings or whatever. And then you've just got some data. You know exactly what shape that data is. You can also define optional fields. And so all you need to do is pass in that entire, all those keys from the, keys and values from the event dictionary that comes in. And then you have access to it rather than using dictionary notation, using dot notation. And you can also do nested structures as well. So it's really nice to use. And you just know you've got the right data that you need for, to pass it on to the next stage. It also handles settings. So your configuration in environment variables, which is easy to do in lambdas, generally how you, how you provide certain distinct settings between different environments, test and live, for instance. And so this example takes, there's an M prefix, Mars here, the name of the project. So take any environment variables, which if you're running locally might come from a .m file, but otherwise in the lambda context, they'll be coming from, from the environment variables themselves. So anything that starts Mars underscore goes, goes into here, but without, without the prefix. So Mars cert file path as a, as a, as an environment variable would get loaded into here. And then I've just made a little shortcut to a cert, which means go and go and make a tuple out of these two bits of data and return that as the certificate. And then I'll use that in something like this. So if I need to send a request to where I need to authenticate using a certificate that is provided by the, provided within the lambda, then just use settings dot search. That's looking up this thing here from the environment variables. But it also validates that you've got everything you need to run the code. So that's really handy. So AWS databases we've used in various projects. We did do a lot of stuff with DynamoDB for a while, kind of hit the edges of that without sort of getting too much, it's too much details and felt, felt like we were kind of lacking something. We tried out Timestream for one project and that worked quite well for the thing we used it for, but it wasn't very extensible for, for larger use. Really, I kind of always wanted to go, you know, to start you to be using SQL databases, like Postgres, but we sort of struggled to get around, you know, these short lived prototypes. We didn't really want, we don't really be wanting to be spending time managing infrastructure and that kind of thing and dealing with instances and having to shut them down at the end of projects and things like that. So anything like DynamoDB, which is just, it just works out of the box, serverless and you're not getting build for instances that are left on. It's really handy way to work. So we found that there was a serverless option available for RDS, the managed SQL database services. So you can, you can run Amazon Aurora Postgres serverless in serverless mode and you, the way that works is you provide the, specify the capacity range and scaling configuration and you enable something called the web service data API. And that just means you can access it through, essentially through the Boat of Three library. And obviously there's, there's various bridges available to, for your, your standard ways of working. And all you do is in your cloud formation, so the infrastructure defined for your Lambda, you just add on this as one of your resources, say, I'd like a database, please. Here's its capacity settings and things and you've got a whole SQL database available to play with. And we'll even drop down to using zero capacity units, if you take the right, if you take the right boxes or set the right settings in here. And that allows you to just, you know, once the project's over, if it's not getting run anymore, you're just not going to build for any usage whatsoever. And you get access to it through the console, but of course you can also access via budget three bridge or preferably using Aurora data API from Piper or if you're using SQL Alchemy, the, there's an SQL Alchemy bridge for that as well. And you connect to this using the AWS secrets manager. So you don't actually have to pass around credentials and things like that as well, or even open up ports or, you know, security groups and VPCs and things like that, which are the kinds of things we wanted to avoid in the short lift projects. So we have this something called the new labs app portal, which is really handy. It's, it's essentially a long live, long lived project, an EC2 web server that hosts static files that you put into S3 just means every project can just reuse that infrastructure. You can just throw some HTML into that, throw, create a static site and some of the other team build react apps and things and they can just easily deploy them to that portal. And it just means that without creating new infrastructure, you just dump some files in there and you've got some web presence for your data or for your project. And it's, you know, that it's not public. It's behind either BBC login, which is a two factor authentication login system for BBC staff. Or if you've got a certificate on your machine, you can get access through that. And that's really handy. It means we can do things like this. So we use a template generator called chameleon Python library, which basically we kind of devise a website content structure. So it might be, you know, I want all the list of all the programs on the first page and within each of those I want all the list of episodes that we've processed their running orders. And we create chameleon templates for each page type. So the episode page, the home page, the brand's page. And we create a logic layer for retrieving that data whether it's coming from the database or coming from the running orders or something, create that layer for how you define what, you know, how to write that page out or how to pass the data into that template. And then, you know, create a lambda that just takes that logic and says, right, at the end of the state machine, once I've processed all the data, I've got all the bits of the data that I need, I've written it to the database, I want to write out the web page. Okay, so I've processed a new episode. So let's write out that new episode page. Let's update the brand page to include the link to that new episode. And let's update the home page because it has all the stats on it or something like that. And whatever logic is for what you need, define that in the lambda and also usually tend to create a command line version for if you ever need to just rebuild the whole website. That seems to be a really good way of working because then you don't actually have a sort of, you know, a Python like web framework running that could go down. Basically, if a lambda write fails and the state machine fails, you've just got something that needs fixing to rewrite that one page, but everything else is still up because it's just static HTML. We also use something called struct log, which is for structured logging. So this looks really great when you're running locally, you get to see all the relevant information and it's all colored and nice and structured, and you kind of see what's going on in your programs, which really encourages good logging practice. And it also supports JSON logging, which is ideal for when you're running code in AWS because you can access and even search the structured logs in CloudWatch. So we just have a bit of configuration that says if you've enabled JSON logging, just add these processes and configure a struct log network. And that has created a JSON renderer. So the top one here is what it looks like if you run one of our lambdas locally, you just get a bit of information out and you get these kind of these structured bits. These are the bits of data I chose to as well as just the info message, the bits of data I chose to include on that line. And then when it comes out as JSON, you actually get this kind of collapsible thing. And you can even search for like, give me all the logs where the brand PID was this one. And you can see all of that, all of the log entries that have that in, for instance. So I talked about PIDantic settings earlier. So for connecting to our database. So here I've got an example where instead of just Mars underscore being the prefix, it's Mars DB underscore. And then it's Mars DB underscore ARN and the secret ARN, which is the secrets manager to get the credentials for the database and the database name. And so just using that as well just means you, you've got this additional thing looking for specific types of settings for when you need to talk to the database and using SQL alchemy, which allows us to define what a table structure looks like. This is quite similar to PIDantic. So you just define what all the columns are and what the types are and what the relations are as well. And then you can do, you can connect to create, you know, connect to your database engine using the Postgres Aurora Data API bridge. And you can do things like query for an episode, selecting from that table where the episode PID is the given PID and execute that query and return the row. And you get all the data really easily. So, and then there's also something we've been looking at recently, which is lambdas can actually, you can actually provide a function URL for your function. So you could even, you know, you can implement a, so it's a dedicated HTTPM point for your lambda, which allows you to, you know, build a RESTful API, for instance, you know, in this serverless context. So without having to spin up servers and have dedicated machine running your API, you can just define it in a lambda. And with that endpoint means you can just throw data at that API and get your data back. So this allows us to, and with fast API, this is kind of the easiest way to get this done, which is also built on PIDANTIC, which, you know, makes it easy to define the inputs and outputs using PIDANTIC models. And it's a self-documenting project and it's got all the bells and whistles and it's really, really good. So essentially, with all of these tools that we're, you know, talking about, you get practically for free, you get, you're getting a serverless API for your serverless database and you haven't had to manage all these resources and instances, which is really, really handy for the kind of rapid prototyping we do. And you can easily add authentication to that as well. And just to finish off, just the learnings from these projects, because we, you know, we work fast and we've just got these surveyed project cycles, trying to achieve something in that small amount of time without getting your head, you know, bogged down in the detail of the infrastructure and all that kind of thing. You get chance to just try new things every project cycle. You take those learnings onto the next project. So there have been times when we've kind of taken one approach and thought, you know, this is, this, this seems like the right way to do it. You've kind of had gone through some pain points in that and, you know, it slowed you down a bit. And, but then, you know, you finish that project, you move on to the next one and you go, you know, actually it would have been better if we'd have used SQL Alchemy instead of this other thing in this project that enables us to hit the ground running and get going really quickly on that next project. Using spikes to try out ideas, just spend a day or two just researching something, presenting, you know, like a one pager, explaining your thought process and what you learned to the rest of the team is really, really handy. No project is perfect. So, you know, we're not striving for perfection in any of these. We're just iterating every single project cycle. We don't tend to have any, you know, hard and fast rules. Just determine what's good practice and keep improving on that. We don't get bogged down and this is the only framework we're allowed to use or anything like that. We do lots of knowledge sharing. So if somebody learns something in the team and thinks, oh, you know, it will be beneficial to lots of people in the team to get to know about this thing, I've spent my two days researching it or we did it in this whole project. I'm going to share to the rest of the team what we learned really, really useful doing that and prioritizing for delivery. So making sure what you're aiming, you know, what you're aiming for, what you're trying to achieve. And the most important and the final thing I'm going to mention is making the best use of the research week and the rabbit week. So don't get started too early if you haven't spoken to your end users or the people you're going to be working with. Really get, draw down those use cases and things and using the wrapper week wisely so that you actually do finish on time and don't feel the need for an extra two weeks. So that's all from me and I think we've got time for questions. Well, that was a fantastic talk, Ben. I definitely learned a lot. However, we don't have time for questions. So if anyone wants to ask questions, you could please do it in the Liffey board room or on the Zoom call. Is that good? So yeah, cool. Thank you. Thank you, Ben.