 All right, welcome. So in preparing this session, I wasn't quite sure what to actually talk about. So there were a lot of proposals that were grouped in this topic area that are all very specific to specific APIs. And I think most of them have found a home in breakout sessions already, things like feedback sessions about what do you want to get out of the search API. And so I don't think we need to talk about those here. And instead, I would propose to talk about the high level picture and direction of where we are going with APIs. I'm open to suggestions, though. I don't have a lot of material myself to present. So if you have specific things you would like to talk about, please speak up. So let me give you two slides of introduction for the high level questions that I think would be interesting to discuss. And then we can launch into the discussion. So this is a short overview as I see it of where we are. So we have a lot of new functionality that is being built on APIs. The request volumes are going up. And latency is important because a lot of latency sensitive services depend on these APIs. And as apps, there's increasing the web using APIs and as a lot of third party consumers as well, which we want to encourage to use these APIs because we want them to build on top of all the content that we have. So we need to support this growing volume and ideally do it at low latency and need to figure out how. The two main APIs that we have right now are the Action API, which is the traditional and API and service most requests. It's very powerful, started out as an edit interface and has a lot of additional functionality on top of that. And the Fully New Rest API, which doesn't have that much functionality and is mostly focused on content access at this point. And now we also have some smaller service API like Boris that are standalone. That is basically one API per service and are now tied into one overarching API. There's some differences between the main APIs with regard to caching in particular. The Action API is mostly based on query parameters that aren't necessarily all at the same way. So it's very difficult to cache and purge. So as a result, there's mostly no caching in API responses. It's very much an RPC model. And on the other hand, REST is deterministic URLs normally, so the default is caching, varnished caching, which helps with latency and scaling. Documentation and testing, that's orthogonal to those that are two different choices that we have right now in implementations. One is using Spagger specs versus a custom documentation system that has additional features that Spagger does not have like localization. And there's some implementation differences. Paracrest CPU over has differed quite a lot. They are not necessarily inherent in the platforms that are used, but there are a lot of factors that play into that. But they are quite significant. If you look at it like 35 milliseconds versus roughly half a millisecond per request, that's not nothing. Yeah, that's a slight difference in the processing model as we work over per request by default versus asynchronous event loops, which is relevant for proxy use cases especially. So here are some of the big high level questions that I think we could tackle in this session. The first is REST versus RPC, which direction are we moving in? Are we longer term moving to a REST model so that we can cache most responses? Are we what kind of versioning do we use for the API? There's currently two different approaches. In the Action API, there's a format parameter where you can select the format, especially the JSON format, to use or to return. While in the REST API, there's several parts. There's a major version in the path. And each endpoint has a stability marker. Endpoints start out as experimental off and then gradually move towards stable. And the promise is that a stable endpoint, if it has changed in a breaking way, the major version has to be incremented. So you can basically rely on a stable endpoint keeping working. And there's content type headers with spec versions in them. So you can find out what exactly, if there's no non-breaking minor changes in the content that is returned, you can find that out with the content type. And finally, well, not finally, but the number of APIs that is the discussion we had several times now, basically new services like Aura's can expose a separate API. We can keep doing that for each new service. And the other extreme would be to have one unified API that binds them all together and has one way to discover all this functionality and has a unified monitoring and all these things, or something in between where we find some compromise. And finally, if we manage to establish agreement on where we're moving, how do we get there? While they're in small steps towards those things without obviously rewriting everything from scratch. So these are my proposals. Should maybe switch to the etherpad. So I think that one API is probably not a very good solution. The REST API is really good at giving you information about one thing, one page, or one image, or one whatever it is. Whereas the Action API is optimized for giving you a piece of information about many pages. If you have an arbitrary set of 500 pages, it's really hard to query that in a REST API and have it be done in any sane way and it would probably blow up your caching. Whereas that's exactly the use case that a lot of our bots and user scripts on Wikis tend to have. So I think that we can pretty easily have both APIs that serve these different use cases. There's definitely different trade-offs between batching, which generates especially providing very rich functionality where you can generate one list and then apply an operation in one, though, one request. On the other hand, the REST model is you do a lot of very small but cheap requests. So it basically puts the burden on the client to do a lot of this iterative work. Actually, I know the mobile apps want to do very few requests. And so some of the stuff they're looking at is making a REST API that gives them all of the information about one page that they need to do their whole page display. Yeah, it's just a composition of other requests again. That's a good use case for a REST API because you're accessing the information about one thing and you want it heavily cached. That wouldn't fit all that well in the action API unless we look into something like varnish X keys or something that we could have an endpoint that actually was easily purgeable. There's even in the REST API, there's also post end points that let you do things that are not casual at all, like converting your custom wiki text to HTML or your custom HTML and modified HTML to wiki text. So it's not completely mutually exclusive in a way. It's REST too. It's a post endpoint. But I guess what I'm most interested in is should most of the content, all the high traffic entry points that people hit all the time for standard tasks, should those ideally be cacheable? It depends on what the people are doing. Yeah, right. It's basically about can we push it to all those where that is a suitable pattern? Because I think by volume, all the high volume ones are fairly straightforward given me this information. While the more complex interaction patterns, those are bots. I mean, they also have high volume. But I think that that volume is not going to necessarily go that far up. Some bots really do hit a lot of requests. But they do things that wouldn't really benefit from caching. So caching isn't such a concern. I don't know that high volume is such a useful thing as to whether it's so performance sensitive and that it makes the same requests over and over again where caching is important or not. Yeah, it depends on what you build on top of it. If your main web front end is hitting this API, then caching is pretty important. Yeah. Brian, I think you guys actually mostly covered the things that came up here to say. I do think that the REST versus RPC question is maybe a false dichotomy that there are many and varied and complex use cases for programmatic access to the content held inside MediaWiki. And things that are more content export oriented typically align with a REST API pattern with something where there's a noun that you're asking for things about the noun. REST is typically tell me about this dog, tell me about this cat, tell me about this pig sort of based. And the Action API is a bit more, it's not the most beautifully designed API that anybody's ever had in the world, but it's much more a kind of a cursor oriented like give me. I want to know this aspect about a varied set of content and I probably don't know exactly which content I want, but I have some kind of search or discovery mechanism, some sort of query that I'm going to present to you that says come back and tell me about all the things that happened related to these things. So I guess what I'm trying to say is that there are good uses for both patterns and it really depends on the particular consumer that's trying to be to satisfy which type is built. I guess as I said earlier, the question is more if we, for endpoints that are more of the type, tell me about the cat or the dog, should we use REST for those or should we use something that cannot be cashed for everything, I guess more the trade-off that we can actually. I think you're going to have a hard time finding anybody who says that more caching would be bad as long as we can find reasonable performance ways to invalidate said caches, right? One of the three hardest problems in the history of the planet. One thing I'd like to be clear about is when you say REST versus RPC, are you referring to the RESTful APIs versus RPC APIs, or are you talking about specifically the action APIs, paradigm versus the... Mostly talking about basically everything being opposed or some similarly opaque and difficult to cash request, where you can't hook into a standard HTTP cash semantics, but have to maybe still get to do some caching at the application layer further down, but it's a lot more complex? I think if you look at the action APIs surface area, it feels much more high level than the RESTful API, but you could implement that on top of the RESTful API. So you could have essentially a client library for the RESTful API that looks and feels almost verbatim as the action API does, and that would essentially give you both options with a single carrier mechanism. Yeah, I agree, especially where we move more work to the client. If you provide that code out of the box in a toolkit, then you iterate over things, and there's already toolkits like PAE, Wikibod, and so on that do some of this iterative stuff, then I agree, then you can get a similar feel in the end. Well, you could get a similar feel, but I think that you would run into more performance problems trying to emulate all the things that the action API does using our REST API. Yeah, I agree, that's definitely cases. Which goes back to the point that both APIs are good, and I don't think that we need to try to say we only need one. Yeah, okay, but I think that they serve different purposes, for sure. I mean, the two APIs that we have right now, they serve different purposes, and they're both very legit. Honestly, we had to do very simple interactions with MediaWiki, and the action API seemed like an overkill in some cases, and mostly, I mean, I never took a look at it, and some things looked like counter-intuitive for the casual user that I was, so I think that for a casual user, so somebody that just wants to build something on top of our data, the RESTful API could be much better if you want to do some quick proper prototyping, and that's what RESTful is good for, right? It's simple, there are some things, anyways, that REST, because it's built on top of HTTP properly, cannot do, for example, it's supposed to be stateless, meaning that you can just, you can't properly keep state between requests, so one of the things that, historically, people implementing just REST APIs have problem with is transactions. I'm not sure that we do transactions right now in the action API, but we could, in theory, just allow people to do transactions with requests, while it's, let's say, philosophically wrong with REST API to do transactions. I think that for, we are now moving a lot to use REST API in general between our internal services, and that's going to present some limitations on the wrong run, maybe, I mean, I've seen quite a few projects moving away from REST for their public interface, I mean, some data stores that are trying to use, for example, GRPC, which is a new standard of RPC interactions that Google published as open source, which is basically implementing on top of HTTP2 a lot of things that it's kind of hard to do with REST, usually, so making promises and timeouts and managing timeouts explicitly from the client side and the server side, which is one of the problems that we are, we were thinking of tackling, right, to get really, if you think, I think it was a ticket written by you about this exactly, so, I think for the public, the REST API is important for people that just want to consume data quickly from mid-weekly or that want just to build, I don't know, simple clients, especially for reading, I think it's great. I think that probably the Action API has a big value for people doing more complex things that need more rich interactions directly, and those people are typically, I'm not worried about caching for people doing complex things because complex things are going to be hard to cache or maybe even, let's say, painful to cache because they, I mean, I don't expect many people to ask for the same set of 500 articles at the same time, right? So it's probably going to be more hard for the good to cache those, and that's why the REST API that goes just by entity, by single entity is easier to cache and that makes sense in that case, but even, I would say that, okay, this is about how we interact with the public right now, so as people outside of, we can make the cluster interact with our content, and I agree that we need to keep both APIs. I would like us to think about a little bit if we want to keep moving with REST interfaces for everything within our cluster, so within the various services. I think we should consider going on if that fulfills all of our needs completely or if we have to think of moving, let's say to an evolution of REST in some ways internally, and that's a part of the problem, I mean, probably for simple things like, I don't know, mathoid, which does just one simple thing, REST is good enough, but if we start to move more things outside of a main blob of media wiki, like bigger things, or let's make a stupid example, right, image conversion, you might want to do, to have media wiki called an external service to do an image conversion or transformation, and you want to come back later to fetch results and you want to be able to define in a clear way what happens when a request times out where it is not fulfilled in the time that you want to serve the result to the client. In those cases, probably if we keep REST, it's going to be a limitation in the long run, so. Yeah, I think there's lots of solutions out there, large companies having often their own frameworks. Google has their, some are using REST actually, some are using various frameworks, but it's a big step, I think that's the main issue with moving to a different framework internally. It has a high cost as well, so they have to be compelling advantages. But yeah, I agree REST is not the most efficient or technically best thing ever, it's just ubiquitous and yeah, easy to understand for anybody. So one of the things I'd like to suggest is that we, I mean we can keep going on the REST conversation, REST RPC conversation. One thing that I would like to do as well is A, just provide a reminder on what this area as a whole is meant to be, is about basically getting access to our content, getting data in and out of the system. Regardless of what format it's in, we've got the content format area to talk about that. This area is about the infrastructure around the content and how, and the APIs are central to that. So I wanted to provide that reminder there and then as a follow up to that, I would like to ask, there was a session earlier that Dan and Andrew were leading, but there was a lot of people in that session. Could one of you, would one of you mind, put you on the spot here, what are you mind just giving like a 60 second overview of what you all discussed in that and? Sure, so we generally talked about this general concept of data flows, collecting, processing and serving data and how it relates to like how that is done by different teams, search, fundraising, research, analytics and different requirements from all of those different perspectives and efforts on Event Bus and kind of trying to, mostly the consensus was that pretty much everyone thought the problem was similar enough that we wanted to work on it together and similar enough that Event Bus could solve all the different use cases but that we needed to collaborate and start working on it together. So do I need to clarify any of that for anyone? Because I used words that I don't know if everyone knows Event Bus or things like that. Raise hands, I don't know. It's a REST API for Kafka, essentially. Actually maybe just a further clarification on Event Bus because I know that, so I'll, yeah. So I'm just asking for further clarification on Event Bus because I know that we've had at least some mailing list discussion in the past about what is Event Bus and how relevant it is, is it to this area and so that, yeah, I think maybe just a little bit more. Sure, so the general idea is that you want to structure all the data that's flowing in the similar paradigm so that everyone can understand what to do as a consumer, what to do as someone who's interested in different streams that you wanna join different streams or whatever. So the main idea is all these data pieces that are flowing through this system should have a schema so that consumers can understand them so that we can go back in time like two, three years and be able to consume data that was produced then that we might not remember what the code looked like. So factoring out this metadata in the form of a schema is one of the primary things. And then from that point, really, Event Bus is just making it easy to produce events in those schemas to a stream and not so much worried about the consumption part because that can get tricky, but just kind of generally having the streams there so people can join them or do whatever they need with them and sort of standardize how all of that works. And one example that might be interesting is the search team wants to integrate things like page rank or page view statistics in how they rank results. And so having a stream of that kind of data is useful for them. And knowing the structure of that data is accomplished by the schema. That's a concrete example. Any other session summaries? So I was curious about the number of APIs. We talked about two APIs having good reasons, but we have recently created a couple of new services and they're separate APIs for those. What is your take on that? I mean, the trade-offs are possibly user friendliness and discoverability versus independence of how the APIs are designed. So now, oh, there it is. I wanted to reply on some of the earlier comments. So one thing that was brought up is being able to do batch requests in the Action API. It's a very important use case for a lot of bots and tools that need data. But well, one thing that does maybe reduce any of that a little bit, and I'm not sure to what extent we can expect this to be ubiquitous in tools, but HTTP2 and speedy by extent do allow for a lot more easily to have parallel requests or in other ways to batch more data. And the other thing is what we're seeing now in REST Base as well is that it can compose data objects without needing to fetch them again, right? Just because you make a batch request to REST Base doesn't mean REST Base needs to do a batch request at the back end. It can fetch the individual object that it already has and use them. But it does bring up in what way we want to do caching. So right now, if I understand correctly, we don't really utilize Varnish for REST Base caching. I don't know if it completely by masses it, but at least it's not like the main purpose to be cached there because we want to cache it in Cassandra instead if I understand correctly. No, it's mostly relying on Varnish. There's some issues still with login requests, so that's a task to fix that. But all the high traffic stuff is basically relies on Varnish. Okay, that's interesting. And so yeah, so what I wanted to bring up is basically using REST Base as a composer versus REST Base as a proxy. So there's various services that we now bundle into one process that run essentially as one HP service. But there's also services that REST Base includes but also doesn't include in that they are exposed there, but actually run as a separate process or as a separate service entirely, but are proxied through it. So they're still exposed through it. And I think that's an interesting model that we can explore more. But it does bring up like what we wouldn't do with storage. Like do we want the individual storage to be, do we want the individual storage or do we want Cassandra behind REST Base to do the storage for like long-term things. So separating those concerns is a good sign. Yeah, I guess it's a little off topic for the API part only, but it's definitely important for the system design. I think one, it's a lot simpler to scale and manage stateless services. So there's a strong incentive to keep state auto services. Some of that was the idea with REST Base originally. I think there are limits to that. We shouldn't use one storage layer for everything. And there definitely shouldn't be access, direct access from one service to the other services storage so we can change things behind its back. That should be clear ownership. Yeah. And I think right now most of the use cases that REST Base serves are basically cash-like use cases. In some cases, it's authoritative because it's bound to a user session, for example. And we can consider using it for actual storage or for actual content, authoritative content in the future, but that's not why it's used right now. Yeah, but there's also AQS, for example. The Analytics Query Service has its own Cassandra cluster, its own storage. Which one, sir? The Analytics Query Service for HV Data has its own Cassandra cluster and they write to it from Hadoop and are independent in their management and that has worked very well. So only proxies for REST, but it isn't stored inside REST Base's Cassandra cluster? Yes, yes. Okay, I'm sorry. So yeah, there's kind of this dual role of aggregating APIs and presenting one unified API to the outside that REST Base currently fills, but that can also be implemented for really high traffic endpoints and something like Varnish, like we can tee off this traffic too directly to the backend without it ever reaching REST Base. But it's more about what model do we want to expose to the developer in terms of documentation? Should it be one path in one general API or is it a separate domain, for example, or a completely separate branch for a different API? I just wanted to make a quick point. One of the things I liked about, so we worked on the Analytics Query Service and one of the things I liked about that experience was the modularity between the storage and the API level. I think that's really, really important. We did run into problems with Cassandra, both computing data to stick into it because it needs everything pre-computed and serving data out of it while it's being loaded and lots of things like that. We solved some of it. We're still working on some of it, but we love the flexibility that we can just say it's easy from REST Base to implement like a Druid module or a PipelineDB module and query that instead. I think whatever kind of APIs and things we develop, we should always think about that modularity because you never really know until you hit scaling issues or query issues or whatever. I have a question regarding performance. So did you benchmark reading from the REST Base API via reading directly from Hadoop? So I mean, that could be comparable in theory, but it would be interesting to see how those compare. Dan, that seems to be analytics-specific. I did not benchmark Hadoop. You don't really want to query Hadoop. So we didn't even try. It's not a system that you want to be responsible for keeping online with a level of reliability that an API would need. And yeah, latency. So there's no real, I mean, it's got to start up the VM and do all this stuff every time you ask it anything, so. Also, we only have one Hadoop cluster that's important too, yeah, but yeah. I think a separate storage layer, a separate presentation layer for the data makes sense from a lot of different points of view, both data model, and I mean, there's like databases very specialized for time series, for graph or whatever that you definitely want to use. You don't want to just go against Hadoop. Just a comment on just API design. And one thing I've noticed, that's the difference between the REST-based approach and the MediaWiki action API approach, and just in terms of having more granularity in the endpoints and more restrictions on what you can request individually allows you to have more robust documentation on like the response type. So you actually have a schema that you can say is guaranteed for a specific version of an endpoint, which is kind of orthogonal to REST versus action, but the action API in its flexibility in allowing you to compose different data objects in the same response makes it exponentially more difficult to say like this query that you just made up has this schema and has these properties that are guaranteed, whereas REST-based can say very explicitly, this is what you're gonna get. This field is optional. This field is an integer and things like that. And the ability to granularly, discreetly change that schema over time and enforce backwards compatibility is nice that REST-based has considered that more important from the get-go. And I was not saying that's something that we couldn't explore in the action API. I think it's just one thing that in my experience is nicer to work with as a developer. Still no comments on the number of APIs? I was on a walk talking about APIs before I walked in the room. So like could you maybe kind of restate that question? So we have a lot of new services and some of them have started to expose their own API like Oris right now. And you could follow that model of each new service having potentially their own API, different domain or a different sub branch under slash API slash service name. And the other extreme would be having one API that assembles them all, has one unified form to document it and then there's obviously the current state which is somewhere in between where we have two APIs. So I'm mostly wondering about should we push or not services to integrate into one larger API or two larger APIs or should we encourage separate APIs basically? Oh, OK. So when you say one API, you mean one domain that has a bunch of specific endpoints underneath it that are like well-described? Yeah, something where you have a unified documentation so you can discover what is available. You can browse it. And of course, this could also be built on top of multiple separate APIs. But I guess the question is how uniform those would be. So those are the trade-offs that I see. But I'm going to weigh in. Well, so I have a sort of a different question. But I'll save my question for a second. Yeah. Just to speak on the kind of multiple APIs, I know one of the things we ran into, especially when I first joined and even up to now is just just, oh, Corey. And I'm on the mobile apps team. But yeah, the discoverability of the APIs, there's so many people have worked on so many things and they're so disparate. And like I said, multiple domains, like discoverability really is a problem. I can't think of how many times we've come across like, oh, we really need an API that does this only to mention it in passing. Something like this is like, oh, that's already been three times. And here's where you find it. So I think whether it's actually one URL or how it works, but the discoverability I think is really important. I think some sort of source documentation and however we do that, I think that's kind of like, this is a level up from this. But I think that's kind of important. Boom. OK. This is working. Yeah, so I don't think that there's like a perfect answer right now. I mean, the API.PHP documentation is usually not bad. In most cases, people who are creating an API.PHP implementation could document their interfaces better. Yeah, we don't right now have the ability to easily state what the output looks like, although I'm sure that would be possible. But yeah, generally I'd be in favor of having most of the high-performance internet-facing APIs more centrally available for discoverability purposes and for developers to be able to see things and allow API.PHP to change faster. It's easy for people who are building front-end code for the Wikimedia properties, for example, to adjust their stuff for API.PHP on the fly, but for people who have package implementations or suppose they're a third party and they need to have like a highly reliable interface, having like a central place of documentation would be better. So I don't think like there's a simple answer myself. Well, speaking about the ORS in particular, I just wonder if some of the reason behind them having their own API was because they implemented it in labs and you can't have a production service depending on labs. And once they move it into production, then I hear that they're looking at integrating in both REST Base and the Action API. Speaking of ORS, I think that basically because it's written, it's a completely different project. So it has its own API, of course, and you can probably integrate it into REST Base or any global API endpoint that you want. But I want to say, I can't hear you, Marco, sorry. OK. Anyways, I think that there are good reasons for having both unified API for most things. I mean, at least the things that more people would like to use for building clients for our rich data set, right? And for having single purpose specified API for the single service, because that's, of course, going to be more flexible and can have a different format than it shouldn't necessarily conform to the standard that you should conform to if you want to have an unified API. What we need and what we don't have, that's what Corey said before, is we don't have a good site for documentation. Unified documentation of all of our APIs, it's not, I mean, when I try to search for information, I have to go around searching for information and we should probably spend some time and maybe even have somebody do that. The microphone, oh, OK, it came back. We should probably have somebody doing that work explicitly, I mean, documenting all the APIs that we expose to the public. Because it's a lot of things. I mean, even the Wikidata query service is an API if you want, right? So we should really have a single place where a developer can go and just look around at all the APIs that we offer and have an unified API for most things, which has a standard structure that's unified for everything, logic in itself. Hi, I'm Steven. I'm on the mobile apps team as well. I wanted to make sure that I understood the question. So would another example of multiple APIs be the differences between querying content on, or sorry, retrieving content on Wikipedia versus Wiktionary? No, sir. I would consider those different sites and you can have the same API structure for both of them. So whether that is part of a parameter or whether that's just a different domain and expose the same API, whatever path or API.php, I would consider that the same. The question is more different bits of functionality. For Wiktionary, for example, the query part should that be a separate service and a separate API with a different structure, potentially, or should we try to unify at least the ones that are generally useful into one coherent API, or at least gradually nudge things in that direction? So it's kind of like how great it is that you can do a query on Wikipedia and then run a similar query on Commons, is that right? Using mediaWiki API. Yeah, at least for all the features that are common between the two, which is a lot. I think from the perspective of a developer that is really focused on solving a particular problem, it's absolutely wonderful to have an API that is in the same terminology of the particular site. So for example, if I'm on Wiktionary and I say define the word dog, I probably have a pretty good idea of what that is just by looking at the endpoint. How many is that? Can you give an example for what differ between Wikipedia and Wiktionary? Perhaps for Wikipedia, the example is article. The endpoint is article. And I say give me the article for dog. And that gives me the full content of an article, or give me the card for dog. And that gives me a small amount of data that I can present in a card-like interface. And some of these, I think, would be common. But having different APIs or allowing different APIs would make it much easier, I think, for developers focused on a specific problem for a specific installation of a specific site. But to clarify, you mean slightly different functionality. So you get a different response. For example, on Wiktionary, you would get a definition that has, as we were actually just discussing, a summary. Well, on other sites, maybe you would use a summary that has a lead paragraph and then page image. Yeah, personally, I don't really think of developing a really generic app that just works with every single MediaWiki install. I think it could be done. But as a developer, I'm really interested in focusing on the content or the data at one site and developing an app for that. Yeah, Stephanie, I agree that whenever there's different needs and different functionality, it should be named differently. There's no point in confusing people by naming the same and actually doing something different. But there's also cross-cutting functionality, like just give me the HTML for this page as it is rendered. That can probably be named the same if it's doing the same thing. I'm the gatekeeper, so go ahead. So I think, generally, what I would recommend is separating the, because I completely agree with Stephen. I think you have an API surface area, which is the actual things that you call, the functions you call, and they're different in action and REST based. And there could be specific ones for OREs and things like that, but separating that from the actual transport mechanism, client-side libraries. What we do is we have a GRPC that's used internally for absolutely everything, and then we have protocol buffers that define specific APIs for specific things. So if a mobile application needs to talk to its back end, it can communicate directly via that mechanism and using a predefined protocol buffer. So essentially, the only thing that's different between specific APIs is APIs are the actual surface area, which is the protocol buffer and the code that implements it, but everything else is standardized. And that makes discovery easier, it makes implementation easier. You don't have to think about, is this RESTful, is this that, which client library do I use, where do I go to look for this thing that's all standardized, and it makes it much simpler. And it also allows you, by virtue of being centralized, to enhance, to add things like HTTP2 caching and stream reuse and things like that that grossly simplified the effort to build a new API and to use an existing API from a new front end. I think the API spec work is actually also moving in that space, it might not be quite there, and there might definitely pros and cons. But I think the point in time where that was making your own was the only way to get all these benefits. And I think it has changed a lot since then with HTTP2 and with specs that have been developed since. So I think REST is no longer so cold as it used to be, the cold world out there. Okay, I just want to remind everyone that we have less than half an hour left. So refocusing our efforts on what we want to achieve is good, but I also want to bring up two points. One is, so regarding the merging of services behind REST-based, I think it also brings up an interesting point with regards to focus of scalability and uptime. So I think it will be a pretty big difference for something like Aura's, whether it is behind REST-based or not in terms of how many instances you needed that service and how exposed it is. And it goes in both ways. On the one hand, if REST-based is highly reliable and highly available, it would cover a small short outage in Aura's. But the other way around, if Aura's were to expose directly, it wouldn't be affected by downtime in REST-based. So those two also are playing some, like, can you still serve content when one is down? This would of course be simplified when you have something like Varnish which also covers those kinds of gaps. But I just want to bring it up. Yeah, also for the isolation, having one cluster versus multiple clusters so that you can avoid one deploy, wiping it all out and the procedures around deploy is all playing to it. Just a comment, since this is my area basically, that's exactly the problem that I posed about having a proxy in front of everything. We already have Varnish for good reasons. I mean, multiple proxies. It may be good for functionality, so for presenting a unified API, but it's surely not a good thing for stability. There's no way it's a good thing for stability. I mean, there's no way. The good thing is that a path-based structure makes it fairly easy to rewrite or map specific requests or specific backends and something like Varnish. But the downside is that you get more complexity in the Varnish configuration, basically. So you want to keep the number of entry points and all that Brandon has been doing work on making that more streamlined, registering basically path prefix and mapping it to a backend that would help, basically. We could still expose it as one documentation, but have the request go straight to the backend without ever touching respace. So I think that could give us the best of both worlds. It looks the same to the outside, but it actually goes straight there. But I think that's a long tail of low requests, not very critical endpoints where it might be okay to just go through respace. Yeah. So I also mentioned earlier was having a shared place for documentation on different APIs. And indeed, as has been working on this, on Media Weekly at ORC and myself as well on Wikitech. But a more significant effort on that is also the developing hub that some people are working on, which is somewhat comparable to what you would, for example, see with developer.google.com where you have a shared interface of all the different products that we have and the APIs that they expose. So that is definitely in the work for those not, but I'm personally working on. My second point was this, when you have a unified interface, you also have unified versioning. So it gets a little bit more tricky in how you deal with major breaking changes when everything is under one API. You also have one API version number, at least the way it's currently structured. That can also be something that can really quickly expand complexity in how different people have the authority within the API. Because if everybody's in the same API, then who actually owns the version number? And how many times do you wanna keep breaking things and rerouting things? And then you get a lot more cache segmentation of things that didn't change and stuff like that. Yeah, it's related to the discussion about pair entry point versioning versus global API versions. You can have both, but that gives you the flexibility to keep both up and working. But yeah, one paths are one option, while other parameters could be another option. So that's a lot of, generally you can push, as we've experienced in the action API, we can push backwards compatibility pretty far by being only additive and so supporting the old requests. But yeah, that's always going to be breaking changes. We so far don't have pair entry point versions, but we're considering using accept headers. But right now, as it is documented, the stable APIs would force a major version increment, which has the problem you bring up. Something else I was thinking about when we're talking about the usability of the APIs is that as developers, when we get started, it's pretty much important that we understand the relationship between the sites and then some sites are more coupled to others than others are. So obviously all the different language sites are for Wikipedia are extremely tightly coupled. And that kind of goes with commons as well, whereas some other project like Witchinary is a little further off to the side. So as a developer using the API, it's kind of odd if I'm looking for Wikipedia content and then I have to get images and then I'm going over to the commons API. So in a situation like that, it might make much more sense to have a unified API, whereas if we're talking about some other side project like Witchinary or something like that, it does make more sense maybe if they are separate, but it's kind of like something to think about where it might be more of like a case-by-case basis, but I mean, and this goes more than just beyond developers to like our perception from the public of like we're basically seen as Wikipedia and we have all these other projects and it's hard to understand how they all relate together. And I think the way we do our APIs also shows it's like also confusing to us and how they relate together. So I think when we think about restructure in our API, we should kind of think about that perception of how we want these things to be seen and how they should be seen to the outside world. We, there are some technical issues there, but I totally agree with your point. That's just that we currently don't have a mechanism in place to deduplicate all this at the caching layer, for example. So if you expose the same image through 800 APIs, 800 domains, then you have clearly have to deduplicate that. So one API means like under one domain or it just means, because you're talking about like proxying between all the different services or something, so is that like one domain or does that mean, I'm still not really sure what that means, I guess. Yeah, there's two ways to look at it, I guess. One is the API structure, which would be the same for German Wikipedia versus English Wikipedia versus French Wikipedia. But then there's, so that is one way to say this is one API, it's one structure. It's one spec, but it's exposed at three different or 200 different domains. Or you could treat one domain as one API. And you were talking about routing based on path prefixes, but that would be for like different services that might be internal and exposing instead of proxying through REST base or something. I'm talking about varnish routing. Well, you can bypass, yeah. You can base on path prefixes, you can bypass varnish, REST base and tell varnish to go straight to the back end. And the outside doesn't need to know. I mean, you just see a URL and... What's the benefit of doing that over just using a different domain and routing that way? Oh, domains are fairly expensive. If you are a client on English Wikipedia, you have a connection open to English Wikipedia, you have to domain DNS resolve to English Wikipedia. So you better, if you want to have a fast response, you better use English Wikipedia for your API as well. You want to minimize the number of domains you use and the number of connections you have to open, especially with TLS. That's several roundtrips, fair connection. And that's basically the limit, in fact, all the motivation for having these APIs at slash API, something. We have 15 minutes left on the timekeeper. Okay. Yeah, well, and that's one of the things I wanted to make sure we talked about, because this is about content access in APIs and I wanted to make sure that we didn't just focus on the API as being the only way to get at the data. Ariel ran a session earlier about getting at the dumps and how we generate dumps of our data. Yeah, so just the summary. One of the main ways that users get at our data is they download XML or SQL dumps of tables or revision and page content. And there are more and more data sets that are being produced over time as well. These dumps get slower and slower to run. And at some point, the old architecture organically grown has gotten just very difficult to maintain. I hate dealing with it anymore. So the idea is if we knew what our users saw as flaws or were missing and we could toss the old system all together and just solve it ideally, what would that look like? Reusing as much work as being done here as possible. So the event bus was one of the things that came up repeatedly. One is a way of distributing jobs out to different nodes. And another as a way that we could make incremental data available because people beg us for incremental dumps. They wanna have a full snapshot and then they wanna just see a feed of what's coming up in two days or five days, what's changed instead of having to download and process this whole huge thing again. So those were the two event bus uses and oh, already I'm forgetting. Help, is anyone here that was there because my brain just went dead? I'm jet-legged, I promise it's not the speakers. Anyways, there were several other suggestions that were very helpful. Trying to think of maybe having pages with their revisions together in HDFS. And every page is a separate item, so regenerating a dump just looks like deleting old pages, putting new pages in and then some packaging that was up for download. Anyways, a number of different ideas. You can find the etherpad. Please chime in, there'll be a project called Dumps Re-Write, at least hopefully that'll be created and so I would encourage people to get on and help us make those better. There's also HTML dumps that are experimental. We talked about them, we talked about them. See, this is why I'm dead. We have multiple formats coming out from different sources. We want it to be easy, so instead of I have to write a whole different set of okay, how do these get generated and how often we index them and this and that. It should be one infrastructure that just, you say, here's my source, here's the script that runs it, here's where they live and then everything just happens. So the HTML dumps were absolutely a prime candidate, their prime example. Those are currently using SQLite databases so it can be updated in place, basically. Random access, but downside is you have to download a full database, you can't stream it. So Gabriel, I'd be appreciative if you could help tease out the question I'm about to ask and it has to do maybe with a long run vision of both a universal translator and engaging, perhaps, Wiktionary in relation to Wikipedia, to Wikidata if that was something, for example, that World University and school could further. How would this paradigm work in terms of maybe a unified, a generally unified API scenario that you were mentioning? If, as Daniel Kinsler said just a half hour ago, that Wiktionary is in the works for both text and spoken language in the future, would the versatility of this API, I'm not a programmer, be able to be applied from Wiktionary, say, to content translation to a from language and a to language in Wikipedia in a particular article, is how to conceive of these various APIs within this unified API structure as one gains more significance hypothetically, brainstorming-wise is my question, I think. I'm not sure if you're after word to word correspondence or if you're looking for text translations. What is the use case? The use case might be to go from a Mandarin Wikipedia article or a MIT OpenCourseWare course in Wikidata in Wikipedia into an English Wikipedia article from that Chinese or Mandarin or from a MIT OpenCourseWare into English. So for article translations, we have the content translation project which is very nice, provides a nice user interface and uses machine translation to support the user. So if you start out with a machine translation, you can fix it up. So that would be my suggestion for articles. So all are unique domain APIs in one sense and is that my understanding from what the previous speaker right here just asked to conceive of content translation as well as Wiktionary as well as each Wikipedia language as separate APIs or you consider, could you consider all of Wikipedia, one API, vis-a-vis all of Wiktionary, one API, one versus content translation as yet a third API? I'm trying to clarify in my own mind how APIs, this sort of scheme of them, work together. I'm not sure I understand the question completely. I think this content translation has a separate API right now, you're right. But your question is if they can be integrated or what is that? More and more fully integrated in different ways. I'm Scott McLeod. Well, they can be definitely when we are in conversation with the content translation team. But there's concerns around they're using third-party translation services and we don't want to expose those to the world necessarily. So thank you if you consider those. Just also like say like also work on OpenStreetMap and the way like they handle like diffs and like dumps and also like the APIs, I think. Well, maybe not perfect, but it's like, it works pretty well. Like the changes, like they're stored as like change sets like in the database and then they provide like like minutely, hourly, like daily, just different incremental like dumps as well as like the whole thing. And then there's like tools that like pretty easily you can slice and dice or like apply like the changes to keep like your copy up to date. So like, I think there's a quite a proliferation of like third parties like, well, Matbox now are like us and like providing like services and like tools and like as well like their API, it's like a REST API which is like version six, but it's been around for like years and has been like stable and so like, I think that also helps like people like build stuff around it and just allow it to get like used pretty widely. So it's, I mean, at least for Wikiday and like something like having like the structured diffs and like providing them incrementally, like I don't think would be like so hard to do, but like still needs like some sort of infrastructure and stuff and just time and time to get that done and like someone to get that done, but yeah. That's also a lot of interaction between the APIs and the incremental updates, for example, if you just distribute the list of changes you might have a REST line just run these updates by hitting a REST API. So like in terms of domain names, like how many distinct domain names we need, like I guess for myself, I would probably prefer to see something like www.wikimedia.org slash API I know that we've like had the REST.wikimedia.org domain in the past, but getting that to be consistent so that users when they land on projects in the future or operating off of one domain might simplify certain things that might help us become more aware of things like DNS poisoning. It also would simplify things I think for developers in general. So I'd be kind of interested on that front. I do understand like right now that's very much impractical because of the nature of how skins are rendered more or less on a per domain basis or how experiences are rendered on a per domain basis and there's certain assumptions. But yeah, just that's kind of like where my thinking is that I think it'd be really neat if when somebody comes to read an HTML page in the browser if it were served off of a singular domain and that might be bootstrapping some sort of API backed experience, whether it's done at server composition or if it's done by a client orchestration. Yeah, but I'm not sure how that works like from an ops perspective. If we actually standardized on one domain name kind of across the board. I don't want to conflate having the API endpoints with serving the HTML, but I think like inevitably that's probably like the simpler approach. Yeah, we currently have upload and we have WDW media wiki.org actually for some global data like AQS, yeah, page view data is exposed there. But yeah, the main issue really is number of connections. If you are mainly on English Wikipedia and you also do some API requests then you just need to open extra connections if it's a different domain and that's on a slow connection that's a second or so. Just your TLS set up several roundships, maybe some packet loss thrown in. Right. That's very expensive. With multiple domain names, right? So yeah, I heard you earlier on like standardizing for both like older user agent implementations and ones that are even newer. Like I think it's like generally that's good. I wasn't sure though like in terms of whether our kind of VIP architecture can handle like an arbitrarily large number of sockets to the same domain name. Maybe that's a non-issue in our environment. I don't, maybe somebody from hops could speak to that. Like suppose that we wanted to unify everything under one domain name with the projects being like path based underneath that like both on the API and on the HTML website side. Or do you know? Yeah, people will still go to English Wikipedia for only unless we can tell them it has to be ww slash en which would be a big change. Yeah, I realize this like there's a convoluted step, a bunch of rewrites that happen right now. And maybe that's just the way it will always be. I think Timo's proposal to have an entry point across projects that directs people to all these different resources that are available. I think that it seems to be a lot more practical in the short term, including you should use comments or images, this kind of information. That's, I think, yeah, I think the point that we have to treat this as an outsider, we don't really know what these projects do. As an insider, we, yeah, which in a way it seems obvious, but to an outsider, not so much. Yeah, they don't need to necessarily reside on the same domain to be discoverable through one domain as HTML can point to anything. And so that can be really powerful. In terms of reuse of, I think it makes sense to keep wiki specific APIs exposed under that wiki's domain. If anything, to simplify connection reuse and just being able to contextually discover it when you're already on that particular project. So, a.nzv.com.org slash api now gives you what you would expect. But if you're a high traffic consumer that consumes a lot of different wiki media projects and content, I think internally we will probably always be serving different projects as being one service, right? Like we don't have a wikipedia server, we have a media wiki app cluster and they're essentially presented as one service. And the same for REST-based. Like REST-based takes the domain essentially as a parameter of a longer resource. It's not in any way more significant than the page title or the revision name really. And so if you want everything for one domain, we all can already do that. That's what REST-based that we committed to work is for. So for that domain, you can do them all on one if that's what you want. And if you just need one, you can use that one. So they both work the same. And I don't know if they're already doing it, but if not already, we can actually also trivially share the same varnished cached object even. We currently already do that with project domain slash static. They're actually all cached as one object in varnished. It just ignores the main part. And so it allows some kind of rewriting and cache reuse even at the varnish level, even though they're presented on different domains entirely. Yeah, if it's uniform, then it can be one rule, basically. Yeah, right now, the policy with REST.vickimedia.org is really discouraging using it because we want to encourage people to use the same connections, but it is still available. So if we think this is, I mean, it is only one API, so we probably want to have a wider integration for the discoverability part. But, yeah, maybe revisit it down. Well, the connection reuse goes both ways, right? So if you're using this Wikipedia already, your reuse connection will be contextual things Wikipedia, but if you're consuming 100 different Wikis, then your reuse connection is actually going the other way. In that case, you probably want to use REST base as your one domain. Hi, do you have use cases in mind? Yeah, so actually, I do. From the Contravinalism project, we fetched different, it's basically a service that runs in two labs right now, or in Wikimedia Labs, I should say, it's its own cluster right now, but it makes API requests to many different Wikis based on whatever recent changes happen across the whole RC feed input from basically all public Wikis. And so it doesn't do it right now because it uses the API to PHP still. But if I migrate it to REST base, I can see myself using REST base at Wikimedia.org and we use that connection throughout the whole process. So I think in that other questions raised there, I think I've got a summary of the things that we discussed and so there's like the long detailed minutes and then there's just the brief summary, like if you needed to write a trip report for this or something, like this would be the piece that you could cut out and say, this is what we talked about. Did I get it right? Like is there something, and I guess maybe just as a follow up, like this is, this will probably be the headline for, this is what we talked about in this meeting. So I think that we can, as far as this summary part and so to the extent that then we can follow up on which of these questions is really truly important for us to answer and which of these questions is like shrug, like we don't care so much. Like I think that that may be what we should be considering for the coming weeks and months. Yeah, I think the discovery part definitely is a very central one that was brought up many times and different and it's very interesting to change the perspective to look at it from somebody who doesn't know anything about the Wikimedia universe that I think is a very important point that I took away. Well, we don't want to stand between all of you and lunch. So, all right, thank you.