 Good morning, everyone. My name is Peter Gagan. I work for Heroku and this is my talk. It's called Upsert Use Cases. Let's talk about a new feature in the just released PostgreSQL 9.5. This particular feature has been, I would say, a long time coming. And I'm quite relieved that it was in PostgreSQL 9.5. So let's get started. What is it? What is Upsert? Upsert, incidentally, is sort of an important item to update or insert. It's a way that we casually refer to the feature. So the idea is that you, as you don't know, you insert a row or on the basis of an existing conflicting version of that row being present, you instead go to update that existing row. PostgreSQL 9.5 also added a similar insert ignore type feature that rather than updating would have the insert statement, which we'll see the syntax of in a moment, not insert anything at all on the basis of there being an existing conflicting row. And the big picture here is that you don't have to worry about concurrency. You quite simply write the Upsert statement and you leave it at that. You don't have to give too much further consideration to race conditions concurrency. As we'll see, this is important. Thank you. It alleviates the burden of writing what can be fairly subtle code. So what I specifically mean is that there are various errors that may occur in similar features in other systems without you perhaps expecting them to occur that we have voided. So sometimes I've used this more formal terminology that I myself invented as jargon, the so-called fundamental Upsert property. And these are the errors I specifically refer to, what I called unprincipled deadlocks, which means you have two sessions that are Upserting at the same time and they deadlock with each other but in a rather unconventional way. Specifically, there's no user-visible mutual dependency. You as an end user cannot very well say, oh, I know, I'll just reorder the statements within my transaction and that'll make that go away. So it wouldn't have done if you simply had to live with deadlocking of that nature even though you had no such conventional mutual dependency. You didn't have two updates that you could reorder to make them consistent in every transaction because there's two locks involved in this one statement. So that doesn't happen and it's known to happen with other systems which we will have a little bit more on shortly. Also no spurious, unique violations. The whole point of an Upsert is you insert on basis of that there being an existing row by which I specifically mean a row that has the same value in a constraint that you specify then you ought to go update or ignore whatever it is. But you certainly don't want to have a unique violation in that unique constraint or index. The whole point is that you're avoiding that. If that happens, the feature in my view is broken. So that also doesn't happen. This may seem almost obvious that this should not occur and yet, as we'll see, there are examples from other systems as these things do regularly occur in the event of concurrency. So the syntax is sort of driven by an insert. I'll show that shortly. Like I said, this is a very widely requested feature, mostly for all OTP and web app sort of use cases. Now the syntax. So what you see here is that I've extended the existing insert statement. I've added a new clause that makes all this possible. So in fact, yes, I'm afraid not there was an AV guy and actually I think he said, don't turn off the lights during the presentation. So he specifically told me not to. I think it's being recorded, that's why. Okay, so the syntax you see here. First of all, we have a fairly basic example of an absurd, a conventional example. So here there is an insert of two rows. The one that says gizmo translobal and then another that says associated computing. So this is a simple case where we will either insert one or both of those or zero or one or two of those or update them alternatively on the basis of specifically a unique violation in, this is probably the primary key. See the way it says on conflict, did. That did is the primary key. We're saying that's the constraint. We are taking this alternative update path on. The second example is he shows what I call before the ignore variant where rather than updating, we're simply doing nothing on the basis of an existing version of that row. We're leaving it as it is. The third example is a slightly more worked out, absurd example where we are inserting or updating one row and we have an additional constraint on what we update the row on. We're also signifying that in order to go ahead with that update the zip code is not equal to 21201 and if it doesn't pass the update doesn't go ahead, of course. The update will still lock the row with this variant with absurd rather which is not true of conventional updates. They won't lock rows that were not affected otherwise you'd lock almost every row in the table whereas this will reliably lock the row in the event of you not updating. I'm highlighting the excluded pseudo table here and this is a way that we can reference rows originally proposed for insertion so it will be inconvenient and possibly not even possible to repeat the values from the insert because the first one certainly is a multi-row insert. Having the excluded pseudo table allows us to, you know, it feels like a join almost on what you originally proposed for insertion so it's just a convenient way of referencing what you have there in the first place when they're repeating it and obviously because this is, as I said, a multi-row insert statement would be rather inconvenient to do it any other way. This incidentally does carry forward the effects of before insert triggers which may have contributed to the rows taking the alternative path being excluded from insertion so the update statement in Postgres in general supports an update from where you're joining a target table you mean to update against another table and you are updating one from the other so it kind of feels like the excluded table it sort of feels like that in an update except it's implicit that the excluded table is available and obviously it's a special sort of table a magic implicitly defined table this highlighted syntax that did column is the this is used to infer, as I say, or as we say now the unique index that you are interested in using to arbitrate whether or not you take this alternative update path this will figure out which of any of the available unique indexes you had in mind dealing with various subtleties makes your intent clear so specifically what it does is if you happen to get a duplicate violation and any other defined unique indexes that you did not infer here then you will in fact get a conventional duplicate violation this is a problem with still with the MySQL implementation where you just better make sure that anywhere you get a duplicate violation is where you have expected to get one if, for example, for whatever reason you end up getting a duplicate violation in some other unique index then that implementation will spuriously update possibly causing something you might describe as logical corruption so we avoid all that by offering this and requiring you to use it requiring you to be explicit about exactly what you mean to update this is much better than naming an index or a constraint directly it's more robust than that columns can be in any order this can be a composite index certainly you could have, you know, you could be inconsistent in the ordering that would just work fine there are other sorts of advanced abilities which we'll see in a little while so it's going to figure out which unique index you want to use and implicitly which ones you don't these incidentally these unique indexes that you infer will appear in explain output if you attempt to explain or if you could explain the insert or update statement then you'll see precisely what actually was inferred and other details like that so insert on conflict do nothing also supports exclusion constraints exclusion constraints are sometimes called a generalization of unique constraints that means that if a unique constraint is a constraint that enforces equality that is that there can be no two equal things in a row or set of rows at the same time exclusion constraints generalize that to other operators for example the overlaps operator so this could be useful in a domain like event management we're in this room right now we are occupying this room if we wanted to have not fixed time thoughts but time thoughts over continuous ranges arbitrary slices of time that things could happen in a room an overlaps operator would determine if something overlapped with something else but by having that represented as an exclusion constraint we can enforce much like a unique index that no one can use the room at the same time and have that work in a robust way and so this will work with the do nothing variant but you may want to spell out the name of the UD constraint yourself directly by using a syntax which I have not shown which is basically on conflict on constraint constraint name do nothing okay one question I get a lot is why did I not go about implementing merge for those of you that don't know merge is a feature in a number of other major database systems described by the SQL standard it's a DML statement which is often thought to be comparable to what I'm talking about but I would suggest that isn't really the case at all it's kind of a way of combining a bunch of insert update or even delete statements so you're sort of it's defined as joining two tables together and inserting updated or deleting one of them a target from source useful certainly in data warehousing that kind of thing where you want to reconcile two tables but importantly merge as implemented and as described by the standard there's no counter example does not provide guarantees around definitely getting an insert or an update there's plenty of evidence that people expect this and are often disappointed when in production it doesn't turn out that way so I was able to add some amount of merge like flexibility to what I came up with but it's not equivalent to merge merge has all these various problems and I think on the whole I was happy that I was able to offer something that I feel gave people what they really wanted okay this is can I get a show of hands does anyone recognize what this is anyone oh sorry visibility is not so good here what you're seeing here this will jog some memories when I describe what it is yeah that's pretty bad yes it's showing you how you ought to do in previous versions how you ought to go about correctly implementing more or less what I've described that you definitely get either an insert or an update and what you can see here is that this is like a PL, PG SQL function that involves a sub-transaction so initially you attempt an update in the event of that not affecting any row you go and insert in a sub-transaction which may itself have a dupf violation which you can catch in the sub-transaction and if that happens then you try from the start so this is pretty ugly but it does work it does offer that useful set of guarantees so this is what you would have been required to do previously and a lot of people weren't doing this anyway they were doing other things that maybe seemed like they were safe but in fact were not but this would be the safe way of doing it that is correct according to my own definition prior to 9.5 they're prone you can have problems for example with getting a unique violation in the wrong unique constraint I already talked about why I think that's important in general you could even have say a trigger defined on the table that affects some other table which may itself raise a duplicate violation so perhaps the spurious violation is the same on the same table there are a number of subtleties like this that must not be missed when using this particular approach so I mentioned just a moment ago how merge doesn't offer these guarantees I'll now show you how to make merge do just this to make it more or less offer these guarantees and this is an example from Stack Overflow this is an oracle merge statement again this is rather difficult to read here so I'll tell you what it says it's an example of someone had a merge statement and essentially said why doesn't it do what I just said it ought to do that is why do I sometimes not why do I sometimes get a spurious violation unique violation I'm not expecting that isn't that the point and someone here proposes a solution that was well received on the Stack Overflow page where they do what I showed in the last slide was Postgres they have a subtransaction they put the merge in and they have to handle all the errors the catching errors there at the bottom in that exception block so in fact all they've accomplished over and above what Postgres was doing in earlier versions was to combine an insert and an update into one statement which is really in the main not what people are interested in people are interested in simple guarantees so I guess this illustrates in my mind what the problems are with merge it hasn't really as we see here hasn't really saved this person any trouble at all so in summary on merge versus absurd absurd is there guaranteeing a certain outcome insert or update whereas merge quite a different thing it operates according to the same rules as individual insert update or delete statements just combined together in order to make absurd have these various guarantees I had to essentially create new exceptions to the rules about how MVCC a core mechanism for transaction isolation in Postgres works I had to to suit the purposes of this statement I had to create special cases there where merge does no such thing so and that's sort of what I'm getting at here when I say there's a fundamental difference because it merge doesn't make any guarantees at all it's good for bulk loading it's good for data warehousing when you're going to drive everything with a big join because maybe it's a conventional join certainly nothing is required to go through in fact maybe there isn't even a unique index defined on the table I mean after all in data warehousing that could be very common if you're doing a bulk load before you might then only create the index afterwards so quite a different thing this is a MongoDB bug I'm not in the business of bashing MongoDB already in my style and in any case I don't know very much about it I don't know how qualified to do that but I'm pointing at that and this bug is current today I did check it again so this bug still exists and has been acknowledged by MongoDB as being a valid bug concerned a regression in the latest version of MongoDB due to their presumably due to their wired tiger storage engine having more granular locking than before so what we see here is all of a sudden getting duplicates violations where they're supposed to not have that happen that's the point that you get an update not a dupe violation so this is a regression that they have which they're working on MySQL I think similarly had problems of this nature in the past I believe that those are sorted out now but I guess when I was working on the feature I looked into problems people were having with other systems with you to avoiding those so it's rather subtle to get these details right all at once Yes No it's not, no Merge is in the SQL standard this is it tends to be the case that when people when vendors, when implementers implement something like this they'll use their own flavor of your own syntax so MySQL has its own one we have our own one now Terrig data has its own one there's a number of other I think is it 5A so there's a tendency perhaps because of these sort of subtleties for them to do it in each case their own way so yet another non-standard syntax I don't see that the SQL standards committee would I mean they don't even the SQL standard doesn't even acknowledge indexes it doesn't mention indexes because they would consider that to be an implementation detail and so if they don't mention indexes it's not they don't concern themselves with not unreasonably they don't concern themselves with that stuff so it's the kind of thing that I think they would never do basically I know more general summary of Upsert it offers what I would call a very flexible syntax which I'm very happy with we got all the edge cases we fixed them so some of the things I've talked about here I've occasionally maybe sounded a bit complicated but ultimately it's all about guarantees that you as an app developer can leverage making it easy to use correctly hard to use incorrectly we may still do merge in the future I'm not opposed to that it does have its uses as I've acknowledged and yeah I think that's about all I have to say on merge in general or on Upsert in general okay so this point I will take any questions on Upsert in general if anyone has them does anyone have any questions about anything I've said so far so I'll now discuss new use cases some less obvious uses of the feature in some cases using things that are unique to Postgres which I think are notable for data integration what is data integration for our purposes it's sort of a requirement to integrate data from diverse sources these are often autonomous or distributed sources which is to say that we don't control how they might be formatted it might be that a vendor does that and it's not practical to get them to do it the way we would prefer so maybe there's inconsistencies across these various sources that kind of thing this is very common this is just the way of the world and how those are represented might be maybe for example it could be that they're using different actual character encoding but it might also be something like maybe they're using metric units because it's a Canadian or European supplier that kind of thing so these are all inconsistencies we must account for maybe this is something like a common separated values file or something terrible like that so these are sort of practical real world scenarios not especially glamorous but I think common now in this sort of complicated landscape it's important that we have or sometimes it's important that we have idents which is to say in this case the absurd statement can be executed multiple times without changing the result beyond the first application so typically you'll be able to write the absurd statement such that it doesn't especially matter if it's executed multiple times that's something that can happen from time to time so that'll be just fine if you write the statement correctly if we need to reconcile the table in a live system I mentioned before special visibility rules how I've changed in very specific cases how MVCC works that could be necessary here that could be particularly necessary here because you could have multiple concurrent processes doing this sort of thing and you can't really sort of tense the tables make sure nothing is happening from any other process at the same time that's really not practical and there's probably other various messiness as there always is so my perspective here is when you're dealing with all this complexity just be careful with bad plan it's certainly not going to scale all too well ETL is always complicated but meshing all this complexity together can be sometimes very difficult, very challenging so by having all these guarantees you can sort of leverage them to the hilt we've got a few basic good primitives that you can use and you can get a lot of mileage out of them and you don't have to be an expert to get all this benefit and because it's easy you sort of get it automatically so I feel this is a very important goal for Postgres in general so I think I've done that here now getting a bit more into the example this is something calling an accidental distributed system which is it's not something like involving Apache, Kafka or some kind of message bus system something fancy like that it's more for day-to-day kind of example so this particular example involves retail systems with disparate suppliers with ad hoc text file formats some of these may sell the same product without necessarily being all from independent manufacturers so they could be selling multiple suppliers could be selling the same product essentially this is a standard boring sort of business information system and this requires periodic syncing with regional regional suppliers these are basically autonomous systems so an end retailer needs to maintain details of each product with autonomous systems of non-obvious it's not obvious that one is started in respect of what values for example a product description so there's very subtleties because of that so this is an example a little more worked out I'm creating a products table here a supplier prices table so this suppliers perhaps don't have a particular interest in making this easy for you for all they would have you buy everything from them when for practical reasons, economic reasons you're not inclined to they don't care about distinctions that the end user the retailer naturally cares about they're not especially inclined to accommodate the requirements that that person they have so we need to work around that okay so this is an example absurd statement so what you're seeing here is we're downloading new comma separated values file inserting some rows into a product and then potentially updating but on the basis of the existing data that being less than the new timestamp that you see there so it's kind of like your updates will only go ahead if that data is considered authoritative on the basis of on the basis of it being the most recent according to that timestamp so it's sort of like a technical the update sort of has its own snapshot which is the timestamp that you specify that's going over people's head but again the whole point here is that you just there's semantics it provides they're not especially complicated although the details of how you might avoid problems are more complicated you get that facility and you use it and you don't have to be an expert if you're trying to do it the other way there is significant potential for problems to arise so you know as I said if we make all this easy if we make it just happen then even people who are you know not liable to you just automatically get all these sort of benefits and they will accrue I guess alright so you've seen how UPSR can selectively update a row we'll look at more advanced capabilities now this is where a lot of the detail work went your favorite existing feature should work seamlessly there's a principle in post-crescue development that all features must play nice with each other so you'll have a hard time finding any case where it fails to make the new UPSR statement work with an existing feature okay so I'm going a little bit back to the example in making this further point so consider that you might have a system where products need to be or other objects that represent the database need to be logically deleted but not actually deleted maybe you have historic sales records that need to continue to refer to the products you have line items that are sold years in past even though those products have logically been deleted they still need to be in reports and so on and so forth and yet you probably don't want to display them when you do a generic search for the product however you can't very well delete the product and expect to have them in sorry, excuse me in the first place you can't delete them because they are simply not going to be available for the report but it's also true that if you don't set a difficulty with constraints in that if you have a flag that represents the Boolean column is deleted there's a difficulty there with making sure that you still enforce constraints the SKU the bar code whatever it is the supplier code if you still have the product record the old product record in the table it can still happen that you need to create it again because you need a new representation of that same product for whatever reason it does happen so you cannot very well have both records in the same table if you naively have a unique constraint however Postgres is a good way of working around this you could use a partial unique index you could have it say create a unique index on column where is active and so when you set is active to false thus logically deleting the product record at that point you will you will then you'll then it won't matter when you go to create a new version of it because uniqueness will only be enforced among those rows that are currently active so you get everything works basically but the question then arises well where does this leave absurd and that too is supported so this is slightly I don't know how visible that is so you see here the where is active that is distinct from a predicate on the update notice before the update not after there could also be a predicate afterwards so this is saying when we go to look for or infer unique index it's okay if it only covers those rows where is active so that partial unique index will work just fine as you would expect because we spelt that out again this just works I also already mentioned this phrase I was fond of making it easy to use correctly and hard to use incorrectly I think I've accomplished that we've accomplished that it's a little bit unusual that the presence or absence of a constraint affects the semantics of a query but that just needs to be the way it needs to work so we sort of end over backwards to make it safe having offered this inference capability we ought then to make sure that we don't add further ways to yourself in the foot with the inference capability so having made people use it we better be very sure that we haven't given them more problems than we've prevented the cure if the cure was worse than the disease that would be pretty bad so there are further edge cases that I'll talk a little bit about I mentioned earlier you can explain well in general everyone realizes you can explain an insert or an update this is an example of explaining an absurd showing the novel aspects of it these are highlighted so you can see the arbiter index in this case it's called absurd peak key the primary key this of course is what determines whether or not we had a would-be-dupe violation which is grounds for taking the update path so it's specifically limited to that one here it is possible sometimes that you would have multiple such equivalent unique indexes for example you're doing a migration maybe you are replacing an index because it's bloated something like that these things do happen and we need to worry about the rare cases a lot because when it does come up even if it's very rare it's surprising if you have a spurious unique violation so you can create unique index concurrently and end up for a short time with two exactly equivalent unique indexes and a number of other possibilities but then if that happens this will figure out both of them you'll use both of them of course usually that won't happen but if it does happen you certainly don't want to have in-production all of a sudden so again easy to use correctly this is a partial index example it's a similar example so I'm just showing two arbiter indexes here I'm going to talk more about advanced features of inference I haven't said so yet but this also supports expression indexes Postgres will allow you to create indexes including unique indexes on arbitrary expressions for example you could do that to create a lower column name expression index so everything in the index will be lower case and if it was a unique index you'd thereby be enforcing uniqueness over lower case versions you'd thereby avoid not recognizing well you would recognize as equivalent cases where the only differences were in the case of the text you were inserting so I mentioned forgiving of ordering redundancy predicate works on unique indexes that satisfy it as we say so you could still have a non-partial unique indexes and if it happened to be if it happened to be satisfied or it happened to satisfy the inference investigation every other way it would work fine because it would cover everything which is sufficient it doesn't just cover some subset that you spell out it covers everything so that's good too again if you added a partial index, if you're doing a migration and production you have both at the same time for a little while it would just work so the inference should be used in preference to as I said earlier the on conflict on constraint constraint name variant that's really kind of an escape hatch we don't recommend you use that and as I mentioned earlier exclusion constraints require it because there's no way of inferring an exclusion constraint you have to spell out exactly which one you mean but in general best avoid doing that so I'm going to talk a little bit other advanced features so Postgres foreign data wrapper who knows what Postgres foreign data wrapper is I guess a half people for those of you who don't know it's a way of an extension that Postgres offers that allows you to create foreign tables that more or less appear as tables in your local Postgres QL instance but in fact are maintained by another Postgres QL instance that you connect to so it's creating the illusion of having those be local tables with a high degree of accuracy so this will itself work with absurd but it only support to do nothing variant and actually that's simply because you're required to spell out what you mean or constraint you have in mind when you use the do update absurd so because foreign tables in general have no concept of a constraint for that reason alone it does not yet work with Postgres foreign data wrapper but they do nothing variant when you don't ask for a specific constraint when you don't require that it be limited to that when it's everything which is probably the common case with do nothing then that will work just fine with Postgres foreign data wrapper anyone heard much about ORLS the row level security feature also a new to 9.5 feature can I get a show of hands for that perhaps so people know about that about the same proportion the existing permission system in Postgres is basically column oriented or table orientated where you have permissions on columns can I can I see the data can I update the data etc ORLS basically offers similar although somewhat different capability with row at the row level where you can also limit which rows are updated so this is something that is a major item in the 9.5 release which this will also play nice with more or less transparently there are some there is a little special consideration to how this must work with ORLS but not too much so it's nice that we were able to offer those both at the same time with each other updateable views as the name suggests way of defining a view and having an update on that view automatically be rewritten such that it updates the underlying reference table if the system can determine that it's equivalent to such an update that can happen transparently also will work just fine here library it's not PostgreSQL 9.5 it's part of the core release it's not a library as such no no it's part of the core system it basically works by typically what you do is you have a column that has the name of the user connecting to the database it doesn't have to work that way and you create a policy on the table and that forces the query against that table to have an extra bit on the predicate essentially limiting what it can see or what it can update it could be that an update is not allowed to see rows it could also be that upon attempting to it throws an error there's a lot of complexity and variation there depending on exact requirements the possibilities are considerable core distribution logical decoding is a big feature in Postgres 9.4 it's a way that we can do logical replication a logical decoding plugin can potentially support for any arbitrary purpose it can produce it can consume rather logical change set records that could be used for any number of purposes the way this will be reported if you do an upsert it'll be reported reliably as an insert or an update to the row it's not possible to reconstruct that it was an upsert so it looks like inserts and updates if you were consuming these change set records from your plugin it might matter that you're not able to reconstruct but probably not certainly not for logical replication so what you didn't see also is that the on conflict do update statement the syntax is totally unrestricted in its structure merge won't allow you to put sub selects in any of the handler clauses that it has whereas this has virtually no restrictions at all you can for example have subqueries in the update you could be correlated subqueries they may reference the excluded table so it's really very flexible in terms of what you can in terms of what you can throw in there there's an awful lot of flexibility there over and above what yes sir yeah I mean if you you could certainly you're certainly not restricted in that you must use the values statement you could for example insert into whatever conflict do update and then you could select star whatever so you can select an arbitrary thing in general you can select an arbitrary number of rows to insert and so it follows that you can select an arbitrary number of rows to insert or update because that's just an extra clause on insert well that's one possibility certainly yes there's also a new 9.5 feature it'll let you for updates in general are now updating multiple columns from a subselect so in other words you could do update table set in parentheses abc equals select 1 2 3 from some other table so you've got a subselect with multiple columns updating multiple columns all at once as opposed to for example having multiple such subselects which is messy I can combine it to one nice little one this will of course work just as well with insert on conflict do update that's pretty good there's also some of you will know that in general insert or update statements in postgres support a non-standard returning syntax although in fact oracle has a similar syntax this allows you to show rows after they were inserted updated or before they were deleted in the case of delete you can insert something and then you can see what you actually inserted what would you want to do that well maybe you have for example a serial an auto increment sequence and it's not you can't really determine ahead of time what the primary key value will be well that will be right there if you do return you can immediately this will work with an insert on conflict do update statement the same with inserted rows of course but with updated rows it will show you those as well so you can see inserted or updated rows returned to the client as if they were selected more or less again no reason why I ought not to just work which it does I guess we spend some time discussing whether or not we should show the updated rows as well and the decision was that you ought to and so you do okay so because of that you can does anyone know what a writable CTE is in postgres like a with statement okay so that's just some of you will perhaps know that in other systems there's always been or as an ability to have what is essentially like a temporary table that exists for the duration of a statement so you do a with something as and then you define it could be it's another statement selecting something from something else and then there's a main statement that appears underneath that that can reference the CTE as if it were a table this is a nice way of breaking up the logic of certain complicated things by sort of pipelining them so anyway returning which I mentioned just now can work with this in postgres which is a unique postgres thing it's not in any of the other systems that do have with statements so you can pipeline things you're inserting or updating across CTEs as well so again immense flexibility very useful for data integration I'm really looking forward to seeing how people use it partitioning mostly works as before that's not significantly different probably not going to specifically note that I guess that's all I have to say on the subject that's any questions any further questions yes sir yeah I always get that question which is the answer is no but let me just explain why that is basically if you do an upsert what if so that did work so that was offered what would happen if both were violated at the same time that's where you get the complexity of that you know if they were both violated at the same time then you can't really do both alternative actions at the same time right now it is true as I mentioned that you can technically although it's not really intended to be used in a fancy way you can technically use two unique indexes as arbiters at the same time I guess they're more or less equivalent in the event of migration that kind of thing