 So it seems like most people are back from lunch, so perhaps we can go ahead and get started. So the plan for this afternoon, what we thought we'd do is we'll start off with this lecture and then afterwards, we thought that since this is the beginning of week two, it would be good to learn to share a little bit about the projects that people are thinking to work on and you've been thinking about. So we'll just sort of go around the room and then each personal group can just say a few words about what they're planning to do and so we can just share our plans and have feedback at this stage. And then later through the week we'll have opportunities to present more tools for using in the context of the projects. For example, the NJO phase composites or BCSO, Boil Summer Interseasonal Oscillation, to think about the sources of predictability over your region that you didn't get a chance to do last week, we can present, I think Parola presented it already last week, but we can present that again as well as some other tools. I think the emphasis all the time is that there are many tools available as options but these are not by no means obligatory in terms of, well, you need to apply these tools. So to begin we thought it would be nice since there wasn't any lectures on this in the first week and we're starting to think in the second week especially about, well, how could S2S forecast be applied specifically in a region toward helping manage various climated risks or early warning of hazards to say something about how we verify forecasts, forecast verification. Some of this may be well known to many of you but maybe not to all people and then I think I'm going to be raising more questions and answers about how it is that we should go about verifying sub-seasonal forecasts. So what I'll say is mostly relates to the seasonal forecast scale but many of the same metrics are also used in weather forecasting but we need to think about, well, what it is that we actually want to verify should there be a weekly average or in terms of the sub-seasonal scale what it is that we're really interested in. So we'll talk a little bit about what makes a good forecast, something on a little bit on skill scores, verification of probabilistic forecasts and I want to show you that we do have a sub-project on verification within S2S. And so far it has lots of plans but not much done yet. So there's a lot of opportunities for developing sub-seasonal forecast verification on sub-seasonal scales and I'll try to give you some, hopefully convey some feeling for that. So what makes a good forecast? These are three aspects of what makes a good forecast that come from a slide by Simon Mason from the IRI. So Simon has worked on forecast verification for many years and he's even developed his own scores. He's a leader in the field of forecast verification and a couple of slides come from him actually. So he identifies these three things, quality, value and consistency and forecast quality. So the forecast should correspond with what actually happens, right? This sort of seems intuitive that you would want that to be the case. Right at the outset I realize I don't have any slides on that in my presentation but what actually happens means you need data. And for the sub-seasonal scale that may not be as obvious as for the seasonal scale. If you're thinking about precipitation, seasonal average of precip, you could build from monthly averages. But if you're thinking of sub-seasonal scales whereas Vincent was showing we're really thinking about daily variability, then we need daily precipitation data sets to be able to do that and those are much more challenging. So in terms of verification, developing the data sets of daily precipitation and having confidence in those data sets, there are several available, is a challenge. So the forecast should correspond with what actually happens. So this includes skill, reliability, sharpness, discrimination, and other forecast attributes. So they call these attributes of a forecast quality, like skill, sharpness, et cetera. And I'll say a few more words about those in the later slides. Then value, that's something we talked about this morning's lecture. Not only should they correspond with what actually happens, but they should be potentially useful. So they should come at the right time, be timely to the decision, be specific, may require downscaling of a forecast to local level and be salient in terms of we talked about the forecast quantities should be something related to the decision. And then finally here, this one on consistency, the forecast should indicate what the experts really think. Well, I mean, that seems like a no-brainer, right? But there can be problems with actually fulfilling this if the forecasters want to hedge their bets and they may have estimates of the probability of billow normal rainfall coming from their tools, but they don't quite trust, they have a feeling that this is maybe a too extreme forecast. So they'll hedge and so they don't actually issue what they really think. And so it's important for them to do that, it hurts skill if you don't. And in terms of a skill score, it has implications that the skill score should give, you should have an optimal forecast if you were to use as your forecast the true probability of the event happening. So skill, this really gets to the question, is one set of forecasts better than another? So skill is a comparative measure. A skill score is used to compare the quality of one forecast strategy with that of another, a reference set. The skill score defines a percentage improvement over the reference forecast. So often the reference forecast would be some kind of climatological forecast. And so this is a relative measure of forecast quality compared to this climatological reference. But still better in what respect we still need to define what we mean by good. So how do we do this? Assessing a set of forecasts, comparing with what actually happened, we can do it in this way and so we compare one forecast with this observation, a second forecast with a second observation, etc. And this is normally done in time so that for seasonal forecasts these would be distinct years, say for the summer of 1979, 1980, 1981, like that. And there's two ways that you can do that and that are typically used to do that. And one is to do this based on real-time forecasts that were actually issued in real time, which is the ideal really because if you have a set of forecasts available you'd like to be able to compare that with what actually happened. Or the other strategy is to use hindcasts which are also, especially on the sub-seasonal scale, referred to as re-forecasts where you take your model and you run it for previous years. So these are made respectively for past years or for past years. So for forecast centers that have been in the business for some time they would be able to do this for real-time forecasts. It has actually been done now for some of the regional climate outlook for us since those climate outlook for us have been going since the late 90s. People have gone back and tried to verify and see how good have they actually been because this is the key question that users of those products need to know. So Simon Mason among others has done that. At the IRI we have a portal for a mat room for forecast verification and that is based on our real-time forecast. So the real-time forecast we've been issuing since around 1998. So if you look at those scores they are issued, they are based on what we actually did issue going back every month since 1998. But more typically you'll find in seasonal forecasting that people don't do that and especially you could imagine since forecast systems are being updated all the time you might want to know well how good is my today's forecast system versus one that we had 20 years ago. So typically we'll use hindcasts or reforcasts to do this verification. This can be also of maybe of greater scientific use because then you can compare two forecast systems, your previous version versus a recent version. And particularly in the S2S database where many of those models are run on the fly so that they are updated often. So what you will be able to do in the S2S database is go and look back at those reforcasts that were issued in 2015 versus in the future ones that will be issued in 2017 because say for the ECMWF system you will have reforcasts associated with every start that's made and they are all being archived in the database. So typically in modeling we do this based on hindcast and reforcasts. And this is what I meant about consistency. Sometimes it's called a proper in terms of a skill score, a proper scoring rule is designed such that quoting the true distribution as a forecast distribution is an optimal strategy if you average over many cases. So in the end when we aggregate these all together we get some kind of skill score. And so I'm mentioning here that this would be done generally in time and in seasonal forecasting these would be different years so for a particular season we'll do this for previous years of that season but on the sub-seasonal scale this could also include many starts within a particular season so if the forecast are issued every week we can pull together over many weekly starts or if the forecast are issued every day such as with the NCEPCFS or the CMA model we can even do that pulling over daily starts. And so you can see that for sub-seasonal forecasts it might actually be easier to verify on the real-time forecast because we have more samples. If we just take our real-time forecast say from 2015 because we're starting every week if we average those together we'll have lots of samples. Normally there's some seasonality to skills so we'll want to just do this for a particular season. So in terms of scores one that we often use in the climate community is a simple deterministic score that enables us to get an idea of the skill of our model even if we're not thinking too much of applications it can help us scientifically to see if our forecast has some skill or comparing with a previous version if we have better skill. This measure is not at all recommended for applications just because of what I was mentioning in this morning's talk about the need for probabilistic forecasts and conveying uncertainty. You don't convey any uncertainty with anomaly correlation coefficient but it can be very useful for scientific use and assessing predictability and the model's ability to capture that. So this is just an association measure our increases and decreases in forecasts associated with increase and decrease in observations. It doesn't measure the accuracy. One nice thing about it is if you square the anomaly correlation coefficient it tells us how much of the variance in the observations is correctly forecast. Now if you apply this so x is the model and y is the observation so here we have the covariance between those and then we have the standard deviations on the bottom here so you notice we have an x bar here and a y bar so these are anomalies and so if you want to construct this for your sub-seasonal forecast you have to think well what should I use for my x bar or my y bar and this would come from the mean of the hind cast or the refall cast for x and the mean of the observations for y and you can see that this could have some if you're thinking about a weekly time series of this you could have there could be some seasonality in this. So in general terms we would subtract a lead dependent climatology so for week one if it's from December 1st through 7 we would subtract from the forecast for 2015 the forecast for all other years in the past from that. This skill score incidentally is quite sensitive to outliers and so you can get a high score this shows just in this case it was a good association between the forecast in red and the observations in red and the green is the hind cast in 1985 and it actually gave a lot of this .64 is just coming from that one year so you have to be careful in the interpretation. So here's just another little schematic of what this is doing all the bias is being subtracted because we subtract off that x bar and so even if the x bar and the y bar differ if there's some bias in the forecast compared to the observation we subtract that we subtract that one off and so it's subtracting the bias and then also the amplitude bias is being divided out as well because we have we're dividing by the standard deviation of x and the standard deviation of y so it's just just the association of the ups and downs. So what about if we do this for some sub seasonal forecast this is an example from a paper of ours where we actually took three models this is before S2S days we did this although the paper is only coming out now we did this several years ago now so this is using an earlier version of the S2S monthly forecast system and this is showing a normally correlation for week one, week two, week three and week four lead times and the scale here red is sort of 0.5, 0.6 and so you can see how you can use a normally correlation to be able to compare between lead times here so it can be although this is not a score we would want to be using for administrative use in the use of forecast it can be a very useful score for comparing different lead times so Paula has actually has a MATLAB script for doing this that some of you may care to also be having downloaded the data to actually apply to the apply to the forecast or we may have an example of doing that we may have an example presented later in the week so as I mentioned the lead time climatology is subtracted this is for boreal summer so we've pulled over start dates this is when the ECMWF model was only starting every Monday so they also have Thursday starts now as well all the other way around so we're just taking all the starts between I think it was mid-May and mid-September for the 1992 to 2008 period and we're calculating the anomaly correlation for each point between those weekly time series of the model forecasts and the observation and this case what do we use for observations we used a we just use the the CPC's merged analysis of precipitation so this is a course two and a half degree it's actually pentad so this sort of illustrates in the the challenges of doing this and having the observed data over the full period of the behindcasts so you could do this for you could say okay I'm going to use is a GPC P that has a has a one-degree version but I think it just starts at the end of the 90 so it needs to overlap the full period and so what we did here was to actually interpolate the pentad to daily it's it's not it's not not an optimal thing to do but since we're verifying on weekly averages we we think it's a we we think it's a fair approximation and so and so that's what we did here so what do we find I think you'll you'll agree that there's quite quite a striking result here in terms of just visualizing sources of predictability so you can see that in the first week you have a lot of red and there you have the the predictability from from initial conditions of the atmosphere so when in your projects if you're looking at different forecast ranges or maybe you did that last week in the in the first first week second week third week four week you'll see you'll see a decay in skill generally and you can certainly see the dramatic drop-off between week one and week two in in most places and so this is what we're up we're very ambitious in our S2S project that we're actually starting here in in week three so what came before in the World Weather Research Program the TIGI TIGI project was to be looking at forecast leads up to 14 days which is what what we think of being the limits of deterministic predictability according to weather but if you look here you see that already so much has been lost after seven days so even though the limits maybe near two weeks you you you really lose a lot after a few days as in that schematic that I was showing in the morning but then there are patches of red that persist and maybe you'll find that in your in your cases or maybe you'll be lucky enough that in your country that's the case for some particular season so this is a boreal summer maybe it's different in different seasons this is a this is also an older version of the model so maybe the the latest ECMWF system has has better skill so if you're looking at your own particular region say over say over Colombia here it it looks pretty dire as we move even beyond the first week but maybe maybe this is maybe this is the wrong season to be looking looking at this or we would like to know how well the models are doing and what might be what might be the sources of predictability so let's just maybe look at those those red areas there anything stand out in in this one here is quite striking how it doesn't decrease as you go to to longer lead and you would think that it should try to think about uh the skill should decrease as we go to a longer lead time but doesn't seem to be the case here any any anyone want to throw out a suggestion what what this in terms of a source of predictability what what this is you can see it right along the equatorial pacific yeah yeah and and so yeah particularly so really not decaying as we go through the four weeks and so even on sub seasonal scales the end so can be a source of predictability and we think of it we tend to think of it as seasonal but they can of course this is only association and if we looked at the mean squared skill score or something else maybe we're going to see that that's not the case but in terms of association we are seeing that and then over south southeast Asia maritime continent that's also a reason why it's decreasing somewhat but we see we have some skill there that that's persisting into these weeks and and that is that is coming from the from the mjo so we looked at actually at three models and i think that's the the great thing about the s2s databases now that we can look across models and see how how well do individual models do in different places or can we combine them together to get more skillful focus even even than than shown by a single model and then we we were intrigued by this persistence of the skill over over south over the maritime continent so we we went in and we looked at some particular cases there to see if we could get a better understanding of that and so we picked a particular year 2002 and then what's shown here is the seamap is this is just averaged over borneo island because borneo island turned out to be the the place in the world over the land that had the best skill so the the s2s has actually has a sub-project over over the over joint with the mjo task force over the maritime continent so i think this is a region and also rezan is going to be talking about the the azian cough on on on thursday i think on wednesday and that that is a a region where there's certainly some some skill in these and so that is just average over borneo island and these are by these are by pentad starting in week on may 28 so that's june and this is july and so you can see that there was there was a peak in in rainfall in late june and then there was a then there was a there was a negative anomaly a dry spell in in july and you can see that the the red line is the ECMWF at week two lead and the green line at week three lead and you can see that even at week three we can capture the model could capture some of these these intracesional swings and we could really relate it the the blue here is the mjo is an mjo episode over the indian ocean and the the the maritime continent especially over the islands tends to feel the mjo convection when it's when the mjo is actually over the indian ocean so it feels it ahead of the mjo envelope and then the july is in is in is in brown where it went actually into the western pacific sector so we could identify these swings in in borneo rainfall and the the mjo's observed evolution with with with them we didn't look in the model to look at the the models mjo and we will soon have the paula was was coding up some of the some codes to do that but i think that's that's not quite ready to use in this training but it's it's it's an option that we we we will have and then we looked at three different years 2002 2001 and 1999 and this is just showing with the the the nino sst anomaly so it was actually an el nino year in 2002 which pushed the whole envelope up uh uh well it pushed the whole envelope down sorry so we've got lower rainfall whereas in in 1999 it was a land nino year which sent to push the the whole envelope in rainfall up so one thing that we often talk about in in s2s is the idea of a forecast of opportunity and are there particular co phasings of of phenomena that that have some predictability like like an el nino and mjo that could give you particular windows of opportunity of good skill and then this is a something to be thinking about in terms of forecast verification well normally we pull over all the all all the forecasts to get an estimate of skill as i showed in in this in this schematic but if we wanted to do this in for some kind of forecast of opportunity should we be doing something more conditional on particular particular enso or mjo conditions it's a question of relevance to s2s verification so i would just to draw your attention to the some aspects of the the i of the s2s database that that are relevant for for doing verification of them and especially that the hind cast lengths which are relatively short of of the of the order of say 18 years for ECMWF and so if you think that enso is playing a large role we don't have many enso events in in 18 years so how are you going to how are you going to verify that we we think we if we're doing sub seasonal verification then we have many more starts to average over but if there's some non-stationarity if enso is playing a role maybe the effective number is less so this could be could be a problem for models that that have short hind cast sets there are other models that are coming more from the seasonal scale like the australian poama model which has a much longer a hind cast set from 1981 to to 2013 so this is great from that point of view unfortunately the the bureau metrology of australians are now are now switching to the uk's glossy model so so this in a couple of years this is going to go away and this will be replaced by the met office system which they said they'll they'll have them they'll do more more hind casts but it will be an on-the-fly system and so that will be limited in the in the number of years so these are sort of important things to think about in terms of verifying sub seasonal forecast then the other one is the hind cast size so if we're if we're doing a verification we we would like and we're using hind cast we would we would like to use the the same configuration of the system that's being used in the real time forecast for the hind cast in particular if we're trying to estimate a forecast distribution we should have just as many ensemble members as we as we do in the real time forecast and that is typically has been the case in in seasonal forecasting systems and you can see the poama system there has the same number of ensemble members for the real time forecast as it does for the hind cast but it's really the odd one out so for example if you look at ECMWF there's 51 members for real time forecast but there's only 11 members for for the hind cast and this actually is a big improvement it was just five members until I think it was May May or June this year when they they they updated the system so that that's another also another potential issue in if you want to verify a hind cast from from s2s how are you going to estimate your your forecast probabilities if you only have a few ensemble members to do that methods for estimating the the pdf from a large ensemble can be just just a matter of counting the number of ensemble members that exceed the threshold for example but you have to resort basically have to resort to some parametric methods if you want to use small ensemble size so fit a parametric distribution and do some kind of parametric regression and I'll show you an example of that how many ensemble members is enough and how much is in low number or high number I mean 33 is obvious that it's better than three but for example the NMME said that 10 member threshold so each member each model should have at least 10 members so there's been some work done on that say I think there was there was some work in the in the at ECMWF where they they they looked at well what is what's the sensitivity to the ensemble size of the skill I mean maybe you can think about here where in some places where you do have large larger ensembles what about if I degrade the size of the ensemble how does that impact on on my skill maybe it's worth mentioning just because you asked that question for example NCEP for ECMWF 51 but there's there's actually a big difference here in that the the NCEP forecasts are issued every day so the ECMWF forecast is a burst ensemble on Mondays and Thursdays but the NCEP one is a lagged ensemble every six hours so can we pull together several days from from from NCEP going if we if so if we're starting today on monday can we pull the forecast over over sunday that's four over saturday and digital four over friday that's another four then you already have 12 so as you're doing that that kind of lagged ensemble pooling your forecast gets slightly older but then that would be offset by having more members so what's the what's the trade-off in doing that so this heterogeneous database where we have you have all sorts of different ways of doing it can can help us do research on things like that relative merits of burst ensembles versus versus lagged ensembles so I've talked a bit about s2s let me talk just a little bit about probabilistic verification so this is a forecast that we issue that I showed this morning for from the IRI we we give we we divide into teresile categories and we give the the probability of the most most likely category so we're probably we're we're over over Borneo here we have an eighty eighty percent chance of of below normal this was issued in may of this year for June July August so was was this a good forecast or or not how could we go about verifying that and so I mean I mentioned that the skill score we need to average over many forecasts but what about if we just go and look at what actually happened so how are we going to go about doing that since we're just forecasting some some category here where we're we're forecasting some category with some probability like eighty percent below normal over Borneo how can we compare that with with what actually happened some amount of rainfall let me skip that one so that's the same one on the left here that's the forecast PDF the way that if you go to the IRI's forecast net assessment if you go back to previous ones and then if you take a regional view like the Asia region here you there's an actually little box you can tick on verifying observation and what this pulls up here is the CAMS OPI so this is the merged satellite gauge product that's used for for monitoring so it's kept up to date that's in the IRI data library and the way that we plotted it here is now I could show this in terms of the the spread over over all years what percentile was it in this in this past June July August was it well we said it was would have an 80% chance of being in the below normal 33% so where did it where did it end up lying and this what's shown here percentile of observation relative to 1971 to 2000 climatology and so we can see that it it was around the 33rd percentile so or in northern Borneo we were it was actually much more extreme you can see there's more variation spatially in the observation to what there was in the forecast and though those forecasts are made a quite quite low resolution about 300 kilometer grid so this is one way of doing it right a for a probabilistic forecast can't be can't be right or wrong unless we forecast 100% or 0% and we're careful never to do that so we could compare and see that where we were where the year actually was and we think that this could this is a kind of a nice way that users can can go in and and and look at a particular forecast to see how it how it compared so what about some key attributes of probabilistic forecast what I want to just describe to you are two here that sort of go together sharpness refers to the concentration of the forecast distributions the sharper the better provided the predictive distributions are calibrated and reliability are the forecast probability correct on average or is there some systematic bias toward under or over confidence so in our forecast we would want to have for sharp forecast we want to have forecast where we get lots of color on this map where we get really we get really into the the browns and the the the deep blues here that we have lots of lots of that so the forecasts are sharp they're sticking their head out they're saying that for a particular category we have a high probability of that we're not wishy washy 33 33 33 33 which is all the white areas there where we we don't have any skill right then we we need to to get some kind of score we need to average over many cases but this can give us an idea of looking at one particular forecast and comparing it with a it's a way to to compare one forecast probabilistic forecast with a verifying observation so so you mean that it might not be fair to be saying that I don't I'm not sure you can do anything I don't think you can do anything quantitatively but it can it can give a someone a feel for what that means if you're if you're making a probabilistic forecast and in terms of the distribution well what actually happened and so in developing a feel for being able to make make sense of this this kind of forecast format on the left I don't I don't know what you think if you think that it if it can if you think it if you think it can help Steve Yeah, isn't that the reliability measure? Yeah, this has a, maybe this is going into too much detail with putting the full percentile rank. Because the other way of doing this is to just put the teresile category. And so if it fell in the below normal, then we'll put yellow. If it fell in near normal, we'll put gray. And if it fell in above normal, we'll put green, which is the other way of doing it. And that's the way that Rizan showed a couple of weeks ago in Asyankov in comparing the previous COF's forecast with what actually happened. My thinking there is that it's nice to know how far away you were. And this gives you some indication toward extremes as well, that we're not just looking at the three categories. But maybe this is overly ambitious putting this ranking percentile of the observation. Let's say we have 15 ensemble members. How can we say that it will be below normal or above normal or toward that? So what you can do is you can just count the number of ensemble members that falls into those three categories. But maybe your question is, well, 15 isn't very many for counting that. And what we do is in order to make this, we use a parametric approach where we only use the ensemble mean to get the mean of transformed Gaussian. And we get the standard deviation, the spread of the Gaussian comes from the hindcast errors, essentially, by looking at the performance over past years. So there's a parametric approach of doing it and there's a counting approach of doing it. And it may well be that using parametric approaches is the way to go with sub-seasonal forecasting. I'll show an example of that later. Yeah, there's quite a few publications actually in the weather forecasting community by Tom Hamill and Dan Wilkes on extended logistic regression where what you want to predict is actually a probability rather than using ensemble mean. Because the Gaussian approach or transformed Gaussian is better suited to a seasonal forecast rather than when you get down to a weekly or a daily one. There's actually a paper by Tippet et al where they looked at the relative, how good it is using a parametric approach versus using a counting approach. And they compared the two methods and they found that the parametric approach was somewhat better. It gave a higher skill. So reliability in sharpness, let me explain these a bit. So the reliability, this says, did we correctly indicate the uncertainty in the forecast? It shows how well the forecast probabilities correspond to the subsequent observed relative frequency of occurrence across full range of issued forecast probabilities. So what happens here is that we plot the forecast probability against the observed relative frequency on a chart like this pooling across typically many points. So this will be done for, here it's over all tropical land areas. And so what we do is to look to plot all the times that these are done for different teresol categories. So the above normal one there is in green. For example, we will plot for all the times that we issue a forecast probability of 0.5 of being in the above normal category. How often did that happen? It's a 50-50% chance so it should happen on average with a 0.5 probability. So if the forecasts are reliable, they should lie on this diagonal line where what you forecast on average is what you get. And it shouldn't be that you're forecasting 80% probability a lot of times. But then if you look at all the times when you forecasted 80% probability of being above normal, you find that that didn't actually happen very often and your forecast wasn't successful very often. And so this is a nice calibrated case where things generally lie along this diagonal line for the above and below normal categories. But you'll notice that it's not at all the case for the near normal. So what we find in our seasonal forecasts, even if the Arcofs may like to hedge toward the issuing higher probabilities in the central teresol, we don't have any skill for forecasts issued in that category. And if you go and look at our maps, you'll scarcely find anything in the gray. Maybe that has to do also with the fact that it's not issued very often. Because of this, because there's no skill, the forecasts don't have any skill in that category. And so we only issue that there's a mask, if you like, on here where white means there's no forecast, there's no skill. So there's just climatological probabilities, which means one-third, one-third, one-third. Well, I don't think it means that. It means that there's no skill in forecasts that have that in that case. Incidentally, I mean, this is a forecast for a strong El Nino year, right, this year. And we have very strong probabilities here of 70%. But if you were to look back across our forecast for previous years, you'll find that we're much more generally forecasting light yellows or light greens. This is a very bullish forecast with a lot of sharpness, but they're not generally that sharp. We don't often issue forecasts in the 80th percentile for below normal. And that's the other thing here that is given alongside normally associated with the sharpness and measures whether or not forecasts vary much from their climatological distribution. And most seasonal forecasts avoid being overly precise by using, say, teresile or five categories. And it would be really sharp if you were issuing near zero or one. Most of the forecast probabilities are in the range of 40 to 60%. And the forecast system would be said to be smooth or not sharp. And so although in that particular forecast it was sharp over Indonesia, this, if you look back in general, this is how often we issue, again, a green is for the above normal category and brown is for the below normal and gray is for near normal. This is how often we issue those forecasts. And you can see up here in the 0.8, it's almost not showing at all. We don't often issue forecasts with that probability. And so plotting this kind of reliability diagram is challenging in that we don't have many samples up for high probability or low probability. We have lots in the middle, because this is how often we issue the forecast. But we don't have many as you get away from the 33rd percentile equity probability. And this is typical of seasonal forecast systems that they tend to be relatively smooth. In order to be well calibrated, they have to be smooth. So calibration can be thought of as making the forecast as sharp as possible so that you do issue things like this, not sharp forecasts, but maintaining reliability so that if you plot on average it will lie on the diagonal line, even if most of your forecasts are around about issuing just the climatological forecast. Adrian? Yeah, I think we haven't done that comparison. If the ECMWF system done in that way can be shown to be reliable, I'm sure it would be more sharp than this. There was another question you were saying. In terms of the IRI management, there are like 50% of above normal. But when the predictions are updated next month, it goes directly from 50% of above normal to 50% of below normal. So when this kind of situation is happening, is this the reliability of the forecast? So I mean those forecasts are not very sharp. If it's a 50% probability, we're still quite in the middle of the distribution. So going between, you know, 40 below versus 40 above, that is, they're really sort of neighboring. They're almost like neighboring forecasts in a way. I would be concerned if it went from like 80 to 20, something like that. So maybe this one, I just put this one in at the last minute. This is showing the advantages of using pruning over models and building a multi-model ensemble. So it's been shown in seasonal forecasts that you can improve the reliability by using a combination of multiple models. And that's shown here for three individual models. This is for July, August, September, pre-SIP, 30 South to 30 North. And the individual purple lines here, this is the above normal on the left and the below normal on the right, are the reliability diagram for the individual models. And you can see that they're overconfident because they're more horizontal than the diagonal line so that if we forecast above normal with 80% probability, that is only happening 40% of the time. So they're overconfident. If we pull them together without doing anything special, we can get this black line here. So that is just making one big ensemble out of the three models. Here's for the below normal as well. And then the green line here, I forget the technique that was used, but this is using in the two-tier system using some performance-based combination weighting the models. And you can see you can improve the reliability even more. It may be that with today's models, if we just pull ensemble members, we can actually get pretty close to this dashed line. And I think I've seen results like that coming out of ECMWF that if you have a large ensemble, you can get pretty reliable forecasts without having to do any kind of special calibration. So in the S2S project, we would like to know, well, what's the case for the sub-seasonal scale? Is it also the case that if we make a multi-model combination over several models that we can improve the skill of the forecast? And so we have a sub-project on verification for various questions like that. And I thought I would just put this up. If you go to the S2S webpage, you can go to Subprojects and you can find a document that has all of this information. Sorry, I was in Singapore the week before last and I came back to New York and the climate shock there, it was sort of similar temperature to this or even lower. I came down with a cold and a slight chest infection. Hopefully that didn't come from... In Singapore, a lot of people are sick because of the haze from the forest fires that was caused by El Nino. Anyway, I hope you can catch what I'm saying. So we talked about a couple of attributes, sharpness, reliability. There are others that I didn't mention, discrimination, resolution, et cetera, et cetera. But what are the forecast quality attributes that are important when verifying an S2S forecast? Now, how should they be assessed? Should we be looking at weekly averages or should we look at week three and week four pooled together? Which verification methods and forecast attributes are appropriate for reporting to users as well? So we may use anomaly correlation for scientific purposes but for reporting to users, we shouldn't do that. Something I mentioned, how should issues of short hindcast period availability and reduced number of ensemble members in hindcast compared to real-time forecast be dealt with in constructing these measures? Do we need to use some parametric approaches? Sorry. I mentioned this thing about identifying windows of forecast opportunity. How should we go about identifying those and assessing the contribution of climate drivers? So in your cases, when you're looking at the S2S forecast in your region, you can be thinking about what those drivers might be for your region. Maybe soil moisture might be one, for example. How about extreme events? When extreme events have this perennial problem of rarity just by definition, so when we have this couple with the small sample sizes in the S2S database, how should we go about that? There's some other ones here like active and break phases of monsoons, wet and dry spells. And then what about thinking about verification in a seamless manner across timescales? So we are trying a couple of things to look at, verify, skill probabilistically for sub-monthly forecasts. And I showed you there's anomaly correlation maps at the beginning. This is an attempt. This is now using the S2S database. So this is a reliability diagram for forecasts issued by the CFS V2 model for July, August, September. And this is a short period of 1999 to 2010 for one to four weeks lead. And so that's the different colors in the year. Week one in purple going down to week four in green. For the three TERSOL categories over the U.S. And what we have been experimenting here with is this extended logistic regression which enables you to take the forecast probability as a predictant in the regression. And the predictor in this case is the ensemble mean of the CFS V2. I can't remember now if it's pooling over a lagged ensemble to do this. But the first result here, it seems like we are getting some reliability of these forecasts coming out in these weekly averages. I just want to emphasize the last thing here in terms of the forecast format. I mean, what should this really be for a week three, week four outlook? This is the way that CPC is doing this in terms of they're using simple below normal above normal forecast probability. So it's similar in nature to the kind of seasonal forecast probability format. But if you come to a daily weather forecast where now we're looking at what is actually happening on each day, then maybe there's a probability of rain. But if you're thinking three weeks out is an anomaly format really what you want to be giving or should you really be giving some estimate of the total precipitation in that given week or some statistic of daily rainfall within that. So I think there's some questions as to how we really want to format an issue forecast on those scales that should be informed also by user perspectives of what would be useful. But this will also play into the way that we want to verify these. What kind of score should be used? Should we be looking at weekly averages or maybe decad is better? I think ACMAD issue decaddle advisories. And so maybe we want to look at decads for the month rather than looking at these weekly averages, for example. So some of the main points, forecast verification requires large sets of forecasts or re-forecasts. And this is challenging for S2S where we have these shorter data sets and smaller ensemble sizes. Verification involves considering many attributes of forecast quality. And we should think about which ones are most opportune on the sub-seasonal scale. So there is a sub-project in S2S on this topic. And we are encouraging, trying to encourage community involvement toward really S2S verification issues. And lastly, calibration intimately involves verification because it seeks to maximize the sharpness while maintaining reliability. But this is something I'll talk a bit about on Wednesday. In order to derive your calibration, you also need to use the re-forecast data. So you need to make sure that you're not using the same data twice to calibrate your forecast as well as verify it. So thank you. I think I'll leave it there. Take any questions.