 I'm going to talk about X-ray nano-imaging, how big can we go? Which is something we've been thinking about in our group. Okay, so here's the team. You're gonna hear from Saugat later on and quite a few of the things I'll talk about have involved Ming and Sajid and some Panpan. Chaoling and Shirley are more on biological sample preparation, which I'll talk less about today and Everett in sort of correlative light and X-ray microscopy, which I won't talk about all about today. I'm gonna give a shameful plug to for my book if you weren't aware of it. Okay, so here's my outline. I'm gonna talk about tichography and then how big can we go? The challenges include, what is the resolution that we can achieve and what do we have to do to get there? What is the coherent flux that we have coming down the line? What do we have to do for speed? What do we have to do for 3D imaging and the number of tilt angles and kind of putting it together how big can we go? And then I'll sort of segue into leading up to Saugat's talk. So of course, I think everyone on this seminar knows all about tichography. It's really been a wonderful development. In fact, I remember when I first heard there was a coherence workshop in here in France, what was the name of the island, whatever it was. And I heard John Rodenberg's talk and I was kind of puzzled by it at first and didn't kind of connect it very well to what we were doing in coherent diffraction imaging. And it was Franz Pfeiffer, I think, who really noticed that talk. Oh yeah, Pochero, yeah, yes, yes, right, yeah. And it was really taken off from there. So you all know about that. The way we tend to do it at ARGON is by combining it with fluorescence so that we get both high resolution transmission image and not quite as high optics limited resolution image with fluorescence, but that lets us do things like this image cells where we can see where trace elements are as well as see the structure of the cell. And we've also done earlier some work and now that's kind of taken off on its own project imaging and integrated circuits. And before it kind of took off to this bigger project, here's the early demonstration we did. And I just want to point out that imaging this chip here was done through 300 microns of silicon before we for polishing. And so being able to image a thin chip layer but through 300 microns of silicon, there's no way you could ever do that with an electron microscope. Beam won't penetrate that. X-rays, you can penetrate that. And so that got us thinking about how thick a sample can we do and what's involved in the challenges for that. And so that kind of gets into the challenge here. So when you want to kind of make estimations, you have to have a way of worrying about photon noise and contrast and this is the approach that we use for this. It comes originally from Bob Glazer and really from signal processing community but there are some key early papers that I've listed at the bottom here. And the basic idea is that your signal to noise ratio involves calculating your signal and your noise. The signal is the difference in intensities you measure between whether there's a feature there or it's just a background. That's your signal, that difference. So this term right up here. Your noise is from square root of the number of photons doing a Gaussian approximation to Poisson distribution. And you have uncorrelated noise in the feature measurement versus the background measurement. And so you add those in a root mean squared fashion. And that gives you this expression on the right here that the signal to noise ratio goes like the square root of the number of photons per pixel and bar. This intensity difference divided by the square root of the intensity cell. And that quantity IF minus IB over square root IF plus IB is capital theta, which in David Sayers papers is known as a contrast parameter. It's not quite the same as a fringe visibility as you can see, but it's similar. And then if you wanna get, well, just simple manipulation of that formula tells you that the number of photons you need per pixel is a signal to noise ratio squared that you wanna get. People usually assume five to one as a minimum from the Rose criterion, which goes to early studies on human imaging or human vision rather, divided by this contrast parameter squared. And then of course, if you wanna get more complicated, you include absorption and over layers and under layers. And so a much more detailed version of these calculations is given in this paper with Mingdu in 2018, listed at the bottom here. So that includes stuff like multiple scattering and so on. Okay, so I'm gonna make an assertion here, which is that all good image methods require the same fluence. That to get a specified resolution, you have to give a fluence of photons per area required by the intrinsic contrast of the sample. And also Parade's law says that the scattering signal drops off as the fourth power of Q or fourth power of spatial frequency or angle. And so if you're doing it in a dichography experiment or in an imaging experiment, you're collecting the light and phasing it with a lens. In all these cases, two times improvement in the spatial resolution need 16 times more fluence. So I assert that it's just fluence that, if you do everything else right, it's fluence that limits the imaging. Lenses collect scattered light and re-phase it to yield an image. You can do that, of course, in amplitude absorption contrast and in phase contrast. And so there are examples of doing Xerniky phase contrast for full field imaging and scanning microscopy. The little reference there is for scanning. But if you're doing a lens downstream of the sample for full field imaging, the efficiency of that lens for soft X-ray zone plates is only about 5%, typically 5 to 10%. And that means you need to use 10 to 20 times more fluence on the sample in order to get the required fluence in the intensity measurement. Coherent diffraction imaging has the advantage of using pixel array detectors that are near 100% efficient. And the phasing algorithms really don't introduce as far as we can tell any extra noise. And so I think there are many examples of computational studies of that, including the one I've listed there. And I maintain that the fluence requirements are the same, whether you use Fresnel diffraction in holography or near field tichography or for Fraunhofer diffraction, far field diffraction for CDI and tichography. And we did a numerical study of that. I think Tim Saul that may feel a little differently about that, but I still maintain this statement. And just to give you an example of this, here's some work from Zhenjing Deng where he imaged a test pattern multiple times except with different exposure times per tichography illumination spot. And if you look at the curve at the upper right here, this is giving power specter of the reconstructed images, you can see that you're getting higher and higher spatial frequencies as you increase the dwell time. And it goes roughly as that fourth power scaling. And then a simulation study at the bottom, we have an estimate in that paper of what is the kind of critical fluence for a given contrast of object, which in this case gave us to predict that the fluence would be 350 photons per pixel to see this simulated test object. And then we did reconstructions with a variety of methods, far field tichography is FFP with either a Poisson or the standard least square noise model versus near field tichography versus holography. And we've found in this simulation study that we had quite similar requirements for the exposure and they all worked out well with this kind of critical dose. So again, I claim that fluence determines resolution if you do everything else, right? And probably most of the people who have heard how long ago of this, but there was a paper that came out in 1976 in Electron Micros to be that made this statement, a three-dimensional reconstruction requires the same integral dose as a conventional two-dimensional micrograph provided that the level of significance and the resolution are identical. The necessary dose D for one of the K projections in a reconstruction series is therefore the integral dose divided by K. So in other words, if you're trying to detect whether this voxel has a feature or a background material, it doesn't matter whether you bring all the photons in from one direction in one projection image or if you distribute the photons around all the projection directions. And this was quite a controversial idea, the idea that there's no more dose involved in 3D imaging than 2D imaging when it was first introduced. But now it's really embedded in the practice of single particle electron microscopy where you're taking many, many images of identical objects and combining them to get kind of the combined statistics from the totality of the measurements and also medical x-ray tomography. So this is now kind of used routinely in all sorts of cases, even if they don't refer back to this original paper. Okay, so with all that, we carried out a study where we asked what would be required to image large samples? And so in this, we decided, okay, we don't know at each thickness of sample what is the ideal photon energy? What is the fluence required when you include all the overlying absorption and so on? And we did it, you know, if you're gonna do a simple model study, you pick a simple model. So we did basically two samples. One is 20 nanometer copper features and silicon with varying silicon thickness. And the other one is 20 nanometer protein features in ice. And we actually made the ice kind of cytosol-like with a little bit of protein in it. So those are the two samples that I'm gonna talk about. And this is kind of a complicated plot here. You're gonna see another one for biological samples in a second and so I need to explain this. What's being varied on the x-axis is the photon energy. What's being varied on the y-axis is the overall thickness of the silicon in which we have this little feature that we're trying to image. And then we're showing both a grayscale image and contour lines of the number of photons required per voxel. And so when you see the number six there, that means 10 to the sixth. And when you see a five, that's 10 to the fifth. So what it's showing you is that for very thin samples, of course we need only a few tens of thousands of photons to see this object. And that the optimum energy is rather broad between about three and nine kilovolts. But when you get to a thicker sample, you wanna increase the energy because you need to penetrate all of this other silicon. And so you'll see there's sort of the optimum is this, where there are these cusps of these contours are. So for a sample that's like 300 microns thick, your optimum energy would be something like about, 13 kilovolts or something like that. And for even thicker, you need to go to a higher photon energy. And the kind of rule of thumb that we see in doing this is that while this calculation is a much more detailed thing, a good approximation is to say that you should choose a photon energy for the thickness, such that the mean absorption in the overall material is about one over E or 30% transmission. So if you pick about 30% transmission, you're usually gonna be pretty near the optimum photon energy for that sample thickness. So that's what's shown here for a silicon. And then if from this plot, we go from this whole large array of values, for each thickness, we pick out the optimum photon energy and plot it. And we pick out the required fluence and plot it. We get this here. We have the fluence, the required number of photons in red. You can see again, it goes from a few thousand up to near a million. And the optimum photon energy at which the minimum fluence was found is shown in blue. Okay, so again, as you get to a thicker and thicker sample, you should crank up the photon energy. And it also follows approximately this trend of the one over E attenuation length in silicon. And of course, you see the silicon cage in there affecting the choice of energy. Okay, if we do that same thing for biological samples, here's the curve for that. We need, it's a little less contrast than copper and silicones. We need more photons. But again, we follow the one over E absorption choice for optimum photon energy approximately. And we also plugged into this model limiting the dose to 10 of the ninth grade. And so all of these conditions keep the dose below 10 of the ninth grade because that's sort of the point at which you start to see degradation even in a cryo sample. Okay, so we have a estimate then and here's the optimum photon energy and the required number of photons for a biological type sample in our simulations. You can see you need a lot of photons because it's a low contrast object. So the next question we wanted to ask then is, well, if we need a lot of photons, what might we get down the road from these diffraction limited storage rings? Of course, max speed four being the first one of them in operation. And so we've had this wonderful history in recent years going back for decades in fact of these very significant gains in coherent flux. And so we start to get to think about how far can we push things with X-ray nano imaging in terms of large samples now that we get more and more flux. And of course, X-fells offer even more coherent flux but because it comes in these very short pulses with a long time in between you have to worry about ablation of the sample. And so usually they use the diffract but destroy type approach. And that's not really compatible with a tomographic type imaging because you have to have your sample stick around for another viewing angle. Okay, I'm gonna then go and take two example cases which are the planned upgrades of the APS at Argon at high photon energies and the ALS at Berkeley at low photon energies. We can then calculate the spatial and the coherent flux from those sources which is shown here. This is using the best undulator of their set at each photon energy. And so it's kind of the best you could get at the best optimized beam line in each case at each photon energy. And of course, because coherent flux goes like brightness times lambda squared you get more coherent flux at larger lambda at lower photon energies and less coherent flux at smaller lambda at higher photon energies. And that's just how it rolls. Okay, so we can then use that required number of photons per voxel times the number of voxels in the image. Actually, it only goes as n squared not as n cubed for the time for 3D imaging. I won't get into that but it's in written in our paper in internal applied crystallography. And then you can calculate what is the per pixel time first of all with the available coherent flux using the optimum energy. So this is gonna involve the product of the number of photons that we need here required photons and available coherent flux and then picking out the best product of those two for the shortest pixel time. And so if you do this, you find that, okay you get a slight shift in the optimum energy because of the way the coherent flux drops off and you get a pixel time, which is rather fast for the copper sample. And if you do it for the biological sample it's still quite fast. So this tells me that, you know we want to be doing very short per pixel exposures and so we really need to push for high frame rate detectors, you know like megahertz range frame rate detectors. There are some R and D efforts of that especially at the FELs. And so we have some hope that this might come down the line but I think we need these for this kind of imaging at Synchrotrons. Okay, so next I'll say something brief about detector pixels versus frame rate for speed. And so I know there are a couple of different ways that people do typography there's using bigger coherent illumination spots they require more detector pixels to work with and they require more monochromaticity and they aren't as demanding on the detector frame rate. If you do small coherent illumination spots you require fewer detector pixels and also less monochromaticity so you can use a little bit more flux from a quasi monochromatic source but you do need the higher detector frame rates and you can show rather easily that the information transfer rate to the detector is the same in each case if you're trying to do the same overall imaging speed. So it's really just what is the information rate you need to get out and we need to get high information rates out. Okay, 3D imaging now. So I've talked so far about kind of standard typography and then saying that those fractionation works but now we need to think about the details of 3D imaging. So of course we know very well from what's been done pioneered at the Swiss light source and really operating routinely and beautifully at the Swiss light source that one very successful way to do 3D imaging with typography is to do 2D typography at each rotation angle and combine those 2D images in a standard tomographic reconstruction algorithm to get a 3D image. And of course that's not the only thing they do at the Swiss light source these days but that's kind of how it was pioneered in a very simple way to understand it. However, this only works within a depth of focus limit and the depth of focus limit goes like the transverse resolution squared over lambda. And so here we have a plot of the depth of field in microns on the vertical axis versus the transverse resolution on the horizontal axis. And you can see that for soft X-rays you have a pretty small depth of field when you're doing high resolution imaging. For harder X-rays the depth of field gets bigger but if you're gonna talk about doing 100 micron samples and your optimum photon energy from an exposure point of view is 10, 12 kilovolts you've got a problem here. You've got many depths of field your sample is gonna have out of focus regions. So how do we deal with that? And of course the way conceptually to deal with that and the forward problem was put forward in electron microscopy quite some years ago this multi-slice method where you treat your sample conceptually as being made up of a bunch of thin slabs. And what you do to model wave propagation through the object is you apply the optical modulation of that slab and then do a free space propagation to the position of the next slab. And you can keep doing this over and over again. And it turns out much to our fun it even gives you this reproduces mirror reflectivity phenomena and waveguide effects and X-ray optics. Cana and Lee did paper on that some years ago. So it's a very nice method. And again it's how you have to interpret images in electron microscopy. And of course then Andrew Maiden and John Rodenberg took that over for a typography and said, well, we've got coherent wave fields. We can then include objects at several discrete planes and model the propagation of the wave between these planes and image these multiple planes in an object in multi-slice typography. And then it's been demonstrated in the X-ray first in Japan and then at Swiss light source and then elsewhere. So there's lots of work now in multi-slice typography. So when you do that, you can ask questions about how many projection images do I need to acquire to do successful 3D images? And it's kind of a surprising result here. If you think about standard tomography, you take a projection image. A projection image means there's no variation in the information you get along the depth direction. So you get a pure projection image and it has transverse spatial frequency information in Fourier space shown on right, but it has no depth information perpendicular to that as shown in right. And so when you do standard tomography, you take these projection images and you fill in Fourier space as shown at lower right here from all these projection images. And the Crowther criterion, which tells you that you have complete information filling in Fourier space is basically the statement that at the outer edge of Fourier space, the outer radius here, you have no gaps in coverage of information. You filled completely all of Fourier space. Now, of course, you can do reconstructions with less than the Crowther criterion information. It works much better in algebraic type reconstruction than it does in Fourier-based reconstruction because the gaps in Fourier give you artifacts unless you deal with them somehow. But still, if you really want to measure all the spatial frequencies in the object, the Crowther criterion tells you that the number of tilt angles you need is pi over two times the number of transverse pixels. Well, let's think about what happens in multi-slice tachography. In multi-slice tachography, we're reconstructing several depth planes because of the wave propagation effects in one plane to the next. And so if we say that, well, we get good separation from one plane to the next at one depth of focus, then what we get is shown at lower left, we get a number NA of axial depth planes. And then if you look at that representation in Fourier space, well, you don't have just a single projection with no Z information. You have some Z information, as many as the number of planes you measured. And so the volume that you fill in in Fourier space has some extent along the beam direction. And then if you want to collect all regions in Fourier space without gaps as you rotate the sample, it turns out that your Crowther criterion is modified by dividing by the number of axial planes that you reconstruct. And so that's how you can think about how many tilts you need. Because if you think about taking million pixel transverse images to reconstruct a million pixels cubed for an object and had to do a million tilts, that sounds pretty crazy. And it seems unnecessary perhaps, and that's what we get in this calculation that in fact, if you work it out in terms of transverse resolution and the object thickness and the wavelength, the object thickness drops out. It's rather amazing. If you can really do this multi-slice approach. And so the required number of tilts is actually reasonable for even doing kind of million pixel across projection images in multi-slice dichography. You need a few thousand tilts, not a million. So that's kind of good news for doing 3D imaging of really large objects. Okay, so now let's put this all together. We have estimates of the per pixel imaging time and we have estimates of the coherent flux from the source. If we could do everything right, how long would it take us to image a 3D object? This assumes a 10% efficient beam time, 100% efficient everything else, which of course is not reality. But if we could get as close to that reality, how long would it take us to image a volume of, here's a thousand microns, a millimeter cubed? How long would it take us to image a millimeter cubed? If we could do all this right, these calculations tell us that it would be a minute. Okay, this is insane because we need really fast scanning. We need really fast detectors. But it's a goal we can reach towards. We can try and imagine doing really large objects. And in these calculations, which are very optimistic, it's plausible. And if we do it for a biological sample, again, it's plausible. We could imagine doing a centimeter cubed in a week. Now that's kind of exciting for things like conectomics. If you want to image a whole mouse brain, for example, and see a high enough resolution to distinguish synapses from near misses of dendritic processes, you want to be able to image a whole brain at 20 nanometer resolution. If you could do everything right and fully optimize, you might be able to do a whole mouse brain in a week. And if you could do a whole mouse brain in a week, that means in a year, you could do many from many animals with different animal models for human neurological conditions, different stages of development. You can really do science if you can do many, many samples. And so this is exciting for us, it's challenging. We're nowhere near this imaging time today. But in principle, we should be able to get towards something like this. And that's what we wanted to understand in this simulation study. Okay, so then let me end with a few brief comments that lead up to Saugat's talk. So if we can model the imaging process, we can reconstruct the data, that's a statement. And of course, in conventional tomography, we've already done large volume data sets of sort of terrovoxel size. That's done with conventional projection tomography with no multi-slice, no ticography, and about a half micron resolution, but it can be done. One can deal with these large volumes by parallel processing on supercomputers. And then in ticography, the approach you can use is that if you can make a forward model from a given guess of the object, you can predict what you would measure and then compare that against your measure and have a cost function, you could do a non-linear optimization approach to reconstructing the object. And that was done first by Guizarre-Sacairos and Finup, as far as I know in ticography. Then you can include things like Poisson noise models instead of Gaussian noise models for the lower exposures. And you can do it for this beyond depth of focus ticography where you don't sort of do multi-slice and then combine the slices and get a projection, but really just kind of included in the whole problem. That was done in a paper with Mark Gillis. And then using it with automatic differentiation on supercomputers, that you're gonna hear about in a moment from Saucat. But the nice thing is there are these packages that are written for large data sets for calculating your neural network weighting, which use reverse mode automatic differentiation. And so Mingdu and Saucat have paper on doing that for 3D imaging and their code base. There's another more recent paper kind of describing that code base for which is nicely implemented and parallelized on supercomputers. And so you're gonna hear now more about what's meant by automatic differentiation in Saucat's talk. And so this is a good time then for me to end and thank the team and many collaborators. And yeah, and now I'll be able to hand off to Saucat. So I'm gonna stop sharing my screen and I don't know if you wanna do any questions for me first. Yes, thank you Chris. This is a very inspiring talk. And I would say we can have some time for questions or comments before we pass. So Carlos Sato, please. You can unmute yourself. Yes. Yeah, okay. Thank you for the presentation. So just for curiosity, you'll mention what would be the upper limit, 100% efficient and so on. So I just wanna know, do you have any idea of what's the current situation? How far are we from this dream that you are showing us? At the APS really far. We're using beam lines that tend to have crystal monochromators. So they're providing way too much monochromaticity. We probably only need, multi-layer monochromators can provide the required monochromaticity. We're using zone plate optics, which are really inefficient, especially at hard X-ray energies. We're probably in most cases, people are overexposing the sample and not using Poisson noise model. So today at the APS, where we also don't have that 100-fold gain of brightness, it's where we're probably, well, we're well off of this curve. So you'd really need to, I think the, what is it, ID 21 and ESRF, where they have a very efficient KB mirror pair. And they have these kind of fast frame rate detectors that are based on phosphors though, which are not very efficient. But if you could combine that with the nice direct detection in silicon, I think that could be a beam line where you could get closer to this. But we're a long ways, I don't have a good number, but probably a thousand-fold away from this right now. Okay, thank you. Chris, you were mentioning that the detector development is actually crucial here. I can imagine of building a beam line a little bit more efficient and increasing the bandwidth and improving the optics. But what about the detector? I mean, there are some developments for detectors for X-Fells. How is that going? Well, you know, X-Fells are the only people really having the direct need right now to push this. Of course, they have rather big pockets compared to kind of individual beam lines. But still, you know, LCLS, they talk about eventually getting up to 100 kilohertz and, you know, that they could go to megahertz. But it's not been kind of a central place on the roadmap to my understanding. And I really, I'm not 100% sure of what's happening lately at European X-Fell with their detector developments. You know, being able to capture data from every pulse that European X-Fell would drive you towards megahertz and LCLS too would drive you towards megahertz. But I don't think there's a real well-funded effort on this yet. And so that's why I'm kind of trying to push this a little bit that this is what we all need, both synchrotrons and X-Fells is high frame rate detectors. And by the way, I didn't include a citation here, but there's a couple of papers that talk about how to do data compression in tichography, both in the GPU reconstruction code of the Swiss light source. And then we have one where we talk about doing it on the detector chip. And we're right now in the midst of a study with Fermilab about doing other schemes like PCA based compression on the detector chip. And so that is going to be something you also need to minimize the demands on your data channels coming out from these detectors. Chris, may I ask you to use the common shot to share any links you think it may be interesting for this community while Sal gets talks. So if you want to get to any other publication that will be really appreciated. You could also send a link to your book. I don't think there's anything wrong with that, so please do. This is really a sharing place. So Manuel first and Pablo after, please Manuel. Thank you. So thanks Chris, it was a very good, very great talk and very inspiring indeed to, inspiring and also scary I guess. So I was very curious for the, so you put the example of in an ideal world, you can get this one millimeter cube with 20 nanometer solution in tissue. What would be the data rate? Do you notice, what would be the data rate needed to achieve that? Yeah, so where do I, let me get to the right slide here. For a biological tissue, for the thicker sample, it's megahertz and for a thinner sample, it's 100 megahertz, it's this pixel time here. Now, again, you can play trade-offs. I assumed this pixel time assumes that a pixel is acquired in one frame, which of course you really don't need to do. You can get many pixels in one frame. And so that's a, it goes into the trade-off whether you use a bigger spot or a smaller spot. So I think there's some wiggle room here, but it does imply sort of getting into this megahertz range of frame rates for detectors. I think beyond that, you'd have to do a more specific modeling of a specific experiment to really get at that. Does that sort of give you a hint? Yes, I wanted to just clarify. So with megahertz, you would mean like the number of resolution elements per second. If we reach this speed, then we would be able to do that. Yeah. Okay, thanks. For the thicker samples. It's surprisingly smaller than I would have expected somehow, but thank you. Well, again, if you look at a thinner sample here, we're talking about 100 megahertz. Yeah. Okay, good. Thank you. Yep. Thank you, Pablo is next. Thank you for the next talk, Chris. I have some questions related to high energies. I mean, if I'm missing your presentation, you mainly focus around 10KB, you have discussed also up to 30. And there are these famous papers about crystal autograph based on crystal autograph. It says that coherent imaging, depending on the sample size, it could be much more interesting around 30 to 40 KBs. Could you comment? Why don't you go there? Or what do you think specced from coherent imaging and 30 KB from the challenges? So, as you go to higher energy, the crystallographers have made the statement that, well, it goes down to the X-ray refractive index, right? The scattering, the phase modulation that produces scattering goes like the real part of the refractive index, which tends to drop off as lambda squared. Whereas the absorption goes like the absorptive part of the refractive index, which tends to drop off as lambda fourth. And so that tells you, you should go to higher and higher energies. But the problem is you still need to get some scattering signals. So even though the scattering to absorption ratio gets better, you lose scattering and you need more and more photons and you just don't have time to get that many photons. Again, the coherent flux drops off like lambda squared. So that's, if you see this plot in front of you here, it does go out to 30 kilovolts and you can see that you actually need more and more photons, which takes more and more time and so it becomes less and less favorable. Even though, if I don't have it here, but we have dose plots that look quite similar. And what you see is that in these calculations, the dose sort of levels off at high energies as you go up in energy, you don't really gain any more in dose because you need so many more photons. So that's how we look at at higher energy. So to us, the optimum is always to be where the absorption in the overall sample is wanna read. That's gonna be about where you're best. So, okay. So you always look at absorption, you don't care that much about the dose because you lose the efficiency because of the scattering power. Right. Okay. Thank you. Then I see next is Saugat. Did you raise your hand? Yeah, I have one question for Chris. Do you know what the failure modes for multi-slice tachography are like? What are the scenarios where it doesn't work? That's a good question. And I don't think it's been explored as much as it could be yet. So for example, what are the monochromaticity requirements for multi-slice? Do you need to be more and more monochromatic? The more and more slices you need to reconstruct? I suspect you do. I think it's probably rather easy to show that you do, but I don't know. And maybe someone else who's on the talk does know, but that would be one thing. And of course, you know, the more demands you're putting on the reconstruction, the less tolerant you are of recovering any other parameters like scan position errors. You know, the more information you demand from your reconstruction, the less you can tolerate refining other parameters. But I think it's an underexplored topic. All right, do we have any idea how many slices we can reconstruct? I mean, if you talk about slices that are pretty far apart, again, I think probably monochromaticity and also spatial coherence come into play because you need to be able to propagate the wave field between those two planes. And that implies a degree of monochromaticity. Okay, so this is a good idea for papers I'll get. Well, I graduated, Chris. You're gonna have to find an undergrad student. Yeah. Okay, so I don't see any other burning questions. We can still have some more discussion at the end. So then it's my pleasure to introduce Saugert. And yeah, we already said something about you before. You've been working in the department of computing and now you are also working the APS, right? You're taking a new position. So, yeah, floor is yours, Saugert. Okay, yeah. So I'm going to talk about certain things. This is a big portion of my PhD work where we used automatic differentiation for tachography and first using first and second order reconstruction methods. In this talk, first, I'm just gonna talk about optimization of the phase retrieval of tachography and then try to give a good reason for why automatic differentiations of interest for someone doing tachography or phase retrieval in general. Then I'm gonna talk about what are the challenges when we do phase retrieval of first order algorithms. And then, yeah, some work we did developing a matrix free 11 by Mark-Warth algorithm for tachography. Okay. So to start with, this should be a familiar idea. This representing phase retrieval as an optimization problem where you have a complex value signal that you want to reconstruct and that you propagate maybe far field, near field, detect using an pixelated detector and you measure real value intensities. So the propagation matrix can be a Fourier transform or a near field. The detected intensity measurements are real value. So optimization problem is based on the current guess of the complex signal, we can count and based on the forward model, we can calculate a particular expected intensity. And then we can define a metric that calculates the difference between the expected intensity and the actual intensity. And what we want to do is find the value of the current guess of the signal that minimizes this metric of interest. Yeah. So this is our optimization problem. Find the value of Z that minimizes some metric between the expected measurements and the detected measurements and the actual measurements. Yeah. And so one way to do that is I think Chris mentioned that the Gaussian noise model. That would be one way to define the metric between the expected intensities and the measured intensities. And then here's C of Z, so regularizes constraint something else. Like a very popular way to solve these kinds of optimization problems. It's for gradient descent, which I assume everyone is familiar with. In the gradient descent, the gradient descent is an iterative minimization method where every step you take is the negative of the gradient direction. So you take steps in negative of the gradient direction with some step size. Because this is, because our loss function or our error metric, the distance between the measured and the expected data is a real valued number, but the variables we're optimizing are complex value. We need to use something called a gradient gradient. In practice, it's like for analytical purposes, if you want to do, if you want to do the derivations, then it's convenient to take, to use this definition of the written gradient, where you're taking derivative with respect to the complex conjugate of the variable. But I guess the basic idea behind the written gradient is that you need to solve if you want to solve for the complex variable Z, you need to define two variables, either the complex variable Z and its complex conjugate or the real part and the imaginary part of the complex variable. So for analytical calculation, this is very convenient, but in practice, we can, as long as we know how to practice with the gradient expressions, we can use either this expression or as we tend to use in like our automatic differentiation code, take the derivative with respect to real and imaginary parts of Z. And yeah, this is it. As I've mentioned, this is an iterative method. So this is how you would perform a phase retrieval. And I guess one basic, even a lot of the popular like really old school phase retrieval algorithm that the error retrieval of the error reduction algorithm can be represented as a gradient minimization, a gradient descent algorithm, hybrid input output is not exactly a gradient descent method. It's more of like a maximum, but a lot of these like approaches can be represented using this kind of optimization framework. So the optimization framework is a fairly general framework. And in particular, we are concerned about the non-convex optimization problem. We're not going to talk about the complex optimization. Again, everyone here is familiar with tachography. So this is the example that I'm going to be talking about where you have focused elimination coming in and getting a good sample. Yeah, we're overlapping probe positions and you have the intensities and you might also have a background term just like plus something. And what you want to do in tach, in the, so in particular, I'm going to look at two types of tachography in the construction problem. One is when we have, when we know what the probe profile is, what the elimination looks like when it hits a sample. And then another is when we don't know what the elimination looks like when it hits a sample. And the second problem is generally called the blind tachography problem in the literature. The first problem is referred to as a tachography and it's an example of something called the coded diffraction pattern problem. Problem in a general kind of applied mathematics and phase retrieval literature. So this is going to be like a problem of interest for the rest of the story, tachography. Okay, so next we're going to talk about automatic differentiation. Well, for tachography, what we want to do is, sorry, okay. Again, minimize the metric of the last metric that we define to measure the distance between the measured data and the expected data. And this is the Gaussian noise model for the tachography case in particular. And if you calculate the gradients, then the gradient like here, the notation is a little bit off, but the gradient per pro position comes out to be something like this in the tachography case, which is, this is the gradient with respect to the object and the gradient with respect to the probe. And this is a fairly complicated expression and it's not trivial to calculate. Now, if we want to amend the forward model, then we have to change, maybe if we want to go from like the far field to near field, then we have to, the only way we can do that is, we derive these closed form expressions again. If you want to change the loss function and go from Gaussian model to Poisson model or mixed Gaussian Poisson model, or if you want to introduce regularizers, if you want to optimize the new position correction, in this optimization framework, the only way to do that is to recalculate the gradients for all of these different scenarios. And that becomes difficult and tedious, pretty fast. And yeah, ideally we've got to avoid that. So then the question is, how can we calculate these gradients automatically and efficiently? The solution, spoiler alert, I'm going to argue is automatic differentiation. But the key idea that goes into it is that derivative calculation is the mechanism to process. Whether we're calculating it by hand or using a computer for this, the basic approach is to repeatedly use the chain of differentiation for derivative calculation. And the chain of differentiation is based on this idea of elemental differentiability in that any function we define, any even complicated last functions that we define are composed of individual elemental functions that we know how to differentiate. So in this simplified example, we have like sine and cosine effects. We know how to take derivative of x is this one. We know how to take derivative of cosine, that's negative of sine. We know how to calculate the derivative of sine and that's cosine. So when we want to calculate the derivative of this complex expression, more complex expression, what we do is we use the chain of differentiation and then yeah, if we did it manually, we would calculate the closed form expression for this, for the derivative. But if you're using automatic differentiations procedure without the idea behind automatic differentiation procedure is that we do not calculate the closed form expression, but we keep track of these individual derivatives here. And then at each point during the, first we evaluate this function of interest and at each point, this intermediate evaluation, we store this value in the memory. And then we look, we calculate the derivative, this is called the backward pass. And when we calculate the derivative, we use these stored intermediate values from the forward pass and we again like calculate the derivative stepwise without ever accumulating the closed form expression. This is just completely numerical. And we give, we perform mystery calculations to combine this using chain group. The one, like another popular way to calculate the derivative is automatically in the computer is by using the finite difference method. In the finite difference method, you approximate the derivative versus the big difference between the finite difference method and the automatic differentiation method is that in the automatic differentiation method, you're calculating the exact derivatives. You're not doing any, you're only doing any approximation based on the machine precision. So yeah, this is the gradient value you obtain is as accurate as you could even with the closed form expression. This is automatic differentiation. This is some historical notes. For the phase retrieval application, there was like, first I thought that the first application was proposed by Joeling and Pinoff in 2014, where they actually, in this paper, they comment about how automatic differentiation is difficult to use because there's no software available for this automatic differentiation. So they actually present the entire mathematical framework that you need to set up a basic automatic differentiation procedure. This is in 2014, but by 2015, I think TensorFlow versus, even in 2014, there were already some tools like Torch I think was already there, but in 2015, around 2015, TensorFlow came out and then PyTorch came out and we just had this explosion of deep learning methods. So now there are multiple different platforms which perform automatic, which can perform automatic differentiation in like high performance computing settings and GPUs and like tensor processing units. So we've had this huge, we had this huge change in landscape between 2014 when this paper came out. And then in between 2017, when Yusuf Nasser published a paper that used TensorFlow for ticography. So now suddenly using automatic differentiation is extremely easy and there are like this tons of tools available that can handle all of this. This was one line of work, but like much later, I also found out that there was already some more on using automatic differentiation in electron ticography literature, but in this case, the perspective that they used was to describe like this multi-slice electron, multi-slice model as an artificial neural network, which is the same idea, which is exactly the same idea as you'd use an automatic differentiation, but it's simply that in this perspective, they called it back propagation, but because they're representing it as an artificial neural network, this becomes a little bit of a limiting perspective. It's like kind of you're focusing on the tree and not on the forest. So this didn't find very wide application, but it was there. So the first paper was by Vanderwerke in 2012 as far as I know. And then Kamilov in 2013 as well. Okay, so again, going back to the question of why would we want to use automatic differential of these retrieval, no need to manually write the gradient expression, which decouples the forward model from the optimization. What I mean by this is that we don't need to be thinking about, okay, how exactly are we going to optimize this model when we define the forward model and vice versa. So if you want to use the second order method, I guess our optimization technique is just independent of the forward model. This means that we can easily change the forward model. We can introduce new optical elements. We can change the propagation matrix. We can change the loss function to introduce other noise models. We can introduce regularizers without having to think about okay, how do we get rid of the derivative of this? But two very popular regularizers are the one regularizer that imporches sparsity and the total variation regularizer that imporches smoothness. And if we want to optimize new experimental parameters, again, we do not have to go with the derivative. So I think I want to particularly highlight this paper that like Ming put in the work and it's really, I think it's a very comprehensive paper that we published recently. And yeah, this paper showcases all of these applications as I'm going to talk about in a little bit. But yeah, the change now is between 2014 and now is that there are no more of these ML software frameworks that are heavily optimized and can be used directly in CPU, GPUs, TPUs. On top of that, the automatic differentiation back ends of these frameworks is also developing really fast. So things that are not possible like two years ago are possible now. Examples of this could be combining forward models or combining optimization with partial differential equations. These are, there are many avenues that we haven't really explored. And again, this field is developing so fast that there's just a lot of things we can explore in the future. The big drawback with the particular kind of automatic differentiation that we talked about, the reverse mode automatic differentiation with that propagation is that you need to store all of these intermediate values. So there's like a significant increased memory consumption. So this is the trade-off that we have to think. This is the big trade-off. Okay, yeah. So in the first thing that I published in automatic differentiation approach, we looked at automatic differentiation for dichography. Before that, Yusuf, Nasheed and Chris, some others that published paper on AD for dichography, but that did, that was more of, okay, there's more proof of concept that did not this rigorously analyze just the application of the AD framework. And this paper, we did that. We showed that, okay, the derivatives that we calculate are exactly what we expect. And we also showed that this is very easy to extend to other forward models. The key idea that we use in this paper, and this is the key idea for automatic differentiation generalize that, what we define is the forward model, what we define is the object that go and then we calculate the elimination position. We define how the farther detection works. And then, yeah, we also define the intensities and we define error function. This is all we do in the forward model. And the software error automatically like, uses the chain of the differentiation to calculate the gradients that we need. And this is just to showcase that changing the loss function in this case, changing and going from the Gaussian noise model to the Poisson noise model, is as easy as just changing this expression here. Changing the propagation mode is as easy as changing this expression here. This is the power of automatic differentiation approach. And we showed this through a couple of different ways. We use the same basic framework with like these small changes in the code to also reconstruct near-field tachography, probe and the object in near-field tachography case. And we also looked at the multi-angle bright tachography where now you're reconstructing a 3D structure and not just a 2D structure. And we're looking at the bright geometry and not the transmission geometry. This is our paper. And Ming, like in collaboration with Ming, we also published all the papers that looked at beyond the focus type of tomography where we are combining this near-field propagation, as Chris talked about, multi-slice vanilla propagation to reconstruct the 3D object. Chris talked about this already. So, but yeah, we showed that all of this is possible with only, well, the multi-slice because it's tachography and tomography, this part requires significant changes in the code. But otherwise, the basic idea of optimization remains the same in all of these different approaches. And so one thing I mentioned earlier is, I mentioned that we can arbitrarily change the loss function, that we can change, we can optimize the new elements in the modern like position, tilt correction. Ming showed this again really nicely in the paper published recently, where this is the test cases, multi-distance holography, where we introduce errors in our simulation, we introduce errors in the hologram distances and also in the total orientation of the hologram of the detector. Collectively, we can call these affine errors because these errors can represent with like single matrix. On top of that, we also introduce short noise like personas into the intensity, simulated intensities. And okay, so first, what I want to show here is, if we look at the reconstructions without any correction whatsoever, using just the, if we, well, there's no thought with just with no correction. But if we use only the distance correction and the z along like the longitudinal distance correction for the hologram, then we are reconstruction, not very good. This is with the LSU, which is a Gaussian lens model. So this reconstruction is not very good in this case. Now, if we introduce distance and the tilt correction, then we get a very accurate reconstruction in the noise pre-case. The highlight here just that the software can easily reconstruct both the, can easily incorporate the hologram distances and the orientations for this refinement. On top of that, we also look at the noise pre-case we also looked at different noise models. Even here, we looked at the difference between the Gaussian lens model and the Sommler model. There's no discernible difference in the noise pre-case, which is what we expect. But there is a difference between, and here, the influence. So in this case, there is a difference between the switch and all of these reconstructions use as a default the affine correction. So in these reconstructions, we are already correcting for both the tilt and for the hologram distance as well. And this is just different cases for different fluences of different noise models. And also with the application, the total variation regularizer, you can see that when you apply the total variation regularizer, the reconstruction is much, in this case, it's much better with the Gaussian noise model, versus, I mean, total variation with the Gaussian noise model is much better than either of the reconstructions without the Gaussian noise model. But the highlight, the key highlight here is that we're using a single modular software framework to do all of this. Another thing I mentioned was that AD is HPC-ready. This is AD frameworks, two sets of HPC-ready. This was demonstrated by Youssef in his paper in 2017, where he looked at, where he compared between, like, sync pi where, in this case, what he did was he divided a large object in the subdomains, and then used the pi algorithm to reconstruct each of the individual subdomains. And the sync and the isync in this case, just refer to how often the solutions of each subdomains are synchronized. The isync is not, in the isync case, the solutions are not synchronized. So the pi code that Youssef used for this was very carefully spent, very long time optimizing this code. And this is very carefully optimized for multi-GPU keys. And when he used, when he coded up the equivalent framework using TensorFlow without doing this care, extremely careful GPU code, he found that in the TensorFlow reconstruction scaling it was actually better than this very careful, like customized code. So there's so much effort that all of these research, these companies, TensorFlow is from Google, PyTorch is from Facebook, that these companies put into their GPU back ends or into their FPC back ends, that our putting in a lot of extra work to try to develop our own code base to do this kind of in GPU and like multi-CPU reconstruction is often unnecessary and almost suboptimal. This is what this work showed. And you should be very surprised by that. Another perspective for AD now is, or another, I guess, use case. Anyway, in the multi-slice propagation, what we do is we submit a large number of slices and we don't actually obtain a closed framework version for the forward model. But if we have very large number of slices, we just write a program that calculates the propagated exit way by the very end. And this, it just keeps track of the elimination like each slice and propagate slice by slice by slice. This, we write this program. So a forward model is a program. And automatic differentiation, by definition, what automatic differentiation is, is a way to convert a program that calculates the forward model to a program that calculates the gradient. Now, if we wanted to calculate this by hand, if you wanted to do all of these costs, like if you wanted to incorporate this into this optimization by hand, for this case, we cannot derive the closed framework version anyway. So what we eventually end up doing is writing a program to calculate the same thing, to calculate the gradient to do this backward propagation. And by definition, automatic differentiation to just a way in which the computer automatically does this. So this is the power of automatic differentiation. And this is also, this is actually the basic idea that makes automatic differentiation so attractive for the neural network case where you might have many different layers. And you cannot write either the closed form forward model or the closed form gradient model. So all you can do is work in terms of these programs. Another perspective is also that, another argument for the use of automatic differentiation is that if we want modular software, if we want modular framework that addresses the first variety of forward models, solves variables, solves the object and probe variables as well as that can also correct for translation, tilt, rotation angle, et cetera. Then one approach would be to write a closed form expression for each of these combinations, which is just completely unrealistic when it is extremely tedious. So if we wanted to write this framework by hand, we would try to think of a modular approach that uses the chain rule intelligently to combine the gradient calculations. And this is exactly what automatic differentiation is. So this is another argument for why automatic differentiation is an interesting approach for feature. So the application that I showed so far in the automatic differentiation phase retrieval or gradient-based phase retrieval all use the first overall algorithms in that it only used the gradient implementation and it did not use any higher-order derivative information for the optimization. This is a little bit limited and a little bit difficult because when we do a gradient-based phase retrieval, we have to define a step size. So this is the gradient-based phase. This is the very basic C++ approach where we use a step size. And we can either analytically calculate the step size like a conservative choice for the step size which is called Lipschitz-Holstern, inverse of Lipschitz. And the EPI case, for example, uses this kind of step size. Or we can use an accelerated method using this conservative choice of the step size like in the EPI algorithm needs to vary slow descent and the various optimization. So if we want faster optimization, then we have some options. One is to use, like our option is to use an accelerated method. And once we use an accelerated method, in this case, the accelerated method that we're using here is called the momentum method, which is, again, very popular method in optimization literature and particularly in machine learning literature. In the momentum method, this is much faster but you introduce an extra parameter, the beta parameter on top of step size. There are other accelerated methods as well, like the extremely popular method in the optimization literature and machine learning literature right now is called the atom optimizer, which also introduces these extra tuning parameters. The difficulty now is once we introduce the extra tuning parameters, then we also have to, like the optimization becomes very sensitive to the values of these tuning parameters. And even on the basic steepest descent case, outside of the E-Py model, outside of, like, once we move towards complex forward models, then it becomes harder and harder to define the conservative step size. So even in that case, we have to tune for the step size anyway. So in the end, our reconstruction becomes very, very dependent on these tuning parameters. For example, in our paper, we looked at Papertag program with atom optimizer and we found that for different levels of photon count, the ideal values for these step sizes were very different. So one step size that works for one photon count and one loss function might not work for, in this case, the color scale shows the reconstruction error after 15 minute iterations. So a step size here works fine for the photon count, particularly the glow photon count, but that would lead to very slow optimization or no optimization at all for larger photon count. So our optimization procedure is dependent on this kind of tuning that we have to do, which is difficult and very competition expensive. And this is a very big challenge, like not only in our work, not only in inverse problem literature, but also in the machine learning literature book. So in effect, as scientists, we're stuck between like a hard rock that is so optimization or a hard place where we have to spend a lot of effort trying to optimize specific parameters. So what's the solution? Like how can we achieve fast optimization, fast and robust optimization without doing this kind of tuning? One idea is to use instead of trying to define sophisticated strategies for this kind of tuning, why don't we use some method that produces well-skilled directions? And we can do that by using higher order derivatives. For example, by using the Newton method or quasi Newton method. To understand what we mean by like high order, like second order methods, we can go back to kind of a very basic picture of it, even first order optimization. When we do even the first order optimization, what we are doing is first, if we want to optimize, if we want to calculate the step size at a particular point C for cost point and G, then we approximate that using the Taylor expansion, the second order Taylor expansion is the second order Taylor expansion. Here we have the zeroth order term, the first order term, and then the second order term, where M is the matrix of second order derivatives. Now, in this graph here, I'm just looking at very simple function where G is the original function. And let's say we have no information about the second order derivatives at all, then we substitute M with the identity matrix. And in this case, we get the approximation, that's the red curve here. And then when we're calculating the descent step at this point for G, what we're doing in effect is calculating a minimization step for this quadratic that we calculate. So in this case, the minimization step would take us here. And that is in fact, the steepest descent step with the step size of one. But that is not optimal because that would not actually lead to the decrease in the original function, it leads to the increase in the original function. So what we have to do is we have to tune this, our approximation, we have to scale this identity matrix appropriately so that we find an appropriate step size that leads to a decrease. This is what we're doing when we're doing a basic first order method of basic steepest descent approach is we are approximating our function at the quadratic and we're calculating the scaling of that quadratic of scaling that gives us a descending direction. Now, what if we replace, what if we use the exact second order approximation, the second order matrix of derivatives, which is the Hessian. And that step is called the Newton's method. And as we can see, the Newton's method is well-skilled in that it takes us to a decreasing direction without doing any further tweaking. This is just a natural output of the, a natural output of our Taylor expansion. So we don't have to do any further tweaking here. But Newton's method is also a little bit limited sometimes when we're working with non-convex functions. Because the Hessian matrix is not always, it's not always positive semi-different. Or if our function is not convex, if we have like some kind of feature like this, then the Newton's, if we start here, Newton's method might take us in the increasing direction to the closest minima or maxima, the closest critical point of the quadratic approximation at this point. So this is not desirable. One solution to this is to use the Gaussian matrix, which is a positive, which is like a convex approximation of Newton's matrix or convex approximation of the Newton's method. In the literature, originally the Gaussian method only applies for like these square optimization problems, like with the Gaussian noise model, but we generalize this a little bit further to apply for any noise models. Then the question that comes up as well is the Newton's method, the second order methods can produce well-skilled directions without doing any further tricking. Why don't we just use it all the time? The problem is that the matrix M is an n by n matrix containing second order potentials. So to calculate this matrix and to store this matrix, we have the number of calculations on the stereo requirements grow quite radically. From one perspective, X-ray to first order methods can actually be interpreted as attempts to approximate this matrix, approximate the Hessian without actually calculating the addition, that actually calculating this n squared terms. And this is just for small-scale optimization problems, people use these second order methods all the time because at that point, it's a perfectly reasonable approach. If we have 10 variables to optimize for, then we have to get 100 elements in the matrix, which is fine. But once we go into any large-scale matrix, this becomes completely unreasonable. So the solution to this that we're saying is a class of methods called the matrix-screen methods, which is fairly new in the literature. With the idea that instead of calculating the full matrix of second order derivatives, we substitute that with a function that calculates a matrix vector product without calculating the full matrix. What does that mean? In the Gaussian method, this is the optimization step we take in the Gaussian method where the gradient, G is the Gaussian matrix and P is the step that we calculate. So we would calculate this matrix and we calculate the inverse of this matrix and we multiply that with the vector. If we just multiply on both sides by G, we get this linear system. And for this linear system, there exists a lot of methods, like the standard conjugate gradient method or the GMI-res method, which can calculate this vector, this desired step size P without inverting this matrix. Not only that, if we have a method to calculate this matrix vector product directly without ever calculating the full matrix itself, then we can still use the conjugate gradient method or the GMI-res method to calculate this in this update direction. And that would be a matrix-free optimization method. And it turns out we can actually calculate, particularly in the Gaussian matrix case, we can actually calculate this matrix vector product using existing reverse mode AD methods like TensorFlow PyTorque without ever calculating this full matrix. And so that the computational memory costs are only about four times that of the gradient calculation. While the Gaussian method is fairly powerful by itself, the Levenberg-Marquardt method is that we look at as an improvement over the basic optimization method in that it is more robust. And the Levenberg-Marquardt method is actually one of the go-to algorithms for small-scale linear optimization problems, or like small-scale non-linear optimization problems. If you look at optimization software, it's the curve-fitting software and what that origin, if you look at the main pack and MATLAB, sci-pi optimized, the Levenberg-Marquardt algorithm is typically one of the options that you'll find there. But generally, the Levenberg-Marquardt method is only applicable for this kind of v-square problems where you have the difference between expected and measured data and square root. But this is not the general, the Poisson noise model, for example, does not look like this. We found out that there was recent, like, Scraudolph in 2000, he proposed a method called the generalized Gaussian method, which extends the Gaussian method to general noise models, to general error metrics instead of the v-square model. On top of that, I will mention it briefly, but for the blind dichographic case, in particular, when we want to solve a book of proof on the object, we ideally want to apply additional constraints that limit the solution space so that we can reverse the obtained solution. And if you want to obtain, if you want to apply constraints, one way is to use a projected VDM method. And we can actually incorporate this kind of constraint optimization within this 11th book Markov's method as well. So in this approach, what we did was we combined, in our paper, we combined the projected gradient method and the generalized Gaussian method to develop a more general matrix 11th book Markov's approach. Again, going back to our original problem, we were interested in typography where in the paper that, the paper recently accepted that Optics Express, so it's going to be coming out soon. We looked at different noise models and we looked at the standard object retrieval problem and the blind typography problem. And one of the things that we were interested in was also, okay, what's a good second order optimization strategy in that? Do we alternate between updating the prove and the object like we do in EPI? Or do we calculate the prove and the object updates together at the same step? We call this alternating and joint optimization. So for the object retrieval alone, when we're not retrieving the probe, we compare the 11th book Markov's performance with some of the state-of-the-art methods in the literature. One is Nistrope's accelerated gradient or Nistrope's momentum method. And the other is the pre-condition nonlinear boundary gradient method. And what we see is that if we look at the number of iterations we need for convergence, we see that the second order method converges much, much faster. And this is showing what's called supaline convergence compared to the first order methods. This is the expected result. This is what we expect by the important limitation, the reason why we're not using second order methods in the first place was that the actual computational cost, the actual computation required calculating these use matrices, would require calculating these use matrices. So the actual computational costs in terms of plots would ordinarily be much, much higher per step than a first order method. But in this case, with our matrix tree method, we found that the living by Markov method had a computational cost comparable to or better than, this is the actual computational cost required to reach convergence. So the number of iterations is different for each method, but we're calculating actual computational cost required with the solution. And the computational cost required to use the solution is actually as good or even better as first order methods. And this is an object retrieval case with the Gaussian environment. In this case, actually, the master of momentum method and the conjugate gradient method also performed really well. But the limitation of the non-linear conjugate gradient method is that the non-linear conjugate gradient method is very difficult to adapt for constraint optimization. Yeah, I'm not aware of any popular adaptations of the conjugate gradient method for constraint optimization with constraints. While there exists constraint optimization adaptations of the master of actually gradient of the master of momentum method, this can require a lot of step-sized tuning. Or we have to use the Lipschitz constant for the step size, which is difficult to calculate. So when we look at the blind dichrography case, we don't look at we choose the conjugate gradient and the momentum method from our comparison. Because neither are either easy to adapt or in the conjugate gradient case, they're not guaranteed to converge for the blind dichrography. So what we compare our method against is the ADMM method, which was published by Stefan Matruzzini and others in 2018, with the Phoebe method, which was published in 2015. And then both of these methods have guaranteed convergence for the blind dichrography case. And we also compare this against the EPI method, just because the EPI method is too popular in literature. We found out that one interesting result here is that in the EPI method, if we don't apply any constraints, even with constraints, I don't really know if the EPI method has been proven to be convergent for the blind dichrography case. But if we don't apply any constraints, the EPI method is not convergent for the low SNR case. We have a very low income input on count. And in this case, if we applied the nonlinear conjugate gradient method again, that would also not converge without the use of constraints in this case. And the result here was that for the blind dichrography case in particular, the dendromarket algorithm with our constrained adaptation performed extremely well, particularly when we optimize the proven object together instead of alternating between the proven object optimization. Interestingly, the edumem method was very difficult. The edumem method actually requires a lot of tuning. There's a penalty parameter that we need to tune carefully. And it turns out the edumem penalty parameter was very difficult to tune for the low SNR, the highly noisy case. So it is very likely that our tuned edumem parameter is not the optimal edumem parameter, but that this is just difficult to tune. So yeah. In general, for the lower SNR case in particular, we found that the Levenberg Markov method completion cost and the regressiveness was much better than the state of the art methods of the blind dichrography application. This is the Gaussian loss model. The Poisson loss model turns out it's much more difficult to solve with general second order solutions and not just the Gaussian region, not just the generalized Gaussian, but also with the basic Newton's method. The Poisson loss model is very difficult to solve. But interestingly, the conjugate gradient method actually performed really well. Or the other first order methods do not perform very well, even like the edumem method I'll show you in the slide, does not perform very well for Poisson loss model, but the non-linear conjugate gradient method performs really well, and I'm not really sure why. Anyway, since the basic Poisson loss model was difficult to solve, and very computationally expensive, and this is PLM and I'll show the preconditioned and without preconditioning Levenberg Markov method, and the computational cost for these is extremely high. So what we did instead was we designed the Sorugate formulation for the Poisson loss model, where we stabilize the Poisson loss model using this extra term, which we drive to zero as we pursue the optimization. So after maybe 50 iterations, our solution, our noise model that we're optimizing, our sorugate is exactly the Poisson loss model, but just small change in the early iterations of optimization is enough to decrease our computational cost to completely unreasonable compared to the non-linear conjugate gradient method to comparable or somewhat higher. So the Levenberg Markov method with sorugate Poisson formulation works for a lot of favorable cases. And it also works pretty well for the blind dichography case as well. Again, in this case, the Phoebe method was not designed. The Phoebe method in particular that I mentioned earlier used the Lipschitz constants for the step sizes. And I don't really know what the Lipschitz constants of the Poisson loss model are. So we didn't use the Lipschitz constant here. And that was the reason we also didn't compare the Nistrov's momentum method in earlier slide. For the ADM method, Stefano and co-authors, they derived the actual step, like analytically calculated actual step size that they need, actual steps that you take at every step. And we're using their derivation, the exact expression. But again, we found that for the low SNR case, the ADM method was difficult to tune and was very difficult to tune and very computationally expensive. But for the higher SNR case, the Poisson loss model is again, the ADM method. And the other method performed comparably have similar computational costs. But the difference, the big difference is that for the Lipschitz method, we don't need to do any, we don't need to do the extensive parameter tuning that we need with the ADM method. So in this paper, what we basically showed is that we can use the secondary order of the, of a reversible phase retrieval with the minimum parameter tuning. And in particular for the Gaussian loss model, the rate of convergence is really high. The Poisson loss model is hard to optimize. More general takeaway is that matrix-free secondary order optimization viable approach. And this was a fast-developing field. So there's a lot of work being done in this application. Even when we consider the optimization literature at the work that we did with the Markov method, it's pretty much the state of art in the general literature. But it's a fast-moving field and matrix-free secondary optimization that's being developed pretty fast. So maybe in, maybe there will be new development that will also allow the application of, okay, what I didn't mention is that the main limitation of this method that we talked about, all of the methods that we've compared so far is that it's the E5 method, is that these are all batch minimization methods and that they use the entire data set at once. But for large enough data sets, we can't do that. We have to use smaller chunks of the data set so that it fits in the memory. And we can't use the liberal Markov method or the non-linear conjugate algorithm method or the ADM method for that application. But again, this is a fast-moving field which is being developed. And there are a number of upcoming, up-and-coming mini-batch secondary optimization approaches as well. And there are, we can also extend this to apply regularizers or other extensions by proximal operators, but we haven't done that in the current work. Just as an aside, I also want to mention that the basic error reduction approach is a limited version of the projected gradient method that we have incorporated in the seventh Markov method. So if you wanted to accelerate the error reduction method in CDIPs which we call, one thing we can do is we can, instead of using a fixed step size, we can just actually tune the step size per user line search for the step size, which would make it the actual projected gradient method. Just that would significant, that's right, the basic error reduction approach. And then we can only use 11 with Markov method for that, that would be like a much more significant acceleration. However, despite all of these advances, the big challenge that we, as Sophie said, is that phase retrieval is a non-combat optimization problem. So even if we use the Markov method, even if we use the EPI method, even if we use contribute gradient method, we generally only tend to go to the nearest minima and not necessarily the global minima. So if we start from here, this is paper by Stefano Padne. If we start at different positions, we might end up in different minima, different local minima, not all of which are the true solutions to the problem. And there are workarounds that exist, for example, the hybrid input output method is a popular workaround for the basic CDI phase retrieval case, but they're not necessarily robust and they're also not guaranteed to get us to the true minimum minima. So even with all of these advances, we are still nowhere close to actually solving the phase retrieval problem. So, okay, that was the presentation for my work. And I just want to mention some other things with the related to EDM. As Chris mentioned, we have the APS and the SR5 upgrades coming out, which will give us a lot more coherence flux so much will allow a lot higher resolution or data flow for existing experiments. So that, we expect that to enable a lot of new physics. And in particular though, AD method comes into play here where we can actually use AD methods to also incorporate homologic equations, for example, into the basic, into our optimization framework to solve for maybe instead of multi-slice, maybe there might be some areas where actually solving for the homologic equations might produce a faster, better solution. But that's certainly worth exploring. And with the AD method, we can incorporate that. For beyond of the focus case for, or a general dynamic diffraction case for materials interfaces, we might want to solve instead of just using the projection approximation, we might want to use a dynamic diffraction and solve the Takagi-Token equations. In that case, again, there is some doing some work with some of these people like Tauzer and Ross and Martin Holt, we use automatic differentiation to directly solve the Takagi-Token equations instead of use the projection approximation for the BCDI diffraction. And hopefully we're hoping that incorporating AD in these ways will allow us to robustly solve like beyond of the focus imaging materials interfaces. Another application of AD would also be to incorporate prior information. As I mentioned, main showed, like the main has incorporated all the variation. He's also used something called a deep imaging prior and it's on the paper. But these become easy to incorporate using automatic differentiation, but there's also still a lot more work required like finding which priors are appropriate and how to tune these priors. Hopefully in the future, we can use some of these methods to also reliably achieve the global minimum to actually solve the phase retrieval problem. I'm not really sure what approach that could be. And a different line of work that also combines the automatic differentiation with phase retrieval approaches is also the machine learning approach where we can either use machine learning as, one approach is to use machine learning as a supplement for physics based models and that we might use machine learning to regularize the physics based model or we might use machine learning models like to define some kind of natural imaging prior. In this case, we'll be kind of, to find ways to combine the deep neural networks and the physics based form model. And this is very convenient to do with the automatic intervention approach. Further, we might even be able to develop end-to-end deep learning methods using like physics based approximants. And from the APS, I know that Matt Shoraker had published a couple of papers doing this in the typographic case and the BCDI system. But again, in this case, one thing we might be interested in is using deep learning model to find an approximate solution and then doing the physics based refinement. Again, AD is a natural way to combine these approaches. And further on, if you want to experiment with automation, if you want to use, yeah, experimental automation that tries to try to automatically find the spots in the typography scan where we need more scan points versus less scan points. And then we might want to use Gaussian processes or deep learning methods where again, relying on automatic differentiation is the easiest way to implement this. Either way, this is a very fast-moving field and the future is bright, just like our coherent boxes. Yeah, and I want to thank my PhD advisors, Stefan, Chris, and Yusuf was also involved with the PhD project for a while. Collaborators, Mark Allen has been a very important collaborator against SID, has also seen the meaning of also been very important collaborators for me. Other collaborators, Matthew, Ross, Dora, Argon, lab members, Jacobson, and all. Yep, any questions? Thank you so much, Sargez, for this talk. Is there any questions from the audience? Please raise your hand or unmute yourself and you can put it in the chat if you prefer. I think you've been in the field maybe not long enough, but the complexity for phase retrieval is actually, in my opinion, hindering a little bit the expansion of this method in a large community. So my feeling from your talk is that this is an effort to actually make it more generic and more robust so that people can actually use it a little bit more blindly. So what is your feeling? Is he actually going to help people approaching? Do you think that there is a hope to make this retrieval out in so robust that actually one needs less? I think that is the hope that we can just inject, we can just give a phase retrieval problem to an algorithm and that will spit out the solution. That is the hope and that is something that we're, in the EPS for example, that is something that we're trying to work towards either using neural algorithms or by also using deep learning approaches. So I think second order methods is one way forward for that. An alternative would also be to automate the hyper parameter tuning search within algorithm framework, but typically that requires a lot more competition than the second order method. So yeah, hopefully this kind of approach will become, I'm hoping that this will become a standard part of more of a bigger package that can robustly solve the problem. But I don't think we're anywhere there yet. So this would be just one component of that kind of approach. Is there any of the other people in this audience have any comment about this and how easy it is actually to teach? I just want to mention one thing is that I would recommend, like interested people, I would recommend people looking at Ming's paper, the software paper that we mentioned and Chris might have shared the link to that because that actually makes this the generic approach that we talked about that really highlights that kind of a generic approach where it solved the tilt correction, everything. I think Manuel has a question. Yes, Manu, please. Thanks for that for the presentation. Well, I had one quick comment. In particular, when you mentioned that there is no closed form solution for the multi-slice gradient. And there, well, I think the definition of closed form, maybe we have a different definition, but there is at least an analytical solution for it that is just as closed form as the normal gradient of the single slice, for example, which does rely back propagation slice by slice, et cetera, but just say as a comment that there is, and you actually, normally, you know, when these gradients, when you compute them and when you calculate them analytically, you also learn a little bit of what parameters influence the back propagation of the error or the difference into the next step of the solution. So there is a bit of value, but of course, in this, the calculation of the gradient for the multi-slice was very long and painful. And, you know, we got it wrong a couple of times, et cetera. So there is a lot of value of automatic differentiation. And I think not only for typography, also for other tomography and many other, when the forward form becomes complicated, then you have to start from scratch to calculate this gradient, this is very tedious. So I think there is a lot of value. And my question was, for the most part, I think you had used your approach in numerical simulation, right? To study the topology of the error and how different algorithms behave, but did you get a chance to try it in with real data? And the second part. So I would look at the paper I mentioned with Chris Ming that we published recently. It actually looks at experimental data as well. Like there's lots of experimental data. So I don't know, Chris, do you have a link to the paper? But we can share a link to the paper. That actually looks at all of that, yeah. I'll paste a link in the chat. And my second part of the question was, are these methods already in use in APS, or is there something that will, is planned for the near future now that you are joining APS? I think it's planned, I think that's part of it. So as far as I know, it's not incorporated into normal workflow in APS. So this is still kind of a research approach. For example, I think, again, I'm going to be doing a post-doc at APS soon and I'm going to be working with Matthew Shirakara who's in the APS Computational Science Division now. And yeah, so I guess part of the reason why I'm going to be involved is to try to incorporate this into normal day-to-day work. So Ming was working there quite up to very recently and he was much more involved on the research side again, using this kind of method. Similarly, there are also other people like in CNN, there's Chao Zhu, assistant staff scientist now, who's been looking into using automatic differentiation for solving tachycardia turbine equations. So there's a lot of different branches of research work going on, but I don't think this has been fully incorporated in the day-to-day workings of APS yet. Yeah, thanks. Oh, thank you. I don't see any other raised hands. Please stop now. I think we've been now a couple of hours together. It's maybe time to wrap up. I thank again Chris and Sarget for the presentation today and everybody for being here so long. I think that there was a request also from VivaStope if you can share this, the slides, but I mean, maybe you could also get in touch with Sarget, that little speaker. Yeah, I can share the slides. Yes, through me as well. So if there is any way I can be of a bridge within the community, within this, then I'll be happy to serve. So again, thanks everyone and we'll see you after the summer.