 3D projection is the process of rendering numerically described 3D geometry as a 2D image. In simple wireframe rendering, this involves transforming the 3D vertices of our objects into 2D coordinates, taking into account our chosen vantage point in the world, and then simply drawing straight lines between those coordinates. So, to render a wireframe cube, for example, we transform the 8 3D vertices of its corners into 2D vertices and in our image draw straight lines between those vertices. The trick, of course, is in how to do this transformation, taking into account the optics of cameras in the human eye, which cause distant points to converge to the center of the image. In other words, we must take into account the phenomenon of perspective. In this video, I'll explain why cameras and their eyes perceive the world in perspective and explain how to compute a perspective projection. We'll also briefly discuss the simpler alternative, orthogonal projection, which is the style of rendering seen in architectural blueprints where distant points do not converge to the center of the image. In a later video, we'll revisit how the projection computation is more commonly described using matrices. For now, though, we'll stick to just simple algebra geometry and trigonometry. In traditional photography, an image is formed by light striking a piece of light-sensitive film such that the different intensities and frequencies, a.k.a. colors of light, affect the different parts of the film surface differently. In digital photography, it's the same idea except instead of a surface of light-sensitive chemicals, we have a surface with a grid of sensors which each report a digital measure of the light striking them. These sensors are called CCD light sensors, where CCD stands for charge-coupled devices. Light sensors are actually just one kind of CCD, but in the context of digital cameras, CCD usually implies a CCD light sensor. Now, to get an image from the world, what we cannot do is simply hold a film or light sensor array in front of the scene, like shown here. We need the light from the respective parts of the scene to hit the corresponding parts of the surface, e.g. the light coming from the top of the scene and only that light should hit the top of the surface and only the top of the surface. As depicted by the white lines here, the ray of light from the point at the top of the Eiffel Tower should hit a point towards the top middle of our surface. What happens in the real world, however, is that light is bouncing all around, such that generally light from all parts of the scene hits all parts of the surface. Here we see some of the unwanted rays of light in red. Light from all parts of the scene are hitting the same point on the surface, adding up to far too much light and probably not the right frequency. This happens for all points on the surface, such that we get a blank, all-white image. This is why cameras in the human eye have lenses, to focus the light from different parts of the scene onto different parts of the film or the sensor array. The very simplest kind of lens is a pinhole lens, which is exactly what it sounds like. Punch a small hole in a box and you have a pinhole camera. As the diagram shows, light from a point at the top of the tree passes through the hole and only strikes a point at the bottom of the box's interior. Light from a point at the bottom of the tree passes through the hole and only strikes a point at the top. Effectively then, the image on the interior back surface gets mirrored upside down. This happens on the horizontal axis as well, at least from a person's perspective standing behind the box. Light from points on the left side of the tree end up on the right side of the box and light from points on the right side of the tree end up on the left side. This isn't actually depicted correctly in the diagram. Look closely at the smaller tree and you'll see it hasn't been correctly flipped horizontally. As you might imagine, a pinhole lens doesn't produce a great image. Making the pinhole small tends not to allow in enough light, producing a weak image, but making the pinhole larger produces a blurry image. As the pinhole becomes larger, light from each point in the scene passes through the lens in a larger cone. These cones of light strike the back interior of the camera, producing overlapping blotches of color instead of focused points. In practice, only the very smallest cameras, like hidden cameras, use pinhole lenses. Most practical cameras use a lens made up of one or more elements of glass. The idea is that the pieces of glass are shaped in such a way such that all light rays from one point in the scene get refracted to the proper point on the film. Here, all light rays from the same point in the scene that reach the front side of the lens are refracted onto the same point on the film. Notice that, like in a pinhole camera, the scene is getting flipped vertically and horizontally. That, of course, isn't a big deal because we can just take our film or digital image and flip it around when we display it. Given a particular lens, the proper distance from that lens to the film is called the focal length of the lens. If we move the film closer to the lens or if we moved it further away, the rays of light from one point in the scene wouldn't converge onto the same point of the surface, producing a blurry image. A related issue is that, for detailed reasons of optics we won't get into, a lens cannot properly focus points at all distances in the scene onto the image at once, producing an effect called depth of field in which objects in the foreground and or background may be blurry. With certain lenses and lighting conditions, points in a large range of distances can be nearly focused. Probably the most famous example of such shots are in Citizen Kane, such as this one where the woman in the foreground is in good focus, but the men well behind her remain mostly in focus as well. In our 3D rendering, we will not account for depth of field. Effectively, our images will have infinite depth of field. All points at all distances will always be in focus. Getting back to focal length, a very important thing to understand is that lenses with different focal lengths produce different images. Here we have a longer focal length top and a shorter focal length bottom. Because light through the long lens refracts at a shallower angle, fewer parts of the scene get focused onto the film, capturing a narrower portion of the scene. With the shorter lens, we get more of the scene onto the film. This is why shorter lenses are also known as wide angle lenses. Somewhat confusingly though, it's not common to refer to a long focal length lens as a narrow lens, even though they do capture narrower portions of the scene. The other relevant term here is field of view, which is the measure of the angle of the portion of the scene captured. The shorter the lens, the wider the field of view. Now, consider an observer standing in a hall with a floor and a ceiling that run parallel. Looking down the hall, the light rays from the floor and ceiling that reach the observer converge towards the same vertical part of the observer's vision. In fact, if the hallway were long enough, the position of the most distant parts of the floor and ceiling would be only imperceptibly different. This is perspective. Light from the scene converges on the observer at a single point, such that points further away in the scene converge towards the center of the image. With a simple round lens, light converges directly towards the middle of the image, producing a curvilinear effect, in which straight lines from the world may end up curved. More elaborate rectilinear lens is correct for this distortion by converging distant points separately along the X and Y axis of the image instead of directly to the center. This preserves straight lines but at the cost of artificially stretching the image at the edges. In the example here, notice how much wider the rectilinear image is and notice how the wall panels on the left seem to get larger towards the left edge of the image even though those panels are all the same size. The distortions of both curvilinear and rectilinear lenses becomes more extreme for wider angle lenses because, with a wider angle lens, points in the distance effectively converge faster to the center. An object which is a given distance from the lens converges more towards the center the wider the lens. So, the question now is which is more realistic, curvilinear or rectilinear? Well, on the one hand, curvilinear perspective better preserves the relative sizes of objects and it arguably better reflects an idealized perspective in which light from a scene converges to a single point and so distant points converge directly towards the center. On the other hand, rectilinear simply looks right. To most people, rectilinear better matches human perception and soon why this is the case quickly gets bogged down in the murky philosophy and science of perception so we'll just take it as given. Getting to 3D rendering now, what we generally want to simulate is a rectilinear projection of a virtual world. In some advanced cases, we might strive to simulate aspects of real-world cameras of the human eye such as depth of field or some degree of curvilinear bend. We, however, are just going to keep things simple by ignoring such issues. We'll produce our perfectly rectilinear images with an infinite depth of field. This, in fact, is basically the default case used in much 3D rendering, games especially. If what we're simulating isn't necessarily a camera or human eye, we can think of our task this way. We have a virtual world and a virtual window in that world and we want the 2D image which a virtual observer sees looking through that window. Be clear, though, that our observer is really neither a camera nor a pair of human eyes but really just a point in space, a focal point. The next thing to note in this setup is that the observer's distance to the window changes the field of view. Getting close to the window widens the field of view, backing away narrows the field of view. To determine what point from the scene should appear on a point of the image, we extrapolate a line from the observer through that point on the window until it collides with something in the scene. So here, what the observer sees at the red dot on the window corresponds to a dark point on the tree. The color at that point on the tree is the color we want to see at that point on the image. Extrapolating like this through every point on the window gets us our complete image. The question now is how to do this extrapolation from 3D coordinates. Consider this process in just two dimensions, here from a side view. The line extrapolated through the window passes through a certain point on the window but hits the green apple at a higher point. In effect, this point of the scene gets translated down to where it should appear on the window. Notably, were the apple closer along the line of extrapolation, it would require a smaller translation. Also note that the angle of the extrapolation line affects the size of the translation. The smaller the angle, the smaller the translation. Once the angle is 0, that is, where the observer sees straight through the window, the translation is also 0. So the point in the scene directly straight ahead never gets translated no matter how close or far away it is. For the points that do need translation, however, the formula to find the point on the window is quite simple, derived by noting that the line of extrapolation forms two overlapping right triangles with the line extrapolated straight through the window. The value of A1 here, the distance from observer to window center, is our focal length and up to us to select when rendering. Assuming then that we have the values A2 and B2, we can find B1 by noting that the ratio of B1 to A1 equals the ratio of B2 to A2 because these are corresponding sides of two right triangles of the same angle. Solving for B1, we get B1 equals A1 times the quantity B2 divided by A2. Assuming the window center to be our origin, then B1 is the height coordinate on the image for our point in the scene. We can apply the exact same logic to find the point's horizontal position on the image. The only change is that this time B1 and B2 are coordinates of our horizontal axis, so finding B1 gets us our horizontal coordinate. Another way of thinking about this process is that we are squeezing the observer's field of view into a rectangle such that all the points in our field of view get squeezed along with it. Points farther back from the window and farther from the center axis get squeezed proportionally more. And again, because we're going for rectilinear projection, we squeeze the vertical and horizontal axes separately, one before the other instead of at the same time, which would produce a curvilinear projection. So here, this is what we end up with. Looking at the before and after side by side, note that the two red dots had the same distance from the center axis before the squeeze, but the red dot farther from the window gets squeezed more towards the center axis. Imagine, then, that those two points describe the side of a wall running parallel with the observer's direction of vision. Once we account for projection, the wall seems to converge in the distance towards the observer's center of vision. This is just like what we observe with our eyes or our camera. Looking down these train tracks, the parallel lines of the rails converge to a point in the distance. Once we have squeezed all our vertices, we have their coordinates as they should appear on our 2D image. Assuming again that the window is the center of our coordinate system and that the horizontal axis is x, the vertical axis is y, and the depth axis is z, then the x, y coordinates of the vertices denote their position on the image. This, though, assumes that 0, 0 denotes the center of our image, which is actually often not the case. Recall from earlier units that 2D pixel coordinates are commonly described in terms of 0, 0 at the top left corner with the y axis pointing down. Moreover, we don't necessarily want one world coordinate unit to have the same dimensions as a single pixel in our image. So, to account for all of this, we must translate from our 3D xy coordinate system centered on the window to a 2D xy coordinate system, which is possibly centered elsewhere and possibly of a different scale. It's a simple translation process we'll come back to shortly. Now, ideally, we're modeling a field of view that's shaped like a pyramid with its tip at the focal point, our so-called observer. Any object inside or overlapping this pyramid, here shown as a blue triangle from the side, should show up in our image. So far, though, we've thought of a projection as capturing a virtual observer's view through a virtual window, implicitly disregarding anything between the observer and the window. To render objects in front of the window, however, we can actually use the very same formula. Just like with objects behind the window, we can extrapolate a line from the observer through an object on the near side, and where that line intersects the window represents where it should appear in the image. The only difference is that the object vertices end up expanding away from the center axis instead of contracting towards it. Problems arise, however, when rendering objects very close to the focal point, for reasons having to do with floating point rounding error and aspects of rendering which we'll get into later. Briefly, imagine what happens in our formula as A2, the distance to the vertex, approaches 0. When we do the division, the smaller and smaller fraction that results may begin to exceed the limits of our floating point precision, producing ugly errors, especially when we start drawing filled in polygons. Even worse, a coordinate line on the focal point itself would have an A2 value of 0, which, in our formula, would trigger a divide by 0 and thus break the code. To avoid these issues, usual practice is to simply clip the drawn geometry with a near clipping plane, such that only geometry behind the plane gets rendered. Here, for example, this apple lies in the field of view but in front of the near clipping plane, so we ignore it in our rendering. For different reasons, it's also usual to specify a far clipping plane to cap rendering of objects past a certain distance. By rendering only objects within a certain distance, the rendering job can often be greatly simplified and hence made faster, which is of course especially important in games. Games of recent years often have the clipping distance set far enough away that it's not noticeable in most scenes, but earlier 3D games often had to set it distractingly close, using distance fog to hide the appearance of distant objects popping in and out of the world. Again, modern games still use these techniques, but usually set the clipping and fog distance far enough away to be less noticeable. By the way, the lopped-off pyramid formed by our truncated field of view is known as the Frostum. Make sure to get that correct, there's no R after the T. Now, as a matter of convenience and simplification, it's common to assume that the viewing plane corresponds with the near clipping plane, that is, assume that it corresponds with the small end of our lopped-off pyramid. If the image plane and clipping plane are tied together, what if objects you want to see are getting clipped? How can we fix this without changing the apparent camera position and field of view? Well, we have two solutions. First, simply scale up the coordinates in your world such that everything is bigger and further apart. This would be as if the world outside your window grew, but also moved away from you such that everything through the window looks the same. Crucially though, we scale the world relative to the focal point, not the origin at the center of the image plane. This way, some vertices may get moved onto the backside of the near clipping plane, and hence show up in the rendering. Here, for example, one of the two points lies in front of the near clipping plane, and so won't get rendered. If we double the scale of the world relative to the focal point, now both points get drawn, but we otherwise haven't changed the image. The alternative solution is to proportionally scale down the image plane and focal length, effectively moving the image plane closer to the focal point without changing the apparent image. Be clear that, within a given field of view, changing the distance to the view plane doesn't change the resulting image as long as the view plane changes size to fit the same field of view. But as mentioned earlier, we should be cautious of letting the distance from the focal point to the near clipping plane get too close to zero. Finally, let's complete the process of rendering a wireframe image. So, again, we start by defining our camera to be facing up the Z-axis, with our view plane centered on the origin. Having specified a focal length, i.e. the distance of the observer to the origin, we then perspective adjust each vertex in our scene. For example, given a focal length of 20, the vertex at 5, 10, 30 gets adjusted with A1 equal to 20, and A2 equal to 20 plus 30. Computing for X, we plug in the X-chord 5 as B2, giving us a new X-chord 2. Computing for Y, we plug in the Y-chord 10 as B2, giving us a new Y-chord 4. So, this vertex gets perspective adjusted to coordinate 2, 4 on our view plane. Once we start filling in our polygons, we'll have further use for the Z-values, but in wireframe rendering, we can ignore the Z-values once we have our perspective adjusted vertices. The last coordinate adjustment is to account for a possible difference between the image plane dimensions and the destination image dimensions. In other words, we must translate between view plane coordinates and pixel coordinates because one coordinate unit does not necessarily equal the width or height of one pixel. Here, for example, if our view plane is 100 units wide and 70 units tall, but our destination image is 200 pixels wide and 210 pixels tall, and assuming that our pixel coordinate system is centered at the center of the destination image, then a coordinate 2015 on the view plane gets translated to 40, 45 and pixel coordinates. The formula is quite obvious. We observe that the ratio of the pixel X coordinate to the view plane X coordinate equals the ratio of the pixel grid width to the view plane width. Likewise, the ratio of the pixel Y coordinate to the view plane Y coordinate equals the ratio of the pixel grid height to the view plane height. So solving for X to the pixel grid gets us X to the pixel grid equals X to the view plane times the pixel grid width divided by the view plane width, and solving for Y of the pixel grid gets us Y of the pixel grid equals Y of the view plane times the pixel grid height divided by the view plane height. The last complication is that the pixel grid origin is usually not at the center of the image, but rather the top left or sometimes the bottom left. bottom left, we add half the pixel grid width to x, and half the pixel grid height to y, which in this example would mean adding 100 and 105, yielding a coordinate 140, 150. When the origin is at the top left, we do the same thing, but flip our y coordinate by subtracting everything from the pixel grid height, which in this example would mean subtracting 150 from 210, yielding a y coordinate 60. Finally, once we have our pixel coordinates of the vertices, we get our wireframe rendering by simply drawing lines between the connected vertices. So here on the left, for example, the cube is made up of eight vertices, and we simply draw lines between the vertices which define our edges. How exactly we keep track of which vertices connect to which is just a detail of how exactly we define our polygons in our data. The only other thing to note here is that were our projection meant to be curvilinear rather than rectilinear, we couldn't simply draw straight lines between the vertices because we'd have to account for how straight lines should bend around the image center. A rectilinear projection spares us that problem. Because this video already runs long, I'll not go over actual code implementing this wireframe rendering, but you'll find running code examples on the site that build upon our previous 2D drawing code. There are only a couple hundred lines to it actually, so it shouldn't take much effort to read and understand. So by now, we should understand how to render a 2D image from 3D geometry for the special case where our virtual camera view plane is at the origin of the world and the camera is looking up the z-axis. But what if we want to render the world from another angle and position? Well, it turns out to be easiest not to move the camera in our virtual world, but rather to move the world around the virtual camera. Whatever camera moves we want to make away from the origin, we get the same effect by actually moving everything in the world in the inverse direction. So, for example, if we want to dolly this camera forward closer to the apple, we could instead just move the apple the same distance in the opposite direction. Likewise, instead of moving the camera up, we could just move the apple down the same distance. The same idea applies to rotations. If we want to pitch the camera down pivoting around the origin, we could instead rotate every object in the world the same angle the opposite way, pivoting around the origin. Now the question is, how do we move objects and rotate them? It's an important question, not just for moving our camera, but also moving objects in our world relative to each other, and it's something we'll look at in detail when we talk about transformations. Lastly here, recall that, for a given view plane size, longer focal lengths effectively narrow the field of view. Imagine then what happens when our focal length grows into infinity. The bounds of the field of view run parallel to the center axis of vision. What this means is that when it comes time to squeeze our coordinates, none of them move at all. They're already all in their proper position for an infinite focal length. This special case is called an orthogonal projection, and it has the effect that lines running into the distance parallel with our vision never converge. Orthogonal projection is a sort of projection used in architectural blueprints because, while it makes object depth often difficult to interpret in the image, the orthogonal projection usefully cuts down on the number of visible lines in the image and preserves the relative distances between every point in the 2D plane. You might assume that orthogonal projection has no place in games, but most 2D games today are modeled as layers of flat images stacked in 3D space but rendered within orthogonal projection. Here, for example, this scene is made up of a few layers of flat images, a couple layers of background depicting scenery at a few distances, then the foreground walls and floors that actually collide with the player, then the player avatar and the enemies along with the gun projectile particle sprites and effects, and then a layer on top of the vine's plants and other decorative objects. This layering in 3D and rendering as an orthogonal projection not only makes it more straightforward to construct layered scenes, it allows 2D games to utilize the pixel-pushing power of the GPU. The HUD elements here are also likely drawn as layers of an orthogonal projection, but because they're always rendered on top in fixed positions on the scene, they're likely rendered in a separate coordinate system. This is in fact also how proper 3D games render HUD elements. The 3D world is drawn with a perspective projection, then the HUD is drawn on top with a separate orthogonal projection.