How to write a (software) 3d polygon pipeline or : TLC (Transform,Light & Clip) July 1, 2000 (updated July 11) by Charles Bloom [email protected] I. Article II. Notes III. Pseudo-Code IV. Comparison -------------------------------------------------------------------------- I. Article This is another review article, with some pretty slick tricks thrown in that I've learned over the years. If you're new to 3d, then you have almost no chance of organizing your polygon pipe well the first time around, but it's amazing how many old hands do it wrong as well. This is also a bit archaic, because you should just be using D3D to do all your work, since you need hardware T&L. (see note 8 at end) The basic principle here if that on modern CPU's, you want to do things the same way a graphics chip (GPU) does them. That is : deep pipelining, no conditionals, few state changes, and very few memory touches. So, how do we do this? All of our objects are stored in Indexed Primitives, that is either Indexed Strips or Indexed Lists. This means there is an array of verts (xyz,rgba,uv,normal) in model-space, and a set of indexes which build triangles by referring into the vertex array (VB). I'll assume from now on that we only have indexed lists, since strips provide no benefit to software T&L. Depending on the maximum number of verts in the VB, you can use bytes or words for the indexes. These VB's may compressed (see note 1). The rendering api looks like (for example) : TLC_Primitive( MyVBType * Verts, int NumVerts, WORD * Indexes, int NumIndexes, Transform3d * ModelWorldTransform, UVMap and Texture and Lighting Parameters ); Each VB has an AABB (axis-aligned bounding box) associated with it, which absolutely contains all the verts in the VB (note that the box is typically in model space, and the frustum is in view space, so you must transform the AABB using the Model->View Transform (MVT) (see later)). When you tell the VB to render, it checks if the AABB is totally outside of the view frustum, and trivially rejects it when that's the case. (actually, you should have used a broad-phase like an oct-tree to already have rejected groups of non-visible objects). When you tested the AABB against the view frustum, you also gathered information on which of the six planes (four sides + front & back) the box crosses. These are your clip flags (CF), and you organize them as bit fields, (1<<0) to (1<<5). Your triangles will only need to be clipped against the planes that the AABB crossed (eg. only clip against plane 'p' if ((1<
World (a parameter) and World->View (a global) transforms. Each transform is a 4x4 matrix with {0,0,0,1} in the bottom row (that is, a 3x3 matrix and a 3-vector translation), all 3d vectors have an implicit {1} in the final spot when treated as a 4d vector. The result is the Model->View transform (MVT). Then we create an orthonormalized version of the MVT using Gram-Schmidt reduction. We then transform all of the source verts in the VB (xyz,rgba,uv,normal) into View space, using the MVT, and rotate all the normals into view space using the normalized MVT (so that the normals stay normalized!), filling out the ViewSpace Position & Normal in the PV array. This is just a bunch of matrix-vertex multiplies. Some notes are very important here : 1. we will be doing all lighting in view space, not model space. This is because the model transform might contain shears and scales, and you can't take a spherical light through a shear and still have a spherical light!! (* new : see note 0 at end) 2. we transform all the verts in one big pass. This means that we might transform verts which are not actually used, but doing it in one pass means there are no branches, no conditionals, just a bunch of SIMD math, which we can do with SSE or 3dNow, and it's blazing fast. 3. using the orthonormalized xform for normals isn't quite right. To get totally accurate normals after a shear, you would have to recompute them from the faces. However, this is pretty irrelevant, no-one can see a small error in normals. It's more important to keep your "normals" normalized! (see note 15 at end) Note that if you are doing smooth-skin character skinning, instead of computing the ViewSpacePosition & Normal using the MVT, you would take the skinning information from the VB and blend matrices into the ViewSpace Position & Normal. You can concatenate the MVT onto your skinning matrices, so that the result of skinning gives you verts and normals in view space. Any other procedural geometry operations would be in this step as well. When we do the transforms, we also project all the vertices into screen space. This is just a divide which can be totally simultaneous with the matrix math on new CPUs like the Athlon/K7. We also copy the rgba and uv from the VB into the PV, since we are already reading from the VB memory and writing the PV memory, this copy is free. Note that if you have a nice big L2 cache, the PV should never leave L2. The PV's are only read after being written, so a good L2 will resolve that read-after-write and return directly, flushing out to the main-memory mirror in the background (though it will never be used). We also set IsLit to false. The final part of our per-vertex pass is computing the ClipFlags per vertex. This is a bit-flag field just like the whole-VB CF that we computed earlier from our AABB. For each vertex, we simply do : PV.ClipFlags = 0; if ( VB_CF & FRONT_PLANE ) PV.ClipFlags |= ( PV.ViewSpacePosition.Z <= Near_Z ) ? FRONT_PLANE : 0; if ( VB_CF & LEFT_PLANE ) PV.ClipFlags |= ( PV.ScreenSpacePosition.X <= 0 ) ? LEFT_PLANE : 0; etc... Similarly for all 6 planes. The if ()'s on the whole-VB clip flags will be totally predictable branches (if you want to get fancy, you could use a switch or self-mod code). On the other hand, the if's to fill out the per-vertex clip flags may be unpredictable, eg. very bad for pipelining. To prevent that, we use the ?: structure, which lets us use the CMOV instruction so that no branches are necessary. Note that if VB_CF is zero (eg. no clipping at all), all of these tests are simply skipped. Now we are ready to clip the triangles. We run through the triangle list, we get the indices of three PV's, and increment the list pointer by 3. First, we check if the triangle is all-out of the view frustum. This is done simply with one bit-field test : if ( PV1.ClipFlags & PV2.ClipFlags & PV3.ClipFlags ) continue; // triangle is all out That is, the triangle is all-out if and only if all three of its verts are marked as being outside of one plane (the same plane for all three). Second, we check to see if the triangle is back-facing. We use the simple screen-space check, which means that you can't do the back face test when your object crosses the Z=0 plane in view space (because the X's and Y's get flipped around when you divide by a zero or negative Z). if ((PV3.Screen.X - PV1.Screen.X) * (PV2.Screen.Y - PV1.Screen.Y) >= (PV3.Screen.Y - PV1.Screen.Y) * (PV2.Screen.X - PV1.Screen.X) ) continue; // triangle is back-facing This is just a cross-product in the 2d space of the screen; this test also automatically rejects triangles of zero area. Now, we know this triangle will actually be visible in some form. For each of the verts, we check IsLit. If it's not on, we compute the color and specular for the vertex, using a set of lights which are in View space, and have already been set up in some global scene state. We then set IsLit to true. When we do lighting, we also do any procedural uv-mapping that's desired. The procedural uv mappers take the ViewSpace Position & Normal as inputs and fill in the uv's of the PV. These uv mappers might be environment mappers or projective mappers. We also do any per-vertex fog computation that might be desired. Fog can be computed using the screen space Z, and placed in the alpha component of the specular color. Note that this means that we only light verts that are actually visible. With closed models, roughly half of the triangles will be back-faced, so a bit over a half of the verts will be lit on average. Since lighting is quite expensive, this is a big savings. Now we must clip our triangle. We need to clip against the planes specified by the joint clip-flags of the three verts : CF = PV1.ClipFlags | PV2.ClipFlags | PV3.ClipFlags; If this CF is zero, the triangle is all-in, and we simply add the three vertex indexes to the VisibleIndexList (which starts empty). If CF is not zero, we must clip. I won't describe clipping in great detail, see Graphics Gems or Blinn for that. The point here is just that clipping is very, very rare. In a scene of 100k triangles, you typically must clip only 1k. Don't worry about optimizing your clipper. Much more important is to quickly find the tris that are all-in or all-out, which is what this pipe does well. I clip against the near Z in View Space, and then clip against the other planes in Screen Space. I produce new PVs where the edges intersect a clip plane. These new PVs are filled out by interpolating ViewSpacePosition (but not normal! it's not needed any more), color, uv's, and specular. The reason I clip against the sides of the view in Screen Space (instead of View Space) is to absolutely gaurantee that the verts are inside the frame buffer rectangle. If you clip in view space and then project, floating imprecision may push the verts slightly outside of view, which can cause memory trashes. BTW it's an important "gotcha" to be carefull about these vert slippages (unless you're lucky enough to be on a system with "guard band" or "scissoring"). If not, then whenever you compute the ClipFlags, you must use a slightly smaller-than-view frustum (eg. use epsilons) so that you do slightly more clipping than necessary. You must do this because a vert which is barely inside the view frustum in View Space might be barely outside the frame buffer rectangle in Screen Space. You could just clamp these verts, which is the old-school technique of Quake 1, but clamping is way too expensive, and we can avoid it completely with some care and thought. After clipping the triangle, some new set of verts has been created and tacked onto the end of the PV array. Also, a new bunch of triangle indexes has been added to the VisibleIndexList. These triangle indexes refer (perhaps) to some old verts on the original triangle, and to some of the newly created PVs. We proceed to the next triangle and test it for all-out, backfacing, and clip it if needed, repeating for all triangles in the input list. Now we're all done! We simply render an indexed list using the VisibileIndexList and the screen-space position of the PVs. In order to make the PV's ready to hand off to hardware, the API must either accept strided verts (DX8) or we must split the PV structure into two parts. For D3D, this would look like : struct PipelineVertA { Vector3d ViewSpacePosition; //xyz Vector3d ViewSpaceNormal; //xyz uint8 IsLit; // a boolean uint8 ClipFlags; // six bit flags uint16 Wasted; }; struct PipelineVertB { Vector4d ScreenSpacePosition; //xyzw uint32 Color; //rgba, packed uint32 Specular; //rgba, packed float u,v; // texture coordinates }; Then we simply fire the array of PipelineVertB's at D3D (for example). In DX7 I think it's a bit better to use a vertexbuffer for the PVB's (it eliminates a memcpy internal to d3d). You may also want to use Structure-of-Arrays (SOA) instead of Array-of-Structures (AOS) for the PipelineVertA's to make them more friendly to the SSE transform code, but listen to wise Abrash and do that type of thing very late in the development cycle. -------------------------------------------------------------------------- II. Notes (eg. things I'm not going to cover in detail) : 0. On lighting in view space : you can save some work (the xform of the normal from model space to view space) by lighting in model space when possible. You can only do this when there are no non-uniform scalings or shears in the model xform (eg. it's just rotations, translations, and uniform scalings) (note that the view xform should always be orthonormal), and also when there's no environment mapping (which use the view-space normal). To do this, you simply back-xform the lights into model space and do the lighting there. The problem with this is that you don't get the win of not lighting vertices that are backfaced and culled. Actually, this is probably a no-go, because it prevents you from using the same source data for multiple objects, which is such a huge win we can't abandon it. It's cheaper to load an object from memory once and light it N times in different positions than it is to just load it N times with baked-in lighting. 1. your VB source data could be compressed. The memory for the PV is allocated only once, so the result can be a very low memory footprint indeed. You could un-pack the source VB at the same time that you do the big walk to fill out the PVs. You might compress, for example, by storing the vert coordinates as bytes, where 0-255 corresponds to the extents inside the AABB of the object. You can easily decompress with a few table lookups (eg. no int->float ever required). If your application looks good enough in this quality, you can pack your verts into 3-7 bytes. That's 3 for xyz, 1 for a normal lookup into a table, 2 for an rgba (in 4444 format), and 1 for a uv (4 bits for u and v). Any of the normal, color, and uv can be optional, which is where you get 3-7 bytes per vert. More realistically, you probably need 5 bytes for xyz, and maybe 2 for the normal, giving you a 5-11 byte vertex. (* note that this doesn't count any skinning information, which would take another 4 bytes roughly). The best form of compression, of course, is if the "myvb" is a function (eg. procedural geometry). 2. you could have multiple uv's per PV to do multi-pass. 3. the source VB could optionally not contain rgba's, uv's, or normals' (to reduce memory use and processing). Each VB would have to come in with flags indicating which it had. The PV's would be the same format in all cases, though, there would be no pipeline stall for vert-format switches like D3D has. 4. I believe that this pipe is optimal on modern CPU's. If you can tell me how it could be improved, I'd love to hear about it. (For example, I know this pipe is superior to the one in DirectX7 in various ways, lighting fewer vertices and walking memory more coherently). 5. I said at the beginning that you should be using D3D anyway, but that's not really quite true, except when you're running on an HT&L card. If you have to do ANYTHING in software (eg. character skinning, lighting, uv-mapping) because D3D can't handle it, then you should do EVERYTHING in software, since the most expensive part of this whole business is the memory touches, if you are walking the verts, you should do everything in that one walk. 6. On an HT&L card, you can use the same opaqued API (eg. call render with a VB, a triangle list, and an AABB). Instead of filling out a PV array, though, you fill out a since static D3D VertexBuffer from the VB and fire the whole thing at D3D. This will be much better in DX8, at which time D3D will be able to do all the good stuff we need. 7. This technique is great with VIPM (see my VIPM article). The verts in your VIPM are stored in a big linear VB which is heavily compressed (see note 1), and can also be paged out (truncated) when a large suffix of the VB is not in use. The VIPM's VB is unpacked into the PV array, which results in memory use that goes like : sizeof(PV) * Num_Visible_Verts + sizeof(VB Vert) * Num_Potential_Verts Here the PV is quite large (60 bytes), and the VB Vert can be packed very tight (3-7 bytes). 8. Software T&L is still necessary on most consoles. This pipeline is good for consoles, but maybe not great, because the PV array will be too big for the small caches found on most consoles, unless you work with very few verts at a time (eg. <= 256). I'd love to hear better solutions for consoles. If you did this algorithm on a console, what you'd see is : 1. stream in compressed VB and tri-list from main memory (this can be done asynchronously by DMA on some hardware) 2. unpack into PV's and do all that work and flush back out to main memory (one big walk, must write out to memory cuz cache is not big enough) (the CPU doesn't stall anywhere here) 3. walk the triangles and read PV's back in from main memory (here the CPU cache acts like a vertex cache) (the CPU cost off lighting should hide most of the cache-miss stalling) 4. fill out a display-list and flush to main memory or the graphics system (the CPU doesn't stall anywhere here) This isn't terrible, I can't really see a way around it unless you constrain the VB's to be tiny. Steps 1&2 are basically a memcpy which reads little & writes lots, with a bunch of CPU work tossed in to hide the memory latency. You can use prefetches and DMA to totally hide the memory-read stalls in step 1. Step 3 does a bunch of out-of-order reading. The CPU cache should help a lot if your verts are in a good coherent order. Again, the CPU cost of lighting should hide most of the latency, and prefetching helps again (eg. prefetch the next three verts while you are lighting the current three). Step 4 is just a bunch of writing, there are no stalls. Step 3 is the only place we read-after-write. 9. How to handle strips : the input is still an indexed strip, with null indeces to indicate breaks between strips in a set. All clipping proceeds as usual, except that you break the strips : after a triangle is clipped or marked as out, then the next included triangle must restart the strip. The easiest way to handle this is by putting all clipped triangles into a temporary triangle list, and only keeping the strips generated by triangles that are all-in. If the hardware doesn't handle indeces (eg. just vertex strips), then you push out the PVB's instead of indeces. When doing this, it's better not to generate the PVB's in advance (eg. with the PVA's) but instead generate them after you fetch the PVA's for the given triangle (eg. right before you push the PVB out to the strip). Let me make this clearer. You do the initial input-from-VB stage and fill out the PVA's. Then, per triangle, you fetch the single new PVA (two carry over from the previous tri in the strip), project and light it into a PVB (the other two verts are already in lit PVB's in the output strip), and tack it on to the output vertex strip. If you're on a severely memory-limited system, you could also not precompute the PVA's, and instead do it all in one pass as you create the output strip. Note that if you do this, you are transforming one vert per triangle, whereas if you do proper sharing, most meshes average half a vert per triangle, so you are doing double the transform and lighting work (this is the way GeForce does it, because its CPU is so much faster than its memory). 10. Notes on Tom Forsyth's Subdivision+Displacement+VIPM method. See www.muckyfoot.com first to get Tom's powerpoint presentation. What Tom does is a form of compression. Verts that are created by subdivision and displacement are stored in the post-VIPM VB as 3 indices to the verts of the base-mesh triangle, 2 barycentric coordinates for the location inside the triangle, and a displacement. If you also store a normal, that takes about 8 bytes per vertex. Note that color, uv, and skinning information are all implicit by triangular interpolation using the barycentric coordinates. Also note that the interpolation can be done in View Space (!) so that these verts need never be skinned or transformed! So, this is really excellent for compression and for speed. How does Tom do it? The answer is that these subdivided triangles only move in chunks on the base-mesh triangles. If you read my note on how to do subdivision by storing the equation to compute the vert from the base mesh vertices, then you can think of Tom's trick as taking this equation and truncating it down to only three verts, eg. a form of lossy compression. The neat thing, as Tom has pointed out to me, is that you can control where the error occurs by distributing your base mesh verts so that there are more verts where the mesh will bend more, eg. at elbows and knees and such. You'll never get the real "rubber-bandy" stretchy effect of subdivision surfaces, you'll just get like a smooth carapace effect, like an ant. Basically, what we've gained here is that we've packed a vert to 8 bytes instead of the 15 I got in section 1 (with skinning info), and we've reduced the CPU work a little bit by doing the view-space interpolation, at the cost of a bit of generality and freedom. Theoretically, you could use Tom's method as a form of compressing any mesh, not just one generated by subdivision+displacement. This is the "inverse displacement problem" that Hoppe and others will be discussing at Siggraph 2000. It's a difficult problem : for a given input mesh, you must find a base mesh which you can subdivide, and then convert all the verts into displacements from the base mesh. 11. Shadows Maps and Detail Layers should be enabled or disabled for the entire chunk using the AABB of the VB. See the shadow mapping article for more on this type of thing. 12. How to do clipping on a system with Guard Band : you keep two view fruta, the usual View Frustum, and the much larger Guard Frustum. You test your AABB against the View-Frustum for all-out rejection. Then, you set your whole-VB clip flags (VB_CF) using the guard band frustum. Almost always, this will give you zero, meaning all-in the guard band. Finally, when you do your clipping (only on the planes specified by VB_CF), you use the original view frustum clip flags (you might as well reduce the fill for free). Note that polygons drawn to the guard band still burn fill rate on most cards, so you may wish to limit your guard frustum to be only 256 pixels larger than the screen in each direction. On hards that do fast hardware clip, the guard band is infinite, and you should simple not clip any polygons (even though you are transforming them!), the card can clip them in screen space. You may wish to still clip-test polygons against the original frustum so that they don't need to be transformed and lit. 13. When you have lots of duplicated objects around your scene, that differ only by their xform, you should batch them up and render them all at once. When you do this, there's lots of stuff that you can cache, eg. not change from one render to the next, such as texture settings and renderstates. You also might not need to refill a VB! 14. rant : it's amazing how many people don't undertand the point of LOD. The goal of LOD is not just to speed up your old content, or to make your ambitious content run on today's crappy GPU's. No matter how fast the GPU is, LOD will still be important. Why? LOD relieves the load not only on the GPU, but also on the CPU (fewer verts to animate, collide against, etc.), and on the memory (fewer verts to store). When the GPU can do 1 MTris/frame, then the user should expect scenes that appear to have 100 MTris!! Only with LOD will you be able to push scenes that have far more triangles than the GPU can do. 15. On XForming the normals. Eric Haines makes the good point that if you just transform the normals by the orthonormal version of MVT, your normals will truly not do the right thing. If you want to be more correct, you need to transform the normals by the MVT *with inversed scalings* and then normalize the normals afterwards. You can understand the inversed scalings quite easily. Imagine a sphere with a nice normal on it; if I stretch out this sphere into an ellipse, I have a scaling > 1 in X in my MVT, but the normals are tending to point more *away* from X, which is a scaling < 1. -------------------------------------------------------------------------- III. Pseudo-Code for the whole process : TLC_Primitive( MyVBType * Verts, int NumVerts, WORD * Indexes, int NumIndexes, Transform3d * ModelWorldTransform, UVMap and Texture and Lighting Parameters ) { MTV = WorldViewTransform * ModelWorldTransform; // the model->view transform View_AABB = MTV * ModelAABB; // transform the AABB to view space AllOut = FrustumAABBTest(View_AABB,ViewFrustum,&VB_CF); // also fills out clip flags in VB_CF if ( AllOut ) return; NormMTV = Orthonormalize(MTV); for(i = 0 to NumVerts) { unpack Verts[i] Prefetch Verts[i+1]; PV[i].ViewSpacePosition = MTV * Verts[i].Position; PV[i].ViewSpaceNormal = NormMTV * Verts[i].Normal; PV[i].ScreenSpacePosition = Project(PV[i].ViewSpacePosition); PV[i].IsLit = false; PV[i].ClipFlags = ComputeClipFlags(PV[i],ViewFrustum,VB_CF); PV[i].Color = Verts[i].Color; PV[i].uv = Verts[i].uv; } pVisibleIndexList = VisibleIndexList; for(pIndexes = Indexes to end, pIndexes += 3) { PV1 = PV[ pIndexes[0] ]; PV2 = PV[ pIndexes[1] ]; PV3 = PV[ pIndexes[2] ]; if ( PV1.ClipFlags & PV2.ClipFlags & PV3.ClipFlags ) continue; // triangle is all out if ( View_AABB.Min.Z > 0.f ) // totally predictable branch if ((PV3.Screen.X - PV1.Screen.X) * (PV2.Screen.Y - PV1.Screen.Y) >= (PV3.Screen.Y - PV1.Screen.Y) * (PV2.Screen.X - PV1.Screen.X) ) continue; // triangle is back-facing TriCF = PV1.ClipFlags | PV2.ClipFlags | PV3.ClipFlags; if ( ! PV1.IsLit ) { Light and UVMap PV1; PV1.IsLit = true; } if ( ! PV2.IsLit ) { Light and UVMap PV2; PV2.IsLit = true; } if ( ! PV3.IsLit ) { Light and UVMap PV3; PV3.IsLit = true; } if ( TriCF == 0 ) { *pVisibleIndexList++ = pIndexes[0]; *pVisibleIndexList++ = pIndexes[1]; *pVisibleIndexList++ = pIndexes[2]; } else { // must clip // add verts to PV and tris to VisibleIndexList } } Hardware_Rasterize( PV_ForHW, VisibleIndexList ); } -------------------------------------------------------------------------- IV. Comparison to the Naive Implementation The naive implementation of a TLC pipeline is : compute vertex normals from the triangles light all the verts in world space store the computed colors for each triangle : transform the 3 verts clip the 3 verts against the view render the verts A lot of people write their pipeline like this, it's sort of the intuitive way. Let's see what's wrong with it, or rather in what ways the pipeline I've described here. 1. vertex normal computation from triangles is hugely expensive, is very bad for CPU cache because you must reach all over the vertex set in memory to grab the vertices for each triangle; you must do normalizations which are very expensive. 2. you light verts that aren't visible, such as back-faced verts and off-screen verts. Since lighting is the most expensive operation, this can almost double your CPU work. In contrast, the pipeline here only lights verts when needed. 3. you touch verts separately for lighting and rendering, which is bad for the cache. 4. when rendering and transforming, you don't do a linear walk over the verts, which is again bad for the CPU cache, and prevents you from using a nice SIMD transform. 5. since you do transform-clip-render,transform-clip-render, instead of transform-transform,clip-clip,render-render, you have worse use of the CPU registers and L1 cache. It's best to do all the operations of the same type at once. 6. you don't share plane-out computations for each vert. Also, you branch on the plane-out computations immediately after computing them. The latter causes a hard CPU stall on modern machines, which is horrendous. You should compute all booleans well before you ever branch on them. (you have the same problem for backfacing) 7. you clip against all the planes of the view when typically you only need to clip against one, or even none! 8. you transform and project each vertex more than once, eg. once for every time a triangle uses it, instead of sharing the xform. -------------------------------------------------------------------------- EOF
.png)
