Well for the same reason you cannot do perfect sorting on triangle level on the cpu you also cannot do that on the gpu. It turns out its hard to answer the question “is that triangle completly behind other already rendered objects” without asking the counter question “is any part of that triangle visible” and as long as “that triangle” doesnt have a specific size, “any part of that triangle” neither has a specific size, which means you have to ask that question recursively until at some point you are asking “is that pixel of the triangle visible” (which finally is a question that you can answer) and once you answered the question for all pixels of a triangle with “no”, then you can safely cull that triangle but at that point you already rasterized the whole triangle, so all you can do now is to not run the fragment shader.
And because all pixels are rasterized already, in case you found that some are visible, of course you dont have to render the whole triangle, instead you can still avoid having to run fragment shaders for a pixel that is occluded (which is what happens, as long as you dont change the depth of the fragment in the fragment shader. there is an extension also (not sure if it made it into core later) that allows you to specify for example that you might change the depth of the fragment but in case you do so only ever increase the depth value and never decrease it, in which case the fragment shader can still be skipped if the depthbuffer already contains a value lower than the initial fragment depth, given the depth test condition is “less” as you can also do depth tests that only pass when the fragments depth is the same as or greater than etc the depthbuffers value)
and because the fragment shader with complex lighting, textures and what not is way more taxing on the GPU than rasterizing a triangle you still get decent performance improvements
EDIT: i am sort of lying here, which means potentially you could cull a whole triangle: imagine you got a single triangle centered at the screen. now you could check all its 3 corners if their depth values are smaller than their values in the depthbuffer (you got the position of the vertices on screen and the screensize / depthbuffersize) and if you could now make sure that no pixel in the depthbuffer that is between any of the corners does have a lower value than the interpolated value of the corners, then you could cull that triangle. and you can do that when generating a mipmap chain of the depthbuffer and using the one level that uses adjacent pixels for the corner lookups (because then there is no pixels between the ones you used for the lookups). that technique is called hiz culling (or loz culling depending on which way it goes) and i actually implemented it, there is a link somewhere in the “Suggestions for 3.4” topic that i created if you want to do for the adventure. just that technique is not used to cull the triangle, instead it is used to check bounds of objects against the depthbuffer mipmap chain to cull the whole object