A couple of weeks ago, I was approached by a graphics firm to debate the best way to test graphics hardware, as we move forward into 2006. I outlined the bit-tech
approach to testing graphics hardware, an approach we have refined over the last 12 months, in an email. When I read the recent HardOCP editorial
on the subject of the graphics industry, it occurred to me that my thoughts might be of interest to the rest of the web. Consider this, then, my treatise on benchmarking graphics cards as we move through the next generation.
Problems with benchmarking, today
There are many problems with testing graphics cards on PC systems today.
These include: the CPU as a bottleneck; the fact that today’s games can’t always act as a predictor for tomorrow’s performance; the unreliability of timedemos as reflective of real gameplay; the margin of error in a 120FPS result; the focus on speed rather than image quality.
Many publications, online and offline, fall foul of these problems.
A New definition of performance
When gaming, readers aren't looking at the frame rate - they're enjoying the gaming experience. As you might know, this is the reason why we have switched to a performance evaluation method that substantially eschews numbers in favour of telling readers, in simple terms, 'which card plays which games best'. 'Performance', to us, is not just about numbers of frames - it's also about usable
featureset and image quality. Frame rate matters until a certain point – once we’re over 70FPS average and 35FPS minimum, in most cases, the focus should be image quality.
Our method of benchmarking a graphics card
Our benchmarking, then, involves a number of steps.
The first is to play through the game we are benchmarking, fully, with both NVIDIA and ATI hardware. In the case of something like Half Life 2, this means completing the game twice, with frame rate counters running throughout to see where the hardware is having to work harder. (This means we are able to spot if ATI and NVIDIA cards are struggling in different places). We can then evaluate what kind of minimum and average framerate is necessary for an 'enjoyable' experience in that game.
We also then go back through the game to find sections which will highlight image quality differences - areas where we can see things are being rendered differently, or areas with power lines etc that will show up anti-aliasing techniques.
When we come to testing individual cards, we then play through the 'struggle sections' repeatedly with different settings for image quality and resolution, to see how those affect the performance of the card, and to draw a picture of the card's 'performance characteristics' with regards to the specific featureset of the card, chip or platform.
Once we have done all this, we can give an authorative opinion on what is the 'best playable image quality settings' for the game, on a certain card, that will maintain an 'enjoyable' frame rate (often 30FPS minimum, 75FPS average). This will often involve a subjective analysis of whether, for example, 1024x768 with 2x AA looks better or worse than 1280x1024 with no AA. This can be a call that involves substantial work to make with conviction.
Applying this method to actual boards
We believe that this provides a great way to look at the performance of a card. It tells you, simply, which is going to give you the best gameplay experience. For example:
The FEAR page of our X800 GT review.
In the example, we have played through a section of FEAR with the cards and looked at the FRAPS minimum, maximum and average frame rates, across a number of re-runs.
Here, we can see that all the cards reviewed play best at 1024x768. These are 'mid-range' cards, but FEAR is a demanding title. We have mentioned that 15FPS minimum and 45FPS average is acceptable to play FEAR, based on our experiences actually playing it with mid-range cards. When evaluating the 'performance' of the hardware, we can see that most X800 GTs are capable of 0x / 2x with medium-low detail and 'minimum' shadows. However, the HIS version allows for that detail with 4x AF at the same frame rate (or within the same 'bracket' of acceptable frame rate). We can also see that the 6600 GT allows for the same performance as the HIS but with 'medium' shadows over 'minimum'.
What this tells the reader is that the 6600 GT is the best card in this test for FEAR, but that the HIS is faster than other X800 GTs. We tell you not what card is 3 FPS faster, but what the difference in 'real world playability' is. We can tell you what architectural features actually make a difference to gameplay, and which are unable to contribute much for whatever reason. We believe this is a simple, useful and effective way of conveying relative (and absolute) performance.
Back to top
Known problems with the bit-tech method
There are, however, substantial problems with this method. The first is that it requires an awful lot of time to benchmark a card. It is clearly not suitable for less technical publications / writers / readers, because of the intimate knowledge of image quality that it requires. It is also not suitable for journalists who are not gamers, because it requires that every time a game is added to the benchmark suite, at least 2 days are set aside to play through it (twice) and work out the specific performance details of the game.
(We often use gaming PR companies to get us replies to questions back from developers about game engines so that we know the details of how they work, and use those to aid our gameplay. This is also extra work for an editorial team).
It is not suitable for a readership that wants a buying decision simply boiled down to a single benchmark number (the use of 3DMark). It is not suitable for a readership that is not focused on PC gaming.
It throws up questions about whether or not mid-range cards should be tested on mid-range systems for added realism to the real-world testing, or whether we should continue to test on the platform that suffers from the least CPU bottleneck. Many benchmarkers like to tweak Windows, disable processes, disable sound etc to give what they believe is the most ‘clean’ graphics result - but since that doesn't happen in the 'real world', should that be done?
People will also complain that the method isn't 'scientific' enough, since we're not using timedemos which are repeatable frame by frame, and that when using gameplay rather than timedemos, CPUs can become a bottleneck. To evaluate high-end cards using this method requires a high-end gaming rig akin to what readers will be playing on. Smaller publications may find it harder to get the hardware to be able to do this - FX-57s don't grow on trees.
To evaluate cards in this way requires subjective interpretation of results, combined with a 'big picture' analysis. Simply put, bit-tech
is incredibly lucky to have Tim Smalley do this. He has come up with this method of performance evaluation (based on the original HardOCP treatise
and that publication's subsequent approach); we have refined it together, and his subjective analysis and consequent conclusions are beyond questioning, in my opinion. I believe this puts bit-tech
at the bleeding edge of performance evaluation, and makes our reviews amongst the 'best', if 'better' means more accurate and more useful to the reader.
I do believe that this method is the most accurate way of testing graphics 'performance', based on our new definition of performance. It focuses on the experience of the gamer, and obviously that experience scales across price points. It is a technical, in-depth method focused on gaming, which suits our specific web audience of technical readers who have an enthusiasm not just for the hardware, but for the games they play on them too. Obviously, other publications have to come up with ways of conveying results that are relevant to their own readership, and a 'one size fits all' approach is unlikely to be found.
Approaches of other publications
There is a lot of 'sensitivity' out there about benchmarking techniques, all the way from first principles to execution to presentation of results. I've seen sites do good testing, but create meaningless or unreadable graphs out of it. I've seen fancy-ass flash, which conveys information brilliantly, based on substantially flawed premises. Nobody wants to be told they're doing things 'wrong' and nobody wants to be told what they should be doing by the boys in green and red.
As we go on into next year, we are seing the same themes come up again and again. Once, 'driver optimisations' were dirty words at ATI, as it took NVIDIA to task over its drivers. Now, we have ATI releasing entirely new OpenGL routines to the press. The boot is on the other foot. Once, NVIDIA played off ATI for the lack of WHQL support. Now, it is NVIDIA that has issues with passing Shader Model qualification. Shader Model 3 went from a non-issue at ATI to the major selling point of its new architecture. Image quality was once NVIDIA's forte, and now it suffers from dubious filtering quality.
What we are seeing is all the old themes revisited. Optimisations, 3DMark, SLI Profiles, Vertex textures... everything is a regurgitation of everything else, and many 'issues' reported are nothing more than the inane fiddling of one PR team hoping to score points off another. When all is said and done, what we have to focus on are actual, shipping games
, as well as games that will ship in the future that are of interest to enthusiasts
, and how those games play on the hardware we are evaluating. I believe that real-world game testing, conclusions on performance characteristics and a high standard of that often underrated lubricant, journalistic integrity
will yield the best results.
As always, we welcome your comments in our discussion forum. Especially
if you think you have the definitive answer to one of the known problems.