bit-tech.net

Intel's Core Duo meets the desktop

Comments 26 to 38 of 38

Reply
Marquee 20th May 2006, 00:12 Quote
With that much fire power coming form a 2.6Ghz CPU I wonder what a 3Ghz and overclocked Intel CPU can do. These new Intel chips are very good and I am happy that now I am going Intel for my next rig. The fact there is no Nvidia board for the Chip is kinda a bummer though. I can wait to get one of these bad boys on some water and overclock them.

One thing I think this Chip is missing it the fact there is no HT on the lower speeds. Can you just imagine what a dual core cpu with 2mb cache for each core + 4 threads will be capable of. Man that chip will be a monster with multi thread games.
hitman012 20th May 2006, 00:15 Quote
Quote:
Originally Posted by Solidus
smaller pipelines mean :(
(this is highly simplified)

A CPU does not execute one complete instruction every cycle - this is mainly because running an instruction involves a very large amount of work, which requires a lot of transistor switching in the processor. If you had a processor that did run one instruction per cycle/clock (1 IPC), it would be very hard to raise the clock speed - eventually (and at quite a low clock speed), the time taken to execute the instruction would in fact be longer than a single cycle of the processor. This makes normal operation impossible.

Partly to rectify this problem and allow higher clock speeds (and partly to allow more advanced processing techniques), a processor does not execute the whole instruction in one cycle. Instead, it runs the instruction in discrete steps - these are known as pipeline stages. For example, here is a simple pipeline:



As you can see, there are a number of operations that are performed on the instruction in order to execute it, each of these a stage in the pipeline. Usually, each operation takes a single clock cycle, so it takes 4 cycles (excluding cache latency) in this instance to execute an instruction. This means that several instructions are all running simultaneously - while one instruction might just be entering the pipeline, the one dispatched three cycles ago is just entering writeback (think of an assembly line)

Now that we've broken down the execution into units, the processor has less work to do in a single cycle. This allows it to hit high clock speeds and hence gain better performance. However, you might have noticed that it takes longer for instructions to execute - you're looking at a latency of 4 cycles before you get writeback on the instruction that you dispatched. This means that, at low clock speeds, your processor is now much slower per cycle than it was when we had 1-stage execution.

Complicating the matter is the issue of branch prediction. You might have noticed in the above diagram that there is a stage labelled "BPU"; this stands for Branch Processing Unit, and it is used to evaluate branch statements (statements of or similar to the format "if X then Y, otherwise Z"). The BPU sends this conditional statement to the execution stage of the pipeline to be run - this obviously takes time, which is a problem... any cycles spent doing nothing are wasted time. Modern processors will actually guess which part of the conditional statement is executed and start to run the code of the part before it knows whether it's correct.

So, imagine that it's chosen to run Z in the above example, and say that you've just made your way 3 cycles into executing when you find out that the instruction actually branches to Y. Damn. That means that the pipeline has to be "flushed", i.e. all the instructions that were just executing need to be removed; they are wrong. This wastes a huge amount of time :(. This sort of situation is common in games, which involve a large amount of conditional branching that's difficult to predict.

How does this relate to the P4, Conroe and the Athlon 64? Well, Intel, when designing the Pentium 4, decided that they would have a very long (20-stage initally) pipeline - this gives very bad performance per clock, but it can hit stratospheric clock speeds when compared to shorter pipes. This was what NetBurst (the P4 archy) was all about: hitting high clock speeds. Intel hoped that their poor execution speed would be offset by the high CPU clocks. However, it wasn't optimal at all: branch prediction problems and "bubbles" in the pipeline - places where an instruction can't be scheduled to fit into it - meant that their idea wasn't as good as they hoped. You might have surmised that the longer the pipeline, the longer it takes to flush when you've got a mispredict (IIRC the branch misprediction penalty for Northwood is over 30 cycles) and the longer it takes for bubbles to propagate down it.

On the other hand, AMD (& the Pentium M/Conroe) use a shorter pipeline, in the region of 14-16 stages (can't remember exactly). This means that while you've got quite low clock speeds by comparison, you can do a lot more work per cycle than you can with an Intel chip. The advantage here is that when the branch prediction gets it wrong, you waste many fewer cycles on flushing the pipe than you do with, say, a NetBurst chip.

(If you want to see a real pipeline, here's the 20-stage in early P4s. It's interesting to note that a few of the pipeline stages are simply called "Drive" - no work is done in these, but they are there just to allow signals to propagate across the chip. That's an extreme measure that Intel used to allow these high clock speeds.)
Quote:
Originally Posted by Solidus
cache mean :(
(again, highly simplified)

Cache is a lot simpler to explain than pipelines ;). Basically, your RAM is much slower than your CPU - much slower. Your processor can run at, say, 2.4GHz, but your RAM is stuck trundling away at 400-500Mhz effective. This means that each time RAM is accessed by the processor, it effectively has to slow itself to the speed of your RAM... not good. To combat this, the CPU has 512K-4MB of ultra high-speed memory integrated into it, known as cache.

The whole reason that cache functions as it does is due to a concept known as locality of reference. Most of a program consists of repeated loops of running the same code on similar data over and over again. Cache takes advantage of this by storing the commonly used items in the high-speed memory on the CPU, guessing what will be needed next and loading it where appropriate. Hopefully, this means that the processor will never need to access RAM for a piece of info not in its cache; this is actually fairly rare (cache hits, i.e. finding the data, occur >90% of the time).

I could ramble on for ages about this, but I think this should sum it up. Hope this helps :)
Quote:
Originally Posted by specofdust
I would have thought a further decreased pipeline length would have made the Core Duo downright terrible at encoding videos, and I still find it strange that it's so good.
I imagine that the addition of SSE3 to the Duo would help a fair bit in these tasks. It is interesting, though - perhaps the disadvantage of the short pipeline is negated by the large amount of onboard cache, which is very useful for computationally predictable tasks.
Highland3r 20th May 2006, 11:09 Quote
Quote:
Originally Posted by specofdust
I was just going on what highland3r had said, which was that it was a P-M with a shorter pipeline(he did say afaik though, shoulda looked into it I guess).

Everything I knew about the P-M told me that it wasn't very good at encoding and linear tasks, due to such a short pipeline. I would have thought a further decreased pipeline length would have made the Core Duo downright terrible at encoding videos, and I still find it strange that it's so good.

Aye, sorry wrong on that one, seems (if sources are correct - Intel doesnt seem to have disclosed the length that I can find) than Dothan is 10-12 stages and Yonah will be 14
fantastic dan 20th May 2006, 14:03 Quote
Hitman 012. That was very informative and helpful. Thanks for that.
freeloader 20th May 2006, 15:02 Quote
Hardly a "clobbering" by any stretch. It's a nice chip (performance per watt) but your comparison was hardly equal by any means. You had to overclock the bus to 800mhz. You should've used sometype of divider to reach the 2.6ghz on the 667mhz default bus speed. Anyhow, differences breakdown like so...

Overclocked Core Duo 2.6ghz (800mhz FSB) vs. Stock Athlon FX60 (2.6ghz, 400mhz DDR)

Xvid encoding 11% faster than FX60
DVD Ripping 1% faster than FX60
MP3 encoding 22% in favor of Athlon FX60
Image Processing 6% faster than FX60
Quake 4@640 by 480 - Tie (less than 1 FPS)
Quake 4@1024x768 - 4% faster than FX60
Fear 640x480 19% faster than FX60
Fear 1280x960 - Tie
Far Cry 800x600 - 12% faster than FX60
Far Cry 1280x1024 - 15% faster than FX60

As I've said, hardly a clobbering. And for anyone who accuses me of being an AMD ***boy, I have both systems. An Athlon 64 3800X2 and a P4@3.6ghz. I buy what's best when it's time to purchase.
specofdust 20th May 2006, 15:07 Quote
You make a good point Freeloader, I mean we all know FX chips are born to clock. What'd be really intresting would be if we knew how much they were both, on average, going to clock to, so that a comparison could be that at that level. I mean, I'm sure most people can probably take FX60's up to 2.8, if not 3.0Ghz. So it'll be intresting to see how average overclocks pan out.

Oh and intresting info there Highland3r. It did seem extremely odd it was so good at Xvidding :D
Bindibadgi 20th May 2006, 15:27 Quote
How is excesses like 15 and 19% are not a clobbering? Although the mp3 encoding is a serious weakness at 22% less. Considering most stuff reviewed these days are in single percentage difference at most.

If you want to take clock averages you'd need a dozen motherboards and a trays worth of chips and then you need to take into account ram/psu/graphics. It's not an exact science and shouldnt be treated like one.
Tim S 20th May 2006, 15:28 Quote
Quote:
Originally Posted by freeloader
Hardly a "clobbering" by any stretch. It's a nice chip (performance per watt) but your comparison was hardly equal by any means. You had to overclock the bus to 800mhz. You should've used sometype of divider to reach the 2.6ghz on the 667mhz default bus speed. Anyhow, differences breakdown like so...

Overclocked Core Duo 2.6ghz (800mhz FSB) vs. Stock Athlon FX60 (2.6ghz, 400mhz DDR)

Xvid encoding 11% faster than FX60
DVD Ripping 1% faster than FX60
MP3 encoding 22% in favor of Athlon FX60
Image Processing 6% faster than FX60
Quake 4@640 by 480 - Tie (less than 1 FPS)
Quake 4@1024x768 - 4% faster than FX60
Fear 640x480 19% faster than FX60
Fear 1280x960 - Tie
Far Cry 800x600 - 12% faster than FX60
Far Cry 1280x1024 - 15% faster than FX60

As I've said, hardly a clobbering. And for anyone who accuses me of being an AMD ***boy, I have both systems. An Athlon 64 3800X2 and a P4@3.6ghz. I buy what's best when it's time to purchase.
Hi there, thanks for posting.

The purpose of testing the Core Duo T2600 at 13x200MHz was to see how the architecture performed clock-for-clock against K8. Multipliers are locked on our retail Core Duo T2600, so it is impossible to increase multipliers. The T2600 at stock, is comparable to Athlon 64 X2 4800+, but is set for a near-33% price reduction in just over a week when Intel announces the T2700.

It certainly doesn't beat the X2 4800+ into a pulp, but all things considered, I think it's a better chip. Can you achieve a 20% overclock on either X2 4800+ or FX-60 with a 0.0125V vCore increase with a 'stock' cooler, and can you achieve a 35% overclock with a 0.25V vCore increase on the same cooler?

Another thing worth considering is the power consumption - how much is this chip going to save you? According to TechReport's numbers, a Core Duo system (at stock) consumes under 60% of the power of the equivalent X2 4800+ system.... That's quite a massive difference to your electricity bill if you're using the system under load all of the time. At idle, the difference is closer to 70% - still a big difference for similar performance.
freeloader 20th May 2006, 15:39 Quote
The benchmark that has me really confused is the FAR CRY bench.
You'd think that as the resolution goes up, you'd be GPU limited and not CPU limited. But the benches clearly show a 3% advantage at 1280x1024 vs 800x600. Now that's interesting.

Bindibadgi...the 22% you mentioned is in favor of the Athlon FX-60. I'll definitely give you the 15 and 19% in the game benches. That is impressive.

As far as electricity consumption, in my area it costs me $2.74/month to run my X2-3800@2.6ghz under full load for one month (Folding@Home). So the Duo Core would probably only cost me about $1 or so. For 1 year that would equal about average $16. Not worth mentioning. Of course if you're running a data center or a cluster of some kind, then that difference would be huge. Chock one up for the duo core.
Bindibadgi 20th May 2006, 15:46 Quote
I read it wrongly, the 22% was my mistake.

On a 7800GT at 1280x1024 with no AA or AF, means it only needs to generate more pixels rather than do things differently with them, although there is the difference between max and min texture settings which the game handles differently between CPU-RAM accesses and GPU texture accesses, but that's just memory bandwidth which the core and graphics cards have in spades, not GPU intensive processing. You'd have to have 4AA/16AF to get something near GPU limitation at that res. For games, if the GPU needs to access ram it doesnt have to go through the CPU, unlike with AMD, although if the CPU needs memory access on the Core it needs to go through the northbridge. Swings and roundabouts.

Ive got to agree with you about the price difference for leccy. You leave a light on for a few hours and you've just destroyed any possible gain you have had from buying a different CPU. It's just not worth it.
specofdust 20th May 2006, 16:35 Quote
Bigz, something I didn't really pick up on first time around, mainly because I was thinking "OMG FAST" and not really paying attention to the style, is how you're happy to use some fairly thick terms at times. This I think is great, it lets us get on with knowing the details of it all. The first page of the review was really quite in depth, but I told me so much about the thing. I really like that sorta..crammed with info thing. Keep it up :)

*not sure if that makes much sense, hoping you can extrapolate my intended meaning from that jumble of words.
sjprg 20th May 2006, 17:03 Quote
One of the tests I miss is a test of Adobe Camera Raw processing about 500 raw images from a High end DSLR. How fast can the board do what I want? How many Images per second? A really good test would be to process 100 images from an Aptos 75 camera.
Tim S 20th May 2006, 23:19 Quote
Quote:
Originally Posted by sjprg
One of the tests I miss is a test of Adobe Camera Raw processing about 500 raw images from a High end DSLR. How fast can the board do what I want? How many Images per second? A really good test would be to process 100 images from an Aptos 75 camera.
Hi there,

We are looking to add a photoshop benchmark into our CPU testing suite soon - it will not be in in time for the launch of AM2, but it will follow in reviews after that. We will also be including it in all motherboard reviews for the forseeable future, too.
Log in

You are not logged in, please login with your forum account below. If you don't already have an account please register to start contributing.



Discuss in the forums