Anyone who payed attention at Uni (Computer Science) in the last decade will have been taught threading. Threading was pushed as the way to program. This meant that when we moved to multiple cores your programms would execute in parallel over them. At the time though threading introduced overheads in single CPU systems so this practice was often abandoned.
yes, it does seem that it simplifies the use of locks and semaphores, but like they said, it's something alot of people get wrong alot of the time. If these can be simplified, that means less code to write, and less time spent debugging to make sure your locks were done correctly. much welcomed imo
any indication of when we'll see this in VS? just VS2010 or will we see it in 2008 and perhaps 2005 as well?
why is this so hard? all embedded programmers such as all Electronic engineers in ECS, Southampton uni understands and knows how to implement concurrency on a hardware level.
software programmers are too used to sequentially writing their code. emply a couple embedded systems engineer they'd tell you how to take advantage of parallelism available in any embedded chip.
wuyanxu, there is a huge difference between the messing around Electronic Engineers do and what Software Engineers do...the two are not interchangeable, they are totally different skill sets.
Lets for arguments sake say that programmers did manage to find a way to write multi threaded programs without increasing the bug count through some technique. Its unlikely because we've been struggling with multicore programming for over 40 years but lets hope for a break through beyond languages such as say Erlang.
Even in this new utopia of computing marvel we have a problem. Not all algorithms behave well with multiple cores. The major algorithms fall into 3 general categories:
1) There are thankfully a set of embarrassingly parallel algorithms which will scale for the coming years to thousands of cores without loosing speed. They have almost no serial element at all, are driven entirely by CPU cycles applied and have no locks and just work well with multiple cores. One day these problems will struggle to fill the cores (screen drawing hits a problem once you have a CPU per pixel for instance) but that problem is a decade away at least.
2) A set of mostly parallel problems where something like 95% of the algorithm will run in parallel, but where the rest must be executed in a lock or atomically. But that means that 5% of the algorithms time has to run in a single thread and assuming an infinite number of cores the best speed up we could thus get is around 20x. 1000 cores wont make any performance difference to these problems, they are dominated by the serial part of the algorithm once you have enough cores. Early on we'll see the problem, 20 CPUs on this problem will only provide around a 10x speedup. Many of the programs we're seeing running on multiple cores today fall into this category,
3) Then there are a class of algorithms that naturally fall into a tree structure, such as say a large arithmetic problem. Right now they might be O(n) algorithms but with multiple cores they become log(n) algorithms. But that is also far from a linear speed up that we want from adding additional cores. These problems have diminishing parallelism such that the height of the tree dominates processing time assuming we have enough cores to cope with the maximum width of the problem. Don't worry about the details, but 1000 cores is only going to provide around 10x the performance of 1 core, and 2000 cores just 11x the performance. Serious diminishing returns on these algorithms and yet they will consume all those cores to achieve the speed up, but only for some of the time.
4) Algorithms that are simply serial and have no obvious parallel implementation at all.
In addition to all this the parallel algorithms cost more to run in general. On a single core they take considerably more time to execute and they are notoriously hard to debug. Multicore CPUs wont provide the necessary computing power going forward except on a subset of problems so it really will become a big challenge to keep making programs faster. We'll soon be measuring progress based on algorithm breakthroughs rather than on clock speed. Algorithm breakthroughs are rare and getting rarer. With clock speeds stagnated we're at the end of the ever increasing performance every 18 months. TM is not a silver bullet for this problem, it simply makes one small class of problems easier to program, and that although a problem isn't really the reason why multicore programming hasn't taken off. Its because many problems don't lend themselves to multicores as the algorithms don't get faster.
+rep BrightCandle, was easy to read your post, and well thought out.
Isn't Intel working on a way to combine core's to allow an optimized amount of cores to run threads so that 2 cores could be working on one thread, while 2 other cores run 1 thread each, depending on how much they load the cores are using?
I thought I heard the Intel rep saying this is what the I5, & I7 cpu's could do, while at CES.
Originally Posted by BrightCandle very interesting post
... is very interesting. rep++
It's all nice and shiny when it comes to multi-CPU/GPU algorithms marketing-wise but right now, as we all know, there aren't too many programs [other than encoding software and number crunchers] that REALLY benefit from having more cores.
Give BC's post a read, I think it pretty much sums it up nicely. ;)
What's new about locking critical code areas for concurrent CPU access? OpenMP btw does the same with the omp atomic (single statement) and omp criticial (multiple statement) pragmas. That's parallel programming basics.
BrightCandle, your post sounds all nice and dandy in theory, but there are a lot of real-word problems that lend themselves very well to parallelization, and often they are very simple to implement as parallel code. No big deal. You should be careful not to throw out the baby with the bath water.
2) A set of mostly parallel problems where something like 95% of the algorithm will run in parallel, but where the rest must be executed in a lock or atomically. But that means that 5% of the algorithms time has to run in a single thread and assuming an infinite number of cores the best speed up we could thus get is around 20x. 1000 cores wont make any performance difference to these problems, they are dominated by the serial part of the algorithm once you have enough cores. Early on we'll see the problem, 20 CPUs on this problem will only provide around a 10x speedup. Many of the programs we're seeing running on multiple cores today fall into this category,
The trick to beating Amdahl's law (what you describe in point 2) is to increase your problem size as the number of cores increases. You may not be able to do the same thing faster, but you can do more for no (or little) added cost. If your problem size scales, you will commonly see quite good parallel scalability (until you run out of memory ).
While many current applications will wilter in a massively-parallel environment, IMHO people aren't going to just lie back and say "oh well, too bad, this is a fast as things will ever be". If there's processing power to be used, they will use it, even if it means inventing new applications.
Regarding the article, transactional memory sounds like a useful model, but it doesn't exactly scream "scalability" to me. I'd like to say though that I'm really happy to see this kind of article on Bit ;). The whole desktop parallelism issue is rapidly becoming critical and it's great to see some discussion of the finer points.
Originally Posted by wuyanxu why is this so hard? all embedded programmers such as all Electronic engineers in ECS, Southampton uni understands and knows how to implement concurrency on a hardware level.
software programmers are too used to sequentially writing their code. emply a couple embedded systems engineer they'd tell you how to take advantage of parallelism available in any embedded chip.
Uh yeah, software engineers learn all that stuff, too. In practice, it is not easy to keep track of for anything other than trivial applications. Try writing a massive, distributed server application and handle all the concurrency with only locks and semaphores, see how far that gets you.
Having spent a great many years programming, I'm always amused by these people who want to "educate" programmers, or "do more research" into how to program when they themselves aren't programmers.
It's certainly difficult to do parallel programming. But, that's not why it isn't done more. Hard-core programmers dearly love a challenge, particularly if it involves some unusual problem solving. That's not the issue, though. The real issue is that many things simply can't be done in parallel. You need the results of the previous operation in order to proceed with the next operation. There's simply no way to start on the next operation otherwise. The vast bulk of problem resolution is a serial process. If you need to find E and it depends on finding what D-C is, and you don't know what C is until you subtract B from A, then you're screwed. E is going to have to wait until the first couple operations are done, and they'll have to be done one after another. I don't care how many processors you have, if you don't know what A-B is, you won't know C, and until you know that, you can't subtract C from D, so you can't find E. Looks like three operations, and it is, but you can't save any time trying to use three processors to complete those operations. In fact, it'll probably COST you time.
But, let the wishful thinkers throw money at the problem. The programmers will make a lot of money trying to show them the obvious until it finally dawns on them.
That was the sound of this all going over my head!
This is all far too technical for me but if I could just say that the very fact some very intelligent people seem to be "working" very hard to unlock the full potential of multicore cpu's, whilst playing Quake. I think they deserve all the kudos in the world for getting Microsoft to pay for that.
Originally Posted by karx11erx What's new about locking critical code areas for concurrent CPU access? OpenMP btw does the same with the omp atomic (single statement) and omp criticial (multiple statement) pragmas. That's parallel programming basics.
BrightCandle, your post sounds all nice and dandy in theory, but there are a lot of real-word problems that lend themselves very well to parallelization, and often they are very simple to implement as parallel code. No big deal. You should be careful not to throw out the baby with the bath water.
+1
Most points raised in the thread are valid, but horses for courses, we may as well be arguing that blue is more useful than red or yellow. Any hardware level assistance that gets us there quicker is welcome in my book, even if it's often just a case of every service using it's own core (and of course we have shared memory bandwidth to worry about, but that's another challenge.....). Just as hardware advances, programming techniques need to as well. We're looking forward, here, not back.
@crazyceo, maybe they have an odd crush on miss hoolie from balamory! It's not as bad as a pink Range Rover (there's one of those driving around my county, & that's just SICK...big man-car in girly pink, it's just not right)
Comments 1 to 24 of 24
ReplyFrom what I can see, all this does is simplify the use of locks? I'd still prefer to use locks over this, you have more control!
Shift the burden to the software guys and make life easier for the engineers and fab plants.
any indication of when we'll see this in VS? just VS2010 or will we see it in 2008 and perhaps 2005 as well?
amen to that o_0
threatening to taking over the world
Rather than writing a instructions that are
"thread' - single or double quotes, choose your style and stick to it.
This requires further steps need to be taken
software programmers are too used to sequentially writing their code. emply a couple embedded systems engineer they'd tell you how to take advantage of parallelism available in any embedded chip.
It's top desing Microsoft style ^^
Even in this new utopia of computing marvel we have a problem. Not all algorithms behave well with multiple cores. The major algorithms fall into 3 general categories:
1) There are thankfully a set of embarrassingly parallel algorithms which will scale for the coming years to thousands of cores without loosing speed. They have almost no serial element at all, are driven entirely by CPU cycles applied and have no locks and just work well with multiple cores. One day these problems will struggle to fill the cores (screen drawing hits a problem once you have a CPU per pixel for instance) but that problem is a decade away at least.
2) A set of mostly parallel problems where something like 95% of the algorithm will run in parallel, but where the rest must be executed in a lock or atomically. But that means that 5% of the algorithms time has to run in a single thread and assuming an infinite number of cores the best speed up we could thus get is around 20x. 1000 cores wont make any performance difference to these problems, they are dominated by the serial part of the algorithm once you have enough cores. Early on we'll see the problem, 20 CPUs on this problem will only provide around a 10x speedup. Many of the programs we're seeing running on multiple cores today fall into this category,
3) Then there are a class of algorithms that naturally fall into a tree structure, such as say a large arithmetic problem. Right now they might be O(n) algorithms but with multiple cores they become log(n) algorithms. But that is also far from a linear speed up that we want from adding additional cores. These problems have diminishing parallelism such that the height of the tree dominates processing time assuming we have enough cores to cope with the maximum width of the problem. Don't worry about the details, but 1000 cores is only going to provide around 10x the performance of 1 core, and 2000 cores just 11x the performance. Serious diminishing returns on these algorithms and yet they will consume all those cores to achieve the speed up, but only for some of the time.
4) Algorithms that are simply serial and have no obvious parallel implementation at all.
In addition to all this the parallel algorithms cost more to run in general. On a single core they take considerably more time to execute and they are notoriously hard to debug. Multicore CPUs wont provide the necessary computing power going forward except on a subset of problems so it really will become a big challenge to keep making programs faster. We'll soon be measuring progress based on algorithm breakthroughs rather than on clock speed. Algorithm breakthroughs are rare and getting rarer. With clock speeds stagnated we're at the end of the ever increasing performance every 18 months. TM is not a silver bullet for this problem, it simply makes one small class of problems easier to program, and that although a problem isn't really the reason why multicore programming hasn't taken off. Its because many problems don't lend themselves to multicores as the algorithms don't get faster.
A lot of us do know how to write highly-threaded applications and have done for many years.
Isn't Intel working on a way to combine core's to allow an optimized amount of cores to run threads so that 2 cores could be working on one thread, while 2 other cores run 1 thread each, depending on how much they load the cores are using?
I thought I heard the Intel rep saying this is what the I5, & I7 cpu's could do, while at CES.
There is no editing or proofreading, period.
It's all nice and shiny when it comes to multi-CPU/GPU algorithms marketing-wise but right now, as we all know, there aren't too many programs [other than encoding software and number crunchers] that REALLY benefit from having more cores.
Give BC's post a read, I think it pretty much sums it up nicely. ;)
BrightCandle, your post sounds all nice and dandy in theory, but there are a lot of real-word problems that lend themselves very well to parallelization, and often they are very simple to implement as parallel code. No big deal. You should be careful not to throw out the baby with the bath water.
The trick to beating Amdahl's law (what you describe in point 2) is to increase your problem size as the number of cores increases. You may not be able to do the same thing faster, but you can do more for no (or little) added cost. If your problem size scales, you will commonly see quite good parallel scalability (until you run out of memory ).
While many current applications will wilter in a massively-parallel environment, IMHO people aren't going to just lie back and say "oh well, too bad, this is a fast as things will ever be". If there's processing power to be used, they will use it, even if it means inventing new applications.
Regarding the article, transactional memory sounds like a useful model, but it doesn't exactly scream "scalability" to me. I'd like to say though that I'm really happy to see this kind of article on Bit ;). The whole desktop parallelism issue is rapidly becoming critical and it's great to see some discussion of the finer points.
It's certainly difficult to do parallel programming. But, that's not why it isn't done more. Hard-core programmers dearly love a challenge, particularly if it involves some unusual problem solving. That's not the issue, though. The real issue is that many things simply can't be done in parallel. You need the results of the previous operation in order to proceed with the next operation. There's simply no way to start on the next operation otherwise. The vast bulk of problem resolution is a serial process. If you need to find E and it depends on finding what D-C is, and you don't know what C is until you subtract B from A, then you're screwed. E is going to have to wait until the first couple operations are done, and they'll have to be done one after another. I don't care how many processors you have, if you don't know what A-B is, you won't know C, and until you know that, you can't subtract C from D, so you can't find E. Looks like three operations, and it is, but you can't save any time trying to use three processors to complete those operations. In fact, it'll probably COST you time.
But, let the wishful thinkers throw money at the problem. The programmers will make a lot of money trying to show them the obvious until it finally dawns on them.
That was the sound of this all going over my head!
This is all far too technical for me but if I could just say that the very fact some very intelligent people seem to be "working" very hard to unlock the full potential of multicore cpu's, whilst playing Quake. I think they deserve all the kudos in the world for getting Microsoft to pay for that.
And guys, who the hell paints a building pink?
+1
Most points raised in the thread are valid, but horses for courses, we may as well be arguing that blue is more useful than red or yellow. Any hardware level assistance that gets us there quicker is welcome in my book, even if it's often just a case of every service using it's own core (and of course we have shared memory bandwidth to worry about, but that's another challenge.....). Just as hardware advances, programming techniques need to as well. We're looking forward, here, not back.
@crazyceo, maybe they have an odd crush on miss hoolie from balamory! It's not as bad as a pink Range Rover (there's one of those driving around my county, & that's just SICK...big man-car in girly pink, it's just not right)
-
« Previous
-
1
-
Next »
Discuss in the forums