In a keynote presentation at this week's IEEE Hot Chips Conference at
Stanford University, Rattner noted that designers must deal with complex memory
hierarchies and sophisticated on-chip interconnect fabrics to ensure the cores
are not data starved. At the same time, the processor must provide explicit
thread support and deal with time-critical functions, as well as include
fixed-function accelerators.
For a four-core system, the Gaston algorithm drivers provide five times the
throughput of the gSpan algorithm, but slows as the number of cores increases
beyond four. The gSpan algorithm is more scalable and provides higher
performance than the Gaston algorithm as the number of cores increases.
But algorithms that can leverage the cores and hardware threading are only
the starting point, noted Ratter. Improving the cache architecture of the system
can also boost throughput by a factor two, he added. Tuning the instruction set
enables designers to further improve throughput.
Amdahl observed that there is no significant gain beyond 10 parallel cores. (And typically 4 cores). Though Gustafson did show that gain is possible beyond 10x (Actual 250x), for general purpose computing Amdahl's observation still holds true!
Simple source code changes can often result in substantial performance
enhancements using modern optimizing compilers on high-end embedded
processors. But, why is performance necessary? After all, the
capabilities of modern microprocessors dwarf the capabilities of
1980-era supercomputers.
First, the average case response time of a real-time system is
irrelevant. It is the worst-case response time that guards against a
dropped call, a misprint, or other error conditions in a product. In
other words, performance is necessary to minimize the response time of
one’s system, thereby achieving product reliability.
Second, increased performance allows for the implementation of more
features without compromising the integrity of the system. Conversely,
performance might be used to select a cheaper microprocessor or one
that consumes less power.
This paper has discussed some basic concepts of cache operation and
organization and presented some software examples to demonstrate how an
awareness of the operation of the cache on your system can help improve
software performance.
The use of cache in computer systems can dramatically improve system performance. Increasing cache size is a cost-effective way to improve performance on microprocessors as the transistor count on the chip increases. While the operation of cache is generally transparent to the programmer, some issues can arise that influence a program’s performance.
This paper will look at some of the issues that affect single-processor system performance arising from the tendency of programs to refer to data and instructions in locations of memory close to previously accessed locations. This tendency is called Locality of Reference and is the basic property that allows cache to improve processor performance
Recently I got chance to implement a dynamic memory allocation based ANSI C program, and since I am performance freak, I end up writing same program in three styles with performance range of 1X-5X. Currently, I am working on an article to show variouscoding styles (Simplified ) for dynamic memory allocation and performance evaluation of each!! Keep tuned. And Ideas are always welcome!!