James Bulpin
I attended ISCA primarily to present some of my Hyper-Threading measurement work at the Workshop on Duplicating, Deconstructing and Debunking. I also attended the main conference.
The conference contained a diverse range of work on architectures and related fields. There were a few themes. Firstly, power consumption; a keynote from IBM and several of the papers argued the importance of energy efficient design. Work included efforts to reduce the complexity and therefore cost of architectural components. Secondly, throughput; Sun advocated high machine throughput through combined CMP and SMT. Other papers presented performance inprovements in macro- and microarchitectures. There was some work on evaluation with some silicon realisations of previous work being measured and some work on better design-stage performance methodologies. Memory consistency was covered with work on memory ordering checking and simplification using a transactional memory interface. A theme that came up in my talk, one of the keynotes and a couple of the papers was the greater stressing of the cache hierarchy by multithreaded and multicore processor. Some work on cache compression and prefetching was presented, designed to counter the increasing cache use.
This half-day workshop was where I presented some of my Hyper-Threading measurement work. The workshop was split into two sessions: the first was on simulation, the second, which included my paper, was more a more general session on multithreading and parallel architectures.
The simulation session had two papers on reducing simulation time by using workload reduction through modelling and temporal sampling respectively and a paper advocating the use of a common framework and library for simulators enabling repetition of (other people's) work. The organisers decided to supplement the presentations with a panel of simulation experts. The discussions were polite and the participants mostly in agreement.
The workload reduction work essentially boils down an application workload to a smaller, representative synthetic workload. It was suggested that this works well but only for easy-to-characterise workloads (e.g. SPEC Int). The sampling work highlighted two different approaches to selecting samples: statistical, and selective samples based on identified phases of execution (stratified). The main argument was that statistical simulation is easy and can provide confidence numbers but at the expense of duplication of similar execution sequences. Stratification avoid duplication but is costly to set up and is less accurate. The paper proposed combining both methods to get the best of both worlds.
The paper on a simulator library and disclosure was by Paris/INRIA who are building and releasing a library. To illustrate the library they reverse engineered several data cache tweaks from published papers and compared them to each other directly. They advocated the use of their library for all future simulation work with authors putting their modules back into the library. There was some worry that everyone using the same library would reduce the gene pool somewhat. It was suggested that a formal specification language would be the best way of providing repeatability but would probably not be popular in the community.
The second session contained a more diverse range papers so there was no panel but plenty of audience interaction.
An antipodean collaboration of Cantabria and Adelaide presented a debunking/duplication of an old piece of work on on-chip interconnects which, at the the time of publication, was received with scepticism because of its large performance claims. The problem turned out to be a communication and misunderstanding problem; the authors had designed a half-duplex network rather than the assumed full-duplex one. The workshop paper reimplemented the mechanism and measured it using both half- and full-duplex versions. They found that half duplex communications gave a better system throughput due to not wasting allocated capacity on the return half of the full-duplex channel for the parts of the conversation when only one end was transmitting.
New York University debunked their own work on memory access combining switches in a multicore, multi-memory processor. They found a bug in their original implementation. When fixed, they managed to improve the performance.
My talk on the multiprogramming performance of Hyper-Threading seemed to go down well. Whilst most of the audience were familiar with SMT, they hadn't thought much about these issues. One of the organisers was from Intel; he gave me some useful insight which backed up some of my observations. An IBM researcher was keen for me to try the same experiments on the Power5 but was sadly unable to offer an equipment loan/donation.
There were plenty of Wisconsin and Illinois (two big players in the field) talks with all the usual suspects (MIT, Stanford, etc.) represented. The commercial world was repesented with work and/or keynotes from Intel, IBM and Sun.
The conference had two keynotes. The first on day one was given by Tilak Agerwala of IBM. His main focus was on how important power is in architecture design and evaluation. It is no longer sufficient to just go for pure performance as we are on the limits of air cooled systems. He gave several examples where having a fixed power budget would favour the use of lower raw performance features such as shorter pipelines and narrow issue width due to the power:performance ratio at that power.
He suggested that frequency scaling is about to slow down to half the current rate due to CMOS limitations. If we are to maintain the current rate of _system_ performance growth then we have to find growth in other areas. He was advocating the benefits of "scale out", i.e. lots of smaller computing units connected in some manner, instead of "scale up", making bigger versions. He used the IBM Blue Gene as an example.
The first session was on architecture evaluation. Two papers were on evaluating research processors that have been fabricated in silicon. The third paper evaluated an IA64 compiler in terms of its planned IPC versus its realised IPC.
MIT, with their usual media-friendly sparkle, presented an evaluation of their "RAW" tiled processor. This processor distributes a large number of ALUs and registers across a number of identical "tiles" connected by a Manhattan-style routing network to forward registers to where they are needed. The caches are also distributed across the tiles. The compiler exploits locality to minimise the amount of forwarding necessary. The architecture is like a microarchitectural NUMA. The design is a few years old now and they have actually fabricated the chip. The presented work was evaluating RAW against the Pentium 3. Despite having an actual working chip, motherboard and complete system, they evaluated against a RAW simulator validated against the actual hardware. They probably chose to do this to avoid system slowdown due to their prototype chipset which is built on FPGAs. They received a lot of criticism and sarcasm for this. They report that all benchmarks used with the exception of a few with very low native IPC experienced a high speedup compared to the P3. There was some criticism that they had not tried any full applications, instead focusing on benchamrks with small kernels.
A rather nervous and hard-to-understand Stanford PhD student presented an evaluation of their "Imagine" media stream processor. A "stream" is a sequence of data records of the same type. Stream processing proceeds by executing "kernels" which perform the same small operation on each record in a stream to produce an output stream. They measured the fabricated version and the simulated version and observed a small error (6%). They did use actual applications (MPEG encoding) but didn't compare themselves against things like specialised media CPUs. The talk went on for a little long and most of the audience switched off part way through.
The Illinois IMPACT compiler group have been looking at EPIC compilation for a long time. The work presented introduced some of the mechanisms they use to reduce branches and avoid code expansion when using techniques such as inling. They used SPEC INT2000 to see where the time was being lost and what speedup various optimisations gained. They compared the planned speedups as expected by the compiler (based on the exposed ILP) against the realised IPC running on actual hardware. On average they saw about half the expected IPC which is apparently what one would expect. This is due to dynamic effects such as cache misses etc..
The rest of the conference was split into pairs of parallel sessions. I've described the sessions I attended and given a brief overview of the other sessions based on the papers.
The session on parallelism in microarchitectures covered topic from SMT to vector architectures.
Work from Purdue focussed on wire delays in superscalar and SMT processors. Wire delays do not improve as quickly as transistor delays. Future superscalar pipeline depth will be limited by wire delays but SMT processors are more tollerant to wire delays so can have deeper pinelines while still maintaining Moore's law performance increase for the time being. The problem of bandwidth to RAM- and CAM- based processor resources still remains though. They proposed a mechanism for pipelining RAM in order to offset the long wire delays.
A Dean Tullsen student argued that heterogeneous multicore chips can be better than plain multicore. The idea is that the chip area used for a generation n processor is several times that required for a generation n-1 core but the speedup is not commensurate. He showed simulated results that demonstrated that when there are more runnable threads than homogeneous cores, the heterogeneous approach works well due to having more cores. They rely on the fact that some workloads do not get much benefit from newer generation cores over older one so these can be allocated to the other cores. They dicuss the difficulties with job scheduling and mention how the added dimension of multithreaded cores makes the problem very hard. They suggest some heuristics and conclude that the problem is hard. They didn't address issues of memory bandwidth requirements or the quadratically increasing cost of the memory switch needed for more cores.
Sun introduced the term MLP (memory level parallelism) which expressed the number of simultaneous outstanding memory accesses. They discussed issues that limit MLP than can be exploited. These include the size of the issue window and re-order buffer and the ordering and serialisation constraints. They evaluated how a particular tweak, runahead executuion, would impact MLP. Runahead execution is where, once the ROB is full, the fetching continues and loads are pulled out, and executed as prefetches. They saw large MLP increases which translated to large throughput improvements for database workloads.
The memory consistency session had papers aiming simplify, reduce the implementation cost and test memory ordering guarantee hardware.
A paper from Wisconsin noted that current speculative superscalar processors have to maintain potentially large load and store queues in order to check for memory ordering violations due to speculative, out-of-order execution. The authors argue that this fully associative structure does not scale well in power, area or speed as more accesses are simultaneously in flight. They propose removing the structure entirely. This would leave speculative loads unchecked so they suggest that all loads are executed again, in program order, just before retirement and the fresh value is checked with the original value. Any mismatch would cause execution to roll-back to the mis-speculated load. They do need to store the original value with the load to facilitate this comparison; this is effectively a distributed load queue but need not be fully associative. They show that their scheme has little impact on performance as the load-store queue grows.
Stanford approach the problem of memory consistency by arguing the current memory ordering models are too complex and lead to the use of costly software synchronisation mechanisms. The propose a fully transactional memory system exploiting high interconnect bandwidths available on current and forthcoming SMP and CMP systems. The programmer annotates the start and end of the transaction. On transaction start the processor makes a register checkpoint using (e.g.) flash copying of the register file. During the transaction, loads from the local cache cause that line to marked with a 'read' bit. If a line needs to be evicted the use of the value must be remembered so a metadata-only victim cache is added. Stores are buffered locally until commit. On commit the processor broadcasts (atomically) the stores. Other nodes snoop the interconnect and check for violations of cached values marked as 'read'. A rollback will be initiated if necessary. This mechanism is expensive in terms of store buffers and interconnect bandwidth but the authors argue that the cost is justifiable in the context of forthcoming CMP systems and the gain in performance and programming simplicity is worthwhile.
A paper from Sun presents a tool to check memory systems using the total store order (TSO) consistency model. Their algorithm runs in polynomial time (the problem is NP-complete) and covers many possible faults. They report having found bugs in both pre- and post-silicon commercial designs.
The first day finished with a lively panel session on "tiled architectures" (processors built as grids of smaller computational elements with a communications matrix). They were discussing whether it is worth trying to support instruction level parallelism or not. The main arguments were that ILP is expensive to extract but the gains are not massive (e.g. Pentium 4) but that many workloads need this type of support to get good performance. The non-ILP camp was arguing that the gains from ILP are small and stream workloads (media stuff generally) show much higher gains on tiled architectures than ILP-type workloads. They say that tiled systems are meant to replace ASICs (often in media scenarios) rather than general purpose processors.
The second keynote was from Marc Tremblay of Sun. He was talking on "throughput computing" arguing that we do not need to go for either ILP or TLP but we can have both by using multithreaded-multicore processors, or CMT in Sun-speak. He noted that SMT, CMP and CMT all place more demand on the caches and we can't scale the caches up sufficiently so in effect we are taking a step backwards in cache size. He used the forum to disclose a feature in development for the future Sun "Rock" processor. "Hardware Scout" is a speculative thread that is spawned when a main thread stalls on a L2 or lower cache miss. The thread continues into the program stream speculatively warming the branch prediction and data caches. He suggests that this mechanism is sufficient to buy back the cache performance lost by using CMT.
I attended the session on IO and interconnects:
Wisconsin presented a mechanism for caches within RAID controllers to model the file system block cache in order to turn the current inclusive cacheing to a less wasteful exclusive model. Their main contribution over previous work was to not change the interface to the RAID controller instead having the controller infer the FS cache contents.
Andrew West, a CUCL Rainbow Group RA, presented some of Robert Mullins' work on on-chip routing networks (for e.g. tiled architectures). They have a standard Manhattan style network with a router node at each intersection. To reduce blocking on input queues to the routers they use the idea of "virtual channels" which splits incoming traffic into different lanes depending on where it's going (apropos a left turn filter lane at traffic lights). The problem with previous work has been the latency of the control plane. They propose a speculative precomputation of the control logic with the ability to abort if goes wrong.
Meanwhile in the other session on power and energy, the following ideas were being presented. One paper suggests that most energy saving architecture mechanisms are heuristics based and ad-hoc and require a lot of tuning. The paper proposes a formal approach which can be mechanically optimised to avoid the large manual tuning effort normally employed. The next paper presented a tile-based architecture which grouped columns of tiles into independent clock domains allowing independent frequency scaling to minimise energy requirements. The final paper in the session was from Intel's Israel lab and suggested that trace optimisation could be performed selectively on commonly executed traces. They focus on energy reduction as well as performance increase.
The annual awards were presented before the lunch to which we mere students were not invited [Shame -- REJ]. After the obligatory city promotion by the deputy Mayor, the awards were presented to Kourosh Gharachorloo (Maurice Wilkes award; his work on memory consistency), Fred Brooks [in absentia] (Eckert-Mauchly award; for defining the field, inventing most stuff that matters within it, and his continued work in it) and Przybylski, Horowitz and Hennessy [in absentia] (most influential ISCA paper from 15 years back: "Characteristics of Performance-Optimal Multi-Level Cache Hierarchies"). IEEE Fellowships were awarded to Joel Emer, Guri Sohi, Josep Torrellas and David Wood.
The session on compression and debugging contained, not unsurprisingly, a paper on compression (of cache contents) and debugging:
In an effort to effectively increase the size of the L2 cache, the University of Wisconsin presented a compressed cache scheme. They observed that previous designs work in some cases but the added latency of decompression hurts some workloads. They proposed an adaptive design that left data uncompressed if that would perform better. They used past behaviour to decide on future policy. They see good results. Obviously, cache compression cannot do anything for compulsory misses.
Illinois presented an alternative to standard "watchpoints" for debugging. A watchpoint is a memory location which, when accessed, causes the processor to generate an exeception to allow the OS to perform whatever check/function is required. This is an expensive and limited mechanism. The proposed alternative was a microarchitectural version that caused control flow to be transfered to the user-defined handler function when one of a potentially large number of watchpoints were accessed. Since the common case is (hopefully) that there is no problem and the execution can continue, they use thread-level speculation (TLS) on an SMT processor to allow the original execution to continue speculatively until the simultaneously running watch function has completed and decided whether to allow execution to continue.
The parallel session on superscalars contained two papers, the first one building on reconfigurable functional unit work by automatically translating sequences of instructions into combinatorial functions. The second paper supplements a branch predictor with a "critic" cache which records how well each predictor entry is doing. When the critic notices that the predictor is frequently getting the prediction wrong, the critic adjusts the decision made by the predictor (they use the analogy of a back seat driver when the actual driver keeps making wrong turns).
I attended the parallel session on register file design. The work presented focussed on reducing the size, power and latency of large, multiported register files. The first paper added to previous work on using a register cache. Their contribution was a new policy for deciding which registers to cache and a replacement policy to use. They use a "physical register use" predictor, keyed from the address of the instruction generating the value. This predicts how many times a register will be used and is used to support the policy of cacheing only what is going to be needed a lot in the future, and replace what won't be needed much. Additionally they filter some writes to the cache when all the predicted uses are made by instructions currently in flight via the bypass network. The second paper noted that there is a lot of redundancy in the register values, particularly with commonality in a number of the most significant bits. They proposed a register file design to capitalise on this. The third paper presented a design to abuse the register rename tables to store short register values.
Meanwhile in the other session on reliability Intel discussed the problem of transient faults on transistors due to neutron and alpha particle strikes. They propose limiting the amount of time an instruction can sit in a vulnerable storage location, forcing it to be squashed and restarted when experiencing long delays. They discuss how not all faults will affect the final outcome of a program and suggest that fault detection tags possible faulty instructions and data and the error is only signalled once it is known that the fault will have an effect on the execution. Work from Illinois and IBM describes hardware wear-out leading to hard errors as the hardware ages. They propose a realiability-aware processor that can adapt to gradual wear-out by reducing its performance as it starts to age. The third paper, from Purdue, examines the problem of inductive noise on power supply lines.
The final session was on performance methodologies. The theme was avoiding time consuming simulations. The first talk presented an analytical model of a superscalar processor. They note that all work at the moment uses simulation which is time consuming and prone to errors. The use of an analytic model will allow more effecient exploration of the design space. The second paper was related to one of the WDDD papers and presented a statistical synthetic trace generation system for simulation. The idea is that a benchmark is statistically profiled and used to build a much shorter synthetic trace representative of the original benchmark.
Problems with this page?
Contact the mm-net webmaster
Last modified Fri Jan 28 10:42:46 GMT 2005