ISMM'04
Report on Memory Management Technical Program
at co-hosted OOPSLA and ISMM 2004,
October 23 - 28 October
Andy Cheadle
The 2004 International Symposium on Memory Management
October 24-25, 2004 Vancouver, British Columbia, Canada
Sponsored by ACM SIGPLAN
ISMM is a forum for research in management of dynamically
allocated memory. Areas of interest include but are not limited to:
explicit storage allocation and deallocation; garbage collection
algorithms and implementations; compiler analyses to aid memory
management; interactions with languages, operating systems, and
hardware, especially the memory system; and empirical studies of
allocation and referencing behaviour in programs that make
significant use of dynamic memory.
Slides and links to the ACM Digital Library copies of the papers can be
found at
http://www.research.ibm.com/ismm04/.
Saturday, October 23, 2004
Two tutorial sessions were arranged for the day before the technical
paper presentations:
- MMTk: the Java Memory Tool-kit
Steve Blackburn, Australian National University Perry Cheng, IBM T.J. Watson
Research Center
MMTK is a memory management toolkit implemented within the context of the
Jikes RVM. It provides the building blocks for implementing many of the commonly
used memory allocator and garbage collector algorithms in use today. The approach
taken is to provide 'mechanisms', 'policies' and 'plans', the composition
of which allows the creation of hybridized allocators and collectors. The
intent of the toolkit is twofold:
- To provide a framework in which GC research can be performed through
the rapid production of collectors via this compositional approach.
- The production of 'portable' high performance collectors through the
toolkit's virtual machine and memory management interfaces.
Perry presented an overview of the basic GC algorithms, an overview of MMTK
and then a more detailed look at its structure. Steve then presented a walk-through
demo of how one goes about writing a collector from scratch. A stunningly
brave effort to do a code, compile and debug cycle live, and within about
an hour and a half he'd pulled it off with a fully working collector.
The tutorial slides can be found at: http://www-124.ibm.com/developerworks/oss/jikesrvm/info/talks/ISMM04-MMTk.pdf
The second tutorial presented:
- The Boehm-Demers-Weiser Conservative Collector
Hans Boehm, HP Labs
Hans presented the basic structure and use of the Boehm-Demers-Weiser collector,
a conservative garbage collector that can be linked into languages or systems
that lack garbage collection. The garbage collector uses a modified mark-sweep
algorithm. Conceptually it operates roughly in four phases, which are performed
occasionally as part of a memory allocation:
- Preparation: Each object has an associated mark bit. Clear all mark
bits, indicating that all objects are potentially unreachable.
- Mark phase Marks all objects that can be reachable via chains of pointers
from variables. Often the collector has no real information about the
location of pointer variables in the heap, so it views all static data
areas, stacks and registers as potentially containing pointers.
- Sweep phase Scans the heap for inaccessible, and hence unmarked, objects,
and returns them to an appropriate free list for reuse.
- Finalization phase Unreachable objects which had been registered for
finalization are enqueued for finalization outside the collector.
The collector has additional support for incremental and generational (mostly
parallel) GC with several facilities to enhance the processor-scalability and
thread performance of the collector.
The tutorial slides can be found at: http://www.hpl.hp.com/personal/Hans_Boehm/gc/04tutorial.pdf
Sunday, October 24, 2004
The ISMM keynote presentation was given by Steven Woo (Rambus, Inc.) on
DRAM and Memory System Trends.
Steven started by summarising the 'Processor-Memory Performance Gap' and
the characteristics of a 'good' memory system, low latency, high bandwidth,
high capacity and a high bank count (to limit thread contention). He highlighted
the fact that DRAM latency is dramatically increasing from the CPU's point
of view with the total memory system latency even higher. An emerging trend
targeted at significantly reducing this latency is to integrate the memory
controller with the CPU. The remainder of the talk discussed the issue of
rising front side bus speeds fuelling the need for increased memory bandwidth.
This can be achieved by widening the memory bus (64 -> 128 bit) or faster
per-pin signalling. Steven focused on faster per-pin signalling and how the
physical limitations are challenging the abilities of system designers to
maintain good signal integrity and to deliver the bandwidth needed by these
systems. At higher signalling speeds, signal integrity degrades due to interference
caused by the reflection of earlier signals with subsequent signals as they
meet at the 'tee' of each memory module. As a result the number of modules
in the standard 'Multi-drop' or 'Stub-bus' arrangement that can be supported
decreases. In turn, memory capacity in terms of number of DRAMs is shrinking.
The need is for an increase in signal integrity and number of modules at the
same time. The current solution to this problem is to use On-Die terminators
that sit on top of each memory module not just at the end of the bus. Currently
under development is a more scalable solution known as 'Advanced Memory Buffers',
which daisy chain (without tees) to increase the number of modules while using
differential point-to-point signalling to increase signalling integrity and
enable higher speeds. Steven finished by touching on the challenges in shrinking
system form factors and achieving reduced power consumption for mobile and
handheld devices.
The following gives a brief description of each of the technical paper presentations,
the slides for which can be found at: http://www.research.ibm.com/ismm04/
Session I: Concurrency
- Message Analysis-Guided Allocation and Low-Pause Incremental Garbage
Collection in a Concurrent Language
Konstantinos Sagonas, Jesper Wilhelmsson
Wilhelmsson presented a memory management scheme for a concurrent programming
language where communication occurs using message-passing with copying semantics.
The runtime system is built around process-local garbage collected heaps and
a shared memory area where message data is allocated. He presented a low-pause
incremental generational collector for this shared memory area that 'imposes
no overhead on the mutator, requires no costly barrier mechanisms, and has
a relatively small space overhead.' The collector is developed to 'respect
the (soft) real-time requirements of the language'. Measurements across a
range of applications were presented to indicate that the incremental collector
substantially reduces pause times, imposes only a very small overhead on the
total runtime, and achieves a high degree of mutator utilization. It is worth
noting however, that the scheme targets this shared memory area only, not
the process local heaps.
- Write Barrier Elision for Concurrent Garbage Collectors
--
Martin Vechev, David F. Bacon
Vechev presented a study on concurrent garbage collector write barrier elision
-- the conditions under which write barriers are redundant and can be omitted,
and described how these conditions can be applied to both incremental update
or snapshot-at-the-beginning barriers. Martin presented the results of a trace-based
limit study investigating the potential for write barrier elimination, finding
that on average, 54% of incremental barriers and 83% of snapshot barriers
are unnecessary.
- Mostly Concurrent Compaction for Mark-Sweep GC --
Yoav Ossia, Ori Ben-Yitzhak, Marc Segal
Ossia presented a method for reducing the pause time of a compaction phase
applied in the context of a mark-sweep garbage collector. Compaction is used
to reduce fragmentation. However the pauses that occur during the second phase,
after object relocation, where object references are updated to reflect the
new locations, can be lengthy. The technique that is employed allows these
updates to be performed concurrently, 'eliminating a substantial portion of
the pause time hit'. The compaction is implemented on top of the IBM J9 JVM
V2.2 using the existing technique of incremental compaction to reduce the
time taken to move objects, but further enhanced by selection of the optimal
area to compact. Selection is performed after executing the mark and sweep
phases, and is based on their results. Measurements of the effect of compaction
on pause time, throughput, and mutator utilization were presented, demonstrating
that: 'compaction is indeed an efficient fragmentation reduction tool, and
that it improves the performance of a few of the benchmarks used, with very
little increase in the pause time (typically far below the cost of the mark
phase)'.
Session II: New Garbage Collection Algorithms and Strategies
- Garbage-First Garbage Collection --
David Detlefs, Christine Flood, Steven Heller, Tony Printezis
Detlefs presented the concept of the 'Garbage-First' garbage collector --
a server-style garbage collector, targeted for multi-processors with large
memories, that meets a soft real-time goal with high probability, while achieving
high throughput. Whole-heap operations, such as global marking, are performed
concurrently with mutation, to prevent interruptions proportional to heap
or live-data size. Concurrent marking both provides collection "completeness"
and identifies regions ripe for reclamation via compacting evacuation. This
evacuation is performed in parallel on multiprocessors, to increase throughput.
The heap is split into regions, with a remembered set for each region. The
garbage collector operates on those regions found to have the least amount
of live data and correspondingly large amounts of garbage during the marking
phase, i.e. doing the most garbage regions first! Dave and the team have clearly
worked very hard to build a collection algorithm that dots all the 'i's and
crosses all the 't's, and for me, this was the most interesting and inspiring
piece of work at the conference. However by Dave's own admission the collector
is a complex beast that leaves one wondering if widespread adoption is possible.
- Dynamic Selection of Application-Specific Garbage Collectors--
Sunil Soman, Chandra Krintz, David F. Bacon
Soman presented the design, implementation, and empirical evaluation of a
'novel Java Virtual Machine (JVM) extension that facilitates dynamic switching
between a number of very different and popular garbage collectors.' It is
clear that no single garbage collection algorithm yields the best performance
for all programs and all heap sizes and this work attempts to address the
limitation of employing a single (or at least a limited hybrid) collector
to all situations. The garbage collector selection is 'annotation-guided'
utilising information gathered from offline analysis of the best collector
at differing heap sizes across various benchmarks. The annotation is a 'switch-ratio'
specified as the ratio of switch point heap size (size at which to switch
to another collection strategy) and the minimum heap size. At program load
time, the JVM computes the ratio of current maximum heap size to minimum heap
size and compares this value with the annotated ratio. Sunil goes on to discuss
the evaluation of a simple heuristic to investigate the efficacy of switching
automatically. The results show that, on average, the annotation-guided system
introduces less than 4% overhead and improves performance by 24% over the
worst-performing GC (across heap sizes) and by 7% over always using the popular
Generational/Mark-Sweep hybrid.
- Automatic Heap Sizing: Taking Real Memory Into Account --
Ting Yang, Emery D. Berger, Matthew Hertz, Scott F. Kaplan, and J. Eliot B.
Moss
Yang presented an automatic heap-sizing algorithm applicable to different
garbage collectors with only modest changes. It relies on an analytical model
and on detailed information from the virtual memory manager. The model characterizes
the relation between collection algorithm, heap size, and footprint. The virtual
memory manager tracks recent reference behaviour, reporting the current footprint
and allocation to the collector. The collector uses those values as inputs
to its model to compute a heap size that maximizes throughput while minimizing
paging. The results show that our adaptive heap sizing algorithm can substantially
reduce running time over fixed-sized heaps.
Wild and Crazy Ideas
The 'Wild and Crazy Ideas' forum provided an arena for informal presentation
of researcher's 'off the cuff' ideas. The following were presented:
- Minimally Conservative Garbage Collection --
Hans Boehm
Hans presented an idea for a compromise between the instruction overheads
of type accurate garbage collectors (per pointer instruction and safe points)
and the space overheads of conservative collectors. Precise pointer location
information is maintained for heap objects, static data and only for registers
and stack at call sites and for locations that always contain non-pointers
or dead data within an innermost loop. The hope is to guarantee that the probability
of unbounded long term retention is zero. Pointer misidentification occurs
at locations rapidly changing between pointer and non-pointer values. With
random scheduling, the probability of continuously observing the same false
pointer should be zero.
- Ben's Predictions for 2014 --
Ben Zorn
Ben pointed to his presentation at the IBM Invitational Workshop on the Future
of Virtual Execution Environments (VEE"04) - the slides of which can
be found at: http://research.microsoft.com/~zorn/talks/FutureOfVEEWorkshop2004.pdf
Related to the above were his predictions:
- 'Open Source' Biology
Predictive genomics
- Post-biological era
- Large increase in economic investment in safety-critical software
- New generation of OS / runtime required for biological, nano and safety-critical
applications
- Identifying Redundant Pointers --
Dave Detlefs
Dave had us imagine a world where a static analysis exists that is capable
of identifying reachability-redundant pointer fields. They are redundant because
they are reconstructible. The question is how to infer the invariants... Dave
has ideas but is not completely sure... However, he presented the GC benefits
of being able to infer the invariants.
- Users Always Get it Wrong
Richard Jones
Richard reminded us why we advocate Garbage Collection over manual memory
management. He the decided to be provocative by questioning whether users
should have so little control over the collector based on the assumption that
they 'always get it wrong' while GC implementers know better. He then argued
that users do know something and should be able to at least give hints to
the runtime system if they know of expected behaviour. He finished by suggesting
that users need 'better', more powerful tools to guide optimisation.
Monday, October 24, 2004
Session III: Regions, Compiler Support
- Experience with Safe Manual Memory-Management in Cyclone
--
Michael Hicks, Greg Morrisett, Dan Grossman, Trevor Jim Hicks reported on
experiences of trying to integrate and effectively use two previously proposed,
type-safe memory management mechanisms: statically-scoped regions and unique
pointers. The team found that these typing mechanisms could be combined to
build alternative memory-management abstractions, such as reference counted
objects and arenas with dynamic lifetimes, and thus provide a flexible basis.
He further reported that their experience -- porting C programs and building
new applications for resource-constrained systems -- confirms that experts
can use these features to improve memory footprint and sometimes to improve
throughput when used instead of, or in combination with, conservative garbage
collection.
- Region Analysis and Transformation for Java Programs --
Sigmund Cherem, Radu Rugina
Cherem presented a region analysis and transformation framework for Java programs.
Given an input Java program, the compiler automatically translates it into
an equivalent output program with region-based memory management. The generated
program contains statements for creating regions, allocating objects in regions,
removing regions, and passing regions as parameters. As a particular case,
the analysis can enable the allocation of objects on the stack. The algorithm
uses a flow-insensitive and context-sensitive points-to analysis to partition
the memory of the program into regions and to identify points-to relations
between regions. It then performs a flow-sensitive, inter-procedural region
liveness analysis to identify object lifetimes. Finally, it uses the computed
region information to produce the region annotations in the output program.
Their results indicate that, for several of their chosen benchmarks, the transformation
can allocate most of the data on stack or in short-lived regions, and can
yield substantial memory savings.
- Experiments on the Effectiveness of an Automatic Insertion of Memory
Reuses into ML-like Programs --
Oukseh Lee, Kwangkeun Yi
Lee presented experimental results for their previous static analysis and
source-level transformation techniques that add explicit memory-reuse commands
into ML program text. The team find that the analysis and transformation cost
is negligible (1,582 to 29,000 lines per seconds) enough to be used in daily
programming. The payoff is the reduction of memory peaks and the total garbage
collection time. The transformed programs reuse 3.4% to 93.9% of total allocated
memory cells, and the memory peak is reduced by 0.0% to 71.9%. When the memory
peak reduction is large enough to overcome the costs of dynamic flags and
the memory reuse in the generational garbage collection, it speeds up program's
execution by up to 39.1%. Otherwise, the transformation can slowdown programs
by up to 7.3%. The speedup is likely only when the portion of garbage collection
time among the total execution time is more than about 30%.
Session IV: Diverse Topics
- General Adaptive Replacement Policies --
Yannis Smaragdakis
Smaragdakis presented a general scheme for creating adaptive replacement policies
with 'good performance and strong theoretical guarantees'. He shows how to
combine any two existing replacement policies so that the resulting policy
provably can never perform worse than either of the original policies by more
than a small factor. To show that the scheme performs well for real applications,
he derives a virtual memory replacement policy that adapts between LRU, loop
detection, LFU, and MRU-like replacement. The resulting policy is found to
often perform better than all of the policies it adapts over, as well as two
other hand-tuned adaptive policies from the recent literature.
- Memory Accounting Without Partitions --
Adam Wick and Matthew Flatt
Wick presented a technique to provide applications with per-task memory accounting
without the per-task object partitions created by conventional accounting
mechanisms, that 'needlessly needlessly complicate communication among tasks'.
- Field Level Analysis for Heap Space Optimization in Embedded Java
Environments --
Guangyu Chen, Mahmut Kandemir, Narayanan Vijaykrishnan, and Mary Janie Irwin
Chen presented a study of the potential benefits and challenges when heap
memory is managed at a field granularity instead of object for two field-level
analysis techniques. The first of these, field-level lifetime analysis, takes
advantage of the observation that, for a given object instance, not all the
fields have the same lifetime. The second, disjointness analysis, is built
upon the fact that, for a given object, some fields have disjoint lifetimes,
and therefore, they can potentially share the same memory space. Guangyu finished
the assessment of these analyses by presenting experimental results.
Session V: Implementation Techniques
- Barriers: Friend or Foe? --
Stephen Blackburn and Antony Hosking
Blackburn reminded us that: It has been a long-standing untested assumption
that barriers impose significant overhead to garbage-collected applications.
As a result, researchers have devoted effort to development of optimization
approaches for elimination of unnecessary barriers, or proposed new algorithms
for garbage collection that avoid the need for barriers while retaining the
capability for independent collection of heap partitions. On the basis of
the results presented, Blackburn and Hosking attempt to dispel the assumption
that barrier overhead should be a primary motivator for such efforts.
Steve presented a methodology for 'precise' measurement of mutator overheads
for barriers associated with mutator heap accesses and measurements for the
cost of a range of popular barriers used for different garbage collectors
within Jikes RVM. Their results demonstrate that barriers impose 'surprisingly'
low cost on the mutator, though results vary by architecture.
I guess the 'view' of this work surprised me somewhat. It's definitely a valuable
piece of work and very useful to have figures for the mutator overhead of
barriers for Jikes RVM, just as I have presented for GHC and Haskell. However,
the work quantifies overhead solely from the perspective of the mutator. I
don't believe it's unsurprising that barriers can impose low overhead on mutators,
especially in the light of the optimisations that Bacon, Cheng and Rajan perform
within their Java Metronome. What has to be kept in mind when discussing barrier
overhead is also factoring in the associated (space) overheads that accompany
the garbage collector algorithm - the reason for employing the barrier in
the first place!
- Dynamic Object Sampling for Pretenuring --
Maria Jump, Stephen M Blackburn, Kathryn S McKinley
Jump presented a 'low-cost' object sampling technique that is employed within
a generational garbage collector to dynamically determine object lifetimes.
The sampler marks an object and records its allocation site every n bytes
of allocation. The collector then computes per-site nursery survival rates.
Sampling degrades total performance by only 3% on average for sample rates
of 256 bytes in Jikes RVM, a rate at which overall lifetime accuracy compares
well with sampling every object. The team introduce a dynamic pretenuring
mechanism that detects long lived allocation sites and pretenures them, given
sufficient samples. To react to phase changes, it occasionally backsamples.
The paper concludes with presentation of experimental results, claiming the
technique to be an 'efficient sampling mechanism that accurately predicts
lifetimes, but leaves open optimization policies that can exploit this information'.
- Exploring the Barrier to Entry - Incremental Generational Garbage
Collection for Haskell --
Andrew Cheadle, Tony Field, Simon Marlow, Simon Peyton Jones, Lyndon While
I presented our work on the design and implementation of a 'production' incremental
generational garbage collector for GHC 6.2. It builds on our earlier work
(Non-stop Haskell) that exploited GHC's dynamic dispatch mechanism to hijack
object code pointers so that objects in to-space automatically scavenge themselves
when the mutator attempts to 'enter' them. The paper details various optimisations
based on code specialisation that remove the dynamic space, and associated
time, overheads that accompanied our earlier scheme. I discussed important
implementation issues and provided a detailed evaluation of a range of design
alternatives in comparison with Non-stop Haskell and GHC's current generational
collector. I also demonstrated how the same code specialisation techniques
can be used to eliminate the write barrier in a generational collector.
19th Annual ACM Conference on Object-Oriented Programming, Systems, Languages,
and Applications October 24-28, 2004 Vancouver, British Columbia, Canada
Sponsored by ACM SIGPLAN
The OOPSLA technical paper sessions showcase research contributions and empirical
results in object-oriented languages, systems, and applications. Twenty seven
technical papers were presented in nine sessions and covered a range of topics
including: novel software design techniques, advanced programming language
features, analysis tools, run-time systems, and new approaches to validating
and optimizing programs.
The following four papers, relating to memory management, were presented:
- A Unified Theory of Garbage Collection
David F. Bacon, Perry Cheng, V.T. Rajan
Bacon presented a formulation of tracing and reference counting algorithms
that shows that they are in fact duals of each other -- having been 'uniformly
viewed as being different approaches to garbage collection'. Intuitively,
the difference is that tracing operates on live objects, or 'matter', while
reference counting operates on dead objects, or 'anti-matter'. For every operation
performed by the tracing collector, there is a precisely corresponding anti-operation
performed by the reference counting collector.
David went on to show that using this framework, all high-performance collectors
(for example, deferred reference counting and generational collection) are
in fact hybrids of tracing and reference counting. The team have developed
a uniform cost-model for the collectors to quantify the tradeoffs that result
from choosing different hybridizations of tracing and reference counting.
The aim is to allow the correct scheme to be selected based on system performance
requirements and the expected properties of the target application.
- The Garbage Collection Advantage: Improving Program Locality
--
Xianglong Huang, Stephen Blackburn, Kathryn McKinley, J. Eliot B. Moss, Zhenlin
Wang, Perry Cheng
Huang introduced a dynamic, online class analysis for finding and exploiting
locality in a copying collector. The analysis exploits method sampling in
a JIT optimizing compiler. For each 'hot' method, object reordering analysis
marks the class fields that the method accesses as hot. Then at garbage collection
time, the collector copies referents of hot fields together with their parent.
Xianglong goes on to describe enhancements to the basic technique that introduce
heuristics for responding to phase change and static analysis to exclude 'cold'
basic blocks from the reordering analysis. Finally, experimental results were
presented with the key result demonstrating that dynamic class reordering
always matches or improves over the best static ordering since its history-based
copying order tunes memory layout to program traversal.
- MC²: High-Performance Garbage Collection for Memory-Constrained
Environments --
Narendran Sachindran, Eliot Moss, Emery Berger
Sachindran presented MC² (Memory-Constrained Copying) garbage collection
-- a copying, generational garbage collector that 'meets the demands of memory-constrained
devices with soft real-time requirements'. MC² extends generational copying
collection. It divides the heap into two regions, a nursery, and an old generation.
The nursery is identical to a generational copying collectors nursery.
MC² further divides the old generation into a number of equal size subregions
called windows. The collector operates in two phases, a mark phase and a copy
phase. The copy phase is broken down into a number of passes. Each pass copies
data out of a subset of the windows in the old generation. Since the collector
knows the exact amount of live data in each window (as a result of marking),
and the total amount of free space in the heap, it can copy data out of multiple
windows in each pass. After each copy pass, MC² unmaps the pages occupied
by the windows evacuated in that pass, thus limiting the total virtual memory
mapped at any time during the collection.
- An Efficient Parallel Heap Compaction Algorithm --
Diab Abuaiadh, Yoav Ossia, Erez Petrank, Uri Silbershtein
Ossia a heap compaction algorithm appropriate for modern computing environments.
Our algorithm is targeted at SMP platforms. It demonstrates high scalability
when running in parallel but is also extremely efficient when running single-threaded
on a uniprocessor.
Ossia (?) presented a new heap compaction algorithm targeting SMP platforms.
Instead of using the standard forwarding pointer mechanism for updating pointers
to moved objects, the algorithm saves information for a pack of objects (restricting
the memory accesses to a relatively small block table, rather than reading
random locations in the heap). It then does a small computation to process
this information and determine each object's new location. By employing a
'smart' parallel moving strategy, the algorithm achieves '(almost) perfect
compaction in the lower addresses of the heap, whereas previous algorithms
achieved parallelism by compacting within several predetermined segments'.
The team investigate a method that trades compaction quality for a further
reduction in time and space overhead and finally, propose a modern version
of the two-finger compaction algorithm. This algorithm fails, thus, 're-validating
traditional wisdom asserting that retaining the order of live objects significantly
improves the quality of the compaction'.
Problems with this page?
Contact the mm-net webmaster
Last modified Mon Feb 21 10:25:39 GMT 2005