GCs of JVM tuning: PS+PO VS CMS VS G1

When it comes to GCs of JVM tuning, the following important concepts have to be elaborated:

STW(Stop the world)

The GC pause in which all business threads stop to reach to the safe point.

Take dining in a restaurant as an example, the waiters need to wait till customers finish certain meals before collecting those garbage, otherwise, it could mess up things. It’s same here for GC. At certain phrases of garbage collection, we will need the safe point, like customers stop dining, to collect garbage.

Response time

The shorter the STW, the better the response time. Websites and APIs normally need to optimise the response time. CMS and G1 are the GCs for response time first application.

Throughput

throughput = business logic execution time / (business logic execution time + GC time)

Data mining, compute-intensive applications and schedule tasks modules are typical examples of throughput first application. Parallel Scavenge and Parallel Old are for throughput focus application. It has approximately 10-15% higher throughput compared to G1 in certain scenarios.

Before getting in deep of these three typical GCs, let’s start with some basic important concepts.

Reference count and Root-finding algorithms

Reference count(RC) use the number of reference that objects have to determine whether to collect them. The problem is circular reference.

From the above diagram, ObjA, ObjB and ObjC are all have 1 reference count. However, they are not live objects and supposed to be collected. This is the circular reference problem of RC. The GC won’t collect them if it is based upon the condition of only collect if reference count equals to 0.

Root-finding

This algorithm starts from the roots of the application program.

  • GC Roots include
  • Stack Local variables, run-time methods variables
  • static variables and constants
  • class objects referenced by run-time constant pool
  • objects referred by JNI
  • Objects used as synchronized monitor.

From these root, the algorithm will find out all reachable objects, and these objects are uncollectible objects.

Three common GC algorithms

Mark-Sweep

The mark-sweep algorithm marks and collects those collectable blocks.

Pros:

  1. It is effective when there are many live objects and small amount of collectable objects;
  2. It is suitable for tenured region.
  3. CMS uses this algorithm.

Cons:

  1. It is relatively low efficiency as it will need to scan twice for the entire region.
    1. First scan to find out those live objects;
    2. Second scan to find those collectable objects for collection.
  2. It will generate memory fragments.
  3. It is not suitable for Eden region as there are small amount of live objects.

Copy

The copy algorithm divides the target region memory into two part. It copies all live objects to the second part and clean up the first part.

Pros:

  1. It does not generate memory fragments.
  2. It only need to scan once, relatively high efficiency for small amount of live objects.
  3. It is suitable for Eden region.

Cons:

  1. It wastes half of the region memory.
  2. It will need to adjust object references because it moves objects around.

Mark-Compact

The mark-compact algorithm marks and moves all live objects to the beginning of the region.

Pros:

  1. It won’t generate memory fragments.
  2. It is good for objects’ allocation.
  3. It won’t cause the waste of memory.

Cons:

  1. It is relatively low performance, because it needs to scan twice and compacts objects.
  2. Synchronisation is required when multi threads invoked.

Garbage Collectors

Parallel Scavenge and Parallel Old

Parallel Scavenge(PS) is a stop-the-world, copying collector using multiple GC threads.

Parallel Old(PO) is a stop-the-world, compacting collector using multiple GC threads.

They work together to deal with young and old generation respectively.

This combination is good for throughput-prioritised applications.

Common Parameters of PS + PO:

-XX:SurvivorRatio
-XX:PreTenureSizeThreshold
-XX:MaxTenuringThreshold (in JDK 8, the maximum value is 15 because only 4 bits to store this value in class file)
-XX:TargetSurvivorRatio (this is the parameter that actually determines whether objects should promoted to the tenured region or not.)
-XX:+ParallelGCThreads
-XX:+UseAdaptiveSizePolicy
In-depth understanding of XX:TargetSurvivorRatio
uint ageTable::compute_tenuring_threshold(size_t survivor_capacity) {
  size_t desired_survivor_size = (size_t)((((double) survivor_capacity)*TargetSurvivorRatio)/100);
  size_t total = 0;
  uint age = 1;
  assert(sizes[0] == 0, "no objects with age zero should be recorded");
  while (age < table_size) {
    total += sizes[age];
    // check if including objects of age 'age' made us pass the desired
    // size, if so 'age' is the new threshold
    if (total > desired_survivor_size) break;
    age++;
  }
  uint result = age < MaxTenuringThreshold ? age : MaxTenuringThreshold;

From the above C++ code snippet, we can see that the TargetSurvivorRatio flag is the actually parameter that determines whether objects should promoted to the tenured region or not. Basically, it sums up total size between 0 to specific age, that is just understand the ratio threshold and other objects with age above the threshold will be promoted.

ParNew and Concurrent Mark Sweep(CMS)

ParNew is kinda like the new version of PS with extra feature and enhancement. Its enhancement allows it to work with CMS. For example, it satisfies the synchronisation need during the concurrent phases of CMS.

CMS is a revolutionary GC designed to target small amount memory (from hundreds of MB to several GB) with low pause. It realises the concurrent operation of GC and business threads.

ParNew + CMS + Serial Old are worked as a combination.

Card Table

When YGC occurs, scanning the all objects in the tenured region is unnecessary overhead. Therefore, CMS has a the card table mechanism to save the unnecessary overhead.

It a divide and conquer approach. The card table uses cards to mark chunks of the old region. Each card contains multiple objects. When the application starts, the card table works with write barrier to monitor those cross-region reference. The write barrier is not memory fence. It works like AOP in spring. When assignment happens to a reference, it will change the card table, e.g. CARD_TABLE[this address >> 9] = 0. Those cards with cross-region references are marked as dirty so that during YGC, the GC only need to scan those dirty cards rather than all objects in the old region.

Tri-Color Marking Algorithm

Black, grey and white colours are used for marking objects.

Objects marked with black colour means itself and all objects it refer to have been scanned already.

Objects marked with grey colour means only itself have been scanned but not those objects it refer to.

Objects marked with white colour means either they have been scanned yet or there are no references for them. Therefore, at the end of the scan and mark process, GC can collect these white objects.

CMS has the incremental update mechanism. During concurrent mark phrase, if there are changes on the reference associated with black objects, these black objects will be marked as grey and rescan again.

6 phrases of CMS
1. CMS Initial Mark (STW phrase)

The GC only marks root objects and old region objects referred by young generation objects.

2. CMS Concurrent Mark

GC threads work concurrently with business logic threads. Clients would feel a bit slower on request processing but there is no STW here.

Around 80% of the CMS FGC time happens here.

3. CMS Concurrent Preclean

During the second phrase(CMS Concurrent Mark), the business threads are also working. There could be promotions from young generation, new allocations or some modifications that happened on some objects references, and the cards associated with these objects will be marked as dirty.

The concurrent preclean phrase will rescan these objects concurrently, update obsoleted reference information and clean up cards’ dirty state.

This reduces the workload of the CMS Final Remark phrase so that the STW pause can be shorter.

CMS Concurrent Abortable Preclean

The abortable preclean not only cleans up cards, but also manages the start point of next final remark phrase. The ideal start point of the CMS Final Remark phrase is the 50% occupancy of young generation (the middle point between the last YGC’s end time and the next YGC’s start time).

4. CMS Final Remark (STW phrase)

During previous phrases, there could be newly generated garbage objects or some no reference object regained references again. Therefore in this phrase, there will be a STW to reach to the safe point and remark.

5. CMS Concurrent Sweep

The GC cleans up those garbage objects in this phrase. As it is a concurrent phrase, the business threads would generate some new garbage during the sweep process. These newly generated garbage is called floating garbage.

As it uses sweep algorithm, memory fragments would occur.

6. CMS Concurrent Reset

Reset the CMS internal data structure and prepare for the next CMS.

Common Parameters of CMS:

// UseConcurrentMarkSweepGC = ParNew + CMS + Serial Old
-XX:+UseConcMarkSweepGC
-XX:ParallelCMSThreads

-XX:CMSInitiatingOccupancyFraction
// UseCMSInitiatingOccupancyOnly works with CMSInitiatingOccupancyFraction
-XX:+UseCMSInitiatingOccupancyOnly

// CMSIncrementalMode would disable the effect of CMSInitiatingOccupancyFraction 
// stopping the concurrent phase to yield back the processor to the application
// it would cause more frequent CMS FGC
-XX:+CMSIncrementalMode

// default to be 0
-XX:CMSFullGCsBeforeCompaction
// Enable by default. If we don't set the value for CMSFullGCsBeforeCompaction, then every CMS FGC will cause compact, and this affects the performance of the system.
-XX:+UseCMSCompactAtFullCollection

-XX:+CMSClassUnloadingEnabled
-XX:GCTimeRatio

// suggested pause time, the GC will try its best to achieve this goal(e.g. reduce the young generation's size).
-XX:MaxGCPauseMillis
G1 Garbage-First Garbage Collector

G1 is the new super star in the JVM GC world in JDK 9.

In JDK 9, CMS is deprecated and it is removed in JDK 14.

The G1 GC emerged on JDK 7, evolved mature on JDK 8 and became the default GC on JDK 9.

It is especially good for response-time-prioritized application with relatively large memory (8G – 100+G).

The problem of large memory

When memory in scale, the CMS is no longer an idea choice due to its memory model design. Scanning the all objects in the old region can be time-consuming for CMS if you have 16-100+G memory.

G1’s solution

It is a divide and conquer approach.

The G1 GC logically divides the memory into generations but not physically. Once a region is recycle, it can be used by other generation. For example, a chunk of memory that is allocated for Eden can be allocated as Old after recycle.

Garbage first: it prioritises the region with the most garbage that is needed for GC so that it can spend relatively small amount of time to collect them. It uses concurrent and parallel phrases to achieve its target pause time and maintain decent throughput.

It’s also a compacting GC, compacting live objects from one region to another.

CSet(Collection Set)

CSet records a set of collectable regions. It is the collection of the regions with the most recyclable garbage (G1 prioritises cards with the most collectable garbage, so these regions are put into one set to facilitate the GC)

RSet(Remembered Set)

RSet records references from other regions to the current region.

With RSet’s help, the GC don’t have to scan the entire heap to know which objects refer to objects in the current region.

SATB (Snapshot-At-The-Beginning)

G1 pushes the disappeared references to the local stack, so as to ensure that the white objects are still be reachable by the GC
1. G1 uses SATB because there is RSET, which allows the GC to check if there are other objects referring to specific objects, so that the newly created reference black objects to white objects are still reachable by the GC.
2. Unlike CMS having to rescan all grey objects if there are references’ changes during concurrent phrase, G1 only need to scan those reference pushed into the local stack, resulting in performance improvement.

Please notice: RSet impacts reference assignment a bit because of the write barrier.

YGC

When there isn’t sufficient space to use in the Eden region, YGC occurs.

Mixed GC

Phrases of the mixed GC stage in G1 are kinda like the whole set of the CMS system. Both old region and young region will be collected. It follows the garbage first rule we mentioned before. We set the -XX:InitiatingHeapOccupancyPercent to trigger mixed GC to avoid failed evacuation caused FGC.

FGC

We must avoid to trigger the FGC. Before JDK 10, FGC in G1 is the single thread serial old. You can imagine how an old man clean up the entire CBD of a city. It’s time-consuming. Since JDK 10, the FGC for G1 is parallel, but we still need to avoid it.

How to avoid? set -XX:InitiatingHeapOccupancyPercent to trigger mixed GC eariler.

Common Parameters of G1:

-XX:+UseG1GC
// Do not set -Xmn for G1, because G1 will dynamically adjust the size of the Y region
// suggested pause, G1 will adjust the Young region size to achieve this value, default 200ms
-XX:MaxGCPauseMillis
-XX:GCPauseIntervalMillis

// region size, size of 2^x(x >= 0), e.g, 1, 2, 4, 8, 16, 32
// the bigger the size, the longer for garbage to live, the longer the GC time
-XX:G1HeapRegionSize

// the minimum percentage of heap size for the young generation
-XX:G1NewSizePercent
-XX:G1MaxNewSizePercent
-XX:GCTimeRatio
-XX:ConcGCThreads

-XX:InitiatingHeapOccupancyPercent

// Remove duplicate strings
-XX:+UseStringDeduplication