Monday 14 January 2013

A Garbage Collection Analysis of PCGen, the popular Open Source Character Generator


Introduction

I decided to combine two software loves of mine and perform some analysis on PCGen, a popular Java based open source character generator for role-playing games. I used Censum, our (jClarity's) new Garbage Collection log analysis tool to perform the analysis.

This write-up assumes you have a passing familiarity with Garbage Collection (GC) on the JVM. If you're not familiar with GC then I recommend you join our Friends of jClarity programme. We're building up a knowledge base around GC to share freely with the whole Java community, we'd love for you to come and validate this!

The projects

The two projects I'm using are are PCGen (the project I'm doing the analysis on) and Censum (the GC analysis tool).

PCGen


PCGen is a popular character generator and maintenance program for d20 role playing systems such as Star Wars and Dungeons & Dragons. It's a long running project (> 10 years) and consists of a large (~750,000 LOC) Java Swing desktop tool that has a ton of proprietary format data files. Disclaimer: I'm the Chairperson for PCGen.


PCGen is a data intensive tool. In order to drive its rules engine and to meet the requirement of a responsive UI (with lots of detail) much of this data is loaded up front and held in memory. Users have reported the following issues in the past.
  1. Pauses often occur when dealing with multiple characters and/or high level characters.
  2. When creating a character of a high level or if more than 2 characters were created, PCGen simply died. More tech savvy users report that they saw an OutOfMemoryError in the pcgen log file.
Some work has been done to mitigate this poor performance in the latest version of PCGen (6.0) and so I decided to use Censum to determine if those changes had improved matters.

Censum


Censum is jClarity's new Garbage Collection analysis tool. It's focus is to use powerful analytics to crunch through the raw log data and give busy developers (like contributors to PCGen!) plain english answers quickly. Disclaimer: I'm the CTO of jClarity.



Censum is a new product that is free for Open Source projects and of course if you wish to purchase a copy you can get a free trial license (click on Start Your Free Trial) today!

TLDR - The Conclusion

We have good news, some information and bad news.

The Good News


The positive news is that the default heap settings that PCGen starts with (-Xms256m -Xmx512m) are now adequate in terms of being sized well enough to keep PCGen running. Even after creating a 5th complex character, there was no OutOfMemoryError.

Censum shows that once a full GC is run (typically after each new character has been created), a large percentage of the heap is recovered and each the character takes up about 25-50MB of heap space. We can very roughly extrapolate that with a starting (data loaded) point of ~125MB that PCGen can comfortably hold about 10-15 characters open at any one time without crashing. This is perhaps not enough for a GM to have his Goblin horde up and running, but certainly enough for most regular parties!

The Bad News


The slightly more negative news is Censum reporting that PCGen has relatively high pause times, likely triggered by too much premature promotion.

Too much premature promotion pushes memory into the old gen space more quickly that we'd like. This can have the knock-on effect of causing more old gen collections as well as full GCs, naturally leading to more pause times.

See the full analysis section for extra details on high pause times, premature promotion and what PCGen can do about it.

Where to from here?


PCGen could follow Censum's "stop-gap" recommendation to alter the size of the young generational space. By using the -XX:NewSize flag and setting that to ~256M, the high pause times problem is alleviated.

However, the longer term solution is for PCGen to continue reducing the impact of their data structures (in particular the representation of a player character). This is in fact an ongoing project for PCGen today!

The technical setup

PCGen is typically run from a shell script with the default heap settings of -Xms256m and -Xmx512m. The script was altered to provide the minimum set of flags that are required to produce GC log that can be analysed. The flags that were added to the java command were:

-verbose:gc -Xloggc:pcgen_gc.log -XX:+PrintGCDetails -XX:+PrintTenuringDistribution
  • -verbose:gc and -Xloggc:pcgen_gc.log produce a basic log that outputs to a file called pcgen_gc.log.
  • -XX:+PrintGCDetails provides the absolute minimum set of GC allocation information that Censum needs to perform an analysis.
  • Finally -XX:+PrintTenuringDistribution gives useful information on when objects are moving from a young generational space (eden, survivor 1 and survivor 2) to an old generation space (tenured).
All of these options have little to no impact on a running JVM. You should always have these switched on in Production!

PCGen was run with Oracle's Java 7u10 on a MBP running Mac OS 10.7.5, with 8GB of RAM, a 256MB SSD drive and a hyperthreaded Dual Core 2.8Ghz i7 processor.

The PCGen activities

PCGen begins by loading up basic game mode and system files + the basic UI to load data sources. The next step is for the user to select which data sources to load (roleplaying game rule sets). The popular SRD 3.5 with Monsters set was loaded (Dungeons and Dragons 3.5e).

A Character (Karianna) was created level by level into a 20th Wizard with fully loaded spells, equipment, and a Cat familiar (effectively a 2nd character in PCGen). Several more complex characters were added after that, including a Great Wyrm Blue Dragon (loads of data!).

Analysis

I'll cover the initial data loading use case and then general on going usage (character creation).

Data Loading


I was curious about the memory impact of the initial data load. Although I get fast loading times with my SSD drive having PCGen load its data without memory problems is certainly a project goal! Here is what Censum showed in terms of heap allocation after GC's.


As you can see, the initial load of data caused a number of young generational collections and one old (tenured) GC event at the end of the data load. The heap usage climbed to a max of about 325MB, but after garbage was collected, the heap usage fell back to about 100MB. Not too bad for loading about 15 thick rule books worth of data!

However, data loading for PCGen is a little like the start-up period for an web/application server such as Tomcat. In terms of your GC analysis it's generally best to be excluded as part of a one-off start-up as opposed to analysing your running application.

Creating characters


Creating Karianna and advancing her to 20th level involves filling in details of 13 main tabs ~20 sub tabs and a good deal of data! Another 4 characters were created of similar complexity, some friendly (a Cat familiar) and some not (a Great Wyrm Blue Dragon).

A few screenshots of the process follow:

Skills



Equipment



The Embedded Character Sheet


Censum's Analysis

On opening the log file, Censum took me immediately to its Analytics Summary screen, which lets me know at a glance as to how PCGen's garbage collection is going.


The good news


Immediately I know that:
  1. I have the right flags to collect enough GC data for analytics to work properly
  2. My GC throughput is good (PCGen spends it's time running, not garbage collecting)
  3. There are no nasty System.gc() calls (generally not good practice).

The informational news


Memory Utilisation (which a memory leak is a subset of) and Memory Pool Sizes are informational as the log has not yet gathered 24 hours of data (our recommended minimum to see a full working day's cycle for an application).

The bad news


PCGen appears to have high pause times as well as a premature promotion problem. Let's dive into those a bit further!

High Pause Times

High pause times are caused by GC collectors having to pause application threads in order to clean up object references in a safe manner. The more object references that the collectors have to scan over and clean up, the longer the pause! The longest pauses are usually caused by full GC collections, where the entire Java Heap (e.g. Both young and old gen spaces are getting really full) are cleaned up.

As a user I noticed pauses a couple of times, not enough to really disturb me, but I am aware that I have extremely good hardware and that these pause times may be significantly worse for other users.


As Censum points out, the 0.15% time spent paused is not a major concern, it's the 0.666 second pause time that's concerning. However, I remembered that the highest pause time could come from the initial data load in PCGen. To correlate this, Censum provides a graph of pause times.


The data load was the worst offender, but certainly for each character created there was a good ½ second pause around each character creation point due to a full GC. Again ½ a second wasn't too annoying for me in the context of using PCGen, but as Censum shows, full GCs take time and so PCGen should look to reduce the number of full GC's.

In this case we know that we are probably getting more full GC's than we'd like due to the other warning that Censum gives us - too much premature promotion.

Premature Promotion

Premature promotion basically means that objects that should be getting collected in the young gen space are being promoted into the old gen space before their age is up. This 'age' is known as the tenuring threshold, and is based on a combination of software research done in the 1980's and the JVM's runtime heuristics. Premature promotion can occur due to:
  1. The rate of new objects being created overwhelming the young gen space
  2. The size of the objects being created are too large to fit into the young gen space. e.g. Large contiguous blocks of memory.
This has a knock on effect of putting pressure on the old gen space. It fills up more quickly and therefore more old gen collections and eventually full GCs occur, leading to more frequent pause times.


When I go to take a look and see how soon objects should be promoted and how early they are getting promoted, I get an answer immediately. The Tenuring Summary screen shows us that the Tenuring threshold is set to 15 (objects can survive ~15 collections in young gen before being promoted naturally to old gen). Note also that 100% of objects are being prematurely promoted!


I then go and look and see when objects are being promoted. Censum tells me that as majority are being promoted almost as soon as they arrive in young gen (Tenuring Threshold of 1). That value indicates to me that the arriving objects are too large for the current young gen size. However, I double check and make sure object allocation isn't extremely high, Censum tells me that object allocation peaks at about 2.5GB/s, which is only about ⅔ of the capability of my MBP, so we're OK there.


Steps for PCGen to take

The next step for the PCGen team would be to take a look at something like the histogram from jmap to see what large objects are being created. A quick check of jmap output means that I know that we have large (multi-MB) contiguous arrays of String, char and byte.

An option for PCGen at this point is to resize the young gen space in order to fit these large objects into young gen space using the -XX:NewRatio or -XX:NewSize flags. The size to set either of these two values would depend on the results discovered from the jmap usage. In this case setting -XX:NewSize to ~256M is required to alleviate the the number of pauses.

However, that's generally not a great long term solution. The PCGen project is better off reducing the usage of these large contiguous memory allocations. Lead by Tom Parker and James Dempsey, this is exactly what the PCGen team is doing, taking out large memory structures from the monolithic PlayerCharacter class into smaller more flexible facets.

Summary

It's fantastic to see that the incredibly hard work the PCGen volunteers have put in for the 6.0 release has paid real dividends for the end users!

A couple of years ago I was tracking down a 30+ second pause that a PCGen user had reported when loading up her 25th level Kobold Shaman. It took me several days to track down that it was a GC/memory problem as I only had raw GC log files (which are multithreaded, indeterminant beasts). Going forwards I'll definitely be using tooling like Censum to solve these issues more quickly, which means I can get on with adding new features to PCGen!

Don't forget to check out PCGen for your d20 role-playing game needs and Censum if you want to solve thorny GC problems!

Cheers,
Martijn (Chairperson - PCGen, CTO - jClarity)
aka "The Diabolical Developer"

Tuesday 11 December 2012

My 2012 - including a new Garbage Collection Tool

Hi all,

So I haven't written here for over a year!  Why?


TLDR

Well because several amazing things have happened in the last year or so. I am truly, truly blessed to be part of an amazing, vibrant community of hackers, dreamers and doers. If you've been following me on Twitter then hopefully you'll have had an idea of what's been going on.  If not, or if you were after more details then read on!


I'm the CTO of a new Start-up called jClarity

I co-founded a new start-up called jClarity with industry legends Ben Evans and Kirk Pepperdine along with a host of other exceptional engineersWe're focusing on intelligent products to optimise Java/JVM applications and ensure the stability of the Cloud during its infancy. Basically we decided to try science, basing our analytics off empirical science as opposed to guesswork.

Today, existing performance tools give you lots of data, lots of graphs and no real answers. They also impact heavily on your running system and become a major bottleneck themselves. Having suffered on the wrong end of this for years, we decided to do something about it! We're focusing on simple, light-weight and easy to use tools that give answers as opposed to making you scratch around in deep computer science and Java/JVM theory.

Of course community and supporting FOSS is a strong part of this, we're passionate fans of fun tech communities. If you're interested in performance tuning, cloud, virtualisation, the internals of the JVM, how Java really works etc then sign-up to our Friends of jClarity programme and join in the fun!


The London Java Community continues to kick butt!

We won two awards this year for our work in the Java Standards space.  Our two programmes Adopt a JSR and Adopt OpenJDK continue to thrive and make a real difference to the ~9 million Java developers out there today.

For example, you can thank Stephen Colebourne and also Richard Warburton/James Gough for ensuring that Java 8 gets some great new date and time APIs via JSR 310!


I spoke at 10+ conferences

I was privileged to be invited to speak at a large variety of conferences this year, from intimate rooms at regional events in their first year to keynotes at some of the largest Java conferences in the world. It was (as always) a learning curve, dealing with different audience sizes, cultures and areas of interest.

My technical talks focused on Java and the Machine - looking at where the language and the JVM hit the Operating system and the hardware and why that matters again. For those of you who missed the 2012 trend on this, basically multi-core CPUs and virtualisation/cloud means that Java developers have to really think about the hardware they are coding on top of again, no more free lunch!


The Well-Grounded Java Developer was released!

After ~2 years of elapsed time Ben Evans and myself are very proud that our first book The Well-Grounded Java Developer has been released.  So far it's all 5 star reviews in Amazon, some even by people we don't know ;-)!


I was honoured to be counted as a Java Champion

The Java Champions are a group of Java luminaries that I've looked up to my entire career. To be named in their number is still bewildering to me, but there you have it.

We built a solution to figure out your Garbage Collection problems

We've released Censum, a new tool for analysing Garbage Collection logs, telling you exactly what you need to change in order to stop those mysterious pauses and the OutOfMemoryErrors (amongst other things).

Garbage Collection often causes production problems, and is a difficult, specialised area that can be very frustrating for busy developers, devops and operations teams. Censum analyses complex logs and tells you about your GC performance problems in plain English, along with the fix!

Censum is available to download at www.jclarity.com
.

So that's been my year in a nutshell - I hope yours was as much fun and that you got to write more code than I did, something I need to rectify in 2013 :-)