JVM Metaspace Out-of-Memory (OOM)

My client is experiencing metaspace OOM (PDC alert OPS0025 - Pegasystems Documentation)

What kinds of configurations in Pega Platform can cause metaspace OOM? e.g. Data page, report definition, UI, etc. Articles say class metadata related, but we want to know what configurations in Pega Platform can cause class metadata to load in metaspace that we may fix our code that was recently deployed to start a rapid increase of metaspace usage. What should we look to find root cause and fix?

The OPS0025 alerts in PDC shows a section (with personalized/optimized Table layout), but we could not determine whether this was the root cause or simply got caught in the middle of metaspace which was already full.

Any insight is appreciated.

@Will Cho

Interesting article on Metaspace (at least to me)
https://www.mastertheboss.com/java/solving-java-lang-outofmemoryerror-metaspace-error/?utm_content=cmp-true

It seems it could be that the max defined is too small (should be tried without a max defined) and could also be linked to GC.

I faced lots of OOM but first time I hear of Metaspace OOM.

Interested to follow-up your issues and I hope to read for the RC & Solution :slight_smile:

Regards

Anthony

@Anthony_Gourtay Initially, we increased the metaspace from 2GB to 3GB to buy more time but it wasn’t really solving the root cause. Eventually, the culprit caught up and exceeded the 3GB limit to cause OOM again.

This time, we found the root cause, which was interesting one. We had a very large & complex section, too complicated that it takes about 2 mins to open it in DEV Studio, but often it would timeout and fail to open. This section was referenced in the main Home page so that it would run every time users login. This means it will take a long time to assemble the code (First Use Assembly) and load a very large bytes to the metaspace. To make the matter worse, the pyExecuteDbSystemPulse job scheduler was turned OFF, and the section would assemble every time when a user logs in to the main page to reference it, which was stressing & overloading the metaspace. Consequently, the metaspace usage increased rapidly and, about every 2-3 days, the metaspace Out-Of-Memory occurred and caused the outage & node recycle.

Fix

  • As a short term fix, we turned ON the pyExecuteDbSystemPulse job scheduler so that the troubled section would only assemble once for the first use and then cache it.
  • As a long-term and permanent fix, we are planning to refactor the section and make it simpler and smaller.