Duplicate Interaction History records added abruptly

On a Pega CDH 8.6 platform running on kubernetes, we noticed huge number of records were written to IH for a 2 hour duration this falls in a window where we had cluster patching on environment side. All happened during this event was rolling restarts of pods. Reviewing some data at low level shows these are duplicate IH entries from old/existing interactions.

what could have caused something abruptly to resave this IH into DB? Appreciate if you have any inputs, thank you!!

@shushruthR please can you confirm that you logged a support ticket for this issue?

I found INC-228193 (DF_CaptureReponse not processing records leading to missed IH) reporting.

In the ticket the user reported a huge number of records written to IH for 2 hours window 06/07 500-700 AM CT.

This ticket was closed July 25th with the following analysis made by our product engineering team:

"Although the root cause can only be speculated, there was a call where troubleshooting approaches were discussed.

It appeared that the issues in stream nodes would have caused the data flow issue. However, the artefacts captured at the time were not conclusive due to which we are suspecting this to be a potential root cause since data flow with steam data set as source would rely on the stream services for determining the partitions during the partition assignment process.

Below are two external links for jattach utility available on internet which can be used on k8s to capture thread and heap dumps.

Troubleshooting OpenJDK applications on Kubernetes at the command line with jattach

GitHub - jattach/jattach: JVM Dynamic Attach utility

Should the issue re-occur (where DF is found in hung state), a script was provided to capture thread dumps. Please capture the following information for a detailed review:.

  1. PegaRULES, ALERT, CLUSTER, GC logs from all nodes where the data flow is running.
  2. PegaRULES, ALERT, CLUSTER, GC logs and kafka diagnostics from Stream nodes.
  3. Screenshots of JMX statistics visible for each stream node when clicking on the ‘NORMAL’ status.
  4. Screenshots of all the tabs of the data flow run including any stack trace observed in view warnings. Also, capture the partition report and life cycle events in-case if you run into similar issues for any NON-singlecase (like batch or realtime) data flows.
  5. Use the script provided in the support ticket to capture continuous thread dumps"

Your team was instructed to create a new incident and share all the requested artefacts if the issue is reproduced in future.

Can you confirm that this answered your question?