We’re using Pega Platform in On-Prem Containers Platform (Redhat-Openshift) along with Embedded Search and Stream Services in Pega 24.1.3 version.
So far, Kafka diagnostics logs and kafka data being retained with in the PODS. However, Need to implement a method to retain the logs and kafkadata in persistent mode for better application performance and to analyze the kafka logs when some thing wrong.
Would you be able to use Persistent Volume Clains (PVCs) to configure your Stream pods to mount a Persistent Volume using PVC? This would enable the kafka data (topics, offsets and logs) to survive any pod restart or rescheduling.
Alternatively you can also opt to externalise the kafka logs. e.g. you can deploy a sidecar container logging with Filebeat or Fluentd along with your stream pod to push the kafka logs to an external log viewer like kibana/splunk.
to make Kafka logs and data survive pod restarts in Pega 24.1.3 on OpenShift. Use the Pega Helm chart to run stream as a StatefulSet with a volumeClaimTemplate so Kafka writes to a PVC, not the ephemeral container filesystem. In values.yaml, enable persistent storage for the stream tier and set a storageClass and size; this creates per-pod PVCs. Critically, mount the PVC to the actual Kafka log dir used by Pega’s stream image; earlier builds wrote to /opt/pega/kafkadata while charts mounted /opt/pega/streamvol, so verify and align the mount path or data will still be ephemeral. After you confirm the mount, set Kafka retention and cleanup knobs via the stream tier properties: log.retention.ms or log.retention.hours, log.segment.bytes, and log.cleanup.policy=delete (or delete,compact for compaction where appropriate). Keep PVCs after helm uninstall with a Retain reclaimPolicy if you want to preserve data across redeploys. For diagnostics, don’t store giant file logs on the same PVC; ship container stdout and any Kafka server logs to a central store using cluster logging (e.g., Fluent Bit/Vector to ELK/Loki/Splunk) and keep pod volumes focused on broker data. If you need stronger SLOs and simpler ops, consider externalizing the stream service: point Pega to a managed Kafka on OpenShift (Red Hat AMQ Streams/Strimzi) or your enterprise Kafka; Pega explicitly recommends externalized Kafka for new deployments. When using AMQ/Strimzi, request persistent storage for Kafka and ZooKeeper/KRaft via their CRDs and let that stack handle retention, tiering, and compaction. Set topic-level retention overrides only where needed to avoid over-retention on internal Pega topics. Test failover by deleting a stream pod and confirming leadership and partitions recover with data intact from the PVC. Monitor disk usage and segment counts; shrink segment.bytes if you need faster log rolling and smaller recovery windows. Document a backup/restore runbook at the storage layer (snapshots of the PV or storage backend) rather than exporting topics from inside the pods. Finally, keep search service storage persistent the same way, but treat it as rebuildable cache and prioritize Kafka durability first. This approach gives you durable broker data, centralized logs for analysis, and cleaner day-2 operations
I implemented the same method already…Similar Method i applied to retain the Pega Application logs as well.
Kafka performance is doing better than earlier. But I still see sometimes that Under replications and com.pega.charlatan.utils.CharlatanException$SessionExpiredException issue.
CharlatanException$SessionExpiredException typically occurs when the internal diagnostics session times out or loses context.
Could you check if your stream pods have sufficient memory/ cpu such that the diagnostics service isnt being evicted or restarted frequently? PVC disk throughput might effect when multiple brokers are sharing the same backend resources.
Also verify whether all kafka brokers are healthy(no frequent restart or performance bottlenecks) and are part of the In-Sync replica set.