Pega Infinity 24.1 pods on Kubernetes failing to start up

I am trying to run Pega Infinity 24.1 on a vanilla Kubernetes cluster which has eight worker nodes.

NAME STATUS ROLES AGE VERSION

kmaster Ready control-plane 24d v1.30.2

kworker1 Ready 24d v1.30.2

kworker2 Ready 24d v1.30.2

kworker3 Ready 5d22h v1.30.2

kworker5 Ready 5d21h v1.30.2

kworker6 Ready 2d23h v1.30.2

kworker7 Ready 3d13h v1.30.2

kworker8 Ready 3d v1.30.2

pegacrm Ready 12d v1.30.2

I am using Pega helm chart and using values.yaml file. There are two cassandra pods and three kafka pods running.

When I deployed the Pega helm chart, the Pega search pod started but the web and batch pods failed to start.

NAME READY STATUS RESTARTS AGE IP NODE

cassandra-0 1/1 Running 0 32m 10.0.1xx.x kworker5

cassandra-1 1/1 Running 0 31m 10.0.xx.x kworker6

kafka-controller-0 1/1 Running 1 (87m ago) 2d15h 10.0.xxx.x kworker1

kafka-controller-1 1/1 Running 1 (87m ago) 2d15h 10.0.xx.x kworker2

kafka-controller-2 1/1 Running 1 (87m ago) 2d15h 10.0.xx.x kworker3

pegadev-batch-58d9fcdb96-nzgjm 0/1 Running 3 (2m49s ago) 19m 10.0.xxx.x kworker8

pegadev-search-0 1/1 Running 0 19m 10.0.xx.x kworker2

pegadev-web-879dbcc9-47wvf 0/1 Running 3 (2m59s ago) 19m 10.0.xxx.x kworker7

postgres-deploy-6df65b6bb9-zk2dl 1/1 Running 7 (87m ago) 24d 10.0.xx.x kworker2

The below resources have been allocated to web pod which has nodeType: “WebUser”:

resources:

requests:

memory: “12Gi”

cpu: 4

limits:

memory: “12Gi”

cpu: 4

The below resources have been allocated to batch pod which has nodeType: “BackgroundProcessing,Search,Batch,BIX”

resources:

requests:

memory: “8Gi”

cpu: 4

limits:

memory: “8Gi”

cpu: 4

In values.yaml I am using the image search:8.24.0 for search pod.

Attaching the logs of web pod and batch pod.

Please inform how this issue can be resolved.

web-node-log.txt (170 KB)

batch-node-log.txt (275 KB)

@DebrajB16819133

I looked at the logfiles and noticed the following:

Using Eclipse Adoptium 11.0.23

OpenJDK 1.11 is deprecated for Infinity 24.1

OpenJDK 1.17 is fully supported

A couple of errors that you will probably need to fix to get the pods to start up:

{“source_host”:“pegadev-web-879dbcc9-47wvf”,“level”:“INFO”,“thread_name”:“main”,“appender_ref”:“PEGA”,“logger_name”:“com.pega.platform.environment.nodeclassification.internal.NodeClassificationImpl”,“message”:“NodeTypes considered on current node = [WebUser] , for given -DNodeType = [ WebUser ]”,“version”:1,“timestamp”:“2024-07-15T04:38:10.546Z”}
Located PegaRULES configuration: file:/usr/local/tomcat/webapps/prweb/WEB-INF/classes/prconfig.xml
{“exception”:{“stacktrace”:"org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 3; The element type "env" must be terminated by the matching end-tag "</env>".\n\tat

{“source_host”:“pegadev-batch-58d9fcdb96-nzgjm”,“level”:“INFO”,“thread_name”:“main”,“appender_ref”:“PEGA”,“logger_name”:“com.pega.platform.environment.nodeclassification.internal.NodeClassificationImpl”,“message”:“NodeTypes considered on current node = [BackgroundProcessing, Search, Batch, BIX] , for given -DNodeType = [ BackgroundProcessing,Search,Batch,BIX ]”,“version”:1,“timestamp”:“2024-07-15T04:54:05.464Z”}
Located PegaRULES configuration: file:/usr/local/tomcat/webapps/prweb/WEB-INF/classes/prconfig.xml
{“exception”:{“stacktrace”:"org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 3; The element type "env" must be terminated by the matching end-tag "</env>".\n\tat

@PhilipShannon

Thanks very much for your analysis and suggestions.

I corrected the missing tag ending in prconfig.xml and did not get the error message related to that anymore.

In my worker nodes the Java version Microsoft OpenJDK 21 is running.

Node 7:

root@kworker7:/home/kadmin# java -version

openjdk version “21.0.3” 2024-04-16 LTS

OpenJDK Runtime Environment Microsoft-9388422 (build 21.0.3+9-LTS)

OpenJDK 64-Bit Server VM Microsoft-9388422 (build 21.0.3+9-LTS, mixed mode, shar ing)

root@kworker7:/home/kadmin#

Node 8:

root@kworker8:/home/kadmin# java -version

openjdk version “21.0.3” 2024-04-16 LTS

OpenJDK Runtime Environment Microsoft-9388422 (build 21.0.3+9-LTS)

OpenJDK 64-Bit Server VM Microsoft-9388422 (build 21.0.3+9-LTS, mixed mode, sharing)

root@kworker8:/home/kadmin#

It seems that the docker image “platform/pega/24.1.0” provided by Pega has openjdk version 11.0.23+9 embedded into it.

Please check and inform.

Thanks.

@DebrajB16819133 Thanks for additional details. They may be in the process of making new docker images which utilize OpenJDK 17.

While OpenJDK 11 is deprecated, it should likely still work fine for this.

@PhilipShannon

I found the below events in the batch pod:

Events:
Type Reason Age From Message


Normal Started 5m35s kubelet Started container pega-web-tomcat

Warning Unhealthy 92s (x23 over 5m13s) kubelet Startup probe failed: Get “http://10.0.144.1:8080/prweb/PRRestService/monitor/pingService/ping”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

As suggested by the support ticket INC-B30516, I added the below lines in values.yaml

startupProbe:

initialDelaySeconds: 1100

This resolved the issue.

I have used old laptops with slow HDDs to build up the Kubernetes cluster and that is why the pods are taking long time to start up.

For web tier (nodeType: “WebUser”) I used the below values,

livenessProbe:

port: 8081

initialDelaySeconds: 600

timeoutSeconds: 120

failureThreshold: 30

readinessProbe:

initialDelaySeconds: 600

failureThreshold: 30

startupProbe:

initialDelaySeconds: 600

For batch tier (nodeType: “BackgroundProcessing,Search,BIX,Batch,RealTime,ADM,RTDG”) I used the below values,

livenessProbe:

port: 8081

initialDelaySeconds: 1100

timeoutSeconds: 120

failureThreshold: 30

readinessProbe:

initialDelaySeconds: 1100

failureThreshold: 30

startupProbe:

initialDelaySeconds: 1100

@DebrajB16819133 Seeing this error for our stream pod: “Startup probe failed: Get “http://xxxxxxx:8080/prweb/PRRestService/monitor/pingService/ping”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)”

Please see the below Logs:

28-Oct-2024 14:03:48.272 INFO [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent The Apache Tomcat Native library which allows using OpenSSL was not found on the java.library.path: [/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib]
28-Oct-2024 14:03:48.576 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler [“http-nio2-8080”]
28-Oct-2024 14:03:48.616 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler [“http-nio2-8081”]
28-Oct-2024 14:03:48.618 INFO [main] org.apache.catalina.startup.Catalina.load Server initialization in [720] milliseconds
28-Oct-2024 14:03:48.671 WARNING [main] org.apache.catalina.users.MemoryUserDatabase.createUser Null or zero length user name specified. The user will be ignored.
28-Oct-2024 14:03:48.672 INFO [main] org.apache.tomcat.util.digester.FactoryCreateRule.begin Error creating object: [Null or zero length user name specified. The user will be ignored.]
28-Oct-2024 14:03:48.675 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service [Catalina]
28-Oct-2024 14:03:48.675 INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet engine: [Apache Tomcat/9.0.88]
28-Oct-2024 14:03:48.682 INFO [main] org.apache.catalina.startup.HostConfig.deployDescriptor Deploying deployment descriptor [/usr/local/tomcat/conf/Catalina/localhost/prweb.xml]
28-Oct-2024 14:03:48.704 WARNING [main] org.apache.catalina.startup.HostConfig.deployDescriptor A docBase [/usr/local/tomcat/webapps/prweb] inside the host appBase has been specified, and will be ignored
28-Oct-2024 14:03:49.045 INFO [main] java.util.ArrayList.forEach Name = PRFileStore Ignoring unknown property: value of “Database-based File Access” for “description” property
28-Oct-2024 14:03:49.070 INFO [main] java.util.ArrayList.forEach Name = AdminPegaRULES Ignoring unknown property: value of “PegaRULES Admin datasource” for “description” property
28-Oct-2024 14:03:49.225 INFO [main] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
28-Oct-2024 14:03:49.383 INFO [main] com.pega.pegarules.internal.bootstrap.PRBootstrapDataSource. Loading bootstrap properties from /prbootstrap.properties
28-Oct-2024 14:03:49.385 INFO [main] com.pega.pegarules.internal.bootstrap.SettingReaderJNDI. Could not find java:comp/env/prbootstrap/ in the local JNDI context, skipping prconfig setting lookup
28-Oct-2024 14:03:49.386 INFO [main] com.pega.pegarules.internal.bootstrap.SettingReaderJNDI. Could not find prbootstrap in the local JNDI context, skipping prconfig setting lookup
28-Oct-2024 14:03:49.397 INFO [main] com.pega.pegarules.internal.bootstrap.PRBootstrapDataSource. Bootstrap datatables schema: PEGA_GENE_DATA_USR
28-Oct-2024 14:06:00.422 SEVERE [main] com.pega.pegarules.internal.bootstrap.PRBootstrapDataSource. Unable to connect to database. Will only use properties from file.
java.sql.SQLException: Cannot create PoolableConnectionFactory (IO Error: The Network Adapter could not establish the connection)

Any thoughts or inputs?