Splitting PDF by Keyword While Preserving Original Page Formatting in Pega and Create new PDF

TharushiK176830171 · April 3, 2026, 11:24am

Hi everyone,

I have a requirement to split a PDF document in Pega based on a specific keyword. The expectation is that when the keyword is found on certain pages, those entire pages should be extracted and compiled into a new PDF document (maintaining the original formatting). I am working on Pega Constellation 25.1.2 version.

I have already attempted an approach using GenAI, where I:

Search for the keyword within the document
Extract the relevant page containing the keyword
Generate a new PDF using the Create PDF automation

While this approach works to some extent, it only extracts the text content and does not preserve the original formatting. Additionally, it does not include the full page only the extracted section.

What I’m looking for is a way to:

Identify the pages containing the keyword
Extract those full pages (not just sections)
Generate a new PDF that preserves the original layout and formatting

Has anyone implemented a similar requirement or can suggest an approach or best practice to achieve this in Pega?

Thanks in advance!

JayachandraSiddipeta · April 3, 2026, 11:46am

Hello @TharushiK176830171

When you said GenAI, i am assuming you tried both with GenAIConnect and Agent rule as well.

Generally with a good agent instructions format and strict guardrails you could achieve your requirement with Agent rule. Maybe you can try to generate the Agent instructions, Guardrails and Response style & tone using ChatGPT and provide the same to the agent.

@RameshSangili Your valuable suggestions are needed here for GenAIConnect and Agent rules.

I have tried few requirements with PDF powered by Pega Agent rule and it works well. I will try this requirement as well. Will keep you posted.

If we have to look at alternatives, then you might have to rely on external Java packages like com.lowagie iText, PDFBox, etc. I have only tried these two and there could be more as well. I remember both these packages are shipped as part of Pega Infinity already. You can search in Java classes in Admin studio.

You will have lot of methods to parse through the content of the pdf checking through each page and segregate the content and generate a new PDF. You have to write a separate function to perform this operation and generate the new PDF for you.

Regards

JC

TharushiK176830171 · April 3, 2026, 12:02pm

Hi @JayachandraSiddipeta ,

Thank you very much for the detailed explanation and suggestions. I will look into those options as well if the GenAI-based approach does not fully meet the requirement.

Please do keep me posted if you get a chance to try out this requirement from your end. I will also continue experimenting and share any updates or findings here.

Really appreciate your support!

Regards,
Tharushi

JayachandraSiddipeta · April 3, 2026, 3:04pm

Hello @TharushiK176830171

I managed to use the PDFBox Java api and get your requirement done. Below is the pega jar file version i have in my Pega infinity v25.1.2.

The source PDF is created as an EForm file and opened in the activity and passed the file source into the function. The function will check the search string and pick all the pages and generates a new PDF byte array and return to the activity.

Pega activity

Attached is the function code in the text file

PDFSegregator.txt (3.7 KB)

Attached is the video of the execution of the activity

I have generated the output pdf and just viewed in the browser in the POC.

You can make changes according to your requirements in your application.

I will also let you know the update with the Agent approach as well.

Let me know if you have any issues executing the above approach.

Regards

JC

JayachandraSiddipeta · April 7, 2026, 4:37am

Hello @TharushiK176830171

I have done some practical and theoretical analysis to implement your requirement with GenAI and below is my take.

GenAI rules might not support your requirement exactly but can facilitate your processing of PDF only for:

Keyword suggestion
Semantic matching (instead of exact match)

GenAI is not meant for splitting and stitching the PDF’s based on some logic or decision.

You can use something like this for Pega Agent,

Agent Instructions

Identify page numbers in the document where the {keyword} appears.
Return only the page numbers in ascending order.
Do not extract or modify document content.
Do not summarize or reformat text.

{keyword} can be your comma-separated search string. You can play around with search statement as per your need.

Guardrails

Do NOT generate PDF
Do NOT rewrite content
Do NOT extract partial text
Only return metadata (page numbers)

Style & Tone

Deterministic
Structured JSON output:

{
  "matchedPages": [2, 5, 9]
}

Once you get the page numbers, you can pass it to PDFBox to loop through the pages and fetch only the selected (If you don’t want the red highlighting part mentioned in my previous reply) to generate the resulting PDF. This would reduce the dependency and burden on the PDFBox api and also you will have a lighter code in your function.

Other community members may also provide their feedback on this way or suggest if you any better approach.

Regards

JC

RameshSangili · April 7, 2026, 10:44am

Yes, I totally agree with @JayachandraSiddipeta !

@JayachandraSiddipeta - Great work on Agent instructions & Java API!

Please leverage Gen AI connect or Agent rules to extract the content from the PDF. Doc API powered by Pega works great for all the documents though it’s hand written as well. Please ask the Agent/Gen AI connect to respond back with base64 format and you can use attachToPDF to create the PDF file again.

TharushiK176830171 · April 8, 2026, 3:01pm

Hi @JayachandraSiddipeta ,

Thank you for the valuable insights and support! I’ll work on these.

Conversation		Replies	Views
How to convert the content of the file(may be PDF) into a text format that Pega can process. General pega-platform , conversational-channels , chat-and-messaging , email , natural-language-processing , other-industry , 8-8-3	3	928	November 27, 2023
Download a Section as PDF on click of a button User Experience pega-platform , user-experience , senior-system-architect , security , case-management , java-and-activities	6	1013	June 21, 2024
How to extract the text from file General pega-platform , platform	1	39	February 19, 2026
Redact PDF data using Pega Robot Studio 19.1 General robotics-system-architect , robotic-process-automation , financial-services , pega-robotic-process-automation	4	138	July 17, 2023
Section embedded in paragraph rule for generating PDF doesnot get same formatting of section Knowledge Share pega-platform , user-experience , 8-8-5	1	130	April 13, 2026

Splitting PDF by Keyword While Preserving Original Page Formatting in Pega and Create new PDF

Agent Instructions

Guardrails

Style & Tone

Related topics