I have a requirement to split a PDF document in Pega based on a specific keyword. The expectation is that when the keyword is found on certain pages, those entire pages should be extracted and compiled into a new PDF document (maintaining the original formatting). I am working on Pega Constellation 25.1.2 version.
I have already attempted an approach using GenAI, where I:
Search for the keyword within the document
Extract the relevant page containing the keyword
Generate a new PDF using the Create PDF automation
While this approach works to some extent, it only extracts the text content and does not preserve the original formatting. Additionally, it does not include the full page only the extracted section.
What I’m looking for is a way to:
Identify the pages containing the keyword
Extract those full pages (not just sections)
Generate a new PDF that preserves the original layout and formatting
Has anyone implemented a similar requirement or can suggest an approach or best practice to achieve this in Pega?
When you said GenAI, i am assuming you tried both with GenAIConnect and Agent rule as well.
Generally with a good agent instructions format and strict guardrails you could achieve your requirement with Agent rule. Maybe you can try to generate the Agent instructions, Guardrails and Response style & tone using ChatGPT and provide the same to the agent.
@RameshSangili Your valuable suggestions are needed here for GenAIConnect and Agent rules.
I have tried few requirements with PDF powered by Pega Agent rule and it works well. I will try this requirement as well. Will keep you posted.
If we have to look at alternatives, then you might have to rely on external Java packages like com.lowagie iText, PDFBox, etc. I have only tried these two and there could be more as well. I remember both these packages are shipped as part of Pega Infinity already. You can search in Java classes in Admin studio.
You will have lot of methods to parse through the content of the pdf checking through each page and segregate the content and generate a new PDF. You have to write a separate function to perform this operation and generate the new PDF for you.
Thank you very much for the detailed explanation and suggestions. I will look into those options as well if the GenAI-based approach does not fully meet the requirement.
Please do keep me posted if you get a chance to try out this requirement from your end. I will also continue experimenting and share any updates or findings here.
The source PDF is created as an EForm file and opened in the activity and passed the file source into the function. The function will check the search string and pick all the pages and generates a new PDF byte array and return to the activity.
I have done some practical and theoretical analysis to implement your requirement with GenAI and below is my take.
GenAI rules might not support your requirement exactly but can facilitate your processing of PDF only for:
Keyword suggestion
Semantic matching (instead of exact match)
GenAI is not meant for splitting and stitching the PDF’s based on some logic or decision.
You can use something like this for Pega Agent,
Agent Instructions
Identify page numbers in the document where the {keyword} appears.
Return only the page numbers in ascending order.
Do not extract or modify document content.
Do not summarize or reformat text.
{keyword} can be your comma-separated search string. You can play around with search statement as per your need.
Guardrails
Do NOT generate PDF
Do NOT rewrite content
Do NOT extract partial text
Only return metadata (page numbers)
Style & Tone
Deterministic
Structured JSON output:
{
"matchedPages": [2, 5, 9]
}
Once you get the page numbers, you can pass it to PDFBox to loop through the pages and fetch only the selected (If you don’t want the red highlighting part mentioned in my previous reply) to generate the resulting PDF. This would reduce the dependency and burden on the PDFBox api and also you will have a lighter code in your function.
Other community members may also provide their feedback on this way or suggest if you any better approach.
Please leverage Gen AI connect or Agent rules to extract the content from the PDF. Doc API powered by Pega works great for all the documents though it’s hand written as well. Please ask the Agent/Gen AI connect to respond back with base64 format and you can use attachToPDF to create the PDF file again.