Hello everyone,
I’m looking for advice / best practices regarding a large‑scale cleanup of files stored in the Pega Repository / File Storage.
Context
During several years of platform usage, files in a specific repository folder were never deleted.
We now need to reorganize it and remove files according to defined business/technical criteria.
From my analysis, the only supported way to delete repository files is via the standard data pages (D_pxDeleteFile, etc.).
Current approach
I’m using D_pxListFiles to list the files in a given repository folder.
Since the data page returns paginated results, I:
Call it the first time with an empty marker
Then keep calling it passing the returned marker to retrieve the next page
For each page, I iterate over the files and delete those matching the removal criteria
I’ve verified that:
The marker value changes dynamically on each invocation
Therefore, it doesn’t seem feasible to first collect all markers and then process them later
The only viable approach appears to be streaming through D_pxListFiles and deleting files on the fly
Problem
If we assume hundreds of thousands or even ~1 million files, this approach:
Runs for hours (or more) - batch approach
Question
Is there a more efficient or recommended approach to handle mass deletion / cleanup of repository files in Pega?
Specifically, I’d appreciate guidance on:
Known patterns or best practices for large repository cleanups
Whether a batch / queue‑based / Job Scheduler / async approach is preferred
Any built‑in purge mechanisms, repository‑level tricks, or supported shortcuts
Things to avoid (e.g. long single activities vs chunking)
Real‑world experiences with similar volumes
Thanks in advance for any insights or recommendations.
Absolutely, the challenge is about execution. I’m already trying to use a Job Scheduler, but the executions are still too long.
In particular, the problem I’m facing is that after the first N runs, I may end up in a situation where the first X elements (with X being very large) have already been examined, but I still need to iterate over them again just to check whether they should be deleted or not. This creates an increasing amount of “rework” as the process progresses.
I was also thinking about a two‑phase approach, separating:
File identification
File deletion
The idea would be to first scan the repository and extract the full list of files (with the relevant metadata), store that information somewhere (for example in a Data Type), and only afterward iterate over that list to perform the actual deletion.
Of course, the first phase alone could run for more than 24 hours, but it would be executed only once. Then the deletion phase would be much more controlled and resumable. Do you think storing something like ~1 million rows in a Data Type could be a reasonable and supported approach in this scenario?
I’d be very interested in your thoughts or any alternative patterns you’ve seen working in similar cases.
For a cleanup at that scale, I would not recommend one long, single-threaded activity that streams through D_pxListFiles and deletes inline. The supported delete API is still the right mechanism, but the execution pattern should be chunked and asynchronous so you avoid holding one request open for hours and reduce the blast radius if something fails.
You can try-
Use D_pxListFiles only to enumerate a small batch of files at a time.
Push each file path, or each batch of file paths, into a queue processor or other async work queue.
Let the async workers call D_pxDelete for the targeted files.
this approach give you smaller transactions, retry capability, better observability and the ability to throttle the delete rate.
I’d suggest this structure:
A scheduler starts the cleanup job.
A controller activity reads one page of file names from D_pxListFiles.
For each candidate file, push an item in a queue processor.
The queue processor deletes the file using D_pxDelete.
The controller continues with the next marker and repeats until complete.
This is much closer to how Pega expects large-volume purge-like operations to be handled
scheduled, chunked, and time-bounded rather than all at once
There is no general-purpose “purge this repository folder by criteria” wizard equivalent to case purge/archive for repository files. Repository file cleanup is typically done through the repository APIs, with your own orchestration around them
@VincenzoF1238
I agree you’re heading in the right direction. The first step is to identify the attachments that need to be deleted.
You can retrieve the attachment metadata and repository path from Data‑WorkAttach‑File. Persist the key identifiers (such as attachment key, work object key, repository reference, and file path) into a separate data type/table dedicated to deletion tracking.
Once the attachments are identified:
Store only the required key parameters in this data table
Create appropriate indexes to support efficient querying and batch processing
Storing even millions of records is not a concern, since the table will contain only lightweight metadata and keys.
Using this table as a control list, you can then execute a controlled, batch‑based deletion process to remove the corresponding files from the repository ensuring proper tracking, retry handling, and auditability.