We currently have BIX-based data extraction implemented in our Pega application and are exploring the possibility of introducing Kafka for near real-time data streaming. I would like guidance on the recommended approach and best practices.
Current Architecture
We have 52 BIX extract rules, each created for a different class
Data extraction runs daily using a Job Scheduler
The Job Scheduler invokes an OOTB BIX extraction activity
Extracted data is written as files
Files are then moved to an external FTP location (Sterling) using:
File Listener
FTP Server configuration
This is currently a batch-oriented process
New Requirement
We are planning to introduce Kafka to support data streaming / near real-time integration instead of (or along with) the existing batch file-based approach.
@KiranmaiK Kafka is a good fit here because you want incremental, near real-time change events instead of once-a-day files.
In Pega, publish those change events using a Data Flow that writes to a Kafka Data Set, and send them asynchronously via a Queue Processor so online case commits stay fast and reliable.
Use your existing 52 extract definitions as the controlled list of classes to stream, following the real-time extraction pattern (Extract + Data Flow to Kafka) and map each class to the same message structure.
Standardize topic naming and include a stable key, event type, and timestamp so downstream systems can upsert safely and handle retries without duplicates.
Keep the nightly BIX run only for full backfill and reconciliation, not for integration, so you can recover cleanly if any consumer falls behind.
Plan capacity around peak change volume by tuning partitions and Stream Service sizing, keeping payloads small, and actively monitoring lag and retries for performance and scalability.
Thank you for taking the time to provide such a detailed explanation earlier — I really appreciate it.
Since we currently have 52 BIX extracts for 52 different classes (including Work, Data and Index classes), I’d like to clarify a few architectural points before proceeding:
Topic Strategy:
For these 52 classes, would you recommend creating one Kafka topic per class, or using a shared topic and including the class name as part of a standardized event payload?
Kafka Data Set Design:
Should we create 52 Kafka Data Sets aligned to each class, or design a single generic Kafka Data Set with a common JSON structure for all classes?
Real-Time Implementation Approach:
For live streaming, would you suggest creating 52 Declare Triggers (one per class), or implementing a reusable utility activity that dynamically publishes events based on class?
Handling Index Classes:
For the index classes currently used in BIX joins, is it better to:
Stream them as separate Kafka messages, or
Embed the index data within the parent Work class payload (denormalized JSON)?
I am very new to Kafka and trying to understand all these rules and our BIX functionality that we have right now in the application is huge.