Best Practice to Move from BIX Batch Extracts to Kafka-Based Streaming in Pega

We currently have BIX-based data extraction implemented in our Pega application and are exploring the possibility of introducing Kafka for near real-time data streaming. I would like guidance on the recommended approach and best practices.

Current Architecture

  • We have 52 BIX extract rules, each created for a different class

  • Data extraction runs daily using a Job Scheduler

  • The Job Scheduler invokes an OOTB BIX extraction activity

  • Extracted data is written as files

  • Files are then moved to an external FTP location (Sterling) using:

    • File Listener

    • FTP Server configuration

  • This is currently a batch-oriented process

New Requirement

We are planning to introduce Kafka to support data streaming / near real-time integration instead of (or along with) the existing batch file-based approach.

Additional questions:

  • Is Kafka-based streaming helpful for my use case?

  • What is the recommended Pega approach to publish data to Kafka?

    • Using Kafka Connect / Pega Stream Service

    • Using Queue Processors

    • Using Data Flows

    • Using Real-time event publishing (case processing hooks)

  • Should we:

    • Replace the existing BIX extracts completely?

    • Or keep BIX for batch use cases and introduce Kafka separately for streaming?

  • How can we efficiently handle multiple classes (52 classes) when publishing data to Kafka?

  • Are there any OOTB integrations or best practices for Kafka in recent Pega versions?

  • What are the performance and scalability considerations when moving from file-based BIX extracts to Kafka?

@KiranmaiK Kafka is a good fit here because you want incremental, near real-time change events instead of once-a-day files.
In Pega, publish those change events using a Data Flow that writes to a Kafka Data Set, and send them asynchronously via a Queue Processor so online case commits stay fast and reliable.
Use your existing 52 extract definitions as the controlled list of classes to stream, following the real-time extraction pattern (Extract + Data Flow to Kafka) and map each class to the same message structure.
Standardize topic naming and include a stable key, event type, and timestamp so downstream systems can upsert safely and handle retries without duplicates.
Keep the nightly BIX run only for full backfill and reconciliation, not for integration, so you can recover cleanly if any consumer falls behind.
Plan capacity around peak change volume by tuning partitions and Stream Service sizing, keeping payloads small, and actively monitoring lag and retries for performance and scalability.

Hi @Sairohith

Thank you for taking the time to provide such a detailed explanation earlier — I really appreciate it.

Since we currently have 52 BIX extracts for 52 different classes (including Work, Data and Index classes), I’d like to clarify a few architectural points before proceeding:

  1. Topic Strategy:
    For these 52 classes, would you recommend creating one Kafka topic per class, or using a shared topic and including the class name as part of a standardized event payload?

  2. Kafka Data Set Design:
    Should we create 52 Kafka Data Sets aligned to each class, or design a single generic Kafka Data Set with a common JSON structure for all classes?

  3. Real-Time Implementation Approach:
    For live streaming, would you suggest creating 52 Declare Triggers (one per class), or implementing a reusable utility activity that dynamically publishes events based on class?

  4. Handling Index Classes:
    For the index classes currently used in BIX joins, is it better to:

    • Stream them as separate Kafka messages, or

    • Embed the index data within the parent Work class payload (denormalized JSON)?

I am very new to Kafka and trying to understand all these rules and our BIX functionality that we have right now in the application is huge.