@Ratan
Yes data is handed over to auto generated data flow.
Data flow runs with the support of stream node on background node. No stream node , data flow wont work.
Queue processor ( running on background node) gets the data from Data flow ( running with the support of stream node)
More stream nodes - is vertical scaling of the nodes - high through put with messages spread across nodes.
More background nodes can support both horizontal and vertical scaling - high through put of queue processing.
Always number of threads running for a given queue processor should match the number of partitions.
For E.g., You have a QueueProcessor named - Ratan configured to run on 5 threads on background node.
2 stream nodes with default partition size of 5 , 4 back ground nodes configured.
Data is replicated in both the stream nodes.
Lets assume each record in the queue is taking 1 min to process.
with the given configurations per minute 20 records can be processed. (4 background nodes - each node 5 threads - 5*4=20) - 20 data flow parallel runs per minute speeding up the queue execution.
Kafka is doing the queue balancing, message replication, message dequeue once processed refreshing with latest message and rebalancing.
Value of external Kafka is wrt to data security, administration, partitioning, size. data resiliency ,licensing and market latest upgrade - which are not fully fulfilled with embedded Kafka. Any messages are struck, not processed, lost, requiring restart of zookeeper to rebalance messages etc. many more administrative tasks - all these are not doable with embedded Kafka.
For E.g., A client have AWS Kafka enterprise license with full set of features, why would client need embedded Kafka with limited privileges with a no or less administrative access, no control. In cloud a POD and it’s data is not guaranteed. It’s pay per use in Cloud, if stream nodes are not active and the thresholds are below the normal usage limits and PODs will be down in Kubernetes and entire Kafka folders inside the embedded Kafka will get deleted if no persistence volume is added.
Lets assume you have your production running in Cluster 1 which is North America Boston data center which is down because of cyclone, for business continuity you are spinning up North America Washington data center Cluster. Now data is physically stored in Cluster 1 embedded node which cannot be replicated to Cluster 2 as you have lost connections.
Addressing all these external Kafka which performs all these replication, rebalance, dequeue, enqueue between active cluster and in-active cluster client data is safe and ready for business continuity and resilient.