I am running a dataflow sourced by a report definition, with a defined partition key.
Let’s assume the partition key can have x possible distinct values, and the dataflow will run on y nodes.
My questions are the following:
in the options configuration panel, before launching the dataflow execution, which is the most proper number of requestor I should set to maximise the throughput?
is it correct to assume that the number of requestor * number of nodes must be equal or slightly grater than the possible distinct values of the partition key?
@Phil5873 To maximize throughput when running a dataflow sourced by a report definition with a partition key, set the number of requestors based on the number of nodes and the capacity of each node. The goal is to ensure all partitions are processed simultaneously without overloading the system. Ideally, the total requestors across all nodes should be equal to or slightly greater than the number of distinct values in the partition key. For example, if there are 20 distinct values and 3 nodes with a capacity of 4 requestors each, setting 12 requestors (3 nodes * 4 requestors) would be efficient. If the number of distinct values exceeds the total requestors, aim for a higher requestor count to avoid bottlenecks. On the other hand, if the partition values are fewer than the requestors, adjust the count downward to prevent unnecessary resource usage. Balancing the requestors with the available nodes and partition keys ensures optimal performance by leveraging parallel processing without overwhelming system resources.
So it’s correct assuming that the point to try to balance as much as possible the total number of available requestor and the distinct values identified by the partition key.