Dataflow Studio is deployed on multiple nodes for reliability and scalability but, it is up to the flow designer or user creating the flow to ensure that scalability and optimization is realized when a flow executes.
The scheduling execution of the first processor in a process flow depends on how data is accessed.
Execution Setting |
When to Use |
Distribution Setting |
Primary Nodes |
When each execution of the processor could result in reading the same record from the source, thereby resulting in duplicate processing. |
Set connections to "Round robin" load balance strategy |
All Nodes |
When each execution of the processor is guaranteed to read distinct data from the source. |
Primary Node should be used in cases where the processor may read the same record or data from the source system, each time it executes (e.g., running a database query or retrieving record sets from filesystem). If these processors are set to run on all nodes, there is potential for duplicate records being retrieved and processed.
All Nodes should be used in cases where the processor is guaranteed to read distinct data or records from the source system, posing no risk of duplicate records or processing.
To adjust the Execution schedule on a processor, open the Configure Processor dialog. On the Scheduling tab, set the Execution value accordingly and apply.
Processors running on the primary node should also be set to distribute FlowFiles to all nodes in subsequent connections to allow the load to be spread out.
To adjust the distribution, open the Configure Connection dialog. Select the Settings tab and set the Load Balance Strategy to "Round robin." This should be done for all connections following the processor set to primary node.