Dataflow Studio is deployed on three nodes for reliability and scalability but, it is up to the flow designer or user creating the flow to ensure that scalability and optimization is realized when a flow executes.
The scheduling execution of the first processor in a process flow depends on how data is accessed.
Data Access Scenario |
Execution Setting |
Distribution Setting |
Processor retrieves the same data each time it executes |
Primary Node |
Set connections to "Round robin" load balance strategy |
Processor receives distinct records each time it executes |
All Nodes |
In the case where a processor retrieves the same data each time it executes, the scheduling execution should be on the primary node only. If allowed to run on multiple nodes simultaneously, duplicate records could be retrieved and processed. Examples of this would be running a query on a database or retrieving a set of records from a filesystem like an S3 bucket.
If the processor is receiving distinct records, the scheduling execution can be executed on all nodes.
To adjust the Execution schedule on a processor, open the Configure Processor dialog. On the Scheduling tab, set the Execution value accordingly and apply.
Processors running on the primary node should also be set to distribute FlowFiles to all nodes in subsequent connections to allow the load to be spread out.
To adjust the distribution, open the Configure Connection dialog. Select the Settings tab and set the Load Balance Strategy to "Round robin." This should be done for all connections following the processor set to primary node.