Pages

Monday, 30 July 2012

Partitioning and Collecting Data in Datastage

Partitioning of data helps to implement parallel processing in our jobs. This makes it one of the most important topics of study.

Simplest scenarios do not require delving into the nitty-gritty of the subject, but others will.

The different partitioning methods available in datastage are as follows:

S.No.
Partition
Method of Distribution
Nature of resulting Partitions
Important points

  1.  
Round Robin
1st record to 1st partition and so on
approximately equal-sized
This method is the one normally used when InfoSphere DataStage initially partitions data.


  1.  
Random
Random distribution
approximately equal-sized
Slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.  

It doesn't primarily load balances as it's random. But, for a large enough number of rows, random distribution will be close enough to 1/N rows per node.

  1.  
Same
no repartitioning , records stay on the same processing node; that is, they are not redistributed
Not Applicable
Fastest. This is normally the method InfoSphere® DataStage® uses when passing data between stages in your job.

  1.  
Entire
Every instance of a stage on every processing node receives the complete data set as input.
Every partition consists of the complete dataset.
Used when every instance of the operator needs access to the entire input data set.

  1.  
Hash
Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record.
Can be skewed depending on the data distribution.
The data type of a partitioning key might be any data type except raw, subrecord, tagged aggregate, or vector. By default, the hash partitioner does case-sensitive comparison. This means that uppercase strings appear before lowercase strings in a partitioned data set. You can override this default if you want to perform case insensitive partitioning on string fields.


  1.  
Modulus
partition_number = fieldname mod number_of_partitions , as is obvious fieldname is a numeric data field. 

Can be skewed depending on the value of the key field.
Only one column can be specified as key as opposed to other keyed partitions.

  1.  
Range
divides a data set into approximately equal size partitions based on one or more partitioning keys.
In order to use a range partitioner, we need a range map. This can be done using the Write Range Map stage.
approximately equal-sized.
The only keyed partitioning method that created equal-sized partitions.

[The write range map stage uses a probabilistic splitting technique to range partition a data set.]

  1.  
DB2
Partition an input data set in the same way that DB2® would partition it.
Not Applicable.


  1.  
Auto
We leave it to InfoSphere DataStage to determine the best partitioning method to use depending on the type of stage, and what the previous stage in the job has done.
Usually round-robin / same so equi-sized.
Typically InfoSphere DataStage would use round robin when initially partitioning data, and same for the intermediate stages of a job.

We need to take care of some of the specific stage properties to handle partitioning effectively.
Here is a description of the same.


Datastage uses different icons to specify the kind of partitioning that is happening inside the stages. I have a detailed explanation of these icons here , this will help us understand the partitioning techniques used in our jobs.