Partitioning of data helps to implement parallel processing
in our jobs. This makes it one of the most important topics of study.
Simplest
scenarios do not require delving into the nitty-gritty of the subject, but
others will.
The
different partitioning methods available in datastage are as follows:
S.No.
|
Partition
|
Method of Distribution
|
Nature of resulting
Partitions
|
Important points
|
||
|
Round Robin
|
1st record
to 1st partition and so on
|
approximately equal-sized
|
This method is the one
normally used when InfoSphere DataStage initially partitions data.
|
||
|
Random
|
Random distribution
|
approximately
equal-sized
|
Slightly higher overhead than round robin because of the extra
processing required to calculate a random value for each record.
|
||
|
Same
|
no repartitioning ,
records stay on the same processing node; that is, they are not redistributed
|
Not Applicable
|
Fastest. This is
normally the method InfoSphere® DataStage® uses when passing data between
stages in your job.
|
||
|
Entire
|
Every instance of a
stage on every processing node receives the complete data set as input.
|
Every partition
consists of the complete dataset.
|
Used when every
instance of the operator needs access to the entire input data set.
|
||
|
Hash
|
Partitioning is based
on a function of one or more columns (the hash partitioning keys) in each
record.
|
Can be skewed
depending on the data distribution.
|
The data type of a
partitioning key might be any data type except raw, subrecord, tagged
aggregate, or vector. By default, the hash partitioner does case-sensitive
comparison. This means that uppercase strings appear before lowercase strings
in a partitioned data set. You can override this default if you want to
perform case insensitive partitioning on string fields.
|
||
|
Modulus
|
partition_number =
fieldname mod number_of_partitions , as is obvious fieldname is a numeric
data field.
|
Can be skewed
depending on the value of the key field.
|
Only one column can be
specified as key as opposed to other keyed partitions.
|
||
|
Range
|
divides a data set into
approximately equal size partitions based on one or more partitioning keys.
In order to use a
range partitioner, we need a range map. This can be done using the Write
Range Map stage.
|
approximately
equal-sized.
|
The only keyed
partitioning method that created equal-sized partitions.
[The write range map
stage uses a probabilistic splitting technique to range partition a data set.]
|
||
|
DB2
|
Partition an input data set in the
same way that DB2® would partition it.
|
Not Applicable.
|
|||
|
Auto
|
We leave it to InfoSphere
DataStage to determine the best partitioning method to use depending on the
type of stage, and what the previous stage in the job has done.
|
Usually round-robin /
same so equi-sized.
|
Typically InfoSphere
DataStage would use round robin when initially partitioning data, and same
for the intermediate stages of a job.
|
We
need to take care of some of the specific stage properties to handle
partitioning effectively.