Tuesday, May 25, 2010

DataStage PX questions,faq's,tutorial

DataStage PX questions


What is difference between server jobs & parallel jobs?

--- Server generates DataStage BASIC, parallel generates Orchestrate shell script (osh) and C++, mainframe generates COBOL and JCL.

--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.

--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.

What is orchestrate?

--- Orchestrate is the old name of the underlying parallel execution engine. Ascential re-named the technology "Parallel Extender".

DataStage PX GUI generates OSH (Orchestrate Shell) scripts for the jobs you run.

An OSH script is a quoted string which specifies the operators and connections of a single Orchestrate step. In its simplest form, it is:

Osh “op < in.ds > out.ds”.

Where op – Orchestrate operator.

In.ds - input dataset.

Out.ds – Output dataset.

Orchestrate Vs DataStage Parallel Extender?

What are the types of Parallelism?

--- There are 2 types of Parallel Processing. They are:

Pipeline Parallelism –
It is the ability for a downstream stage to begin processing a row as soon as an upstream stage has finished processing that row (rather than processing one row completely through the job before beginning the next row). In Parallel jobs, it is managed automatically.

For example, consider a job(srcà Tranformer à Tgt) running on a system having three processors:

--- The source stage starts running on one processor, reads the data from the source and starts filling a pipeline with the read data.

--- Simultaneously, the Transformer stage starts running on another processor, processes the data in the pipeline and starts filling another pipeline.

--- At the same time, the target stage starts running on another processor, writes data to the target as soon as the data is available.

Partitioning Parallelism –

Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes. That is, several processors can run the same job simultaneously, each handling a separate subset of the total data.

For example if there are 100 records, then if there are 4 logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place.

Is Pipeline parallelism in PX is same what Inter-processes does in Server?

YES. IPC stage is a stage, which helps one passive stage read data from another as soon as data is available. In other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. It means as soon as the data is available between stages( in pipes or links), it can be exchanged between them without waiting for the entire record set to be read.

Note:- Link partitioner and Link collector stages can be used to achieve a certain degree of partitioning parallelism.


What are partitioning methods available in PX?

The Partitioning methods available in PX are:

--- It chooses the best partitioning method depending on:

The mode of execution of the current stage and the preceding stage.
The number of nodes available in the configuration file.
2. Round robin:

--- Here, the first record goes to the first processing node, the second to the second processing node, and so on. This method is useful for resizing partitions of an input dataset that are not equal in size to approximately equal-sized partitions.

Data Stage uses ‘Round robin’ when it partitions the data initially.

--- It implements the Partitioning method same as the one used by the preceding stage. The records stay on the same processing node; that is, data is not redistributed or repartitioned. Same is considered as the fastest Partitioning method.

Data Stage uses ‘Same’ when passing data between stages in a job.

--- It distributes the records randomly across all processing nodes and guarantees that each processing node receives approximately equal-sized partitions.

--- It distributes the complete dataset as input to every instance of a stage on every processing node. It is mostly used with stages that create lookup tables for their input.

--- It distributes all the records with identical key values to the same processing node so as to ensure that related records are in the same partition. This does not necessarily mean that the partitions will be equal in size.

--- When Hash Partitioning, hashing keys that create a large number of partitions should be selected.

Reason: For example, if you hash partition a dataset based on a zip code field, where a large percentage of records are from one or two zip codes, it can lead to bottlenecks because some nodes are required to process more records than other nodes.

--- Partitioning is based on a key column modulo the number of partitions. The modulus partitioner assigns each record of an input dataset to a partition of its output dataset as determined by a specified key field in the input dataset.

--- It divides a dataset into approximately equal-sized partitions, each of which contains records with key columns within a specific range. It guarantees that all records with same partitioning key values are assigned to the same partition.

Note: In order to use a Range partitioner, a range map has to be made using the ‘Write range map’ stage.

--- Partitions an input dataset in the same way that DB2 would partition it.

For example, if this method is used to partition an input dataset containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node.

What is Re-Partitioning? When actually re-partition will occur?

--- Re-Partitioning is the rearranging of data among the partitions. In a job, the Parallel-to-Parallel flow results in Re-Partitioning.

For example, consider the EMP data that is initially processed based on SAL, but now you want to process the data grouped by DEPTNO. Then you will need to Repartition to ensure that all the employees falling under the same DEPTNO are in the same group.

What are IConv () and Oconv () functions and where are they used?
Can we use these functions in PX?

--- ‘Iconv()’ converts a string to an internal storage format.

Syntax:- Iconv(string,code[@VM code]…)

string evaluates to the string to be converted.
Code indicates the conversion code which specifies how the data needs to be formatted for output or internal storage. Like,
MCA – Extracts alphabetic characters from a field.

MCN – Extracts numeric characters from a field.

MCL - Converts Uppercase Letters to Lowercase….

--- ‘Oconv()’ converts an expression to an output format.

Syntax:- Oconv(expression, conversion[@VM conversion]…)

Expression is a string stored in internal format that needs to be converted to output format.
Code indicates the conversion code which specifies how the string needs to be formatted.
MCA – Extracts alphabetic characters from a field.

MCN – Extracts numeric characters from a field.

MCL - Converts Uppercase Letters to Lowercase….

These functions can’t be used directly in PX. The only stage which allows the usage

of Iconv() and Oconv() in PX is ‘Basic Transformer’ stage. It gives access to the

functions supported by the DataStage Server Engine.

Note:- Basic Transformer can be used only on SMP systems but not on MPP or

Cluster syntems.

What does a Configuration File in parallel extender consist of?

The Configuration File consists of all the processing nodes of a parallel system.
It can be defined and edited using the DataStage Manager. It describes every processing node that DataStage uses to run an application. When you run a job, DataStage first reads the Configuration File to determine the available nodes.
When a system is modified by adding or removing processing nodes or by reconfiguring nodes, the DataStage jobs need not be altered or even recompiled. Instead, editing the Configuration file alone will suffice.
The configuration file also gives control over parallelization of a job during development cycle. For example, by editing the Configuration file, first a job can be run on a single processing node, then on two nodes, then four, and so on.

What is difference between file set and data set?

Dataset:- Datasets are operating system files, each referred to by a control file, which has the suffix .ds. PX jobs use datasets to manage data within a job. The data in the Datasets is stored in internal format.

A Dataset consists of two parts:

Descriptor file: It contains metadata and data location.
Data file: It contains the data.
--The Dataset Stage is used to read data from or write data to a dataset. It allows you to store data in persistent form, which can then be used by other jobs.

Fileset:- DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file with extension, .fs. The data files and the file that lists them is called a ‘Fileset’.

A fileset consists of two parts:

Descriptor file: It contains location of raw data files and the meta data.
Individual raw Data files.
-- The Fileset Stage is used to read data from or write data to a fileset.

Lookup Stage :Is it Persistent or non-persistent? (What is happening behind the scene)

--- Lookup stage is non-persistent.

How can we maintain the partitioning in Sort stage?

-- Partitioning in Sort stage can be maintained using the Partitioning method, ‘Same’.

For example, assume you sort a dataset on a system with four processing nodes and store the results to a Dataset stage. The dataset will therefore have four partitions. You then use that dataset as input to a stage executing on a different number of nodes. DataStage automatically repartitions the dataset to spread it out to all the processing nodes. This destroys the sort order of the data.

This can be avoided by specifying the Same Partitioning method so that the original partitions are preserved.

Where we need partitioning (In processing or some where)

--- Partitioning is needed in processing. It means we need Partitioning where we have huge volumes of data to process.

If we use SAME partitioning in the first stage which partitioning method it will take?

--- Data Stage uses ‘Round robin’ when it partitions the data initially.

What is the symbol we will get when we are using round robin partitioning method?


Given below is the list of icons that appear on the link based on the mode of execution, parallel or sequential, of the current stage and the preceding stage, and the type of Partitioning method:

Preceding Stage Current Stage

Sequential mode--à (FAN OUT) --à Parallel mode (Indicates Partitioning)

Parallel mode --à (FAN IN) --à Sequential mode(Indicates Collecting)

Parallel mode --à (BOX) --à Parallel mode (Indicates AUTO method)

Parallel mode --à (BOW TIE)--à Parallel mode (Indicates Repartitioning)

Parallel mode --à (PARALLEL LINES) Parallel mode (Indicates SAME partitioning)


If we check the preserve partitioning in one stage and if we don’t give any partitioning method (Auto) in the next stage which partition method it will use?

-- In this case, the partitioning method used by the preceding stage is used.

-- Preserve Partitioning indicates whether the stage wants to preserve the

partitioning at the next stage of the job. Options in this tab are:

Set – Sets the Preserve partitioning flag.
Clear – Clears the preserve partitioning flag.
Propagate – Sets the flag to Set or Clear depending on the option selected in the previous stage.

Can we give node allocations i.e. for one stage 4 nodes and for next stage 3 nodes?

--- YES. It can be done using the option, ‘Node map constraint’ available in

Stage page -à Advanced Tab.

--- Checking this option allows you to constrain Parallel execution to the nodes defined in a node map. A ‘Node map’ can be defined by either of the below 2 options:

By clicking the browse button to open the available nodes dialog box and selecting nodes from there. (or)
By typing node numbers into the text box.

What is combinability, non-combinability?

--- Using Combinability, DataStage combines the operators that underlie parallel stages so that they run in the same process. It lets the DataStage compiler potentially 'optimize' the number of processes used at runtime by combining operators. This saves significant amount of data copying and preparation in passing data between operators. It has three options to set:

Auto: Use the default combination method.
Combinable: Combine all possible operators.
Don’t combine: Never combine operators.
Usually this setting is left ot its default so that DataStage can tune jobs for performance automatically.


What are schema files?

-- ‘Schema file’ is a plain text file in which meta data for a stage is specified.

--- It is stored outside the DataStage Repository in a document management system or a source code control system. In contrast, Table definitions are stored in DataStage Repository and can be loaded into stages as and when required.

--- Schema is an alternative way to specify column definitions for the data used by the parallel jobs. By default most parallel job stages take their meta data from the columns tab. For some stages you can specify a property that causes the stages to take its meta data from the specified schema file.

--- A Schema consists of a record definition. The following is an example for record schema:



address:nullable string[];



--- Import any table definition, load it into a parallel job, then choose "Show Schema". The schema defines the columns and their characteristics in a data set, and may also contain information about key columns.

20. Why we need datasets rather than sequential files?

--- A Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a single sequential stream of data. A dataset can be saved across nodes using partitioning method selected, so it is always faster when we used as a source or target. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other

DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using datasets wisely can be key to good performance in a set of linked jobs.

21. Is look-up stage returns multi-rows or single rows?

---. Lookup stage returns the rows related to the key values. It can be multiple rows depends on the keys you are mentioning for the lookup in the stage

22. Why we need sort stage other than sort-merge collective method and perform sort option in the stage in advanced properties?

--- Sort Stage is used to perform more complex sort operations which is not possible using stages Advanced tab properties..

Many stages have an optional sort function via the partition tab. This means if you are partitioning your data in a stage you can define the sort at the same time. The sort stage is for use when you don't have any stage doing partitioning in your job but you still want to sort your data, or if you want to sort your data in descending order, or if you want to use one of the sort stage options such as "Allow Duplicates" or "Stable Sort". If you are processing very large volumes and need to sort you will find the sort stage is more flexible then the partition tab sort.

23. For surrogate key generator stage where will be the next value stored?

24. In surrogate key generator stage how it generates the number? (Based on nodes or Based on rows)

--- Based on the nodes we are generating key values, and the input data partitions should be perfectly balanced across the nodes. This can be achieved using round robin partitioning method when your starting point is sequential

25. What is the preserve partioning flag in Advanced tab?

--- It indicates whether the stage wants to preserve partitioning at the next stage of the job. There a three options 1. Set 2.Clear 3.Propagate.

Set. Sets the preserve partitioning flag, this indicates to the next stage in the job that it should preserve existing partitioning if possible.

Clear. Clears the preserve partitioning flag, this indicates that this stage doesn’t care which partitioning method the next stage uses.

Propagate. Set the flag to set or clear depending on what in previous stage of the job has set(or if that is set to propagate the stage before that and so on until a preserve partitioning flag setting is encountered).

26. What is the difference between stages and operators?

--- Stages are generic user interface from where we can read and write from files and databases, trouble shoot and develop jobs, also it's capable of doing processing of data. Different types of stages are

Database. These are stages that read or write data contained in a database. Examples of database stages are the Oracle Enterprise and DB2/UDB Enterprise stages.

Development/Debug. These are stages that help you when you are developing and troubleshooting parallel jobs. Examples are the Peek and Row Generator stages.

File. These are stages that read or write data contained in a file or set of files. Examples of file stages are the Sequential File and Data Set stages.

Processing. These are stages that perform some processing on the data that is passing through them. Examples of processing stages are the Aggregator and Transformer stages.

Real Time. These are the stages that allow Parallel jobs to be made available as RTI services. They comprise the RTI Source and RTI Target stages. These are part of the optional Web Services package.

Restructure. These are stages that deal with and manipulate data containing columns of complex data type. Examples are Make Sub record and Make Vector stages.

--- Operators are the basic functional units of an orchestrate application. In orchestrate framework DataStage stages generates an orchestrate operator directly.

27. Why we need filter, copy and column export stages instead of transformer stage?

--- In parallel jobs we have specific stage types for performing specialized tasks. Filter, copy, column export stages are operator stages. These operators are the basic functional units of an orchestrate application. The operators in your Orchestrate application pass data records from one operator to the next, in pipeline fashion. For example, the operators in an application step might start with an import operator, which reads data from a file and converts it to an Orchestrate data set. Subsequent operators in the sequence could perform various processing and analysis tasks. The processing power of Orchestrate derives largely from its ability to execute operators in parallel on multiple processing nodes. By default, Orchestrate operators execute on all processing nodes in your system. Orchestrate dynamically scales your application up or down in response to system configuration changes, without requiring you to modify your application. Thus using operator stages will increase the speed of data processing applications rather than using transformer stages.

28. Describe the types of Transformers used in DataStage PX for processing and uses?


Basic Transformer

Transformer-: The Transformer stage is a processing stage. Transformer stages allow you to create transformations to apply to your data. These transformations can be simple or complex and can be applied to individual columns in your data. Transformations are specified using a set of functions. Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows, which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

Basic Transformer:- The BASIC Transformer stage is a also a processing stage. It is similar in appearance and function to the Transformer stage in parallel jobs. It gives access to BASIC transforms and functions (BASIC is the language supported by the DataStage server engine and available in server jobs). BASIC Transformer stage can have a single input and any number of outputs.

29. What is aggregate cache in aggregator transformation?

--- Aggregate cache is the memory used for grouping operations by the aggregator stage.

30. What will you do in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?

--- Use wait for file activity stage between job activity stages in job sequencer.

31. How do you rename all of the jobs to support your new File-naming conventions?

--- Create a file with new and old names. Export the whole project as a dsx. Write a script, which can do a simple rename of the strings looking up the file. Then import the new dsx file and recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

32. How do you merge two files in DS?

--- We can merge two files in 3 different ways. Either go for Merge stage, Join Stage or Lookup Stage All these merge or join occurs based on the key values. The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input of key columns.

33. How did you handle an 'Aborted' sequencer?

--- By using check point information we can restart the sequence from failure. if u enabled the check point information reset the aborted job and run again.


34. What are Performance tunings you have done in your last project to increase the performance of slowly running jobs?

1. Using Dataset stage instead of sequential files wherever necessary.

2. Use Join stage instead of Lookup stage when the data is huge.

3.Use Operator stages like remove duplicate, Filter, Copy etc instead of transformer stage.

4. Sort the data before sending to change capture stage or remove duplicate stage.

5. Key column should be hash partitioned and sorted before aggregate operation.

6.Filter unwanted records in beginning of the job flow itself.

35. If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?

--- It will result in false output even though job runs successfully. In aggregator key value should be hash partitioned so that identical key values will be in the same node, which gives the desired result and also eases grouping operation.

36. What is Full load & Incremental or Refresh load?

37. Describe cleanup resource and clear status file?

The Cleanup Resources command is to

• View and end job processes

• View and release the associated locks

Clear Status file command is for resetting the status records associated with all stages in that job.

Cleanup Resources command and Clear Status file command is available in Director under Job menu.

38. What is lookup stage? Can you define derivations in the lookup stage output?

Lookup stage is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. It can also perform lookups directly in a DB2 or Oracle database or in a lookup table contained in a Lookup File Set stage.

The most common use for a lookup is to map short codes in the input data set onto expanded information from a lookup table which is then joined to the incoming data and output.

Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the key’s values, the row is rejected.

The Lookup stage can have a reference link, a single input link, a single output link, and a single rejects link. Depending upon the type and setting of the stage(s) providing the look up information, it can have multiple reference links (where it is directly looking up a DB2 table or Oracle table, it can only have a single reference link).

The input link carries the data from the source data set and is known as the primary link. For each record of the source data set from the primary link, the Lookup stage performs a table lookup on each of the lookup tables attached by reference links. The table lookup is based on the values of a set of lookup key columns, one set for each table. The keys are defined on the Lookup stage.

Lookup stages do not require data on the input link or reference links to be sorted. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links. The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.

There are some special partitioning considerations for lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method. Another way is to partition it in the same way as the input data (although this implies sorting of the data).

Yes. define derivations in the lookup stage output.

39. What is copy stage? When do you use that?

The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns.

Copy stage is useful when we want to make a backup copy of a data set on disk while performing an operation on another copy.

Copy stage with a single input and a single output, needs Force set to be TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.

40. What is Change Capture stage? Which execution mode would you use when you used for comparison of data?

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set.

The stage produces a change data set, whose table definition is transferred from the after data set’s table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.

The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other.

The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.

We can use both Sequential as well as parallel modes of execution for change capture stage.

41. What is Dataset Stage?

The Data Set stage is a file stage. It allows you to read data from or write data to a data set. It can be configured to execute in parallel or sequential mode. DataStage parallel extender jobs use data sets to manage data within a job.

The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds.

Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the DataStage Designer, Manager, or Director.

42. How do you drop dataset?

There two ways of dropping a data set, first is by using Data Set Management Utility (GUI) located in the Manager, Director, Designer and second is by using Unix command-line utility orchadmin.

43. How do you eliminate duplicate in a Dataset?

The simplest way to remove duplicate is by the use of Remove Duplicate Stage. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

44. What is Peek Stage? When do you use it?

The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets, like the Head stage and the Tail stage.

The Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application.

45. What are the different join options available in Join Stage?

There are four join option available in the join stage.

Inner: Transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped

Left Outer: Transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets.

Right Outer: Transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets.

Full Outer: Transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.)

The default is inner.

46. What is the difference between Lookup, Merge & Join Stage?

These "three Stages" combine two or more input links according to values of user-designated "key" column(s).

They differ mainly in:

Memory usage

Treatment of rows with unmatched key values

Input requirements (sorted, de-duplicated)

The main difference between joiner and lookup is in the way they handle the data and the reject links. In joiner, no reject links are possible. So we cannot get the rejected records directly. Lookup provides a reject link.

Also lookup is used if the data being looked up can fit in the available temporary memory. If the volume of data is quite huge, then it is safe to go for Joiner

Join requires the input dataset to be key partitioned and sorted. Lookup does not have this requirement

Lookup allows reject links. Join does not allow reject links. If the volume of data is huge to be fit into memory you go for join and avoid lookup as paging can occur when lookup is used.

Merge stage allow us to capture failed lookups from each reference input separately. It also requires identically sorted and partitioned inputs and, if more than one reference input, de-duplicated reference inputs

In case of merge stage as part of pre processing step duplicates should be removed from master dataset. If there are more than one update dataset then duplicates should be removed from update datasets as well.

The above-mentioned step is not required for join and lookup stages.

47. What is RCP? How it is implemented?

DataStage is flexible about Meta data. It can cope with the situation where Meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP).

This can be enabled for a project via the DataStage Administrator, and set for individual links via the Outputs Page Columns tab for most stages, or in the Outputs page General tab for Transformer stages. You should always ensure that runtime column propagation is turned on.

RCP is implemented through Schema File. The schema file is a plain text file contains a record (or row) definition.

48. What is row generator? When do you use it?

The Row Generator stage is a Development/Debug stage. It has no input links, and a single output link.

The Row Generator stage produces a set of mock data fitting the specified meta data.

This is useful where we want to test our job but have no real data available to process. Row

Generator is also useful when we want processing stages to execute at least once in absence of data from the source.
Post a Comment

Receive All Free Updates Via Facebook.