Data Warehousing
Database vs Data Warehouse: Key Differences
What is Database? A database is a collection of related data which represents some elements of the...
Datastage is an ETL tool which extracts data, transform and load data from source to the target. The data sources might include sequential files, indexed files, relational databases, external data sources, archives, enterprise applications, etc. DataStage facilitates business analysis by providing quality data to help in gaining business intelligence.
Datastage is used in a large organization as an interface between different systems. It takes care of extraction, translation, and loading of data from source to the target destination. It was first launched by VMark in mid-90's. With IBM acquiring DataStage in 2005, it was renamed to IBM WebSphere DataStage and later to IBM InfoSphere.
Various version of Datastage available in the market so far was Enterprise Edition (PX), Server Edition, MVS Edition, DataStage for PeopleSoft and so on. The latest edition is IBM InfoSphere DataStage
IBM Information server includes following products,
Datastage has following Capabilities.
In the following sections, we briefly describe the following aspects of IBM InfoSphere DataStage:
InfoSphere DataStage and QualityStage can access data in enterprise applications and data sources such as:
Processing Stage Types
IBM infosphere job consists of individual stages that are linked together. It describes the flow of data from a data source to a data target. Usually, a stage has minimum of one data input and/or one data output. However, some stages can accept more than one data input and output to more than one stage.
In Job design various stages you can use are:
DataStage has four main components namely,
The above image explains how IBM Infosphere DataStage interacts with other elements of the IBM Information Server platform. DataStage is divided into two section, Shared Components, and Runtime Architecture.
Shared | Unified user interface |
|
Common Services |
| |
Common Parallel Processing |
| |
Runtime Architecture | OSH Script |
|
For DataStage, you will require the following setup.
To access DataStage, download and install the latest version of IBM InfoSphere Server. The server supports AIX, Linux, and Windows operating system. You can choose as per requirement.
To migrate your data from an older version of infosphere to new version uses the asset interchange tool.
Installation Files
For installing and configuring Infosphere Datastage, you must have following files in your setup.
For Windows,
For Linux,
This information is used to,
Before you begin with Datastage, you need to setup database. You will create two DB2 databases.
You will also create two tables (Product and Inventory) and populate them with sample data. Then you can test your integration between SQL Replication and Datastage.
Moving forward you will set up SQL replication by creating control tables, subscription sets, registrations and subscription set members. We will learn more about this in details in next section.
Here we will take an example of Retail sales item as our database and create two tables Inventory and Product. These tables will load data from source to target through these sets. (control tables, subscription sets, registrations, and subscription set members.)
Step 1) Create a source database referred to as SALES. Under this database, create two tables product and Inventory.
Step 2) Run the following command to create SALES database.
db2 create database SALES
Step 3) Turn on archival logging for the SALES database. Also, back up the database by using the following commands
db2 update db cfg for SALES using LOGARCHMETH3 LOGRETAIN db2 backup db SALES
Step 4) In the same command prompt, change to the setupDB subdirectory in the sqlrepl-datastage-tutorial directory that you extracted from the downloaded compressed file.
Step 5) Use the following command to create Inventory table and import data into the table by running the following command.
db2 import from inventory.ixf of ixf create into inventory
Step 6) Create a target table. Name the target database as STAGEDB.
Since now you have created both databases source and target, the next step we will see how to replicate it.
The following information can be helpful in setting up ODBC data source.
The image below shows how the flow of change data is delivered from source to target database. You create a source-to-target mapping between tables known as subscription set members and group the members into a subscription.
The unit of replication within InfoSphere CDC (Change Data Capture) is referred to as a subscription.
In this section, you have to do following things,
Use ASNCLP command line program to setup SQL replication
Step 1) Locate the crtCtlTablesCaptureServer.asnclp script file in the sqlrepl-datastage-tutorial/setupSQLRep directory.
Step 2) In the file replace <db2-connect-ID> and "<password>" with your user ID and password for connecting to the SALES database.
Step 3) Change directories to the sqlrepl-datastage-tutorial/setupSQLRep directory and run the script. Use the following command. The command will connect to the SALES database, generate an SQL script for creating the Capture control tables.
asnclp –f crtCtlTablesCaptureServer.asnclp
Step 4) Locate the crtCtlTablesApplyCtlServer.asnclp script file in the same directory. Now replace two instances of <db2-connect-ID> and "<password>" with the user ID and password for connecting to the STAGEDB database.
Step 5) Now in the same command prompt use the following command to create apply control tables.
asnclp –f crtCtlTablesApplyCtlServer.asnclp
Step 6) Locate the crtRegistration.asnclp script files and replace all instances of <db2-connect-ID> with the user ID for connecting to the SALES database. Also, change "<password>" to the connection password.
Step 7) To register the source tables, use following script. As part of creating the registration, the ASNCLP program will create two CD tables. CDPRODUCT AND CDINVENTORY.
asnclp –f crtRegistration.asnclp
The CREATE REGISTRATION command uses the following options:
Step 8) For connecting to the target database (STAGEDB), use following steps.
Step 9) Locate the crtSubscriptionSetAndAddMembers.asnclp script files and do the following changes.
After changes run the script to create subscription set (ST00) that groups the source and target tables. The script also creates two subscription set members, and CCD (consistent change data) in the target database that will store the modified data. This data will be consumed by Infosphere DataStage.
Step 10) Run the script to create the subscription set, subscription-set members, and CCD tables.
asnclp –f crtSubscriptionSetAndAddMembers.asnclp
Various options used for creating subscription set and two members include
Step 11) Due to the defect in the replication administration tools. You have to execute another batch file to set the TARGET_CAPTURE_SCHEMA column in the IBMSNAP_SUBS_SET control table to null.
Before we do replication in next step, we need to connect CCD table with DataStage. In this section, we will see how to connect SQL with DataStage.
For connecting CCD table with DataStage, you need to create Datastage definition (.dxs) files. The .dsx file format is used by DataStage to import and export job definitions. You will use ASNCLP script to create two .dsx files. For example, here we have created two .dsx files.
ASNCLP program automatically maps the CCD column to the Datastage Column format. It is only supported when the ASNCLP runs on Windows, Linux, or Unix Procedure.
Datastage jobs pull rows from CCD table.
To start replication, you will use below steps. When CCD tables are populated with data, it indicates the replication setup is validated. To view the replicated data in the target CCD tables use the DB2 Control Center graphical user interface.
Step 1) Make sure that DB2 is running if not then use db2 start command.
Step 2) Then use asncap command from an operating system prompt to start capturing program. For example.
asncap capture_server=SALES
The above command specifies the SALES database as the Capture server. Keep the command window open while the capture is running.
Step 3) Now open a new command prompt. Then start the APPLY program by using the asnapply command.
asnapply control_server=STAGEDB apply_qual=AQ00
Leave command window open with Apply is running.
Step 4) Now open another command prompt and issue the db2cc command to launch the DB2 Control Center. Accept the default Control Center.
Step 5) Now in the left navigation tree, open All Databases > STAGEDB and then click Tables. Double click on table name ( Product CCD) to open the table. It will look something like this.
Likewise, you can also open CCD table for INVENTORY.
First of all, you will create a Project in DataStage. For that, you must be an InfoSphere DataStage administrator.
Once the Installation and replication are done, you need to create a project. In DataStage, projects are a method for organizing your data. It includes defining data files, stages and build jobs in a specific project.
To create a project in DataStage, follow the following steps.
Step 1) Launch the DataStage and QualityStage Administrator. Then click Start > All programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Administrator.
Step 2) For connecting to the DataStage server from your DataStage client, enter details like Domain name, user ID, password, and server information
Step 3) In the WebSphere DataStage Administration window. Click the Projects tab and then click Add.
Step 4) In the WebSphere DataStage Administration window, enter details like
Each project contains:
We will see how to import replication jobs in Datastage Infosphere.
You will import jobs in the IBM InfoSphere DataStage and QualityStage Designer client. And you execute them in the IBM InfoSphere DataStage and QualityStage Director client.
The designer-client is like a blank canvas for building jobs. It extracts, transform, load, and check the quality of data. It provides tools that form the basic building blocks of a Job. It includes
The stages in the InfoSphere DataStage and QualityStage Designer client are stored in the Designer tool palette.
The following stages are included in InfoSphere QualityStage:
You can create 4 types of Jobs in DataStage infosphere.
Let's see step by step on how to import replication job files.
Step 1) Start the DataStage and QualityStage Designer. Click Start > All programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Designer
Step 2) In the Attach to Project window, enter following details.
Step 3) Now from File menu click import -> DataStage Components.
A new DataStage Repository Import window will open.
Once the job is imported, DataStage will create STAGEDB_AQ00_ST00_sequence job.
Step 4) Follow the same steps to import the STAGEDB_AQ00_ST00_pJobs.dsx file. This import creates the four parallel jobs.
Step 5) Under Designer Repository pane -> Open SQLREP folder. Inside the folder, you will see, Sequence Job and four parallel jobs.
Step 6) To see the sequence job. Go to repository tree, right-click the STAGEDB_AQ00_ST00_sequence job and click Edit. It will show the workflow of the four parallel jobs that the job sequence controls.
Each icon is a stage,
Step 7) To see the parallel jobs. Right-click the STAGEDB_ASN_INVENTORY_CCD and select edit under repository. It will open window as shown below.
Here in above image, you can see that the data from Inventory CCD table and Synch point details from FEEDETL table is rendered to Lookup_6 stage.
Now next step is to build a data connection between InfoSphere DataStage and the SQL Replication target database. It contains the CCD tables.
In DataStage, you use data connection objects with related connector stages to quickly define a connection to a data source in a job design.
Step 1) STAGEDB contains both the Apply control tables that DataStage uses to synchronize its data extraction and the CCD tables from which the data is extracted. Use following commands
db2 catalog tcpip node SQLREP remote ip_address server 50000 db2 catalog database STAGEDB as STAGEDB2 at node SQLREP
Note: IP address of the system where STAGEDB was created
Step 2) Click File > New > Other > Data Connection.
Step 3) You will have a window with two tabs, Parameters, and General.
Step 4) In this step,
Step 5) In Connection parameters table, enter details like
Step 6) In the next window save data connection. Click on 'save' button.
In the previous step, we saw that InfoSphere DataStage and the STAGEDB database are connected. Now, import column definition and other metadata for the PRODUCT_CCD and INVENTORY_CCD tables into the Information Server repository.
In the designer window, follow below steps.
Step 1) Select Import > Table Definitions > Start Connector Import Wizard
Step 2) From connector selection page of the wizard, select the DB2 Connector and click Next.
Step 3) Click load on connection detail page. This will populate the wizard fields with connection information from the data connection that you created in the previous chapter.
Step 4) Click Test connection on the same page. This will prompt DataStage to attempt a connection to the STAGEDB database. You can see the message "connection is successful". Click Next.
Step 5) Make sure on the Data source location page the Hostname and Database name fields are correctly populated. Then click next.
Step 6) On Schema page. Enter the schema of the Apply control tables (ASN) or check that the ASN schema is pre-populated into the schema field. Then click next. The selection page will show the list of tables that are defined in the ASN Schema.
Step 7) The first table from which we need to import metadata is IBMSNAP_FEEDETL, an Apply control table. It has the detail about the synchronization points that allows DataStage to keep track of which rows it has fetched from the CCD tables. Choose IBMSNAP_FEEDETL and click Next.
Step 8) To complete the import of the IBMSNAP_FEEDETL table definition. Click import and then in the open window click open.
Step 9) Repeat steps 1-8 two more times to import the definitions for the PRODUCT_CCD table and then the INVENTORY_CCD table.
NOTE: While importing definitions for the inventory and product, make sure you change the schemas from ASN to the schema under which PRODUCT_CCD and INVENTORY_CCD were created.
Now DataStage has all the details that it requires to connect to the SQL Replication target database.
For each of the four DataStage parallel jobs that we have, it contains one or more stages that connect with the STAGEDB database. You need to modify the stages to add connection information and link to dataset files that DataStage populates.
Stages have predefined properties that are editable. Here we will change some of these properties for the STAGEDB_ASN_PRODUCT_CCD_extract parallel job.
Step 1) Browse the Designer repository tree. Under SQLREP folder select the STAGEDB_ASN_PRODUCT_CCD_extract parallel job. To edit, right-click the job. The design window of the parallel job opens in the Designer Palette.
Step 2) Locate the green icon. This icon signifies the DB2 connector stage. It is used for extracting data from the CCD table. Double-click the icon. A stage editor window opens.
Step 3) In the editor click Load to populate the fields with connection information. To close the stage editor and save your changes click OK.
Step 4) Now return to the design window for the STAGEDB_ASN_PRODUCT_CCD_extract parallel job. Locate the icon for the getSynchPoints DB2 connector stage. Then double-click the icon.
Step 5) Now click load button to populate the fields with connection information.
NOTE: If you are using a database other than STAGEDB as your Apply control server. Then select the option to load the connection information for the getSynchPoints stage, which interacts with the control tables rather than the CCD table.
Step 6) In this step,
Step 7) Now open the stage editor in the design window, and double click on icon insert_into_a_dataset. It will open another window.
Step 8) In this window,
You have now updated all necessary properties for the product CCD table. Close the design window and save all changes.
Step 9) Now locate and open the STAGEDB_ASN_INVENTORY_CCD_extract parallel job from repository pane of the Designer and repeat Steps 3-8.
NOTE:
When DataStage job is ready to compile the Designer validates the design of the job by looking at inputs, transformations, expressions, and other details.
When the job compilation is done successfully, it is ready to run. We will compile all five jobs, but will only run the "job sequence". This is because this job controls all the four parallel jobs.
Step 1) Under SQLREP folder. Select each of the five jobs by (Cntrl+Shift). Then right click and choose Multiple job compile option.
Step 2) You will see five jobs is selected in the DataStage Compilation Wizard. Click Next.
Step 3) Compilation begins and display a message "Compiled successfully" once done.
Step 4) Now start the DataStage and QualityStage Director. Select Start > All programs > IBM Information Server > IBM WebSphere DataStage and QualityStage Director.
Step 5) In the project navigation pane on the left. Click the SQLREP folder. This brings all five jobs into the director status table.
Step 6) Select the STAGEDB_AQ00_S00_sequence job. From the menu bar click Job > Run Now.
Once compilation is done, you will see the finished status.
Now check whether changed rows that are stored in the PRODUCT_CCD and INVENTORY_CCD tables were extracted by DataStage and inserted into the two data set files.
Step 7) Go back to the Designer and open the STAGEDB_ASN_PRODUCT_CCD_extract job. To open the stage editor Double-click the insert_into_a_dataset icon. Then click view data.
Step 8) Accept the defaults in the rows to be displayed window. Then click OK. A data browser window will open to show the contents of the data set file.
In the previous step, we compiled and executed the job. In this section, we will check the integration of SQL replication and DataStage. For that, we will make changes to the source table and see if the same change is updated into the DataStage.
Step 1) Navigate to the sqlrepl-datastage-scripts folder for your operating system.
Step 2) Start SQL Replication by following steps:
Step 3) Now open the updateSourceTables.sql file. For connecting to the SALES database replace <sales-connect-ID> and <sales-password> with the user ID and password.
Step 4) Open a DB2 command window. Change directory to sqlrepl-datastage-tutorial\scripts, and run issue by the given command:
db2 -tvf updateSourceTables.sql
The SQL script will do various operations like Update, Insert and delete on both tables (PRODUCT, INVENTORY) in the Sales database.
Step 5) On the system where DataStage is running. Open the DataStage Director and execute the STAGEDB_AQ00_S00_sequence job. Click Job > Run Now.
When you run the job following activities will be carried out.
You can check that the above steps took place by looking at the data sets.
Step 6) Follow the below steps,
The dataset contains three new rows. The easiest way to check the changes are implemented is to scroll down far right of the Data Browser. Now look at the last three rows (see image below)
The letter I, U and D specifies INSERT, UPDATE and DELETE operation that resulted in each new row.
You can do the same check for Inventory table.
Summary:
What is Database? A database is a collection of related data which represents some elements of the...
{loadposition top-ads-automation-testing-tools} Data integration is the process of combining data...
What is Data Warehouse? A Data Warehouse collects and manages data from varied sources to provide...
Fact Table: A fact table is a primary table in a dimensional model. A Fact Table contains...
{loadposition top-ads-automation-testing-tools} What is Business Intelligence Tool? BUSINESS...
What is Teradata? Teradata is massively parallel open processing system for developing large-scale data...