Data Warehousing
Data Warehouse PDF: Data Warehousing Concepts (Book)
$20.20 $9.99 for today 4.6 (115 ratings) Key Highlights of Data Warehouse PDF 221+ pages eBook...
ETL is a process that extracts the data from different source systems, then transforms the data (like applying calculations, concatenations, etc.) and finally loads the data into the Data Warehouse system. Full form of ETL is Extract, Transform and Load.
It's tempting to think a creating a Data warehouse is simply extracting data from multiple sources and loading into database of a Data warehouse. This is far from the truth and requires a complex ETL process. The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse system needs to change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented.
In this ETL tutorial, you will learn-
There are many reasons for adopting ETL in the organization:
ETL is a 3-step process
In this step of ETL architecture, data is extracted from the source system into the staging area. Transformations if any are done in staging area so that performance of source system in not degraded. Also, if corrupted data is copied directly from the source into Data warehouse database, rollback will be a challenge. Staging area gives an opportunity to validate extracted data before it moves into the Data warehouse.
Data warehouse needs to integrate systems that have different
DBMS, Hardware, Operating Systems and Communication Protocols. Sources could include legacy applications like Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors, partners amongst others.
Hence one needs a logical data map before data is extracted and loaded physically. This data map describes the relationship between sources and target data.
Three Data Extraction methods:
Irrespective of the method used, extraction should not affect performance and response time of the source systems. These source systems are live production databases. Any slow down or locking could effect company's bottom line.
Some validations are done during Extraction:
Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL process adds value and changes data such that insightful BI reports can be generated.
It is one of the important ETL concepts where you apply a set of functions on extracted data. Data that does not require any transformation is called as direct move or pass through data.
In transformation step, you can perform customized operations on data. For instance, if the user wants sum-of-sales revenue which is not in the database. Or if the first name and the last name in a table is in different columns. It is possible to concatenate them before loading.
Following are Data Integrity Problems:
Validations are done during this stage
Loading data into the target datawarehouse database is the last step of the ETL process. In a typical Data warehouse, huge volume of data needs to be loaded in a relatively short period (nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server performance.
Types of Loading:
There are many Data Warehousing tools are available in the market. Here, are some most prominent one:
1. MarkLogic:
MarkLogic is a data warehousing solution which makes data integration easier and faster using an array of enterprise features. It can query different types of data like documents, relationships, and metadata.
https://www.marklogic.com/product/getting-started/
2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of Data Warehouse solutions for both on-premises and in the cloud. It helps to optimize customer experiences by increasing operational efficiency.
https://www.oracle.com/index.html
3. Amazon RedShift:
Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool to analyze all types of data using standard SQL and existing BI tools. It also allows running complex queries against petabytes of structured data.
https://aws.amazon.com/redshift/?nc2=h_m1
Here is a complete list of useful Data warehouse Tools.
Following are the best practices for ETL Process steps:
Never try to cleanse all the data:
Every organization would like to have all the data clean, but most of them are not ready to pay to wait or not ready to wait. To clean it all would simply take too long, so it is better not to try to cleanse all the data.
Never cleanse Anything:
Always plan to clean something because the biggest reason for building the Data Warehouse is to offer cleaner and more reliable data.
Determine the cost of cleansing the data:
Before cleansing all the dirty data, it is important for you to determine the cleansing cost for every dirty data element.
To speed up query processing, have auxiliary views and indexes:
To reduce storage costs, store summarized data into disk tapes. Also, the trade-off between the volume of data to be stored and its detailed usage is required. Trade-off at the level of granularity of data to decrease the storage costs.
$20.20 $9.99 for today 4.6 (115 ratings) Key Highlights of Data Warehouse PDF 221+ pages eBook...
{loadposition top-ads-automation-testing-tools} A Data Warehouse is a collection of software tools...
What is OLTP? OLTP is an operational system that supports transaction-oriented applications in a...
What is Teradata? Teradata is an open-source Database Management System for developing large-scale...
{loadposition top-ads-automation-testing-tools} Data integration is the process of combining data...
What is Data Mart? A Data Mart is focused on a single functional area of an organization and...