Data Warehousing
20+ BEST SIEM Tools & Software Solutions (2021)
Security Information and Event Management tool is a software solution that aggregates and analyses activity...
The basic concept of a Data Warehouse is to facilitate a single version of truth for a company for decision making and forecasting. A Data warehouse is an information system that contains historical and commutative data from single or multiple sources. Data Warehouse Concepts simplify the reporting and analysis process of organizations.
Data Warehouse Concepts have following characteristics:
A data warehouse is subject oriented as it offers information regarding a theme instead of companies' ongoing operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on modeling and analysis of data for decision making. It also provides a simple and concise view around the specific subject by excluding data which not helpful to support the decision process.
In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from the dissimilar database. The data also needs to be stored in the Datawarehouse in common and universally acceptable manner.
A data warehouse is developed by integrating data from varied sources like a mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures, encoding structure etc. have to be ensured. Consider the following example:
In the above example, there are three different application labeled A, B and C. Information stored in these applications are Gender, Date, and Balance. However, each application's data is stored different way.
However, after transformation and cleaning process all this data is stored in common format in the Data Warehouse.
The time horizon for data warehouse is quite extensive compared with operational systems. The data collected in a data warehouse is recognized with a particular period and offers information from the historical point of view. It contains an element of time, explicitly or implicitly.
One such place where Datawarehouse data display time variance is in in the structure of the record key. Every primary key contained with the DW should have either implicitly or explicitly an element of time. Like the day, week month, etc.
Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what & when happened. It does not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application environment are omitted in Data warehouse environment. Only two types of data operations performed in the Data Warehousing are
Here, are some major differences between Application and Data Warehouse
Operational Application | Data Warehouse |
Complex program must be coded to make sure that data upgrade processes maintain high integrity of the final product. | This kind of issues does not happen because data update is not performed. |
Data is placed in a normalized form to ensure minimal redundancy. | Data is not stored in normalized form. |
Technology needed to support issues of transactions, data recovery, rollback, and resolution as its deadlock is quite complex. | It offers relative simplicity in technology. |
Data Warehouse Architecture is complex as it’s an information system that contains historical and commutative data from multiple sources. There are 3 approaches for constructing Data Warehouse layers: Single Tier, Two tier and Three tier. This 3 tier architecture of Data Warehouse is explained as below.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to remove data redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture is one of the Data Warehouse layers which separates physically available sources and data warehouse. This architecture is not expandable and also not supporting a large number of end-users. It also has connectivity problems because of network limitations.
Three-Tier Data Warehouse Architecture
This is the most widely used Architecture of Data Warehouse.
It consists of the Top, Middle and Bottom Tier.
We will learn about the Datawarehouse Components and Architecture of Data Warehouse with Diagram as shown below:
The Data Warehouse is based on an RDBMS server which is a central information repository that is surrounded by some key Data Warehousing components to make the entire environment functional, manageable and accessible.
There are mainly five Data Warehouse Components:
The central database is the foundation of the data warehousing environment. This database is implemented on the RDBMS technology. Although, this kind of implementation is constrained by the fact that traditional RDBMS system is optimized for transactional database processing and not for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are resource intensive and slow down performance.
Hence, alternative approaches to Database are used as listed below-
The data sourcing, transformation, and migration tools are used for performing all the conversions, summarizations, and all the changes needed to transform data into a unified format in the datawarehouse. They are also called Extract, Transform and Load (ETL) Tools.
Their functionality includes:
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs, shell scripts, etc. that regularly update data in datawarehouse. These tools are also helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
The name Meta Data suggests some high-level technological Data Warehousing Concepts. However, it is quite simple. Metadata is data about data which defines the data warehouse. It is used for building, maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source, usage, values, and features of data warehouse data. It also defines how data can be changed and processed. It is closely connected to the data warehouse.
For example, a line in sales database may contain:
4030 KJ732 299.90
This is a meaningless data until we consult the Meta that tell us it was
Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.
Metadata helps to answer the following questions
Metadata can be classified into following categories:
One of the primary objects of data warehousing is to provide information to businesses to make strategic decisions. Query tools allow users to interact with the data warehouse system.
These tools fall into four different categories:
Query and reporting tools can be further divided into
Reporting tools:
Reporting tools can be further divided into production reporting tools and desktop report writer.
Managed query tools:
This kind of access tools helps end users to resolve snags in database and SQL and database structure by inserting meta-layer between users and database.
Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization. In such cases, custom reports are developed using Application development tools.
Data mining is a process of discovering meaningful new correlation, pattens, and trends by mining large amount data. Data mining tools are used to make this process automatic.
These tools are based on concepts of a multidimensional database. It allows users to analyse the data using elaborate and complex multidimensional views.
Data warehouse Bus determines the flow of data in your warehouse. The data flow in a data warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta flow.
While designing a Data Bus, one needs to consider the shared dimensions, facts across data marts.
A data mart is an access layer which is used to get data out to the users. It is presented as an option for large size data warehouse as it takes less time and money to build. However, there is no standard definition of a data mart is differing from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for partition of data which is created for the specific group of users.
Data marts could be created in the same database as the Datawarehouse or a physically separate Database.
To design Data Warehouse Architecture, you need to follow below given best practices:
Security Information and Event Management tool is a software solution that aggregates and analyses activity...
Data modeling is a method of creating a data model for the data to be stored in a database. It...
What is MOLAP? Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by...
What is ETL? ETL is a process that extracts the data from different source systems, then...
What is Information? Information is a set of data that is processed in a meaningful way according to...
What is Data Reconciliation? Data reconciliation (DR) is defined as a process of verification of...