Data Warehousing
Data Mining Tutorial: What is | Process | Techniques & Examples
What is Data Mining? Data Mining is a process of finding potentially useful patterns from huge...
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
The Data Lake democratizes data and is a cost-effective way to store all data of an organization for later processing. Research Analyst can focus on finding meaning patterns in data and not data itself.
Unlike a hierarchal Dataware house where data is stored in Files and Folder, Data lake has a flat architecture. Every data elements in a Data Lake is given a unique identifier and tagged with a set of metadata information.
In this tutorial, you will learn-
The main objective of building a data lake is to offer an unrefined view of data to data scientists.
Reasons for using Data Lake are:
The figure shows the architecture of a Business Data Lake. The lower levels represent data that is mostly at rest while the upper levels show real-time transactional data. This data flow through the system with no or little latency. Following are important tiers in Data Lake Architecture:
Following are Key Data Lake concepts that one needs to understand to completely understand the Data Lake Architecture
Data Ingestion allows connectors to get data from a different data sources and load into the Data lake.
Data Ingestion supports:
Data storage should be scalable, offers cost-effective storage and allow fast access to data exploration. It should support various data formats.
Data governance is a process of managing availability, usability, security, and integrity of data used in an organization.
Security needs to be implemented in every layer of the Data lake. It starts with Storage, Unearthing, and Consumption. The basic need is to stop access for unauthorized users. It should support different tools to access data with easy to navigate GUI and Dashboards.
Authentication, Accounting, Authorization and Data Protection are some important features of data lake security.
Data quality is an essential component of Data Lake architecture. Data is used to exact business value. Extracting insights from poor quality data will lead to poor quality insights.
Data Discovery is another important stage before you can begin preparing data or analysis. In this stage, tagging technique is used to express the data understanding, by organizing and interpreting the data ingested in the Data lake.
Two major Data auditing tasks are tracking changes to the key dataset.
Data auditing helps to evaluate risk and compliance.
This component deals with data's origins. It mainly deals with where it movers over time and what happens to it. It eases errors corrections in a data analytics process from origin to destination.
It is the beginning stage of data analysis. It helps to identify right dataset is vital before starting Data Exploration.
All given components need to work together to play an important part in Data lake building easily evolve and explore the environment.
The Definition of Data Lake Maturity stages differs from textbook to other. Though the crux remains the same. Following maturity, stage definition is from a layman point of view.
This first stage of Data Maturity Involves improving the ability to transform and analyze data. Here, business owners need to find the tools according to their skillset for obtaining more data and build analytical applications.
This is a second stage which involves improving the ability to transform and analyze data. In this stage, companies use the tool which is most appropriate to their skillset. They start acquiring more data and building applications. Here, capabilities of the enterprise data warehouse and data lake are used together.
This step involves getting data and analytics into the hands of as many people as possible. In this stage, the data lake and the enterprise data warehouse start to work in a union. Both playing their part in analytics
In this maturity stage of the data lake, enterprise capabilities are added to the Data Lake. Adoption of information governance, information lifecycle management capabilities, and Metadata management. However, very few organizations can reach this level of maturity, but this tally will increase in the future.
Challenges of building a data lake:
Parameters | Data Lakes | Data Warehouse |
---|---|---|
Data | Data lakes store everything. | Data Warehouse focuses only on Business Processes. |
Processing | Data are mainly unprocessed | Highly processed data. |
Type of Data | It can be Unstructured, semi-structured and structured. | It is mostly in tabular form & structure. |
Task | Share data stewardship | Optimized for data retrieval |
Agility | Highly agile, configure and reconfigure as needed. | Compare to Data lake it is less agile and has fixed configuration. |
Users | Data Lake is mostly used by Data Scientist | Business professionals widely use data Warehouse |
Storage | Data lakes design for low-cost storage. | Expensive storage that give fast response times are used |
Security | Offers lesser control. | Allows better control of the data. |
Replacement of EDW | Data lake can be source for EDW | Complementary to EDW (not replacement) |
Schema | Schema on reading (no predefined schemas) | Schema on write (predefined schemas) |
Data Processing | Helps for fast ingestion of new data. | Time-consuming to introduce new content. |
Data Granularity | Data at a low level of detail or granularity. | Data at the summary or aggregated level of detail. |
Tools | Can use open source/tools like Hadoop/ Map Reduce | Mostly commercial tools. |
Here are some major benefits in using a Data Lake:
Risk of Using Data Lake:
What is Data Mining? Data Mining is a process of finding potentially useful patterns from huge...
What is MOLAP? Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by...
What is Data Modelling? Data modeling (data modelling) is the process of creating a data model for the...
Data Warehouse Concepts The basic concept of a Data Warehouse is to facilitate a single version of...
What is Data Warehouse? A Data Warehouse collects and manages data from varied sources to provide...
What is Teradata? Teradata is massively parallel open processing system for developing large-scale data...