Data Integration is a technique of integrating the data which resides in different sources. The goal is to provide the users with a holistic view of the data. It can be viewed more as a practice of consolidating data from various disparate sources. This is viewed as one of the most important steps in Data preprocessing. Data Architects are involved in architecting the process which facilitates an automated process of connecting and routing of data from source systems to target systems. This can be achieved by following some of the below techniques:
- Data Replication: data is replicated in databases for backup purposes.
- Change Data Capture: data changes are captured and identifies in databases in real-time and are applied to data warehouses and other data storage repositories.
- Extract, transform and load: this involves extracting the data from different sources, transforming and harmonizing it and loading it into data warehouse or database.
- Extract, load and transform: this involves loading the data into a big data system and later transforming it.
- Data Virtualization: this involves virtually combining the data from different systems to create a unified view.
- Streaming Data Integration: this involves real time Data Integration where multiple streaming system are integrated.
In this post, we will take a basic approach of integrating data using python. This will give you a basic understanding on how we could integrate different datasets while doing data pre-processing.
This post is one of the steps involved in data pre-processing. Refer this post to understand and implement data cleaning which is a crucial step in data pre-processing. You can also refer this post to understand how to identify outliers in the dataset.