Data Integration is the process of consolidating data from different sources (heterogeneous data sources) to a unified dataset. Data integration can be a simple process within the data pre-processing phase of machine learning modelling or it can be a comprehensive task that can be treated as an independent data integration project on its own. In this post we focus on data integration as a step in data pre-processing. To learn about comprehensive data integration process refer to our previous series of posts here.
Introduction to Data Integration
The organization data is generated and stored from desperate data sources such as business applications, relational databases, non-relational data stores and flat files. More often the business needs to have a 360 view of the business from all the data from different sources. This requirements calls for a data integration process. According to Wikipedia data integration is a process of combining data from different sources into a unified format that can be then used for analytics and decision making process.
In Data science life cycle the process of data integration is usually performed by Data architects and Data engineers. These are professionals responsible for designing and development of data pipelines. In data pre-processing data integration can be achieved programmatically within a similar data modelling tool such as Python programming language.
Factors to Consider When Selecting Data Integration Tools
Different data integration tools have different capabilities and the choice of a particular tool depends on the organization/business data landscape and business use-case. Below are general factors to consider when selecting an ideal data integration tool.
- Supported Data Sources. Pulling data from source system to target system is a key functionality of data integration tool. Different data integration tools support different data sources. While most tools support common data sources like major relational databases (Oracle, Microsoft SQL Server database, IBM DB2, MySQL and PostgreSQL among others), it’s important to consider new data sources such as those for unstructured data sources. An ideal data integration tool should support different and numerous data sources and have the capability of being extended to new future data sources. A good data integration tool grows with the organization data strategy.
- Data Transformation Capabilities. Data transformation is at the core of an ideal data integration tool. A good data integration tool comes with many data transformation functions. In addition to providing numerous data transformation function it should be able to allow for adding custom transformation function.
- Organization data keeps growing hence the data integration tool should be able to scale with the growth in data. In this era of big data organisations are looking for a robust data integration tool that can big data for analytical purpose.
- Real time Integration. Many organisations want to leverage the power of analytics at real-time in order to make decisions faster and avert risk in time. Having tool that support both batch and real-time data integration and processing provides more value to the organisation.
- For the integration of sensitive data organisation look for secure data integration tool.
- The choice of a data integration tool can be influenced by the cost of acquiring the tool or the skilled personnel required to maintain the data pipeline. The aspect of cost brings the perspective of choosing between open-source vs proprietary tools. This can be influenced by the ease of use of the tool. In addition to the cost factor different data integration architectures such as ELT and ETL requires different resources setting up the tool. ETL requires additional staging environment for data transformation while ELT does not hence making it cheaper.
The flexibility of programming enables in achieving of the above factors.
Conclusion
Data integration is key in bringing together data from different sources such as business applications, databases and edge devices into a single dataset useful for business to use in decision making. In this post we have briefly looked at data integration and factors to consider when selecting data integration tool. In the next post we will look at data transformation. To learn about data cleaning check our previous post here.