Data integration platforms every developer should understand
Knowing the new data practices and machine learning technologies is vital for software developers to create business value
Working with large volumes of data is a lot like developing software. Both require a good understanding of what end-users need, knowledge of how to implement solutions, and agile practices to iterate and improve the results. Software development and data practices both require technology platforms, coding practices, devops methodologies, and nimble infrastructure to be instituted and ready to meet business needs.
Data scientists, dataops engineers, and data engineers have many similar technologies and practices compared to software developers, and yet, there are many differences. While attending the 2019 Strata Data Conference in New York, I looked at methodologies, platforms, and solutions presented there through the dual lenses of a software developer and a data engineer.
Getting data ready for consumption
Application developers working with modest amounts of data often implement the upfront data integration, formatting, and storage through scripts, database stored procedures, and other coding options. It’s a straightforward approach to get past the required plumbing and to have data ready to be supported by a microservice, shared through APIs, or consumed by an end-user application.
However, as delivering the applications is the primary business objective and therefore the focus of full-stack developers, the engineering efforts applied to ingest, process, and store the data are often minimized.
On the other hand, data scientists and dataops engineers see the world of loading data very differently. First, they understand garbage in, garbage out principles all too well and know that the analytics, data visualizations, machine learning models, and other data products are compromised and potentially useless if the data loaded isn’t cleansed and processed appropriately.
Also, if data isn’t stored optimally, it makes the analytics work less efficient and may impact query performance. Query a relational database through too many joins or process an extensive time-series database that isn’t partitioned, and productivity and performance may be an issue.
Data teams will always consider the data source, type of data, and requirements to support the volume, performance, reliability, and security when considering how best to ingest data. They will also consider the types of data cleansing and enrichment that are needed before data is ready for consumption.
In summary, they invest significant efforts to make sure data loading meets today’s requirements, is extendable to support changing data formats, and is scalable to support growing volumes of data.
Data integration platforms are not one size fits all
Like software development platforms, there are many different types of data ingestion platforms. Software developers should be familiar with the following basic types of data integration:
- ETL (extraction, transformation, and loading) platforms have been around for a while and are traditionally used to batch process data movements between different enterprise systems. Informatica, Talend, IBM, Oracle, and Microsoft all offer ETL technologies.
- When data ingestion must be done in real time or near real time for IoT and other applications, platforms such as Kafka and Spark or event-driven architectures such as Vantiq are better options.
- Organizations with many business analysts working with data may use data prep technologies to load spreadsheets and other smaller data sources. Tools from Tableau, Alteryx, and Trifacta all provide self-service data ingestion and processing capabilities that can be used by business users with little or no coding required.
- When companies recognize the need for proactive steps to cleanse data or establish master data records, there are open source platforms such as HoloClean, and data quality and mastering platforms such as Reltio, Tamr, and Ataccama. ETL, data prep, and other data integration platforms also have data quality capabilities.
- Large enterprises and those with many data sources spread across multiple clouds and data centers may look at Cloudera Data Platform, SAP Data Intelligence, or InfoWorks Data Operations and Orchestration System. These platforms work across numerous databases and big data platforms and help to virtualize multiple data sources, ease data integration processes, and implement required data lineage and data governance.
Many of these platforms offer visual programming capabilities so that data pipelines can be developed, managed, and extended. For example, a pipeline might start with IoT sensor data collected by Kafka, join the data with other data sources, ship to a data lake, and then push into analytics platforms.
Where can everyone find all this data?
As the number of data sources, pipelines, and management platforms increases, it becomes more difficult for IT to manage and for data consumers to find what they need. Knowing where to search and find data sources is just a starting point; data consumers must know descriptions, descriptive metadata, status, usage rights, and subject matter experts on the available data sources.
The data catalog is a capability offered by many data platforms as a centralized resource for business analysts, data scientists, data engineers, and software developers to find data sources and record information about them. These can be compelling enterprise tools to help share and improve data for analytics or to be used in applications.
Using data properly is everyone’s responsibility
All these tools and capabilities represent new responsibilities and opportunities for software developers.
The responsibility begins by using data integration tools as the primary way to bring in new data sources and use them in applications. The simple approaches of developing scripts and encoding data cleansing operations or business rules in code have significant drawbacks compared to capabilities of data integration platforms.
There are still plenty of coding opportunities since data integration platforms require extensions, rules, and configurations specific to data sources. However, the platforms offer robust ways to manage this code, handle exceptions, and provide other operating capabilities that challenge any do-it-yourself approach.
The data catalog then offers developers a significant opportunity to use more data sources and analytics or machine learning models in software applications. In fact, one of the primary goals of many organizations is to be data-driven and enable employees, customers, and partners to use appropriate and permissioned data and analytics. What better way than for software development teams to embed the available data, analytics, or the machine learning models directly in applications?
Ask data scientists, and one of their primary objectives is to see their machine learning models integrated into applications. Embedding analytics for the benefit of end-users is how data scientists and software developers collaborate and deliver business value—and that’s a win for everybody.
Deel dit nieuws op
JOUW BERICHT HIER?