What is a data pipeline?

28th March 2022 - Reading time: 1 min

A data pipeline is code written to move data from one or many sources to its destination. It may also involve transforming the data to make it suitable for analysis to make informed business decisions.

To make a data-driven business decision an organization wants to retrieve data from somewhere. It could be from their own production database, an ERP system (such as Salesforce, Oracle) or any other place from which one can retrieve data. This is called a source.

The code is written by a software engineer using their preferred programming language. Common languages in the data world are Python, Scala or Java. However, any language can be used. This code executes and extracts data from the sources moving it to its destination.

The destination is the place where the organization wants to move the data. It could be a data warehouse (Snowflake, BigQuery or Redshift), a cloud file storage or anywhere else in their data infrastructure. It's usually from the destination that business intelligence tools (Tableau, QlikView, Looker) consume the data.

An organization also has the ability to write code to enhance the source data through transformation before moving it to its destination. Typical transformations are cleaning, filtering, and aggregating the source data. This is done to get the data in a format that you can understand and act on which is important when the organization is analyzing and making data-driven decisions.

Click here to find out how your organization can run, orchestrate, schedule, and monitor your data ecosystem in STOIX. Be up and running in less than 5 minutes!