Cost efficient data infrastructure

Without neglecting developer happiness and efficiency

13th September 2023 - Reading time: 8 min

Many customers and companies we talk to start their data journey off with an exploratory view on analytics such as customer behavior, product usage, sales and marketing. As the complexity increases, costs start adding up. At this stage very few companies spend time to make the process and infrastructure more lean. It's always difficult to calculate and predict return on investment for exploration.

This post includes actionable insights on how to lower costs and will focus on the area of executing data manipulation (statistics), data runtimes (ML) and data migration (often referred to as ETL, ELT, or various other terms). We group all of them into the term data jobs. Let's explore how you can reduce cost while still delivering high value.

Financial books for infrastructure costs.

Jump down to “Recommendations” to get a summary of cost savings!

Interacting with the data

If you're interested in learning more about data concepts and insights, we've written about that in some of our other posts such as “Unlocking Insights” and “Expertise to valuable insights”.

The first step in a data journey begins with understanding three aspects: what is the data source, where is the data going and how should it look at arrival. The source, destination and data contract. From a data engineering perspective it can be a migration job. From the perspective of an analyst it can be a report or theory needing supporting facts. It could also be a machine learning specialist producing or using a model such as ChatGPT to support business decisions or customer products.

If we can manage all the work within a database, the most common approach today is pure SQL. Databases have become powerful enough to handle the range of small to quite large data amounts at reasonable pricing. If either the source or the destination is separate from each other, the concepts of files, pipelines or code are introduced. From a conceptual level very little differs between an SQL and code approach other than where the actual handling of data happens, hardware bottlenecks and pricing. This is a bit oversimplified for certain areas with larger computational or GPU/TPU needs, such as ML or in-depth analytics.

Databases specialize in working with data. When within the domain of your database, SQL is one of the better approaches for both time and cost. At least until you handle a huge amount of data, which most companies won't be experiencing. As for working with code, there are many frameworks and tools but for languages Python, Scala/Java and R are common with Rust slowly gaining momentum.

Infrastructure and hardware

Regardless of approach for your data journey, the common factor is going from concept to execution and actually running the data job. There's a lot of various solutions for both SQL and code with pros and cons depending on data requirements. Processing and storage are areas where costs quickly rack up if a subpar solution or approach is used. Let's explore the various approaches, when to use them and costs associated with the solutions.

Phase 1 - Starting the data journey

In the starting phase of a data journey, a lot of the work is exploratory. The data sets are small and the team is learning the domain as well as data processes and tooling. High velocity and quick iterations are key together with the ability to quickly discard invalid approaches in favor of better alternatives.

From a technical point of view this requires non-rigid processes, allowing quick navigation of changes and an infrastructure that cost-wise makes sense for the size. A common approach is to first load data into the database as a replica of the source structure within legal constraints and retention, and then stay within the database all the way to visualization. It's an approach that can handle quite a large data load and is quick and easy to get up and running. A recommendation is some kind of relational database where you periodically ingest new data from the business. TimescaleDB is a great choice when using time series data, otherwise most commercial database solutions are price worthy, such as PostgreSQL, MySQL, MSSQL.

For ingesting data into the database, look at pre-build solutions for data migration such as Airbyte or Meltano. If those do not cover your case, the most common alternative is to start working with Python and using SDKs against both sides. STOIX Orchestrator is a product handling both the migration and scheduling and we usually recommend that for saving time and cost as well as when growing towards Phase 2. For workloads consuming information, such as ML, the cost aspects to keep in mind are network, CPU, memory and GPU cost. Serverless is a viable option with generous free tiers. You can also look at managed services in the cloud or run it on a local machine. Usually time spent by employees in this phase is more expensive than the hardware and cloud costs.

For SQL, a common tool for interaction and working with data modeling is dbt or Dataform. The community and tooling around dbt is generally more mature while Dataform has good integration with Google Cloud Platform and BigQuery.

Put most of your effort and models centered in the database and SQL to have a high velocity, quick iterations and great initial progress.

Phase 2 - Growing as a data department

You have found valuable insights and efforts in the company around data that are growing, great! With more data and requirements, the complexity and reliability of produced value quickly increases. This is inevitable and brings concerns that changes from employees might impact business negatively, impeding velocity and progress of work. From a cost and efficiency output, it's important to mitigate those risks early on as well as reducing knowledge required around a growing infrastructure and focusing on the important issues at hand - data analysis.

View through glasses to a blurred background of code on screens.

Your initial focus should be continuous deployment of data jobs, versioning of data models and testing to make sure you keep the data contracts. All of these need to be automated and require bare minimum of interaction and knowledge from your team. E.g: By interacting simply through git, you should be able to test the change, produce artifacts and deploy changes. Your team should spend next to zero effort learning tooling to do this, rather focus on producing quality data jobs. With this you save an immense amount of time and therefore become more cost efficient.

During the growth phase, you have a big variation in infrastructure needs. A lot of computational power is required during a select few hours of the day and otherwise time is spent understanding the output. For this you should have an infrastructure that works well with growth in mind. Serverless starts to lean towards high cost for the yield and a recommendation is to look towards more affordable approaches with slower scaling. Scaling is of importance for cutting costs during the hours when work is not being performed.

For computational work such as Spark or your custom made jobs, Kubernetes is a great option. If you have a Kubernetes guru you can work with node and pod scaling. Google Cloud Platform also offers GKE Autopilot which abstracts away the concept and you pay per usage. STOIX Orchestrator works out of box with Kubernetes and scaling to reduce your cost, and if you want to learn more we can talk about your needs over a free consultation session. Keep the artifacts and storage close to the machines to save on network traffic costs. The desired outcome is to have a platform that works with your varied needs and keeps the costs low without interfering with the team.

For SQL our recommendation is to look at BigQuery if you live in the Google Cloud Platform environment, otherwise Snowflake is a great choice. For BigQuery you pay for what you use, so moving jobs that are recurring or moving large volumes of data to code can save you a lot of money. With Snowflake, make sure you configure your machines to scale down between workloads to cut costs for hours you are idle. We wrote a blog post about cutting BigQuery costs you can read more about here.

Phase 3 - Data centered or specialized domain

When you reach huge amounts of data you usually have both an infrastructure team as well as data teams working on both cost savings and developer experience. So we won't dive too deep into this phase and instead list some of our learnings.

Simplify interaction between data teams and datasets. Both governance, access management and contracts. STOIX Orchestrator has dependencies between datasets built in the product to easier collaborate between data teams.
If you're moving a lot of data in BigQuery, look at capacity-based pricing rather than on-demand pricing based on your workloads.
Prefer migrating data jobs you run frequently or that lift heavy amounts of data from SQL to code. Look into tooling made for large data volumes, such as Dataflow.
Store data efficiently and compressed, look at the need to frequently access data and perhaps move historical data to colder storages.
Look over network traffic costs for ingress/egress traffic. A simple factor of migrating artifacts or storage to reduce network costs can make large differences at larger volumes.
Look over business needs of data and try to reduce and filter output or look over retention periods.

Recommendations

When we get the question: “What's your best recommendation for working with data?” We usually ask: What are your business needs and how mature are you in your journey?

Data is a supporting function to the business that keeps changing and adapting over time. And in effect, the needs on infrastructure supporting the business change with it. Above, we make recommendations for companies working from a small to large scale in three phases. The main takeaways are: Start small, stay fast and focus on output and value. Other recommendations are:

Most of your time should be focused on solving the data problems.
If you start spending a large amount of time on supporting parts (building, deploying, running tests, etc) automate it.
Don't spend money on data infrastructure when it's not used.
Don't choose tools for workloads not matching your company. There are tools that are great for small volumes of data that are crazy slow or expensive when subjected to larger volumes. And the other way around, with tooling for larger volumes of data you tend to get slower at exploration and more expensive for small amounts of data.
Learn from others that have done your journey and ask them specifically what they did in your position and how it went. Experience will save you effort and money.

Want to hear more about how your processes or infrastructure can be cost efficient? Contact us !