Back to Blog
Data EngineeringDagsterInfrastructureDelta LakeSelf-Hosted

Constraint-Driven Data Engineering: Architecting a Modern Data Platform on a Single €6 VM

Why you don't need a cloud budget to do rigorous data engineering. A deep dive into Headless Dagster, Local Delta Lake, and the art of constraint.

The default state of modern data engineering is often "over-provisioned."

For small projects, internal tools, or lean startups, jumping straight into an enterprise-grade setup - like Databricks, Azure Fabric, or a complex AWS architecture - is frequently a mistake. It’s not just about the cost; it’s about the complexity tax. You end up managing permissions, quotas, and vendor-specific quirks instead of shipping value.

If you follow standard tutorials for a "Modern Data Stack," you're often sold a shopping list designed for Fortune 500 complications:

  • Orchestration: Managed Airflow or Dagster Cloud ($$$)
  • Compute: Databricks Cluster or Snowflake Warehouse (min $2/hour)
  • Storage: S3/GCS buckets (cheap storage, expensive API calls)
  • Database: Managed Postgres (RDS/CloudSQL, min $30/mo)

Before you've processed a single byte of data, you're burning hundreds of dollars and drowning in infrastructure YAML. For a focused team or a single engineer, this is paralysis.

Real value comes from great engineering, not just buying expensive tools. When you know what you are doing, you can leverage self-hosted principles and self-reliance to achieve maximum value with minimum cost.

"Constraint-Driven Engineering" is my counter-philosophy. It asks: How much engineering rigor can we squeeze into a single, specialized unit of compute?

This case study documents QuantLens, a production-grade financial data platform running entirely on a single €6/month Hetzner Cloud. We didn't achieve this by cutting corners - we achieved it by architecting "Headless" patterns that remove the bloat while keeping the power. This isn't just about saving money; it's about Performance-Driven design. Local NVMe is often orders of magnitude faster than distributed storage for datasets under 1TB.

The Architecture: Monolith of Power

Most distributed systems are distributed for no reason. When your dataset is under 1TB (which is 99% of startups in their first years), the latency of network calls between your orchestrator, your compute, and your storage is just wasted entropy.

QuantLens brings it all home to localhost:

ComponentCloud StandardQuantLens (Constraint-Driven)
HardwareMultiple VMs / K8s NodesHetzner CAX21 (4 vCPU, 8GB RAM, NVMe)
OrchestratorAlways-on Webserver + DaemonHeadless Dagster (Library + Cron)
StorageS3 (Network Latency)Local NVMe (Millions of IOPS)
FormatParquet/Delta LakeDelta Lake (Acid Transactions)
ComputeSpark ClustersDuckDB + Polars (In-process)

The result is a system that runs with 0% idle CPU and consumes <150MB RAM when sleeping, but bursts to process gigabytes of financial data in seconds.

Innovation 1: Headless Dagster

The biggest resource hog in a self-hosted data stack is the orchestrator itself. A standard Dagster deployment requires running:

  1. dagster-webserver (UI) - ~300MB RAM
  2. dagster-daemon (Scheduler) - ~200MB RAM

On a standard 8GB server, giving up ~500MB of your memory just to wait for work is unacceptable.

The "Just a Library" Pattern

We stripped Dagster down to its core: a Python library for defining assets and dependencies. The standard dagster-daemon isn't just RAM-hungry; it requires a persistent Postgres database for run storage, adding significant maintenance overhead and another point of failure.

Instead, we use the operating system's native systemd and cron to handle lifecycle. We trade "platform convenience" for "architectural simplicity."

The "Trigger Service" is a tiny, 80MB Python process listening on a private port internally (guarded by UFW). When the Next.js frontend wants to run a pipeline, it hits this endpoint:

# infra/ansible/templates/pipeline-trigger-service.py.j2
@app.route('/trigger', methods=['POST'])
def trigger_pipeline():
    # Spawns a transient subprocess.
    # No daemon memory leak. No long-running webserver.
    cmd = ["dagster", "job", "execute", "-m", "data_pipelines"]
    subprocess.Popen(cmd) 
    return jsonify({"status": "started"})

We get all the benefits of Dagster - type-checked assets, lineage graphs, retries - without the operational overhead. The "Webserver" UI is only spun up locally on my laptop when I'm developing, tunneling to the production DB to view logs. In production, it simply doesn't exist.

Innovation 2: The Local Lakehouse

Everyone loves S3, but have you tried NVMe?

Cloud storage is great for durability, but it introduces latency. Every read/write is an HTTP request. By treating the local filesystem as our "Object Store," we unlock insane I/O performance for free.

We use Delta Lake (via delta-rs) directly on the local disk. This gives us ACID transactions and time-travel, just like Databricks, but faster because there's no network hop.

# data-pipelines/assets/silver.py
@asset
def silver_cleaned_prices(bronze_prices_df):
    # Writes purely to local NVMe
    # Delta Log ensures transaction safety even if the VM crashes
    table_path = "/data/silver/prices"
    write_deltalake(table_path, bronze_prices_df, mode="overwrite")

Scalability Note: "But what if the disk fills up?" That's where the Separation of Concerns pays off. Because we use the Delta Lake protocol, migrating to S3 is literally changing one environment variable: DATA_ROOT=s3://my-bucket. The code remains identical. We are choosing locality for performance, not strictly bound to it.

  1. Daily Backups to Object Storage: Critical Delta Lake snapshots are pushed to Hetzner's Object Storage (S3-compatible) on a daily schedule.
    • RPO (Recovery Point Objective): 24 hours.
  2. Ansible-based Disaster Recovery: The entire VM can be reprovisioned from scratch in under 10 minutes. All configuration is code. We lose state, not capability.
    • RTO (Recovery Time Objective): <15 minutes.
    • Bus Factor Mitigation: There is no "hidden knowledge." The Ansible playbooks serve as the living documentation for the entire stack. A new engineer reads the site.yml and understands the system.
  3. Idempotent Pipelines: Dagster assets are designed to be fully recomputable from source. Raw data is immutable. We can rebuild Silver/Gold layers from Bronze.
  4. Monitoring & Alerting: Basic uptime monitoring via external health checks catches issues before they become prolonged outages.

The Cost of Rigor

I often hear developers say they use managed services because they "don't want to manage infrastructure." But the result is often a fragility of understanding. You don't know why your pipeline is slow because you don't control the compute.

By implementing a "Constraint-Driven" architecture, you are forced to understand:

  • Linux memory management (OOM killing)
  • Docker networking security
  • ACID properties on file systems

This isn't just about saving €195/month. It's about maintaining Engineering Sovereignty. When you own the stack from the metal up, you can optimize constraints that others accept as defaults.

In a world of bloated abstractions, sometimes the most powerful move is to just run a script on a server.

Future Scalability: The Road Ahead

We are pragmatic, not dogmatic. This specific architecture has a shelf life, effectively defined by the size of the NVMe drive and vertical scaling limits.

When to Scale (The "Eject" Button):

  • Dataset Size: Exceeds 2TB (Standard Cloud VM storage limits).
  • Availability: Business requires 99.99% uptime (Cluster required).
  • Team Size: >5 Data Engineers working concurrently (Merge conflicts on local state).

Until then, we scale vertically:

  1. Object Storage Decoupling: The Delta Lake can be migrated to S3-compatible object storage (Hetzner, MinIO, or even AWS S3) when local NVMe capacity is exceeded. DuckDB and Dagster both support remote Delta tables natively.

Pedro

Founder & Principal Engineer