QuantLens Data Platform
Achieving a robust financial data platform with observability and orchestration on a self-hosted micro-architecture (8GB RAM VM). (Under Active Development)
Status: This is an internal research initiative under active development. Current work focuses on leveraging DuckDB for vectorized financial analysis and optimizing self-hosted operational costs.
Overview
QuantLens is an internal R&D initiative at JPR Labs designed to push the boundaries of cost-effective data platforming and data engineering. The goal was to build a robust, self-hosted financial data platform that delivers the capabilities of modern platforms like Databricks or Azure Fabric, but engineered to run efficiently on commodity cloud hardware.
This project serves as a reference architecture for lean startups and deep-work technical teams who need powerful data operations without the complexity, overhead, and excessive cost of the typical "Modern Data Stack".

The Philosophy: Control vs. Cost
Enterprise tools like Databricks, Snowflake, or Azure Fabric are standard for large organizations. They solve the problem of scaling to petabytes and managing hundreds of engineers. However, for small teams or deep technical R&D shops, these tools are often overkill. They introduce significant billing overhead and vendor lock-in for workloads that simply do not require distributed clusters.
We believe that for high-performance, focused engineering teams, self-hosted infrastructure provides the ultimate balance: full control, maximum data sovereignty, and extreme cost-effectiveness. You pay for the metal, not the markup.
The Solution
We architected a fully self-managed stack running on a standard Hetzner Cloud VM (CAX21: 4 vCPU, 8GB RAM). This setup costs approximately €6/month, demonstrating that you do not need a five-figure cloud bill to have a professional data lakehouse.
1. "Control Room" Frontend
We built a custom dashboard inspired by the operational clarity of Azure Data Factory's monitoring view.
- Pipeline History: Provides a clear, temporal view of all Dagster runs, allowing for instant debugging of failures.
- Data Explorer: A lightweight query interface allowing users to explore the Delta Lake directly using DuckDB. This brings the "Query Editor" experience of big platforms to local files.

2. Data Storage: The Local Lakehouse
We implemented a full Delta Lake architecture, but instead of using expensive cloud object storage (S3), we hosted it locally on the VM's NVMe drive under /data/lake.
- Architecture: A standardized Bronze/Silver/Gold medallion architecture ensures data quality and lineage.
- Performance: By keeping compute (DuckDB) and storage (Delta Tables) on the same machine, we eliminate network latency entirely.
- Protocol: We leverage the open-source Delta Lake protocol for ACID transactions and time-travel, giving us enterprise-grade reliability on a local filesystem.
3. High-Efficiency Engine: Why DuckDB?
A core architectural decision was selecting the compute engine. The industry standard for years has been Apache Spark. However, we identified a shift in the data landscape.
The Spark vs. DuckDB Trade-off
In 2015, vertical scaling was hard, and RAM was expensive. If you had 100GB of data, you often needed a cluster. Today, a single node with NVMe drives and reasonable RAM can verify massive datasets.
For workloads under 1TB/day (which covers 99% of use cases), the complexity of managing a Spark cluster (JVM tuning, worker nodes, shuffling) is an unnecessary tax. We aligned with the industry trend favoring DuckDB for single-node processing. It allows us to process millions of rows of financial data in-memory with vectorized execution, avoiding the network overhead of distributed systems entirely.
4. Orchestration & Quality
- Dagster: Manages complex dependency graphs and pipeline execution. We prefer its asset-based definition over Airflow's task-based approach.
- dbt (Data Build Tool): Handles SQL transformations and automated testing.
- Docker: Containerizes every step to ensure consistent execution environments.
5. Infrastructure as Code
The entire platform is defined in code.
- Ansible: Automates the provisioning and configuration of the Linux server.
- Self-Healing: The system is designed to recover automatically from failures, minimizing maintenance overhead.
Note: DevOps Highlight: Self-Healing Infrastructure The platform runs on "ephemeral runners" that automatically reset state between jobs. Using Ansible, we provision the environment from scratch in minutes, ensuring no "configuration drift" occurs over months of operation.
Key Outcomes
- High Efficiency: Achieved full operational data capabilities on a standard 8GB RAM VM, reducing infrastructure costs by >90% compared to managed SaaS.
- High Performance: Capable of processing full Nasdaq/NYSE historical datasets in-memory using DuckDB.
- Ownership: Complete data sovereignty with zero vendor lock-in.
QuantLens proves that with the right architecture, creating a "Data Lakehouse" is a matter of smart engineering and choosing the right tool for the scale, not just buying the most expensive enterprise license.
— Pedro
Founder & Principal Engineer