Building an IoT Data Lake: Architecture, Retention & Query
Architecting a data lake for IoT telemetry — bronze/silver/gold zones, Parquet partitioning, retention tiers, and the query patterns that work in 2026.
A data lake is where IoT telemetry goes when the time-series store is too expensive to retain it forever. Designed well, it makes year-over-year analysis cheap and arbitrarily flexible. Designed badly, it becomes the swamp engineers warn you about.
Here is what a working IoT data lake looks like in 2026.
The medallion architecture (bronze → silver → gold)
The pattern that consistently survives is the three-zone architecture, popularised by Databricks but applicable everywhere:
Bronze (raw landing zone)
- Append-only, immutable
- Schema-on-read — data lands in roughly the shape the device sent it
- Partitioned by ingestion date and device ID prefix
- Retention: 1–3 years depending on regulation
Silver (cleaned, conformed)
- Schema-on-write — explicit columns, types, deduplication
- Calibration applied, units conformed, late-arriving data merged
- Partitioned by event date (not ingestion date) and device-class
- Retention: 5–10 years
Gold (analytical, business-ready)
- Aggregated to the grain reports actually need (per-device-per-hour, per-fleet-per-day)
- Joined to dimensional data (customer, product, location)
- Refreshed on a schedule that matches the slowest report
Each zone is a different storage path and a different table. Promotion between zones is a documented job, not a side effect.
Format choices in 2026
Parquet is still the default file format. Columnar, well-compressed, supported by every query engine.
On top of Parquet, two table-format options matter:
- Apache Iceberg — open-source, multi-engine support (Trino, Spark, Snowflake, BigQuery, Athena), ACID transactions, time travel. Increasingly the default for new lakes.
- Delta Lake — Databricks-native, also open-source, mature merge support, strong on Databricks but workable elsewhere.
For a greenfield IoT data lake we default to Iceberg on S3 with Trino or Athena for ad-hoc query, Spark for batch processing, and dbt for transformations between zones.
Partitioning that ages well
Bad partitioning is the most common reason data lakes get slow.
Reasonable partition schema for IoT bronze:
s3://lake/bronze/iot-telemetry/
ingest_date=2026-05-09/
device_prefix=00/
hour=14/
part-00001.parquet
For silver, switch to event date and a device class:
s3://lake/silver/iot-telemetry/
event_date=2026-05-09/
device_class=temperature_sensor_v2/
part-00001.parquet
The trap: partitioning too granularly (e.g. one partition per device) creates millions of small files and kills query planners. Aim for partitions in the 100 MB – 1 GB compressed range.
Retention tiers
A real IoT data lake has at least three retention tiers:
- Hot — recent data, queryable in seconds. Last 30–90 days. Often kept in the time-series store, not the lake.
- Warm — last 1–3 years in the lake, queryable in minutes for ad-hoc analysis.
- Cold — 3+ years in S3 Glacier or equivalent. Restored on demand for compliance or audit, not for routine queries.
Lifecycle policies move data automatically between tiers. The cost saving is often 10x between hot and cold storage at IoT scale.
Query patterns that actually work
Three query patterns dominate IoT data lakes:
1. Per-device historical trace “Show me everything device 0xA712 did in the last 90 days.” — typically 50k–500k rows. Hot store wins. Don’t go to the lake.
2. Fleet-wide aggregation “Average temperature across all units of model X by hour for the last quarter.” — millions of rows aggregated to hundreds. Warehouse / lake gold tier wins.
3. Anomaly hunt “Find all devices that crossed threshold Y in the last year.” — hundreds of millions of rows scanned. Lake silver tier with Iceberg + Trino. Expect minute-scale latency, plan accordingly.
Match query patterns to storage tiers. Putting all queries against one tier — usually the hot store — is what makes IoT cloud bills explode.
The metadata layer
A data lake without a catalog is a swamp. Three components keep it usable:
- Schema registry — the contract for what fields exist in each topic, version-controlled. Confluent Schema Registry, AWS Glue Data Catalog, or Apache Atlas.
- Data catalog — discoverability, lineage, ownership. Datahub, Amundsen, or Atlan are the modern picks.
- Quality monitoring — Great Expectations or Soda checks that data shape matches the schema and freshness contracts hold.
Without these, every analyst will eventually file a ticket asking “what’s in this column?” and someone on the data team will spend a day finding out.
The cost trap
IoT data lakes look cheap (storage is pennies) until you query them.
The expensive query is the one that:
- Scans an unpartitioned table
- Selects every column when it needs three
- Joins dimensional data across the entire history instead of the relevant slice
A single bad analyst query against an IoT silver tier can cost $50–$500 in scan fees on Athena or BigQuery. Multiply across an analytics team that doesn’t know to partition-prune and the lake becomes the most expensive part of the architecture.
Mitigations:
- Use views with built-in date filters that analysts can’t bypass without explicit override
- Set per-user query cost ceilings in the warehouse / query engine
- Encourage Iceberg with hidden partitioning so analysts can’t accidentally scan the whole table
- Educate the analytics team — costs are line items they should care about
What we typically build
For an IoT data lake engagement in 2026:
- Storage: S3 (or GCS / Azure Blob) with lifecycle policies for tier transitions
- Format: Apache Iceberg with Parquet, partitioned by event_date and device_class
- Catalog: AWS Glue Data Catalog or self-hosted Iceberg REST catalog
- Compute: Athena / Trino for ad-hoc, Spark on EMR Serverless or Databricks for batch, Snowflake or BigQuery if the analytics team prefers
- Pipeline: dbt for silver-to-gold transformations, Airflow or Dagster for orchestration
- Quality: Great Expectations checks at every zone transition
- Documentation: data dictionary in the catalog, lineage diagrams in the repo
This stack ages well. Most of the components have been stable for several years and the migration path between them is well-trodden.
If you are starting an IoT data lake or fixing one that has become a swamp, we run these as fixed-scope engagements.
Keep reading
-
Cloud
Connecting IoT Data to ERP, CRM & BI: Patterns That Actually Work
How to integrate IoT telemetry with SAP, Oracle, NetSuite, Salesforce, and BI platforms — the patterns we use on real projects, and the integration traps to avoid.
Read -
Cloud
IoT Integration Platforms Compared: AWS IoT vs Azure IoT vs GCP IoT (2026)
A practical 2026 comparison of AWS IoT Core, Azure IoT Hub, and Google Cloud IoT alternatives — cost, fit, and the gotchas that decide a multi-year platform commitment.
Read -
Cloud
Multi-Cloud IoT Architectures: When the Split Makes Sense
When a multi-cloud IoT architecture is justified, when it's a costly mistake, and the patterns that work for hybrid AWS / Azure / GCP IoT deployments.
Read