Cloud May 9, 2026 · 7 min read

Building an IoT Data Lake: Architecture, Retention & Query

Architecting a data lake for IoT telemetry — bronze/silver/gold zones, Parquet partitioning, retention tiers, and the query patterns that work in 2026.

#Data lake #IoT analytics #Parquet #Apache Iceberg #Time-series #Snowflake #BigQuery

A data lake is where IoT telemetry goes when the time-series store is too expensive to retain it forever. Designed well, it makes year-over-year analysis cheap and arbitrarily flexible. Designed badly, it becomes the swamp engineers warn you about.

Here is what a working IoT data lake looks like in 2026.

The medallion architecture (bronze → silver → gold)

The pattern that consistently survives is the three-zone architecture, popularised by Databricks but applicable everywhere:

Bronze (raw landing zone)

Append-only, immutable
Schema-on-read — data lands in roughly the shape the device sent it
Partitioned by ingestion date and device ID prefix
Retention: 1–3 years depending on regulation

Silver (cleaned, conformed)

Schema-on-write — explicit columns, types, deduplication
Calibration applied, units conformed, late-arriving data merged
Partitioned by event date (not ingestion date) and device-class
Retention: 5–10 years

Gold (analytical, business-ready)

Aggregated to the grain reports actually need (per-device-per-hour, per-fleet-per-day)
Joined to dimensional data (customer, product, location)
Refreshed on a schedule that matches the slowest report

Each zone is a different storage path and a different table. Promotion between zones is a documented job, not a side effect.

Format choices in 2026

Parquet is still the default file format. Columnar, well-compressed, supported by every query engine.

On top of Parquet, two table-format options matter:

Apache Iceberg — open-source, multi-engine support (Trino, Spark, Snowflake, BigQuery, Athena), ACID transactions, time travel. Increasingly the default for new lakes.
Delta Lake — Databricks-native, also open-source, mature merge support, strong on Databricks but workable elsewhere.

For a greenfield IoT data lake we default to Iceberg on S3 with Trino or Athena for ad-hoc query, Spark for batch processing, and dbt for transformations between zones.

Partitioning that ages well

Bad partitioning is the most common reason data lakes get slow.

Reasonable partition schema for IoT bronze:

s3://lake/bronze/iot-telemetry/
  ingest_date=2026-05-09/
    device_prefix=00/
      hour=14/
        part-00001.parquet

For silver, switch to event date and a device class:

s3://lake/silver/iot-telemetry/
  event_date=2026-05-09/
    device_class=temperature_sensor_v2/
      part-00001.parquet

The trap: partitioning too granularly (e.g. one partition per device) creates millions of small files and kills query planners. Aim for partitions in the 100 MB – 1 GB compressed range.

Retention tiers

A real IoT data lake has at least three retention tiers:

Hot — recent data, queryable in seconds. Last 30–90 days. Often kept in the time-series store, not the lake.
Warm — last 1–3 years in the lake, queryable in minutes for ad-hoc analysis.
Cold — 3+ years in S3 Glacier or equivalent. Restored on demand for compliance or audit, not for routine queries.

Lifecycle policies move data automatically between tiers. The cost saving is often 10x between hot and cold storage at IoT scale.

Query patterns that actually work

Three query patterns dominate IoT data lakes:

1. Per-device historical trace “Show me everything device 0xA712 did in the last 90 days.” — typically 50k–500k rows. Hot store wins. Don’t go to the lake.

2. Fleet-wide aggregation “Average temperature across all units of model X by hour for the last quarter.” — millions of rows aggregated to hundreds. Warehouse / lake gold tier wins.

3. Anomaly hunt “Find all devices that crossed threshold Y in the last year.” — hundreds of millions of rows scanned. Lake silver tier with Iceberg + Trino. Expect minute-scale latency, plan accordingly.

Match query patterns to storage tiers. Putting all queries against one tier — usually the hot store — is what makes IoT cloud bills explode.

The metadata layer

A data lake without a catalog is a swamp. Three components keep it usable:

Schema registry — the contract for what fields exist in each topic, version-controlled. Confluent Schema Registry, AWS Glue Data Catalog, or Apache Atlas.
Data catalog — discoverability, lineage, ownership. Datahub, Amundsen, or Atlan are the modern picks.
Quality monitoring — Great Expectations or Soda checks that data shape matches the schema and freshness contracts hold.

Without these, every analyst will eventually file a ticket asking “what’s in this column?” and someone on the data team will spend a day finding out.

The cost trap

IoT data lakes look cheap (storage is pennies) until you query them.

The expensive query is the one that:

Scans an unpartitioned table
Selects every column when it needs three
Joins dimensional data across the entire history instead of the relevant slice

A single bad analyst query against an IoT silver tier can cost $50–$500 in scan fees on Athena or BigQuery. Multiply across an analytics team that doesn’t know to partition-prune and the lake becomes the most expensive part of the architecture.

Mitigations:

Use views with built-in date filters that analysts can’t bypass without explicit override
Set per-user query cost ceilings in the warehouse / query engine
Encourage Iceberg with hidden partitioning so analysts can’t accidentally scan the whole table
Educate the analytics team — costs are line items they should care about

What we typically build

For an IoT data lake engagement in 2026:

Storage: S3 (or GCS / Azure Blob) with lifecycle policies for tier transitions
Format: Apache Iceberg with Parquet, partitioned by event_date and device_class
Catalog: AWS Glue Data Catalog or self-hosted Iceberg REST catalog
Compute: Athena / Trino for ad-hoc, Spark on EMR Serverless or Databricks for batch, Snowflake or BigQuery if the analytics team prefers
Pipeline: dbt for silver-to-gold transformations, Airflow or Dagster for orchestration
Quality: Great Expectations checks at every zone transition
Documentation: data dictionary in the catalog, lineage diagrams in the repo

This stack ages well. Most of the components have been stable for several years and the migration path between them is well-trodden.

If you are starting an IoT data lake or fixing one that has become a swamp, we run these as fixed-scope engagements.

By Diglogic Engineering · May 9, 2026

Building an IoT Data Lake: Architecture, Retention & Query

The medallion architecture (bronze → silver → gold)

Format choices in 2026

Partitioning that ages well

Retention tiers

Query patterns that actually work

The metadata layer

The cost trap

What we typically build

Keep reading

Connecting IoT Data to ERP, CRM & BI: Patterns That Actually Work

IoT Integration Platforms Compared: AWS IoT vs Azure IoT vs GCP IoT (2026)

Multi-Cloud IoT Architectures: When the Split Makes Sense

Let's get started.