Designing OTA Firmware Updates That Don't Brick Devices
The patterns we use to ship firmware over the air to devices in the field — A/B partitions, rollback, signed images, staged rollouts, and the failure modes that bite if you skip them.
The first OTA push to thousands of customer devices is a stressful Tuesday afternoon. The second one shouldn’t be. Getting from one to the other is a system-design problem, not just a firmware-engineering problem.
Here is the approach we have settled on after shipping OTA pipelines for smart-home, industrial, and medical-adjacent products.
A/B partitions are not optional
Single-image updates — flash a new firmware over the running one and reboot — are how you brick a fleet. The moment a device loses power mid-flash, or the new image has a bug that breaks Wi-Fi, you have a paperweight that costs a service call to recover.
The fix is the A/B (or “ping-pong”) pattern: two firmware partitions, plus a small bootloader that decides which to boot. New firmware is written to the inactive partition, verified, then marked as the next-boot target. If the device fails to call home within a watchdog window after booting the new image, the bootloader falls back to the previous slot.
This costs you twice the flash per firmware slot. It saves you ten times that in support cost.
Sign your images and verify on device
Unsigned OTA is an open invitation. An attacker who can MITM your update channel — or simply write a payload that mimics it — can ship arbitrary firmware to your fleet.
The minimum viable answer:
- Sign every release with a key that lives offline (HSM, hardware token, or at minimum an air-gapped CI signer).
- The bootloader verifies the signature against a public key burned into the device’s read-only fuses.
- The verification has to happen before the new image is written to the active boot slot, not after.
ESP32, STM32, and Nordic SDKs all have well-trodden secure-boot and OTA chains. Use them. Do not invent your own crypto.
Resume, don’t restart
Devices on flaky networks will lose connection mid-download. Your protocol should support range requests and chunked downloads with hash-per-chunk verification. The device should resume from the last verified chunk, not start over.
For ESP32 with HTTPS OTA this is a matter of using the streaming variant and persisting offset across reboots. For LoRa or NB-IoT, where bandwidth is precious, chunking matters for a different reason: you cannot afford to retransmit the whole image every retry.
Stage your rollouts
Pushing a new image to 100% of the fleet at once is how you find out that a corner-case board revision panics in the new code — at the worst possible time.
A staged rollout pattern that works:
- Phase 0: Engineering devices, indoors, supervised. Two days minimum.
- Phase 1: 1% of the fleet. Two days. Watch error rates and crash logs.
- Phase 2: 10%. Two more days.
- Phase 3: 50% if Phase 2 is clean.
- Phase 4: 100%.
The fleet management system needs to honor these cohorts and let you halt mid-rollout from a single button. If your platform cannot do this, you do not have an OTA pipeline yet — you have a firmware download mechanism.
Have a kill switch
You will eventually push a bad image. The day you do, you need a way to stop further devices from upgrading and to roll back the ones that did, fast.
This is two distinct mechanisms:
- A server-side flag that pauses the rollout. Devices that have not yet pulled the new image stay on the old one.
- A device-side trigger that forces a rollback. Either a watchdog tied to a key health metric, or a server command sent over the always-on telemetry channel.
Both need to exist before a single device ships. Neither can be added during an incident.
Telemetry from the bootloader
After the new image boots, you want fast confirmation that the device is healthy. The minimum signal is “the new image checked in within X minutes of boot.” Anything else (sensor reads, Wi-Fi RSSI, free heap) is bonus.
The trap is that the failure modes you most want to catch — the new firmware does not boot at all, or boots but cannot connect — are exactly the ones that prevent the device from telling you. That is why the bootloader-level rollback exists: the device decides to revert before any code from the new image runs successfully.
What to test before going live
A non-exhaustive list of OTA tests that have caught real bugs:
- Power cut during download.
- Power cut during write to inactive slot.
- New image fails to boot (corrupt header, missing entry point).
- New image boots but cannot connect to Wi-Fi.
- New image boots, connects, but crashes during application init.
- Server returns wrong content-length header.
- Server returns a partial image (truncated download).
- Two updates queued back-to-back.
The test plan is the deliverable. The OTA pipeline is the artifact. Without the test plan, the pipeline is wishful thinking.
What we hand over at the end of an engagement
At the close of an IoT project, the OTA system gets documented as an operations runbook, not a code module. It describes how to cut a release, how to start a staged rollout, how to halt one, how to roll back, and what every red and green metric on the dashboard means. The engineers who replace us read this on day one and ship a release on day two.
If you are about to do your first big OTA push and want a second pair of eyes on the plan, we are happy to review it.
Keep reading
-
Embedded
ESP32 vs STM32: When to Pick Each for Your IoT Product
A side-by-side look at when ESP32 wins, when STM32 wins, and the small set of cases where neither is the right answer.
Read -
Edge AI
Edge AI on Microcontrollers: TinyML in 2026
What works, what is still painful, and how to decide whether your IoT product should run a model on the device or in the cloud.
Read -
Security
Securing IoT: Threat Models, Secure Boot, and TLS in Constrained Devices
A practical security baseline for connected products — what to do, in what order, and what can wait until v2.
Read