SLO Design for Connected Products: What to Measure & Alert On
How to design Service Level Objectives for IoT products in 2026 — what's different from web SLOs, the metrics that matter, error budgets, and the alert thresholds that work.
SLO design is well-trodden for web services. The same principles apply to IoT, but the metrics, the time horizons, and the failure modes are different enough that copy-pasting web-SLO templates produces SLOs that don’t match the product.
Here is what changes for connected products and how to design SLOs that actually serve operations.
What’s different from web SLOs
Web SLOs typically focus on:
- Request success rate (HTTP 2xx / total)
- Request latency percentiles
- Service availability per region
IoT SLOs care about different things:
- Device connectivity — the percentage of devices online over time
- Telemetry completeness — does data actually arrive when expected
- Command success — when the cloud sends a command, does the device perform it
- OTA success — what fraction of OTA campaigns complete cleanly
- Battery and lifecycle — for battery devices, expected vs actual battery life
The time horizons differ too. A web service measures latency in milliseconds; an IoT system measures connectivity over hours and OTA success over weeks.
The five SLOs that matter
For most connected products in 2026:
1. Device availability SLO
Definition: percentage of devices that have communicated within the last N minutes (typical N: 60 minutes for Wi-Fi devices, 24 hours for battery LoRaWAN devices).
Target example: “99% of devices report telemetry within 24 hours, measured weekly.”
Why it matters: this is the baseline health of the fleet. A drop signals network problems, server problems, or a device-side bug.
2. Telemetry completeness SLO
Definition: percentage of expected telemetry messages that actually arrive at the cloud.
Target example: “98% of expected hourly telemetry messages received per device, measured per device per week.”
Why it matters: missing data drives missing alerts and missing analytics. This SLO catches degraded conditions before they become device-offline.
3. Command success SLO
Definition: percentage of commands sent to devices that complete successfully (acknowledged + acted upon).
Target example: “99.5% of commands completed within 30 seconds, measured per command type per day.”
Why it matters: when commands fail silently, customers lose trust. This SLO is often the most user-facing.
4. OTA success SLO
Definition: percentage of devices in an OTA campaign that successfully apply the new firmware.
Target example: “98% of devices apply OTA within 7 days of campaign start, with rollback rate under 0.5%.”
Why it matters: OTA is one of the riskiest operational activities. The SLO drives staged-rollout discipline (our OTA post).
5. Time-to-acknowledge incident SLO
Definition: time from incident detection (alert fires) to human acknowledgement.
Target example: “P0 incidents acknowledged within 15 minutes; P1 within 1 hour; measured per quarter.”
Why it matters: this is an operational SLO, not a system SLO. Drives staffing, on-call practices, alerting quality.
Error budgets
Error budgets are the operational use of SLOs. If your availability SLO is 99% and the actual availability is 99.5%, you have 0.5% error budget remaining.
Practical use:
- Error budget at full → free to ship features quickly, take more deployment risk
- Error budget low → slow deployments, focus on reliability work
- Error budget exhausted → freeze feature work until budget recovers
Each SLO has its own error budget. The product owner allocates engineering time based on which budgets are healthy.
For IoT specifically, the error budget often gets consumed by fleet-wide events (a bad OTA, a broker outage, a customer’s WiFi network change). Plan for these — error budget burndown isn’t always linear.
What makes a good SLO
Three properties:
-
Customer-meaningful — the SLO measures something the customer would notice if violated. Internal performance metrics that customers don’t see make poor SLOs.
-
Achievable — the SLO target is realistic for the system as designed. A 99.99% availability SLO on a single-region system is wishful; the architecture has to support the target.
-
Actionable — when the SLO is at risk, the on-call team has clear actions to take. SLOs that fire without remediation paths waste people’s time.
The trap: SLOs based on fleet averages
A fleet-wide average SLO can hide localised problems. “99% of devices are online” sounds good — until you realise that one customer’s 50 devices are all offline.
Defence: SLOs measured at the right granularity. Per-customer SLOs catch customer-specific issues. Per-region SLOs catch regional problems. Combine them.
Aligning SLOs with contracts
For B2B IoT products, SLOs often align with customer Service Level Agreements:
- The SLA promises the customer 99% uptime
- The SLO is internal and tighter — say 99.5% — to leave headroom
- The error budget is the gap between SLA promise and SLO target
Alignment matters. SLOs that don’t translate to SLA performance are operational theatre. SLOs tighter than SLAs preserve customer trust during partial outages.
Alert design from SLOs
The mature pattern: alerts based on error budget burn rate, not absolute error rate.
- “Telemetry loss rate is 1% right now” — possibly a transient spike, possibly a real issue
- “Telemetry loss rate is burning the monthly error budget at 14x normal speed” — definitely an issue worth waking someone up for
This nuance matters because IoT systems have varied baselines. A 1% loss rate during a known network event might be fine; the same rate during a calm period is alarming.
For broader observability practice see our fleet observability post.
What we typically build
For an SLO engagement on a connected product:
- SLO framework document — definitions, targets, measurement methodology
- SLI implementation — the queries / metrics that compute the SLOs in practice
- Error budget dashboards — burn rate per SLO, time remaining at current rate
- Alerting rules — burn-rate-based alerts at multiple sensitivity levels
- Quarterly SLO review process — adjust targets based on observed reality
- Customer-facing status page — selective public exposure of relevant SLOs
SLOs only work when the team uses them. The technical implementation is straightforward; the cultural integration is the work.
If you are designing SLOs for an IoT product or operating one without explicit SLOs in place, we have shipped this combination across multiple products.
Keep reading
-
Operations
Compliance Audits for IoT Operations: SOC 2, ISO 27001, IEC 62443
What auditors actually ask in IoT operations audits — SOC 2, ISO 27001, IEC 62443. Practical preparation and the evidence artifacts that pass.
Read -
Operations
IoT Fleet Incident Response: When 10,000 Devices Misbehave
How to handle large-scale IoT fleet incidents in 2026 — the playbook for bad OTA pushes, mass disconnects, security incidents, and the practices that make you ready.
Read -
Operations
Top IoT Fleet Management Platforms in 2026: Mender, Memfault, Particle, Balena
A 2026 comparison of IoT fleet management platforms — Mender, Memfault, Particle, Balena, Hologram — for OTA and observability at scale.
Read