Operations May 9, 2026 · 6 min read

SLO Design for Connected Products: What to Measure & Alert On

How to design Service Level Objectives for IoT products in 2026 — what's different from web SLOs, the metrics that matter, error budgets, and the alert thresholds that work.

#SLO #SLI #Service level #IoT operations #Reliability #Error budget #SRE

SLO design is well-trodden for web services. The same principles apply to IoT, but the metrics, the time horizons, and the failure modes are different enough that copy-pasting web-SLO templates produces SLOs that don’t match the product.

Here is what changes for connected products and how to design SLOs that actually serve operations.

What’s different from web SLOs

Web SLOs typically focus on:

Request success rate (HTTP 2xx / total)
Request latency percentiles
Service availability per region

IoT SLOs care about different things:

Device connectivity — the percentage of devices online over time
Telemetry completeness — does data actually arrive when expected
Command success — when the cloud sends a command, does the device perform it
OTA success — what fraction of OTA campaigns complete cleanly
Battery and lifecycle — for battery devices, expected vs actual battery life

The time horizons differ too. A web service measures latency in milliseconds; an IoT system measures connectivity over hours and OTA success over weeks.

The five SLOs that matter

For most connected products in 2026:

1. Device availability SLO

Definition: percentage of devices that have communicated within the last N minutes (typical N: 60 minutes for Wi-Fi devices, 24 hours for battery LoRaWAN devices).

Target example: “99% of devices report telemetry within 24 hours, measured weekly.”

Why it matters: this is the baseline health of the fleet. A drop signals network problems, server problems, or a device-side bug.

2. Telemetry completeness SLO

Definition: percentage of expected telemetry messages that actually arrive at the cloud.

Target example: “98% of expected hourly telemetry messages received per device, measured per device per week.”

Why it matters: missing data drives missing alerts and missing analytics. This SLO catches degraded conditions before they become device-offline.

3. Command success SLO

Definition: percentage of commands sent to devices that complete successfully (acknowledged + acted upon).

Target example: “99.5% of commands completed within 30 seconds, measured per command type per day.”

Why it matters: when commands fail silently, customers lose trust. This SLO is often the most user-facing.

4. OTA success SLO

Definition: percentage of devices in an OTA campaign that successfully apply the new firmware.

Target example: “98% of devices apply OTA within 7 days of campaign start, with rollback rate under 0.5%.”

Why it matters: OTA is one of the riskiest operational activities. The SLO drives staged-rollout discipline (our OTA post).

5. Time-to-acknowledge incident SLO

Definition: time from incident detection (alert fires) to human acknowledgement.

Target example: “P0 incidents acknowledged within 15 minutes; P1 within 1 hour; measured per quarter.”

Why it matters: this is an operational SLO, not a system SLO. Drives staffing, on-call practices, alerting quality.

Error budgets

Error budgets are the operational use of SLOs. If your availability SLO is 99% and the actual availability is 99.5%, you have 0.5% error budget remaining.

Practical use:

Error budget at full → free to ship features quickly, take more deployment risk
Error budget low → slow deployments, focus on reliability work
Error budget exhausted → freeze feature work until budget recovers

Each SLO has its own error budget. The product owner allocates engineering time based on which budgets are healthy.

For IoT specifically, the error budget often gets consumed by fleet-wide events (a bad OTA, a broker outage, a customer’s WiFi network change). Plan for these — error budget burndown isn’t always linear.

What makes a good SLO

Three properties:

Customer-meaningful — the SLO measures something the customer would notice if violated. Internal performance metrics that customers don’t see make poor SLOs.
Achievable — the SLO target is realistic for the system as designed. A 99.99% availability SLO on a single-region system is wishful; the architecture has to support the target.
Actionable — when the SLO is at risk, the on-call team has clear actions to take. SLOs that fire without remediation paths waste people’s time.

The trap: SLOs based on fleet averages

A fleet-wide average SLO can hide localised problems. “99% of devices are online” sounds good — until you realise that one customer’s 50 devices are all offline.

Defence: SLOs measured at the right granularity. Per-customer SLOs catch customer-specific issues. Per-region SLOs catch regional problems. Combine them.

Aligning SLOs with contracts

For B2B IoT products, SLOs often align with customer Service Level Agreements:

The SLA promises the customer 99% uptime
The SLO is internal and tighter — say 99.5% — to leave headroom
The error budget is the gap between SLA promise and SLO target

Alignment matters. SLOs that don’t translate to SLA performance are operational theatre. SLOs tighter than SLAs preserve customer trust during partial outages.

Alert design from SLOs

The mature pattern: alerts based on error budget burn rate, not absolute error rate.

“Telemetry loss rate is 1% right now” — possibly a transient spike, possibly a real issue
“Telemetry loss rate is burning the monthly error budget at 14x normal speed” — definitely an issue worth waking someone up for

This nuance matters because IoT systems have varied baselines. A 1% loss rate during a known network event might be fine; the same rate during a calm period is alarming.

For broader observability practice see our fleet observability post.

What we typically build

For an SLO engagement on a connected product:

SLO framework document — definitions, targets, measurement methodology
SLI implementation — the queries / metrics that compute the SLOs in practice
Error budget dashboards — burn rate per SLO, time remaining at current rate
Alerting rules — burn-rate-based alerts at multiple sensitivity levels
Quarterly SLO review process — adjust targets based on observed reality
Customer-facing status page — selective public exposure of relevant SLOs

SLOs only work when the team uses them. The technical implementation is straightforward; the cultural integration is the work.

If you are designing SLOs for an IoT product or operating one without explicit SLOs in place, we have shipped this combination across multiple products.

By Diglogic Engineering · May 9, 2026