Edge-Cloud Reconciliation Pattern¶

Context¶

The FlexGalaxy.AI platform manages fleets of edge devices (robots, PDAs) that operate in environments with intermittent connectivity. When a device goes offline, the platform cannot observe its actual state — but must continue making decisions about the rest of the fleet. When the device reconnects, its actual state may differ from what the platform predicted.

This creates three problems:

State divergence — the platform’s projected state and the device’s actual state may not match
Decision integrity — decisions made during the offline period were based on projections that may be wrong
Fleet coordination — other devices may have been replanned around the projected state of the offline device

Without a formal reconciliation process, the platform either freezes (waiting for all devices to be online) or makes uncoordinated decisions (ignoring offline devices entirely). Neither is acceptable.

Pattern¶

State Lifecycle¶

Edge-cloud reconciliation is built on the three-state model defined in the Execution Contract pattern:

Live ──→ Projected ──→ Reconciled ──→ Live
         (offline)     (reconnects)

Each transition triggers specific platform behavior:

Live → Projected (Connection Lost)¶

When telemetry stops arriving from a device:

The Execution Manager marks the device as Projected
The platform begins estimating the device’s state based on its active execution contract — expected position, task progress, battery drain rate
The device’s contracted tasks are locked — they cannot be reassigned to other devices
The Planner replans for connected devices only, treating the offline device’s tasks as unavailable

Projected → Reconciled (Device Reconnects)¶

When the device comes back online and reports its actual state:

The Execution Manager compares projected state vs actual state
Divergence is classified:

Divergence Type	Example	Impact
Position	Robot projected at Rack B-12, actually at charging station	Medium — task incomplete
Task completion	Projected 60% done, actually 100% done	Low — better than expected
Task failure	Projected in-progress, actually failed and holding	High — needs recovery
Battery	Projected 45%, actually 18%	High — may need immediate charging
No divergence	Projected state matches actual	None — resume as Live

The device enters Reconciled state while the platform processes the differences

Reconciled → Live (Divergence Resolution)¶

The reconciliation outcome depends on the severity and type of divergence:

Scenario	Resolution
No divergence	Transition to Live immediately
Minor divergence (position off, task complete)	Update state, transition to Live
Major divergence (task failed, battery critical)	Trigger replanning, consult Policy Service for recovery

When major divergence is detected:

The Execution Manager consults the Policy Service for the applicable failure recovery strategy
The Planner triggers a global replan that accounts for the corrected state of the reconnected device
Other devices may receive updated contracts if the replanning affects their assignments

Fleet-Wide Reconciliation¶

When multiple devices reconnect simultaneously (e.g., after a network outage is resolved):

Device A reconnects ──┐
Device B reconnects ──┤──→ Batch reconciliation ──→ Single global replan
Device C reconnects ──┘

The platform batches reconciliation events within a short window to avoid triggering multiple sequential replans. A single global replan incorporates all reconciled states at once.

Consequences¶

Benefits¶

No frozen fleet — the platform continues operating with connected devices while offline devices hold their contracts
No double-assignment — projected state prevents the platform from reassigning work that an offline device may still be executing
Graceful correction — when projections are wrong, the platform corrects course rather than failing
Policy-driven recovery — divergence resolution follows application-defined policies, not hardcoded logic

Trade-offs¶

Projection accuracy degrades over time — the longer a device is offline, the less reliable the projected state; highly dynamic environments (busy warehouses) diverge faster than predictable ones (cleaning routes)
Reconciliation storms — a large-scale network recovery can produce many simultaneous reconciliation events, requiring batch processing and potentially expensive replanning
Replanning cascades — correcting one device’s state may ripple into reassignments for other devices, especially in tightly coupled environments like warehouse operations

Design Decisions¶

Projected tasks are locked, not reassigned — this is a deliberate choice. Reassigning a projected task risks duplication when the original device reconnects and completes the work. The cost of a delayed task is lower than the cost of conflicting execution.
Edge planner handles local adaptation — the two-tier planning model means the edge device can react to local obstacles without needing to reconcile with the cloud. Only strategic-level divergence (task completion, failure, position) triggers cloud reconciliation.

Examples¶

WES — Multi-Step Task Divergence¶

An AMR is picking items in a warehouse and loses connectivity after completing 2 of 4 picks:

Projected state: AMR heading to Rack C-3 for pick 3 (estimated based on travel time)
Actual state: AMR completed pick 3 but encountered a blocked aisle, holding at Rack C-3

On reconnection:

Reconciliation detects: position matches projection, but task progress is further than expected and the device is holding due to a blockage
Policy Service returns: reassign pick 4 to another AMR, send this AMR to charging
Planner replans remaining picks across available AMRs

ClearJanitor — Completed Ahead of Projection¶

A cleaning robot loses connectivity during a nightly cleaning route:

Projected state: 70% of Floor 3 cleaned (based on elapsed time and route)
Actual state: 100% of Floor 3 cleaned (robot moved faster than projected)

On reconnection:

Reconciliation detects: no divergence in a negative sense — robot did better than expected
State updated, robot transitions to Live
If additional floors are queued, the Scheduler may assign the next job earlier than planned