Edge-Cloud Reconciliation Pattern¶
Context¶
The FlexGalaxy.AI platform manages fleets of edge devices (robots, PDAs) that operate in environments with intermittent connectivity. When a device goes offline, the platform cannot observe its actual state — but must continue making decisions about the rest of the fleet. When the device reconnects, its actual state may differ from what the platform predicted.
This creates three problems:
State divergence — the platform’s projected state and the device’s actual state may not match
Decision integrity — decisions made during the offline period were based on projections that may be wrong
Fleet coordination — other devices may have been replanned around the projected state of the offline device
Without a formal reconciliation process, the platform either freezes (waiting for all devices to be online) or makes uncoordinated decisions (ignoring offline devices entirely). Neither is acceptable.
Pattern¶
State Lifecycle¶
Edge-cloud reconciliation is built on the three-state model defined in the Execution Contract pattern:
Live ──→ Projected ──→ Reconciled ──→ Live
(offline) (reconnects)
Each transition triggers specific platform behavior:
Live → Projected (Connection Lost)¶
When telemetry stops arriving from a device:
The Execution Manager marks the device as Projected
The platform begins estimating the device’s state based on its active execution contract — expected position, task progress, battery drain rate
The device’s contracted tasks are locked — they cannot be reassigned to other devices
The Planner replans for connected devices only, treating the offline device’s tasks as unavailable
Projected → Reconciled (Device Reconnects)¶
When the device comes back online and reports its actual state:
The Execution Manager compares projected state vs actual state
Divergence is classified:
Divergence Type |
Example |
Impact |
|---|---|---|
Position |
Robot projected at Rack B-12, actually at charging station |
Medium — task incomplete |
Task completion |
Projected 60% done, actually 100% done |
Low — better than expected |
Task failure |
Projected in-progress, actually failed and holding |
High — needs recovery |
Battery |
Projected 45%, actually 18% |
High — may need immediate charging |
No divergence |
Projected state matches actual |
None — resume as Live |
The device enters Reconciled state while the platform processes the differences
Reconciled → Live (Divergence Resolution)¶
The reconciliation outcome depends on the severity and type of divergence:
Scenario |
Resolution |
|---|---|
No divergence |
Transition to Live immediately |
Minor divergence (position off, task complete) |
Update state, transition to Live |
Major divergence (task failed, battery critical) |
Trigger replanning, consult Policy Service for recovery |
When major divergence is detected:
The Execution Manager consults the Policy Service for the applicable failure recovery strategy
The Planner triggers a global replan that accounts for the corrected state of the reconnected device
Other devices may receive updated contracts if the replanning affects their assignments
Fleet-Wide Reconciliation¶
When multiple devices reconnect simultaneously (e.g., after a network outage is resolved):
Device A reconnects ──┐
Device B reconnects ──┤──→ Batch reconciliation ──→ Single global replan
Device C reconnects ──┘
The platform batches reconciliation events within a short window to avoid triggering multiple sequential replans. A single global replan incorporates all reconciled states at once.
Consequences¶
Benefits¶
No frozen fleet — the platform continues operating with connected devices while offline devices hold their contracts
No double-assignment — projected state prevents the platform from reassigning work that an offline device may still be executing
Graceful correction — when projections are wrong, the platform corrects course rather than failing
Policy-driven recovery — divergence resolution follows application-defined policies, not hardcoded logic
Trade-offs¶
Projection accuracy degrades over time — the longer a device is offline, the less reliable the projected state; highly dynamic environments (busy warehouses) diverge faster than predictable ones (cleaning routes)
Reconciliation storms — a large-scale network recovery can produce many simultaneous reconciliation events, requiring batch processing and potentially expensive replanning
Replanning cascades — correcting one device’s state may ripple into reassignments for other devices, especially in tightly coupled environments like warehouse operations
Design Decisions¶
Projected tasks are locked, not reassigned — this is a deliberate choice. Reassigning a projected task risks duplication when the original device reconnects and completes the work. The cost of a delayed task is lower than the cost of conflicting execution.
Edge planner handles local adaptation — the two-tier planning model means the edge device can react to local obstacles without needing to reconcile with the cloud. Only strategic-level divergence (task completion, failure, position) triggers cloud reconciliation.
Examples¶
WES — Multi-Step Task Divergence¶
An AMR is picking items in a warehouse and loses connectivity after completing 2 of 4 picks:
Projected state: AMR heading to Rack C-3 for pick 3 (estimated based on travel time)
Actual state: AMR completed pick 3 but encountered a blocked aisle, holding at Rack C-3
On reconnection:
Reconciliation detects: position matches projection, but task progress is further than expected and the device is holding due to a blockage
Policy Service returns: reassign pick 4 to another AMR, send this AMR to charging
Planner replans remaining picks across available AMRs
ClearJanitor — Completed Ahead of Projection¶
A cleaning robot loses connectivity during a nightly cleaning route:
Projected state: 70% of Floor 3 cleaned (based on elapsed time and route)
Actual state: 100% of Floor 3 cleaned (robot moved faster than projected)
On reconnection:
Reconciliation detects: no divergence in a negative sense — robot did better than expected
State updated, robot transitions to Live
If additional floors are queued, the Scheduler may assign the next job earlier than planned