Edge-Cloud Reconciliation Pattern

Context

The FlexGalaxy.AI platform manages fleets of edge devices (robots, PDAs) that operate in environments with intermittent connectivity. When a device goes offline, the platform cannot observe its actual state — but must continue making decisions about the rest of the fleet. When the device reconnects, its actual state may differ from what the platform predicted.

This creates three problems:

  1. State divergence — the platform’s projected state and the device’s actual state may not match

  2. Decision integrity — decisions made during the offline period were based on projections that may be wrong

  3. Fleet coordination — other devices may have been replanned around the projected state of the offline device

Without a formal reconciliation process, the platform either freezes (waiting for all devices to be online) or makes uncoordinated decisions (ignoring offline devices entirely). Neither is acceptable.

Pattern

State Lifecycle

Edge-cloud reconciliation is built on the three-state model defined in the Execution Contract pattern:

Live ──→ Projected ──→ Reconciled ──→ Live
         (offline)     (reconnects)

Each transition triggers specific platform behavior:

Live → Projected (Connection Lost)

When telemetry stops arriving from a device:

  1. The Execution Manager marks the device as Projected

  2. The platform begins estimating the device’s state based on its active execution contract — expected position, task progress, battery drain rate

  3. The device’s contracted tasks are locked — they cannot be reassigned to other devices

  4. The Planner replans for connected devices only, treating the offline device’s tasks as unavailable

Projected → Reconciled (Device Reconnects)

When the device comes back online and reports its actual state:

  1. The Execution Manager compares projected state vs actual state

  2. Divergence is classified:

Divergence Type

Example

Impact

Position

Robot projected at Rack B-12, actually at charging station

Medium — task incomplete

Task completion

Projected 60% done, actually 100% done

Low — better than expected

Task failure

Projected in-progress, actually failed and holding

High — needs recovery

Battery

Projected 45%, actually 18%

High — may need immediate charging

No divergence

Projected state matches actual

None — resume as Live

  1. The device enters Reconciled state while the platform processes the differences

Reconciled → Live (Divergence Resolution)

The reconciliation outcome depends on the severity and type of divergence:

Scenario

Resolution

No divergence

Transition to Live immediately

Minor divergence (position off, task complete)

Update state, transition to Live

Major divergence (task failed, battery critical)

Trigger replanning, consult Policy Service for recovery

When major divergence is detected:

  1. The Execution Manager consults the Policy Service for the applicable failure recovery strategy

  2. The Planner triggers a global replan that accounts for the corrected state of the reconnected device

  3. Other devices may receive updated contracts if the replanning affects their assignments

Fleet-Wide Reconciliation

When multiple devices reconnect simultaneously (e.g., after a network outage is resolved):

Device A reconnects ──┐
Device B reconnects ──┤──→ Batch reconciliation ──→ Single global replan
Device C reconnects ──┘

The platform batches reconciliation events within a short window to avoid triggering multiple sequential replans. A single global replan incorporates all reconciled states at once.

Consequences

Benefits

  • No frozen fleet — the platform continues operating with connected devices while offline devices hold their contracts

  • No double-assignment — projected state prevents the platform from reassigning work that an offline device may still be executing

  • Graceful correction — when projections are wrong, the platform corrects course rather than failing

  • Policy-driven recovery — divergence resolution follows application-defined policies, not hardcoded logic

Trade-offs

  • Projection accuracy degrades over time — the longer a device is offline, the less reliable the projected state; highly dynamic environments (busy warehouses) diverge faster than predictable ones (cleaning routes)

  • Reconciliation storms — a large-scale network recovery can produce many simultaneous reconciliation events, requiring batch processing and potentially expensive replanning

  • Replanning cascades — correcting one device’s state may ripple into reassignments for other devices, especially in tightly coupled environments like warehouse operations

Design Decisions

  • Projected tasks are locked, not reassigned — this is a deliberate choice. Reassigning a projected task risks duplication when the original device reconnects and completes the work. The cost of a delayed task is lower than the cost of conflicting execution.

  • Edge planner handles local adaptation — the two-tier planning model means the edge device can react to local obstacles without needing to reconcile with the cloud. Only strategic-level divergence (task completion, failure, position) triggers cloud reconciliation.

Examples

WES — Multi-Step Task Divergence

An AMR is picking items in a warehouse and loses connectivity after completing 2 of 4 picks:

  • Projected state: AMR heading to Rack C-3 for pick 3 (estimated based on travel time)

  • Actual state: AMR completed pick 3 but encountered a blocked aisle, holding at Rack C-3

On reconnection:

  1. Reconciliation detects: position matches projection, but task progress is further than expected and the device is holding due to a blockage

  2. Policy Service returns: reassign pick 4 to another AMR, send this AMR to charging

  3. Planner replans remaining picks across available AMRs

ClearJanitor — Completed Ahead of Projection

A cleaning robot loses connectivity during a nightly cleaning route:

  • Projected state: 70% of Floor 3 cleaned (based on elapsed time and route)

  • Actual state: 100% of Floor 3 cleaned (robot moved faster than projected)

On reconnection:

  1. Reconciliation detects: no divergence in a negative sense — robot did better than expected

  2. State updated, robot transitions to Live

  3. If additional floors are queued, the Scheduler may assign the next job earlier than planned