(Additional contributors, alphabetically: Laz Kopits, Ryan Leong, Sheel Taskar, Daryl Yang, William Zang)

Building robust robotic manipulation policies requires diverse, high-quality demonstration data at scale. The challenge lies not just in collecting data, but in doing so reliably across multiple devices while maintaining temporal synchronization and handling the inevitable network disruptions of real-world deployments. This post details the system architecture behind the SF Fold dataset, a large-scale robotics manipulation dataset collected using UMI-style grippers.

Our Goal: Take the UMI concept from research prototype to production-ready hardware: a device anyone can pick up and use without specialized expertise, seamlessly integrated into cloud-based post-processing pipelines for scalable data collection.

Dataset: huggingface.co/datasets/zeroshotdata/sf_fold


Hardware Setup

The data collection system builds on the UMI (Universal Manipulation Interface) approach, using UMI-style still grippers equipped with OAK-D Wide cameras for 6-DoF pose tracking via ORB-SLAM3 in stereo mode. The setup supports bimanual manipulation tasks through a multi-puppet configuration: Left Puppet, Right Puppet, and Ego Puppet. Throughout this document, we refer to the data collection devices as “puppets”.

ComponentSpecification
GripperUMI-Style Still Gripper
CameraOAK-D Wide
IMU/Hall Sensors200Hz sampling rate
Camera Frame Rate30Hz
Control InterfacePhysical button + LED feedback

Button Controls: Long press creates/closes datasets, short press starts recording episodes. Color-coded LEDs display current system status.

OAK-D Wide Camera

The OAK-D Wide provides stereo vision for 6-DoF pose tracking using ORB-SLAM3. The camera is mounted pointing upward on each gripper to avoid occlusions during manipulation tasks.

Loading 3D model...

Interactive 3D model of the OAK-D Wide camera. Click and drag to rotate, scroll to zoom.


System Architecture

The system follows a distributed architecture where multiple puppets coordinate through a central MQTT broker running on the Ego Puppet. This enables synchronized multi-view data collection while maintaining robustness against network failures.

Puppet Communication Architecture

Architecture showing Right, Ego, and Left Puppets communicating via the central EgoBroker. Components include MQTTCommandListener, ButtonNode, and sensor_app. (Click to enlarge)

Puppet Communication

Puppet interaction follows a state machine: ON → DATASET_ACTIVE → RECORDING → UPLOADING

Each puppet runs three core components:

  • MQTTCommandListener: Receives start/stop commands, triggers local recording scripts
  • ButtonNode: Handles physical button input on Right Puppet, broadcasts commands
  • sensor_app: Publishes sensor data and system status

When a user presses the button on the Right Puppet, the ButtonNode publishes to the Ego Broker, which distributes synchronized commands to all puppets, ensuring simultaneous state transitions and episode alignment.

MQTT Broker Setup

The system implements a dual-broker architecture balancing reliability with discovery:

BrokerPortPurpose
Cloud (GCS)1883Discovery only
Local (Mosquitto)1884All internal communications

Key Design: The cloud broker serves only as a discovery mechanism. All data recording and episode management happens through the local broker, allowing fully offline operation after initial setup.

Cloud Broker Details
  • Runs on Google Cloud infrastructure
  • Each puppet publishes system metrics to <GROUP_NAME>/<PUPPET_SIDE>/system_metrics
  • Metrics include broker status and IP address
  • New puppets subscribe to discover their local broker IP
  • Connection uses standard port 1883 with admin credentials
  • Automatic fallback to searching mode if connection fails
Local Broker Details
  • Mosquitto MQTT broker instance
  • Handles all internal communications:
    • Message routing between system components
    • Episode transitions and sensor data
    • State change notifications
  • Maintains message order and delivery during disconnections
  • Functions independently of internet connectivity
  • All system events pass through this broker

Data Collection Flow

The data flow spans from physical demonstration through cloud storage, with multiple checkpoints for reliability.

Data Flow Sequence Diagram

Complete data flow between Demonstrator, Puppet, MQTT Brokers, Flask API, Cloud SQL, GCS, and Pub/Sub. (Click to enlarge)

Status Publishing

Each puppet continuously publishes status (IP, health metrics) to the Cloud MQTT Broker, enabling system monitoring and discovery.

Acquisition Flow

When the demonstrator presses the start/stop button:

  1. Button press publishes command to local MQTT broker
  2. Command triggers acquisition scripts on all connected puppets
  3. Each puppet records sensor data with precise timestamps

Upload Mechanism

After recording completes, uploads follow a secure, verifiable sequence:

StepEndpointAction
1POST /puppet/request_uploadRequest upload, log to Cloud SQL
2-Flask API generates signed URL
3PUT to GCSUpload recording with signed URL
4POST /puppet/confirm_uploadConfirm completion in Cloud SQL
5Pub/SubTrigger post-processing pipeline
Upload Flow Details
  1. Request Upload: Puppet initiates via POST /puppet/request_upload to Flask API, which logs “Upload Requested” in Cloud SQL

  2. Signed URL Generation: Flask API generates a signed URL with time-limited write access to the GCS bucket

  3. Data Upload: Puppet uploads directly to GCS using PUT with the signed URL

  4. Confirm Upload: Puppet confirms via POST /puppet/confirm_upload, updating Cloud SQL to “Upload Confirmed”

  5. Trigger Processing: Completion publishes to Pub/Sub topics (RawDataForCalibration or RawData) for downstream pipelines

This mechanism ensures data integrity through explicit confirmation steps and enables tracking through Cloud SQL status entries.


Reliability and Synchronization

Collecting synchronized multi-view data across distributed devices requires robust handling of network failures.

Reconnection Mechanism

TCP connections can become stale without detection, causing silent disconnections. The system employs two mechanisms:

MechanismFunction
TCP KeepalivePeriodic health checks, detects broken connections even when idle
Socket Timeouts5-second max wait, prevents indefinite hangs

Failure Behavior: When connections fail, puppets continue recording locally while tracking disconnection events. LED shows CONNECTING_TO_BROKER (pulsing white) during recovery.

TCP Keepalive Details
  • Activates periodic connection health checks
  • Automatically sends small test messages when connection is idle
  • Detects broken connections even when no data is being sent
  • Configured when each puppet first connects to the broker
Socket Timeout Details
  • Sets maximum waiting time of 5 seconds for all network operations
  • If sending/receiving data exceeds threshold, connection marked as failed
  • Prevents system from hanging indefinitely during communication attempts
  • Configured at initial connection establishment

Episode Synchronization

When a puppet disconnects during recording, it misses episode transitions, causing numbering to become unsynchronized. The reconnection mechanism operates in three phases:

PhaseAction
1. DisconnectStore episode number in /tmp/episode_at_disconnect, log to episode_timestamps.txt
2. ReconnectionObtain current system-wide episode from broker, compare with saved
3. SynchronizationIf mismatch: broadcast sync_episode command, align all counters
Phase Details

1. Disconnect Phase

  • Current episode number stored in /tmp/episode_at_disconnect
  • Disconnection timestamp recorded in episode_timestamps.txt
  • Format: # DISCONNECTED FROM BROKER at: [timestamp]
  • Recording continues in current episode despite disconnection

2. Reconnection Phase

  • Current system-wide episode number obtained from broker
  • Reconnection timestamp recorded in episode_timestamps.txt
  • Format: # RECONNECTED TO BROKER at: [timestamp]
  • Saved episode number compared with current system episode

3. Synchronization Phase

  • If episode numbers match: continue recording in current episode
  • If mismatch detected:
    • System broadcasts a sync_episode command
    • All puppets process this as standard episode transition
    • New episode timestamp added to episode_timestamps.txt
    • Episode counters aligned across all puppets
Example: Synchronized Timestamps

When properly synchronized, all puppets maintain identical episode logs:

ego/episode_timestamps.txt:
Episode 1 started at: 1740348759.123456789
Episode 2 started at: 1740348852.987654321
Episode 3 started at: 1740348912.456789123
Episode 4 started at: 1740348975.321654987
Episode 5 started at: 1740349032.789456123
Episode 6 started at: 1740349102.654789321
Example: Disconnection Scenarios

The left/episode_timestamps.txt below shows two disconnection events:

Episode 1 started at: 1740348759.123456789
# DISCONNECTED FROM BROKER at: 1740348800.123456789
# RECONNECTED TO BROKER at: 1740348920.654321987
Episode 3 started at: 1740348925.789123456
Episode 4 started at: 1740348975.321654987
# DISCONNECTED FROM BROKER at: 1740349000.123456789
# RECONNECTED TO BROKER at: 1740349040.654321987
Episode 5 started at: 1740349045.789123456
Episode 6 started at: 1740349102.654789321

First Disconnection (Skipping Episodes):

  1. All puppets start Episode 1 together
  2. Left puppet disconnects at timestamp 1740348800
  3. While disconnected, ego and right transition to Episodes 2 and 3
  4. Left reconnects at 1740348920, others are in Episode 3
  5. Left syncs to Episode 3, completely skipping Episode 2
  6. Timestamp shows Episode 3 starting shortly after reconnection (1740348925)

Second Disconnection (Single Episode Skip):

  1. All puppets record Episode 4 together
  2. Left disconnects at 1740349000
  3. While disconnected, ego and right transition to Episode 5
  4. Left reconnects at 1740349040, others are in Episode 5
  5. Left syncs to Episode 5 at 1740349045
  6. All puppets transition to Episode 6 together

Timing and Clock Synchronization

Precise timing is crucial for multi-device synchronization. When a new episode starts, the system records the exact timestamp to episode_timestamps.txt. Episode boundaries are determined by comparing each data point’s timestamp with recorded start times.

Sensor TypeSample RateTransition GapMissing Samples
IMU/Hall200Hz~10μs1-2 samples
Camera30Hz~20ms0-1 frames

Clock Sync: All devices use Chrony with public NTP servers, providing sub-millisecond accuracy for multi-device coordination.

Timing Details

Episode Boundary Detection:

  • Data points with timestamps before episode start → previous episode
  • Data points with timestamps after or equal → new episode

IMU/Hall Sensors (200Hz):

  • Typical gap around episode transition: ~10 microseconds
  • Usually missing 1-2 samples at transition
  • Last sample typically ~5μs before transition
  • First new sample typically ~5μs after transition

Camera Data (30Hz):

  • Typical gap around episode transition: ~20 milliseconds
  • Usually missing 0-1 frames at transition
  • Last frame typically ~10ms before transition
  • First new frame typically ~10ms after transition

Clock Synchronization:

  • Chrony installed on each Raspberry Pi host
  • Continuously synchronizes with reliable external NTP sources
  • Provides sub-millisecond accuracy sufficient for multi-device coordination

Sample Data Visualization

The dataset includes interactive 3D visualizations using Rerun, allowing exploration of recorded episodes with synchronized multi-view camera feeds, gripper poses, and sensor data.

View Sample Episodes in Rerun (10 samples)
SampleLink
1Rerun Viewer #1
2Rerun Viewer #2
3Rerun Viewer #3
4Rerun Viewer #4
5Rerun Viewer #5
6Rerun Viewer #6
7Rerun Viewer #7
8Rerun Viewer #8
9Rerun Viewer #9
10Rerun Viewer #10

Dataset and Usage

The SF Fold dataset is available on Hugging Face:

Download: huggingface.co/datasets/zeroshotdata/sf_fold

The dataset focuses on folding and manipulation tasks, providing synchronized multi-view recordings suitable for:

ApplicationDescription
Imitation LearningTraining policies from human demonstrations
Diffusion PolicyLearning manipulation behaviors through diffusion models
Multi-view FusionAlgorithms leveraging multiple camera perspectives
Temporal ModelingModels reasoning about action sequences over time

For more details on the original UMI approach that inspired this work, see the paper and project page.