Skip to main content

Distributed Architecture

YuLan-OneSim supports large-scale simulations using Master-Worker architecture with gRPC communication.

Master-Worker Pattern

Master Node

  • Agent Registry: Maintains mapping of agents to workers
  • Global State Management: Coordinates simulation state across workers
  • Load Balancing: Distributes agents optimally across workers
  • Event Routing: Routes events between agents on different workers
  • Health Monitoring: Tracks worker status and handles failures

Worker Node

  • Agent Hosting: Executes agent logic and decision-making
  • Local Processing: Handles agent interactions within the worker
  • Resource Management: Manages compute and memory resources
  • Heartbeat Mechanism: Reports status to master node
  • Event Processing: Processes events for hosted agents

Node Configuration

Initialize Master Node

from onesim.distribution.node import initialize_node

# Master node configuration
master_config = {
"listen_port": 10051,
"expected_workers": 3,
"health_check_interval": 1800, # seconds
"worker_timeout": 3600
}

master = await initialize_node("master_001", "master", master_config)

Initialize Worker Node

# Worker node configuration
worker_config = {
"master_address": "192.168.1.100",
"master_port": 10051,
"listen_port": 10052,
"worker_address": "192.168.1.101", # Optional: explicit worker address
"heartbeat_interval": 300,
"heartbeat_max_retries": 5
}

worker = await initialize_node("worker_001", "worker", worker_config)

Intelligent Agent Allocation

Community-Based Allocation

The system uses advanced algorithms to optimize agent placement:

  • Relationship Analysis: Groups agents with strong interactions
  • Type Diversity: Ensures different agent types can interact efficiently
  • Load Balancing: Distributes computational load evenly
  • Topology Awareness: Considers network topology for optimization

Allocation Strategies

  • Community Detection: Groups highly connected agents
  • Cross-Type Interaction: Prioritizes different agent types working together
  • Load Constraints: Respects maximum load per worker

Connection Management & Communication

Sharded Connection Pool

  • Connection Sharding: Reduces lock contention with hash-based sharding
  • Connection Reuse: Maintains persistent gRPC connections
  • Automatic Cleanup: Removes idle connections after timeout
  • Circuit Breaker: Protects against cascading failures

Fault Tolerance Features

  • Circuit Breaker Pattern: Prevents requests to failed services
  • Automatic Reconnection: Recovers from network failures
  • Heartbeat Monitoring: Detects and handles worker failures
  • Graceful Degradation: Continues operation with reduced capacity
# Circuit breaker configuration
circuit_config = {
"failure_threshold": 5, # Failures before circuit opens
"recovery_timeout": 30.0, # Time before retry attempt
"half_open_success": 2 # Successes needed to close circuit
}

Distributed Synchronization

Distributed Locks

Ensures data consistency across the distributed system:

from onesim.distribution.distributed_lock import get_lock

# Acquire distributed lock
async with await get_lock("shared_resource", timeout=30.0) as lock:
# Critical section - guaranteed exclusive access
await update_shared_data(key, value)

Lock Features

  • Automatic Timeout: Prevents deadlocks with configurable timeouts
  • Node Failure Recovery: Releases locks when nodes become unavailable
  • Mode-Aware: Uses local locks in single-node mode for efficiency

Performance Optimizations

Connection Optimization

  • Pre-connection: Establishes connections before simulation starts
  • Connection Pooling: Reuses connections to minimize overhead
  • Batch Processing: Groups multiple operations for efficiency
  • Asynchronous I/O: Non-blocking communication patterns