Distributed Architecture

YuLan-OneSim supports large-scale simulations using Master-Worker architecture with gRPC communication.

Master-Worker Pattern

Master Node

Agent Registry: Maintains mapping of agents to workers
Global State Management: Coordinates simulation state across workers
Load Balancing: Distributes agents optimally across workers
Event Routing: Routes events between agents on different workers
Health Monitoring: Tracks worker status and handles failures

Worker Node

Agent Hosting: Executes agent logic and decision-making
Local Processing: Handles agent interactions within the worker
Resource Management: Manages compute and memory resources
Heartbeat Mechanism: Reports status to master node
Event Processing: Processes events for hosted agents

Node Configuration

Initialize Master Node

from onesim.distribution.node import initialize_node

# Master node configuration
master_config = {
    "listen_port": 10051,
    "expected_workers": 3,
    "health_check_interval": 1800,  # seconds
    "worker_timeout": 3600
}

master = await initialize_node("master_001", "master", master_config)

Initialize Worker Node

# Worker node configuration
worker_config = {
    "master_address": "192.168.1.100",
    "master_port": 10051,
    "listen_port": 10052,
    "worker_address": "192.168.1.101",  # Optional: explicit worker address
    "heartbeat_interval": 300,
    "heartbeat_max_retries": 5
}

worker = await initialize_node("worker_001", "worker", worker_config)

Intelligent Agent Allocation

Community-Based Allocation

The system uses advanced algorithms to optimize agent placement:

Relationship Analysis: Groups agents with strong interactions
Type Diversity: Ensures different agent types can interact efficiently
Load Balancing: Distributes computational load evenly
Topology Awareness: Considers network topology for optimization

Allocation Strategies

Community Detection: Groups highly connected agents
Cross-Type Interaction: Prioritizes different agent types working together
Load Constraints: Respects maximum load per worker

Connection Management & Communication

Sharded Connection Pool

Connection Sharding: Reduces lock contention with hash-based sharding
Connection Reuse: Maintains persistent gRPC connections
Automatic Cleanup: Removes idle connections after timeout
Circuit Breaker: Protects against cascading failures

Fault Tolerance Features

Circuit Breaker Pattern: Prevents requests to failed services
Automatic Reconnection: Recovers from network failures
Heartbeat Monitoring: Detects and handles worker failures
Graceful Degradation: Continues operation with reduced capacity

# Circuit breaker configuration
circuit_config = {
    "failure_threshold": 5,      # Failures before circuit opens
    "recovery_timeout": 30.0,    # Time before retry attempt
    "half_open_success": 2       # Successes needed to close circuit
}

Distributed Synchronization

Distributed Locks

Ensures data consistency across the distributed system:

from onesim.distribution.distributed_lock import get_lock

# Acquire distributed lock
async with await get_lock("shared_resource", timeout=30.0) as lock:
    # Critical section - guaranteed exclusive access
    await update_shared_data(key, value)

Distributed Architecture

Master-Worker Pattern

Master Node

Worker Node

Node Configuration

Initialize Master Node

Initialize Worker Node

Intelligent Agent Allocation

Community-Based Allocation

Allocation Strategies

Connection Management & Communication

Sharded Connection Pool

Fault Tolerance Features

Distributed Synchronization

Distributed Locks

Lock Features

Performance Optimizations

Connection Optimization

Master-Worker Pattern​

Master Node​

Worker Node​

Node Configuration​

Initialize Master Node​

Initialize Worker Node​

Intelligent Agent Allocation​

Community-Based Allocation​

Allocation Strategies​

Connection Management & Communication​

Sharded Connection Pool​

Fault Tolerance Features​

Distributed Synchronization​

Distributed Locks​

Lock Features​

Performance Optimizations​

Connection Optimization​

Master-Worker Pattern

Master Node

Worker Node

Node Configuration

Initialize Master Node

Initialize Worker Node

Intelligent Agent Allocation

Community-Based Allocation

Allocation Strategies

Connection Management & Communication

Sharded Connection Pool

Fault Tolerance Features

Distributed Synchronization

Distributed Locks

Lock Features

Performance Optimizations

Connection Optimization