💻AI Training Process

How We Trained 20 Unique Fighting AIs

Creating autonomous AI fighters that deliver exciting, unpredictable battles required a sophisticated multi-stage training pipeline combining supervised learning, reinforcement learning, and evolutionary algorithms.

🎯 Training Objectives

Our goal was to create AI fighters that:

Fight intelligently - Make strategic decisions based on game state
Show personality - Each fighter has a unique combat style
Adapt dynamically - Learn opponent patterns during combat
Create entertainment - Produce exciting, varied battles

📚 Training Pipeline Overview

┌──────────────────────────────────────────────────────────────┐
│                    STAGE 1: Data Collection                  │
│  → 50,000+ simulated fights                                  │
│  → Expert gameplay recordings                                │
│  → Combat scenario library                                   │
└──────────────────────────┬───────────────────────────────────┘
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                STAGE 2: Supervised Pre-training              │
│  → Train base combat network                                 │
│  → Learn fundamental mechanics                               │
│  → 10,000 epochs on labeled data                             │
└──────────────────────────┬───────────────────────────────────┘
                           ▼
┌──────────────────────────────────────────────────────────────┐
│              STAGE 3: Reinforcement Learning                 │
│  → Self-play against trained agents                          │
│  → Reward shaping for combat effectiveness                   │
│  → 100,000+ training iterations                              │
└──────────────────────────┬───────────────────────────────────┘
                           ▼
┌──────────────────────────────────────────────────────────────┐
│            STAGE 4: Personality Specialization               │
│  → Evolutionary algorithms for diversity                     │
│  → Fine-tune each fighter's behavior                         │
│  → 20 unique combat strategies                               │
└──────────────────────────┬───────────────────────────────────┘
                           ▼
┌──────────────────────────────────────────────────────────────┐
│                STAGE 5: Tournament Testing                   │
│  → Round-robin evaluation                                    │
│  → Balance adjustments                                       │
│  → Performance optimization                                  │
└──────────────────────────────────────────────────────────────┘

🔬 Stage 1: Data Collection

Simulation Framework

We built a high-speed simulator capable of running 1000+ fights per hour with accelerated game logic (10x speed mode).

Data Collected:

State vectors: Position, velocity, health, orientation (every frame)
Action sequences: Button inputs (A/D/W/F) with timestamps
Outcomes: Win/loss, damage dealt, survival time
Strategic patterns: Distance management, attack timing, dodge success rate

Expert Demonstrations

500+ manually played fights by skilled players
Labeled "optimal actions" for training scenarios
Edge case handling (wall collisions, simultaneous hits)

Scenario Library

Created 1000+ unique combat scenarios:

Close-range brawls
Long-range positioning
Low-health survival situations
Aggressive rushdown vs defensive play
Counter-attack opportunities

Total Dataset Size: 2.3TB of fight data

🧠 Stage 2: Supervised Pre-Training

Base Neural Network Architecture

Strategic Network (Transformer):

Model: GPT-4o-mini (fine-tuned)
Input: Game state description (natural language + structured data)
Output: Strategic decision (aggressive/defensive/tactical/adaptive)

Fine-tuning Details:
- Base model: GPT-4o-mini-2024-07-18
- Training samples: 50,000 fight scenarios
- Context: "You are an expert fighting AI strategist..."
- Output format: Action + reasoning
- Training time: 72 hours on 8x A100 GPUs

Tactical Network (Feedforward):

Architecture:
  Input(784) → Dense(256, ReLU) → Dropout(0.3)
           → Dense(128, ReLU) → Dropout(0.2)
           → Dense(64, ReLU)
           → Dense(5, Softmax)

Loss Function: Categorical Cross-Entropy
Optimizer: Adam (lr=0.001, β1=0.9, β2=0.999)
Batch Size: 256
Epochs: 10,000

Training Results:
- Final accuracy: 87.3% on validation set
- Loss: 0.342
- Inference time: 8ms average

Training Process

Phase 1: Imitation Learning

Learn from expert demonstrations
Supervised learning on labeled fight data
Goal: Achieve 80%+ action prediction accuracy

Phase 2: Behavior Cloning

Clone successful fighting patterns
Train on high-win-rate combat sequences
Regularization to prevent overfitting

Results:

Base network achieved 85% win rate vs random agent
Demonstrated understanding of fundamental mechanics
Ready for reinforcement learning phase

🎮 Stage 3: Reinforcement Learning

Self-Play Training Loop

We used Proximal Policy Optimization (PPO) for stable training:

Hyperparameters:
- Learning rate: 3e-4 (cosine decay)
- Discount factor (γ): 0.99
- GAE lambda (λ): 0.95
- Clip epsilon: 0.2
- Entropy coefficient: 0.01
- Value loss coefficient: 0.5
- Max gradient norm: 0.5
- Mini-batch size: 64
- Update epochs: 4

Reward Function Design

The reward function is critical for learning effective combat:

reward = (
    damage_dealt * 10.0           # Hitting opponent
    - damage_received * 8.0        # Taking damage
    + survival_time * 0.1          # Staying alive
    + distance_optimal * 2.0       # Good positioning
    - wall_proximity * 1.5         # Avoiding corners
    + attack_connected * 5.0       # Successful hits
    + dodge_success * 7.0          # Avoiding attacks
    + combo_bonus * 15.0           # Consecutive hits
    + knockout_bonus * 100.0       # Winning the fight
)

Training Infrastructure

Hardware:

4x NVIDIA A100 GPUs (80GB VRAM each)
128 CPU cores for parallel simulation
512GB RAM for experience buffer

Training Stats:

Total training time: 14 days
Total fights simulated: 2.8 million
Experience buffer size: 1M transitions
Policy updates: 450,000 iterations

Learning Curves:

Iteration      Win Rate    Avg Damage    Avg Survival
─────────────────────────────────────────────────────
1,000         52.3%        45.2          32.1s
10,000        71.8%        78.3          48.7s
50,000        84.2%        92.1          67.3s
100,000       89.7%        105.8         78.9s
200,000       92.3%        118.4         85.2s
450,000       94.8%        127.9         91.4s (FINAL)

Opponent Modeling

During training, each AI learned to:

Track opponent patterns: Attack frequency, movement tendencies
Predict next actions: Anticipate attacks based on distance/stance
Exploit weaknesses: Adapt strategy mid-fight
Counter-adapt: Respond when opponent changes strategy

🧬 Stage 4: Personality Specialization

To create 20 unique fighters, we used evolutionary algorithms to diversify behaviour.

Genetic Algorithm for Diversity

Genome Encoding:

fighter_dna = {
    'aggression': 0.0-1.0,           # Attack frequency
    'risk_tolerance': 0.0-1.0,       # Willingness to trade damage
    'patience': 0.0-1.0,             # Wait for openings vs rush
    'adaptability': 0.0-1.0,         # Strategy switching speed
    'defensive_bias': 0.0-1.0,       # Blocking vs dodging preference
    'combo_focus': 0.0-1.0,          # Single hits vs combos
    'movement_style': 0.0-1.0,       # Aggressive vs evasive
    'distance_preference': 0.0-1.0,  # Close-range vs mid-range
}

Evolution Process

Initial Population: 100 random DNA variations
Fitness Evaluation: Each variant fights 50 tournaments
Selection: Top 20% by entertainment value (not just win rate!)
Crossover: Combine traits from successful fighters
Mutation: 10% random trait variation
Repeat: 20 generations

Fitness Function (maximizes entertainment):

fitness = (
    win_rate * 0.3                  # Still needs to be competitive
    + fight_duration_variety * 0.2  # Varied fight lengths
    + action_diversity * 0.3        # Uses all moves
    + comeback_potential * 0.2      # Can win from behind
)

Final Fighter Roster

After evolution, we selected 20 fighters with distinct personalities:

Aggressive Types:

Morpheus (Aggression: 0.85) - Relentless pressure, high combo focus
Saint (Aggression: 0.78) - Calculated aggression, punish mistakes

Defensive Types:

GhostHash (Defensive: 0.82) - Elusive movement, counter-attack
TheMiner (Defensive: 0.75) - Patient, wall positioning

Balanced Types:

NeoNode (Balanced) - Adaptive, reads opponent
CipherKid (Balanced) - Technical, frame-perfect execution

Specialist Types:

BitSamurai (Combo: 0.91) - Chain attacks, high damage
DarkWallet (Risk: 0.88) - High-risk/high-reward plays

🔧 Stage 5: Fine-Tuning & Optimization

Balance Adjustments

We ran extensive tournament testing to ensure:

No dominant strategy: All 20 fighters have 45-55% overall win rate
Rock-paper-scissors dynamics: Counter-matchups exist
Skill expression: Better AI wins more consistently

Performance Optimization

Model Compression:

Pruned tactical network: 784-256-128-64-5 → 512-128-32-5
Quantization: FP32 → FP16 (50% size reduction)
Knowledge distillation: Compress strategic model
Result: 3x faster inference, 95% accuracy retained

Inference Optimization:

TensorRT compilation for GPU inference
ONNX export for cross-platform compatibility
Batch processing for multi-fight simulations
Result: 8ms decision latency (down from 45ms)

📊 Training Results & Validation

Performance Metrics

Against Random AI:

Win rate: 98.7%
Average fight duration: 12.3s
Damage efficiency: 11.2 hits to kill

Against Each Other (Round-Robin):

Most balanced fighter: Symbol (49.8% win rate)
Most aggressive: Morpheus (52.1% win rate)
Most defensive: GhostHash (47.9% win rate)
Most entertaining: BitSamurai (highest action variety)

Validation Tests

Robustness:

✅ Handles edge cases (corner traps, simultaneous hits)
✅ Recovers from disadvantage (low HP comebacks)
✅ Adapts to opponent changes mid-fight

Entertainment Value:

✅ Average fight duration: 67 seconds (target: 45-90s)
✅ Action variety score: 8.7/10
✅ Comeback rate: 23% of fights decided in final 20%

🚀 Future Improvements

Planned Enhancements

Meta-Learning: AI learns from past tournament results
Human Feedback: Incorporate viewer preferences
Seasonal Updates: Retrain with new strategies
Community Champions: User-submitted AI variants

Research Directions

Multi-Agent Learning: Train fighters against full roster simultaneously
Curriculum Learning: Progressive difficulty in training scenarios
Hierarchical RL: More sophisticated strategy layers
Transfer Learning: Apply to other game genres

PreviousTechnical Architecture NextBetting System

Last updated 1 month ago

Good evening