💻AI Training Process
How We Trained 20 Unique Fighting AIs
Creating autonomous AI fighters that deliver exciting, unpredictable battles required a sophisticated multi-stage training pipeline combining supervised learning, reinforcement learning, and evolutionary algorithms.
🎯 Training Objectives
Our goal was to create AI fighters that:
Fight intelligently - Make strategic decisions based on game state
Show personality - Each fighter has a unique combat style
Adapt dynamically - Learn opponent patterns during combat
Create entertainment - Produce exciting, varied battles
📚 Training Pipeline Overview
┌──────────────────────────────────────────────────────────────┐
│ STAGE 1: Data Collection │
│ → 50,000+ simulated fights │
│ → Expert gameplay recordings │
│ → Combat scenario library │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 2: Supervised Pre-training │
│ → Train base combat network │
│ → Learn fundamental mechanics │
│ → 10,000 epochs on labeled data │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 3: Reinforcement Learning │
│ → Self-play against trained agents │
│ → Reward shaping for combat effectiveness │
│ → 100,000+ training iterations │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 4: Personality Specialization │
│ → Evolutionary algorithms for diversity │
│ → Fine-tune each fighter's behavior │
│ → 20 unique combat strategies │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ STAGE 5: Tournament Testing │
│ → Round-robin evaluation │
│ → Balance adjustments │
│ → Performance optimization │
└──────────────────────────────────────────────────────────────┘🔬 Stage 1: Data Collection
Simulation Framework
We built a high-speed simulator capable of running 1000+ fights per hour with accelerated game logic (10x speed mode).
Data Collected:
State vectors: Position, velocity, health, orientation (every frame)
Action sequences: Button inputs (A/D/W/F) with timestamps
Outcomes: Win/loss, damage dealt, survival time
Strategic patterns: Distance management, attack timing, dodge success rate
Expert Demonstrations
500+ manually played fights by skilled players
Labeled "optimal actions" for training scenarios
Edge case handling (wall collisions, simultaneous hits)
Scenario Library
Created 1000+ unique combat scenarios:
Close-range brawls
Long-range positioning
Low-health survival situations
Aggressive rushdown vs defensive play
Counter-attack opportunities
Total Dataset Size: 2.3TB of fight data
🧠 Stage 2: Supervised Pre-Training
Base Neural Network Architecture
Strategic Network (Transformer):
Model: GPT-4o-mini (fine-tuned)
Input: Game state description (natural language + structured data)
Output: Strategic decision (aggressive/defensive/tactical/adaptive)
Fine-tuning Details:
- Base model: GPT-4o-mini-2024-07-18
- Training samples: 50,000 fight scenarios
- Context: "You are an expert fighting AI strategist..."
- Output format: Action + reasoning
- Training time: 72 hours on 8x A100 GPUsTactical Network (Feedforward):
Architecture:
Input(784) → Dense(256, ReLU) → Dropout(0.3)
→ Dense(128, ReLU) → Dropout(0.2)
→ Dense(64, ReLU)
→ Dense(5, Softmax)
Loss Function: Categorical Cross-Entropy
Optimizer: Adam (lr=0.001, β1=0.9, β2=0.999)
Batch Size: 256
Epochs: 10,000
Training Results:
- Final accuracy: 87.3% on validation set
- Loss: 0.342
- Inference time: 8ms averageTraining Process
Phase 1: Imitation Learning
Learn from expert demonstrations
Supervised learning on labeled fight data
Goal: Achieve 80%+ action prediction accuracy
Phase 2: Behavior Cloning
Clone successful fighting patterns
Train on high-win-rate combat sequences
Regularization to prevent overfitting
Results:
Base network achieved 85% win rate vs random agent
Demonstrated understanding of fundamental mechanics
Ready for reinforcement learning phase
🎮 Stage 3: Reinforcement Learning
Self-Play Training Loop
We used Proximal Policy Optimization (PPO) for stable training:
Hyperparameters:
- Learning rate: 3e-4 (cosine decay)
- Discount factor (γ): 0.99
- GAE lambda (λ): 0.95
- Clip epsilon: 0.2
- Entropy coefficient: 0.01
- Value loss coefficient: 0.5
- Max gradient norm: 0.5
- Mini-batch size: 64
- Update epochs: 4Reward Function Design
The reward function is critical for learning effective combat:
reward = (
damage_dealt * 10.0 # Hitting opponent
- damage_received * 8.0 # Taking damage
+ survival_time * 0.1 # Staying alive
+ distance_optimal * 2.0 # Good positioning
- wall_proximity * 1.5 # Avoiding corners
+ attack_connected * 5.0 # Successful hits
+ dodge_success * 7.0 # Avoiding attacks
+ combo_bonus * 15.0 # Consecutive hits
+ knockout_bonus * 100.0 # Winning the fight
)Training Infrastructure
Hardware:
4x NVIDIA A100 GPUs (80GB VRAM each)
128 CPU cores for parallel simulation
512GB RAM for experience buffer
Training Stats:
Total training time: 14 days
Total fights simulated: 2.8 million
Experience buffer size: 1M transitions
Policy updates: 450,000 iterations
Learning Curves:
Iteration Win Rate Avg Damage Avg Survival
─────────────────────────────────────────────────────
1,000 52.3% 45.2 32.1s
10,000 71.8% 78.3 48.7s
50,000 84.2% 92.1 67.3s
100,000 89.7% 105.8 78.9s
200,000 92.3% 118.4 85.2s
450,000 94.8% 127.9 91.4s (FINAL)Opponent Modeling
During training, each AI learned to:
Track opponent patterns: Attack frequency, movement tendencies
Predict next actions: Anticipate attacks based on distance/stance
Exploit weaknesses: Adapt strategy mid-fight
Counter-adapt: Respond when opponent changes strategy
🧬 Stage 4: Personality Specialization
To create 20 unique fighters, we used evolutionary algorithms to diversify behaviour.
Genetic Algorithm for Diversity
Genome Encoding:
fighter_dna = {
'aggression': 0.0-1.0, # Attack frequency
'risk_tolerance': 0.0-1.0, # Willingness to trade damage
'patience': 0.0-1.0, # Wait for openings vs rush
'adaptability': 0.0-1.0, # Strategy switching speed
'defensive_bias': 0.0-1.0, # Blocking vs dodging preference
'combo_focus': 0.0-1.0, # Single hits vs combos
'movement_style': 0.0-1.0, # Aggressive vs evasive
'distance_preference': 0.0-1.0, # Close-range vs mid-range
}Evolution Process
Initial Population: 100 random DNA variations
Fitness Evaluation: Each variant fights 50 tournaments
Selection: Top 20% by entertainment value (not just win rate!)
Crossover: Combine traits from successful fighters
Mutation: 10% random trait variation
Repeat: 20 generations
Fitness Function (maximizes entertainment):
fitness = (
win_rate * 0.3 # Still needs to be competitive
+ fight_duration_variety * 0.2 # Varied fight lengths
+ action_diversity * 0.3 # Uses all moves
+ comeback_potential * 0.2 # Can win from behind
)Final Fighter Roster
After evolution, we selected 20 fighters with distinct personalities:
Aggressive Types:
Morpheus (Aggression: 0.85) - Relentless pressure, high combo focus
Saint (Aggression: 0.78) - Calculated aggression, punish mistakes
Defensive Types:
GhostHash (Defensive: 0.82) - Elusive movement, counter-attack
TheMiner (Defensive: 0.75) - Patient, wall positioning
Balanced Types:
NeoNode (Balanced) - Adaptive, reads opponent
CipherKid (Balanced) - Technical, frame-perfect execution
Specialist Types:
BitSamurai (Combo: 0.91) - Chain attacks, high damage
DarkWallet (Risk: 0.88) - High-risk/high-reward plays
🔧 Stage 5: Fine-Tuning & Optimization
Balance Adjustments
We ran extensive tournament testing to ensure:
No dominant strategy: All 20 fighters have 45-55% overall win rate
Rock-paper-scissors dynamics: Counter-matchups exist
Skill expression: Better AI wins more consistently
Performance Optimization
Model Compression:
Pruned tactical network: 784-256-128-64-5 → 512-128-32-5
Quantization: FP32 → FP16 (50% size reduction)
Knowledge distillation: Compress strategic model
Result: 3x faster inference, 95% accuracy retained
Inference Optimization:
TensorRT compilation for GPU inference
ONNX export for cross-platform compatibility
Batch processing for multi-fight simulations
Result: 8ms decision latency (down from 45ms)
📊 Training Results & Validation
Performance Metrics
Against Random AI:
Win rate: 98.7%
Average fight duration: 12.3s
Damage efficiency: 11.2 hits to kill
Against Each Other (Round-Robin):
Most balanced fighter: Symbol (49.8% win rate)
Most aggressive: Morpheus (52.1% win rate)
Most defensive: GhostHash (47.9% win rate)
Most entertaining: BitSamurai (highest action variety)
Validation Tests
Robustness:
✅ Handles edge cases (corner traps, simultaneous hits)
✅ Recovers from disadvantage (low HP comebacks)
✅ Adapts to opponent changes mid-fight
Entertainment Value:
✅ Average fight duration: 67 seconds (target: 45-90s)
✅ Action variety score: 8.7/10
✅ Comeback rate: 23% of fights decided in final 20%
🚀 Future Improvements
Planned Enhancements
Meta-Learning: AI learns from past tournament results
Human Feedback: Incorporate viewer preferences
Seasonal Updates: Retrain with new strategies
Community Champions: User-submitted AI variants
Research Directions
Multi-Agent Learning: Train fighters against full roster simultaneously
Curriculum Learning: Progressive difficulty in training scenarios
Hierarchical RL: More sophisticated strategy layers
Transfer Learning: Apply to other game genres
Last updated

