๐Ÿ’ปAI Training Process

How We Trained 20 Unique Fighting AIs

Creating autonomous AI fighters that deliver exciting, unpredictable battles required a sophisticated multi-stage training pipeline combining supervised learning, reinforcement learning, and evolutionary algorithms.


๐ŸŽฏ Training Objectives

Our goal was to create AI fighters that:

  1. Fight intelligently - Make strategic decisions based on game state

  2. Show personality - Each fighter has a unique combat style

  3. Adapt dynamically - Learn opponent patterns during combat

  4. Create entertainment - Produce exciting, varied battles


๐Ÿ“š Training Pipeline Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    STAGE 1: Data Collection                  โ”‚
โ”‚  โ†’ 50,000+ simulated fights                                  โ”‚
โ”‚  โ†’ Expert gameplay recordings                                โ”‚
โ”‚  โ†’ Combat scenario library                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                STAGE 2: Supervised Pre-training              โ”‚
โ”‚  โ†’ Train base combat network                                 โ”‚
โ”‚  โ†’ Learn fundamental mechanics                               โ”‚
โ”‚  โ†’ 10,000 epochs on labeled data                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              STAGE 3: Reinforcement Learning                 โ”‚
โ”‚  โ†’ Self-play against trained agents                          โ”‚
โ”‚  โ†’ Reward shaping for combat effectiveness                   โ”‚
โ”‚  โ†’ 100,000+ training iterations                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            STAGE 4: Personality Specialization               โ”‚
โ”‚  โ†’ Evolutionary algorithms for diversity                     โ”‚
โ”‚  โ†’ Fine-tune each fighter's behavior                         โ”‚
โ”‚  โ†’ 20 unique combat strategies                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                STAGE 5: Tournament Testing                   โ”‚
โ”‚  โ†’ Round-robin evaluation                                    โ”‚
โ”‚  โ†’ Balance adjustments                                       โ”‚
โ”‚  โ†’ Performance optimization                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ฌ Stage 1: Data Collection

Simulation Framework

We built a high-speed simulator capable of running 1000+ fights per hour with accelerated game logic (10x speed mode).

Data Collected:

  • State vectors: Position, velocity, health, orientation (every frame)

  • Action sequences: Button inputs (A/D/W/F) with timestamps

  • Outcomes: Win/loss, damage dealt, survival time

  • Strategic patterns: Distance management, attack timing, dodge success rate

Expert Demonstrations

  • 500+ manually played fights by skilled players

  • Labeled "optimal actions" for training scenarios

  • Edge case handling (wall collisions, simultaneous hits)

Scenario Library

Created 1000+ unique combat scenarios:

  • Close-range brawls

  • Long-range positioning

  • Low-health survival situations

  • Aggressive rushdown vs defensive play

  • Counter-attack opportunities

Total Dataset Size: 2.3TB of fight data


๐Ÿง  Stage 2: Supervised Pre-Training

Base Neural Network Architecture

Strategic Network (Transformer):

Tactical Network (Feedforward):

Training Process

Phase 1: Imitation Learning

  • Learn from expert demonstrations

  • Supervised learning on labeled fight data

  • Goal: Achieve 80%+ action prediction accuracy

Phase 2: Behavior Cloning

  • Clone successful fighting patterns

  • Train on high-win-rate combat sequences

  • Regularization to prevent overfitting

Results:

  • Base network achieved 85% win rate vs random agent

  • Demonstrated understanding of fundamental mechanics

  • Ready for reinforcement learning phase


๐ŸŽฎ Stage 3: Reinforcement Learning

Self-Play Training Loop

We used Proximal Policy Optimization (PPO) for stable training:

Reward Function Design

The reward function is critical for learning effective combat:

Training Infrastructure

Hardware:

  • 4x NVIDIA A100 GPUs (80GB VRAM each)

  • 128 CPU cores for parallel simulation

  • 512GB RAM for experience buffer

Training Stats:

  • Total training time: 14 days

  • Total fights simulated: 2.8 million

  • Experience buffer size: 1M transitions

  • Policy updates: 450,000 iterations

Learning Curves:

Opponent Modeling

During training, each AI learned to:

  1. Track opponent patterns: Attack frequency, movement tendencies

  2. Predict next actions: Anticipate attacks based on distance/stance

  3. Exploit weaknesses: Adapt strategy mid-fight

  4. Counter-adapt: Respond when opponent changes strategy


๐Ÿงฌ Stage 4: Personality Specialization

To create 20 unique fighters, we used evolutionary algorithms to diversify behaviour.

Genetic Algorithm for Diversity

Genome Encoding:

Evolution Process

  1. Initial Population: 100 random DNA variations

  2. Fitness Evaluation: Each variant fights 50 tournaments

  3. Selection: Top 20% by entertainment value (not just win rate!)

  4. Crossover: Combine traits from successful fighters

  5. Mutation: 10% random trait variation

  6. Repeat: 20 generations

Fitness Function (maximizes entertainment):

Final Fighter Roster

After evolution, we selected 20 fighters with distinct personalities:

Aggressive Types:

  • Morpheus (Aggression: 0.85) - Relentless pressure, high combo focus

  • Saint (Aggression: 0.78) - Calculated aggression, punish mistakes

Defensive Types:

  • GhostHash (Defensive: 0.82) - Elusive movement, counter-attack

  • TheMiner (Defensive: 0.75) - Patient, wall positioning

Balanced Types:

  • NeoNode (Balanced) - Adaptive, reads opponent

  • CipherKid (Balanced) - Technical, frame-perfect execution

Specialist Types:

  • BitSamurai (Combo: 0.91) - Chain attacks, high damage

  • DarkWallet (Risk: 0.88) - High-risk/high-reward plays


๐Ÿ”ง Stage 5: Fine-Tuning & Optimization

Balance Adjustments

We ran extensive tournament testing to ensure:

  1. No dominant strategy: All 20 fighters have 45-55% overall win rate

  2. Rock-paper-scissors dynamics: Counter-matchups exist

  3. Skill expression: Better AI wins more consistently

Performance Optimization

Model Compression:

  • Pruned tactical network: 784-256-128-64-5 โ†’ 512-128-32-5

  • Quantization: FP32 โ†’ FP16 (50% size reduction)

  • Knowledge distillation: Compress strategic model

  • Result: 3x faster inference, 95% accuracy retained

Inference Optimization:

  • TensorRT compilation for GPU inference

  • ONNX export for cross-platform compatibility

  • Batch processing for multi-fight simulations

  • Result: 8ms decision latency (down from 45ms)


๐Ÿ“Š Training Results & Validation

Performance Metrics

Against Random AI:

  • Win rate: 98.7%

  • Average fight duration: 12.3s

  • Damage efficiency: 11.2 hits to kill

Against Each Other (Round-Robin):

  • Most balanced fighter: Symbol (49.8% win rate)

  • Most aggressive: Morpheus (52.1% win rate)

  • Most defensive: GhostHash (47.9% win rate)

  • Most entertaining: BitSamurai (highest action variety)

Validation Tests

Robustness:

  • โœ… Handles edge cases (corner traps, simultaneous hits)

  • โœ… Recovers from disadvantage (low HP comebacks)

  • โœ… Adapts to opponent changes mid-fight

Entertainment Value:

  • โœ… Average fight duration: 67 seconds (target: 45-90s)

  • โœ… Action variety score: 8.7/10

  • โœ… Comeback rate: 23% of fights decided in final 20%


๐Ÿš€ Future Improvements

Planned Enhancements

  1. Meta-Learning: AI learns from past tournament results

  2. Human Feedback: Incorporate viewer preferences

  3. Seasonal Updates: Retrain with new strategies

  4. Community Champions: User-submitted AI variants

Research Directions

  • Multi-Agent Learning: Train fighters against full roster simultaneously

  • Curriculum Learning: Progressive difficulty in training scenarios

  • Hierarchical RL: More sophisticated strategy layers

  • Transfer Learning: Apply to other game genres

Last updated