1. Introduction: The Problem with Pure RL on 2048
2048 is deceptively difficult. Unlike chess or Go where legal moves are immediately enumerable, 2048 introduces two sources of stochasticity: (1) the player chooses an action, but (2) the game engine randomly spawns a new tile (90% chance of 2, 10% chance of 4) into an empty cell. This turns the game into a partially-observable Markov decision process (POMDP) where the player’s action is coupled with an uncontrolled chance node.
Pure reinforcement learning struggles with 2048 for three fundamental reasons:
-
Sparse Terminal Rewards: The only meaningful reward signal is reaching 2048—a goal the agent may never achieve during training. Without dense shaping rewards, value functions collapse to near-zero for most states.
-
High-Variance Rollouts: A single 2048 game spans 500–1500 moves. The multiplicative nature of tile merges means small errors compound exponentially, making Monte Carlo returns extremely noisy.
-
Horizon Effects: The RL agent must learn to reason about consequences 100+ moves ahead, yet discount factors exponentially downweight distant rewards.
Our core thesis: Hand-crafted heuristics are brittle. They encode human intuitions that break under distribution shift (e.g., unusual board configurations). Learned value functions from Masked PPO, combined with the probabilistic lookahead of Expectimax, create a robust, emergent strategy that outperforms both pure RL (0% win rate) and pure heuristics.
2. The Mathematical Core: RL Setup
State Representation and Reward Shaping
The board state stores raw tile values (0 for empty, where is the actual tile value for ). We use log-normalized embedding rather than raw one-hot encoding to compress the exponential tile range into a dense representation.
The immediate reward for a merge is log-scaled:
# From twenty_forty_eight_ai/env/reward.py (conceptual)
def get_reward(merge_score: int) -> float:
if merge_score <= 0:
return 0.0
return np.log2(merge_score) # e.g., 1024+1024 → log2(2048) = 11
This transformation is critical:
- Raw merge scores grow exponentially: merging two 1024 tiles yields +4096
- Log scores compress to a smooth linear scale:
The discounted return becomes , reducing variance in value function estimates without distorting the relative ordering of good vs. bad merges.
Custom CNN Architecture: Why Specialized Convolutions?
Standard MLPs treat the 4×4 grid as flat input, ignoring the game’s inherent structure. We implemented a three-pathway depthwise separable CNN (architecture.py:27-94) that explicitly captures row, column, and local grid patterns:
class CustomCNN(BaseFeaturesExtractor):
def __init__(self, observation_space, features_dim=256):
# Tile embedding: 17 types (0=empty, 1=2, 2=4, ..., 16=65536)
self.embedding = nn.Embedding(num_embeddings=17, embedding_dim=128)
# Three parallel pathways with depthwise separable convolutions
self.row_pathway = nn.Sequential(
DepthwiseSeparableConv(128, 128, kernel_size=(1, 4)), # ← 1×4 kernel
nn.ReLU(), nn.Flatten(),
)
self.col_pathway = nn.Sequential(
DepthwiseSeparableConv(128, 128, kernel_size=(4, 1)), # ← 4×1 kernel
nn.ReLU(), nn.Flatten(),
)
self.grid_pathway = nn.Sequential(
DepthwiseSeparableConv(128, 128, kernel_size=(2, 2)), # ← 2×2 local
nn.ReLU(), nn.Flatten(),
)
Why these kernel sizes?
(1, 4)row kernel: Captures horizontal tile patterns within a row (e.g.,[2, 2, 4, 8]→ merge opportunity). The entire row is processed in one convolution.(4, 1)column kernel: Symmetrically captures vertical patterns.(2, 2)grid kernel: Captures local 2×2 spatial interactions (e.g., corner formations, 2×2 squares).
Standard 3×3 kernels would blend these patterns inefficiently. The (1, 4) and (4, 1) kernels are translation-invariant within rows/columns—a pattern learned at row 0 is automatically recognized at row 3, providing parameter sharing that dramatically improves sample efficiency.
Action Masking: Preventing Invalid Actions
Standard PPO would waste capacity learning invalid moves (e.g., “move left” when tiles are already left-aligned). We use invalid action masking via sb3_contrib.MaskablePPO:
# During environment step
masks = environment.action_masks() # Returns bool[4], True=valid
model.predict(observation, action_masks=masks)
This constrains the policy to only valid actions, ensuring 100% of learning signal is directed toward meaningful decisions.
3. The Search Engine: C++ Optimization
Why C++? The Python Tree Search Bottleneck
At Depth-3 Expectimax, the search explores approximately 3,000–8,000 leaf nodes per decision. Each leaf requires a forward pass through the CNN (GPU) or VALUE head (CPU). If we naively call Python for every node:
- Python interpreter overhead: ~50–100 μs per function call
- CUDA kernel launch latency: ~200–500 μs per kernel
- Total: 3,000 × 500 μs = 1.5 seconds per move (unacceptable)
The solution: C++ core with batched leaf evaluation.
Bitboard Representation and O(1) Lookup Tables
Fast2048.cpp:182-220 precomputes all 65,536 possible row configurations at startup:
void Fast2048::init_LUT() {
for (int i = 0; i < 65536; i++) {
// Pack 4 tiles into 16-bit integer: 4 bits per tile
// index = tile[0] | (tile[1] << 4) | (tile[2] << 8) | (tile[3] << 12)
for (int j = 0; j < 4; j++) {
original_row[j] = (i >> (j * 4)) & 0xF;
row[j] = original_row[j];
}
// Stack → Merge → Stack (2048 rules)
// ...
move_row_LUT.push_back(row); // Resulting row after move
move_reward_LUT.push_back(reward); // Merge reward
move_valid_LUT.push_back(original_row != row);
}
}
Three static LUTs (lines 266–268):
static std::vector<std::array<int, 4>> move_row_LUT; // ~800KB
static std::vector<int> move_reward_LUT; // ~256KB
static std::vector<bool> move_valid_LUT; // ~65KB
Move execution becomes O(1) per row (4 rows = O(4)):
// Fast2048.cpp:81-126 - move_simulated()
int index = row_to_number(row); // Pack row into 16-bit int
merge_score += move_reward_LUT[index];
row = move_row_LUT[index]; // O(1) table lookup
row_to_number() (line 262–264):
int row_to_number(const std::array<int, 4> &row) const {
return row[0] | (row[1] << 4) | (row[2] << 8) | (row[3] << 12);
}
Each tile value occupies 4 bits, enabling single-instruction packing via bitwise OR.
Transposition Table: Memoizing Subtrees
The same board state can be reached via different move sequences. ExpectimaxSearcher.cpp:76-108 caches results using a transposition table:
float max_node_substitute(const Board& board, int depth,
const std::map<Board, float>& leaf_cache) {
if (depth == 0) {
return leaf_cache.count(board) ? leaf_cache.at(board) : -100.0f;
}
TranspositionKey key = {board, depth}; // (state, depth) pair
if (transposition_table.count(key)) {
return transposition_table[key]; // ← Cache hit
}
// ... search logic ...
transposition_table[key] = result; // ← Cache miss: store
return result;
}
Observed cache hit rates:
- Early-game (low entropy): ~80%
- Late-game (high entropy): ~40–60%
This effectively prunes the search tree by 2×–3×, enabling Depth-3 search to complete within the 150ms/move latency budget.
Batched Leaf Evaluation: Amortizing CUDA Kernels
The critical optimization is decoupling leaf collection from neural evaluation (ExpectimaxSearcher.cpp:110-148):
int ExpectimaxSearcher::find_best_move(const Board& board, int depth,
const BatchEvalFunc& batch_eval_func) {
transposition_table.clear();
// Phase 1: BFS traversal — collect ALL unique leaf boards
std::vector<Board> leaves_to_evaluate;
std::map<Board, bool> visited_leaves;
gather_leaves(board, depth, leaves_to_evaluate, visited_leaves);
// Phase 2: Single batched call to Python/ML framework
std::vector<float> evaluations;
if (!leaves_to_evaluate.empty()) {
evaluations = batch_eval_func(leaves_to_evaluate); // One Python call
}
// Phase 3: Build cache for tree search
std::map<Board, float> leaf_cache;
for (size_t i = 0; i < leaves_to_evaluate.size(); ++i)
leaf_cache[leaves_to_evaluate[i]] = evaluations[i];
// Phase 4: Search with cached leaf evaluations
// ... compute best move using cached values ...
}
Batch sizes by depth:
| Depth | Approximate Leaves | Without Batching | With Batching |
|---|---|---|---|
| 1 | 50–100 | ~25–50ms | ~5ms |
| 2 | 500–1,000 | ~250–500ms | ~15ms |
| 3 | 3,000–8,000 | ~1.5–4s | ~80ms |
Batching amortizes the Python interpreter overhead and CUDA kernel launch cost across thousands of states. On GPU, torch.nn.functional.linear() processes a 8000×(4×4) batch in a single kernel invocation—roughly the same cost as evaluating a single state.
4. Bridging Theory and Results: The Ablation Study
Benchmark Results
We evaluated 100 episodes per configuration:
| Configuration | Avg Score | 2048+ Win Rate | Max Tile (Freq) | Avg Moves |
|---|---|---|---|---|
| Raw PPO Policy | 7,995.6 ± 3502.67 | 0% | 1024 (18%) | 541 |
| + Expectimax (d=1) | 5,127.32 ± 2482.23 | 0% | 1024 (4%) | 372 |
| + Expectimax (d=2) | 14,014.08 ± 6496.21 | 13% | 2048 (13%) | 822 |
| + Expectimax (d=3) | 26,523 ± 12749.82 | 58% | 4096 (8%) | 1,393 |
The Shallow Search Trap (Depth 1 Degradation)
Counter-intuitively, Depth-1 search degrades performance (7,995 → 5,127). Why?
The value function learned by PPO contains local noise—imperfect estimates of long-term value. A 1-step lookahead overfits to these noisy estimates, propagating errors without the averaging effect of deeper search.
Our policy network has learned robust action priors through 200M timesteps of training, implicitly regularizing noise. Depth-1 search bypasses these priors, replacing them with noisy evaluations at the immediate successor state.
Search as Regularization (Depth 3 Noise Filtering)
Depth-3 search acts as Monte Carlo averaging over the value estimates at thousands of leaf nodes. By aggregating across diverse board configurations, Expectimax filters out noise in the value function.
Mathematically, Depth-3 Expectimax computes:
where is evaluated by the learned network and averaged over all possible tile spawns.
This produces a 3.3× score improvement (7,995 → 26,523) and enables strategic play reaching the 4096 tile—a configuration the raw policy never achieves.
Conceptual Parallel: Expectimax and Bayesian Optimization
Our Depth-3 search mirrors the exploration-exploitation trade-off in GP-UCB (Krause et al., 2009):
- GP-UCB: A surrogate model is updated from past evaluations; UCB balances (mean prediction) with (epistemic uncertainty)
- Expectimax: The learned value function acts as a surrogate for long-term return; Expectimax reduces epistemic uncertainty by averaging over possible futures at chance nodes
Optuna’s MedianPruner (tune.py:101-104) implements a similar idea for hyperparameter search:
pruner = optuna.pruners.MedianPruner(
n_startup_trials=pruner_config['n_startup_trials'],
n_warmup_steps=pruner_config['n_warmup_steps']
)
Trials that perform worse than the median are pruned early—analogous to UCB cutting off high-uncertainty branches.
5. Conclusion and Future Work
The Engineering Triumph
AI 2048 demonstrates the power of hybrid systems: combining Python’s ML ecosystem (PyTorch, Stable-Baselines3, Optuna) with C++‘s computational efficiency. The C++ engine evaluates 8,000 leaf nodes in ~80ms—a 50× speedup over naive Python—while the Python layer handles neural network training and hyperparameter optimization.
Key engineering decisions:
- Lookup Tables trade memory (~$1MB) for computation (O(1) moves)
- Transposition Tables exploit state redundancy, reducing effective branching factor by 2×–3×
- Batched Evaluation amortizes Python/CUDA overhead across thousands of states
Future Work
-
Distributional Value Networks: Replace point-estimate with a distribution (quantile regression). Use variance to implement risk-sensitive Expectimax—penalizing leaf nodes with high epistemic uncertainty.
-
AlphaZero-Style MCTS: Replace fixed-depth Expectimax with Monte Carlo Tree Search guided by the learned policy . This enables:
- Adaptive depth (shallow for obvious moves, deep for critical decisions)
- Prioritized node expansion using as move ordering
- Natural exploration via UCB-style bonuses
-
Transfer to Other Stochastic Games: The hybrid RL + search framework generalizes to Threes, 2048 variants (5×5 grid), and slot-based puzzle games with similar uncertainty structures.
Key Files Referenced:
cpp_src/Fast2048.cpp:182-220— LUT initializationcpp_src/ExpectimaxSearcher.cpp:110-148— Batched search orchestrationtwenty_forty_eight_ai/agent/architecture.py:27-94— Custom CNNscripts/tune.py:59-91— Optuna objective function