Abstract
While Transformer-based models have achieved state-of-the-art prediction performance,
their internal attention mechanisms remain opaque. We present a spatial attention
visualization framework that maps abstract Transformer attention weights onto
bird’s-eye-view (BEV) traffic scenes, providing the first spatially grounded
interpretation of attention in trajectory prediction. Built upon MTR-Lite
(8.48M parameters) trained on the Waymo Open Motion Dataset, our framework employs a novel
spatial token bookkeeping mechanism. We discover that cyclists receive
up to 73% less self-attention than vehicles
at equivalent distances — a safety blind spot. We further introduce
counterfactual attention analysis to isolate the causal effect of
individual scene elements on model attention.
73%
Cyclist Attention Deficit
4.7%
minADE Degradation from Distance Masking
8.48M
MTR-Lite Parameters
88.1%
Cyclist Miss Rate (vs 54.0% Vehicles)
2.314m
minADE@6 (54.4% Better Than CV Baseline)
Method
Framework Overview
Our visualization framework consists of three components built on top of an MTR-Lite
Transformer trained on the Waymo Open Motion Dataset (20% subset, ~17,800 scenes):
-
Attention-Capture Layers — Custom Transformer layers that extract
per-head attention weight matrices from every encoder and decoder layer without altering predictions
-
Spatial Token Bookkeeping — A bidirectional mapping between abstract
token indices and physical BEV coordinates, enabling projection of attention weights onto the traffic scene
-
Three Visualization Types — Space-attention BEV heatmaps (where),
time-attention refinement diagrams (how attention evolves across layers),
and lane-token activation maps (which road structures guide prediction)
MTR-Lite Architecture
A lightweight Motion Transformer variant with 4 encoder layers (global self-attention over
32 agent + 64 map tokens), 4 decoder layers (agent cross-attention + map cross-attention),
and 64 intention queries refined into K=6 output modes via NMS. Trained for 60 epochs
on Waymo with AdamW, cosine annealing, and mixed-precision training.
Counterfactual Experiments
By directly editing scene dictionaries (removing agents, flipping traffic signals, injecting
pedestrians at varying distances), we perform the first counterfactual attention
analysis for trajectory prediction — enabling causal (not just correlational)
claims about how scene elements influence model reasoning.
Animated Attention Demonstrations
These animations show how attention evolves across encoder layers and over time for different driving scenarios.
Attention Evolution Across Encoder Layers
Encoder layer progression with cyclist — note the attention deficit on the cyclist token.
Dense traffic: attention distributes broadly across multiple agents.
Pedestrian scenario: how attention evolves through 4 encoder layers.
General driving: progressive attention narrowing through layers (5.64 → 5.36 bits).
Temporal Attention Dynamics
9-second temporal attention dynamics with cyclist interaction.
Temporal attention dynamics in pedestrian crossing scenario.
Spatial Attention Analysis
Spatial attention patterns across diverse driving scenarios. BEV heatmaps show where the model "looks" when predicting trajectories.
Non-monotonic entropy evolution reveals layer specialization: Layers 0-2 narrow focus (5.64→5.36 bits), Layer 3 broadens to map attention (5.92 bits).
Tunnel vision: failed predictions show concentrated attention (entropy 5.72 vs 5.94 bits) with 40% higher self-attention.
Cyclists receive 73% less self-attention than vehicles, leading to 88.1% miss rate — a critical safety blind spot.
Attention head specialization: map-focused heads (93.3%) vs agent-focused heads (58.8%) reveal functional decomposition.
Far-range attention ablation: even mild distance masking (α=0.05) degrades minADE by 4.7%, proving far-range context is essential.
Dynamic scene-type adaptation: 42.3% agent attention in dense traffic vs 18.4% in sparse scenes.
Mode-specific attention: left-turn vs straight trajectory predictions attend to different scene elements.
Counterfactual: removing an agent redistributes 0.012 attention to the next-nearest agent, revealing a learned priority hierarchy.
Lane attention selectivity: top 2 lanes capture >60% of decoder map attention.
Decoder attention refinement: self-attention decreases 23.8% across decoder layers as predictions are finalized.
Key Findings
1. Non-Monotonic Layer Specialization
Entropy evolves non-monotonically: 5.64 → 5.50 → 5.36 → 5.92 bits (Layers 0–3).
Layers 0–2 progressively narrow focus on agents (agent share: 49.7% → 55.1% → 62.4%),
while Layer 3 reverses to broad map attention (63.6% map tokens). This reveals a hierarchical
strategy: agent interaction first, then map-conditioned planning.
2. Tunnel Vision Failure Mode
Failed predictions show lower entropy (5.72 vs 5.94 bits in successes),
with self-attention 40% higher in failures (0.049 vs 0.035) and max single-token
weight 49% higher (0.058 vs 0.039). The model “locks onto” the ego
agent instead of distributing attention — a tunnel vision failure pattern.
3. Far-Range Attention is Essential
Distance masking at α=0.05 causes +4.7% minADE degradation —
even mild far-range suppression hurts accuracy. Far-range agents (30–50m) receive 28.6% of attention
despite being numerous. Distance-attention correlation: r = −0.681 (moderate, not extreme),
proving far-range context is essential for accurate predictions.
4. Dynamic Scene-Type Adaptation
Dense traffic: 42.3% agent attention, entropy 6.11 bits.
Sparse scenes: 18.4% agent attention, entropy 5.33 bits.
Highway driving shows longer attention reach (21.4m mean top-5 distance),
while intersections produce shorter, broader attention (17.0m, entropy 6.10 bits).
The model dynamically adapts its attention strategy to scene complexity.
Citation
@article{zhou2026spatial,
title={Spatial Attention Visualization for Interpretable Trajectory Prediction in Autonomous Driving: Discovering Safety Blind Spots Through Counterfactual Analysis},
author={Zhou, Xingnan and Alecsandru, Ciprian},
year={2026},
note={In Preparation}
}