Spatial Attention Visualization

Abstract

While Transformer-based models have achieved state-of-the-art prediction performance, their internal attention mechanisms remain opaque. We present a spatial attention visualization framework that maps abstract Transformer attention weights onto bird’s-eye-view (BEV) traffic scenes, providing the first spatially grounded interpretation of attention in trajectory prediction. Built upon MTR-Lite (8.48M parameters) trained on the Waymo Open Motion Dataset, our framework employs a novel spatial token bookkeeping mechanism. We discover that cyclists receive up to 73% less self-attention than vehicles at equivalent distances — a safety blind spot. We further introduce counterfactual attention analysis to isolate the causal effect of individual scene elements on model attention.

73%

Cyclist Attention Deficit

4.7%

minADE Degradation from Distance Masking

8.48M

MTR-Lite Parameters

17.8K

Waymo Scenes

88.1%

Cyclist Miss Rate (vs 54.0% Vehicles)

2.314m

minADE@6 (54.4% Better Than CV Baseline)

Method

Framework Overview

Our visualization framework consists of three components built on top of an MTR-Lite Transformer trained on the Waymo Open Motion Dataset (20% subset, ~17,800 scenes):

Attention-Capture Layers — Custom Transformer layers that extract per-head attention weight matrices from every encoder and decoder layer without altering predictions
Spatial Token Bookkeeping — A bidirectional mapping between abstract token indices and physical BEV coordinates, enabling projection of attention weights onto the traffic scene
Three Visualization Types — Space-attention BEV heatmaps (where), time-attention refinement diagrams (how attention evolves across layers), and lane-token activation maps (which road structures guide prediction)

MTR-Lite Architecture

A lightweight Motion Transformer variant with 4 encoder layers (global self-attention over 32 agent + 64 map tokens), 4 decoder layers (agent cross-attention + map cross-attention), and 64 intention queries refined into K=6 output modes via NMS. Trained for 60 epochs on Waymo with AdamW, cosine annealing, and mixed-precision training.

Counterfactual Experiments

By directly editing scene dictionaries (removing agents, flipping traffic signals, injecting pedestrians at varying distances), we perform the first counterfactual attention analysis for trajectory prediction — enabling causal (not just correlational) claims about how scene elements influence model reasoning.

Animated Attention Demonstrations

These animations show how attention evolves across encoder layers and over time for different driving scenarios.

Attention Evolution Across Encoder Layers

Encoder layer progression with cyclist — note the attention deficit on the cyclist token.

Dense traffic: attention distributes broadly across multiple agents.

Pedestrian scenario: how attention evolves through 4 encoder layers.

General driving: progressive attention narrowing through layers (5.64 → 5.36 bits).

Temporal Attention Dynamics

9-second temporal attention dynamics with cyclist interaction.

Temporal attention dynamics in pedestrian crossing scenario.

Spatial Attention Analysis

Spatial attention patterns across driving scenarios

Spatial attention patterns across diverse driving scenarios. BEV heatmaps show where the model "looks" when predicting trajectories.

Non-monotonic entropy evolution reveals layer specialization: Layers 0-2 narrow focus (5.64→5.36 bits), Layer 3 broadens to map attention (5.92 bits).

Tunnel vision: failed predictions show concentrated attention (entropy 5.72 vs 5.94 bits) with 40% higher self-attention.

Cyclists receive 73% less self-attention than vehicles, leading to 88.1% miss rate — a critical safety blind spot.

Attention head specialization: map-focused heads (93.3%) vs agent-focused heads (58.8%) reveal functional decomposition.

Far-range attention ablation: even mild distance masking (α=0.05) degrades minADE by 4.7%, proving far-range context is essential.

Dynamic scene-type adaptation: 42.3% agent attention in dense traffic vs 18.4% in sparse scenes.

Mode-specific attention: left-turn vs straight trajectory predictions attend to different scene elements.

Counterfactual: removing an agent redistributes 0.012 attention to the next-nearest agent, revealing a learned priority hierarchy.

Lane attention selectivity: top 2 lanes capture >60% of decoder map attention.

Decoder attention refinement: self-attention decreases 23.8% across decoder layers as predictions are finalized.

Key Findings

1. Non-Monotonic Layer Specialization

Entropy evolves non-monotonically: 5.64 → 5.50 → 5.36 → 5.92 bits (Layers 0–3). Layers 0–2 progressively narrow focus on agents (agent share: 49.7% → 55.1% → 62.4%), while Layer 3 reverses to broad map attention (63.6% map tokens). This reveals a hierarchical strategy: agent interaction first, then map-conditioned planning.

2. Tunnel Vision Failure Mode

Failed predictions show lower entropy (5.72 vs 5.94 bits in successes), with self-attention 40% higher in failures (0.049 vs 0.035) and max single-token weight 49% higher (0.058 vs 0.039). The model “locks onto” the ego agent instead of distributing attention — a tunnel vision failure pattern.

3. Far-Range Attention is Essential

Distance masking at α=0.05 causes +4.7% minADE degradation — even mild far-range suppression hurts accuracy. Far-range agents (30–50m) receive 28.6% of attention despite being numerous. Distance-attention correlation: r = −0.681 (moderate, not extreme), proving far-range context is essential for accurate predictions.

4. Dynamic Scene-Type Adaptation

Dense traffic: 42.3% agent attention, entropy 6.11 bits. Sparse scenes: 18.4% agent attention, entropy 5.33 bits. Highway driving shows longer attention reach (21.4m mean top-5 distance), while intersections produce shorter, broader attention (17.0m, entropy 6.10 bits). The model dynamically adapts its attention strategy to scene complexity.

Citation

@article{zhou2026spatial,
  title={Spatial Attention Visualization for Interpretable Trajectory Prediction in Autonomous Driving: Discovering Safety Blind Spots Through Counterfactual Analysis},
  author={Zhou, Xingnan and Alecsandru, Ciprian},
  year={2026},
  note={In Preparation}
}