Spatial Attention Visualization for Interpretable Trajectory Prediction in Autonomous Driving: Discovering Safety Blind Spots Through Counterfactual Analysis

Xingnan Zhou and Ciprian Alecsandru
Concordia University, Montreal, QC, Canada
In Preparation, 2026

Abstract

While Transformer-based models have achieved state-of-the-art prediction performance, their internal attention mechanisms remain opaque. We present a spatial attention visualization framework that maps abstract Transformer attention weights onto bird’s-eye-view (BEV) traffic scenes, providing the first spatially grounded interpretation of attention in trajectory prediction. Built upon MTR-Lite (8.48M parameters) trained on the Waymo Open Motion Dataset, our framework employs a novel spatial token bookkeeping mechanism. We discover that cyclists receive up to 73% less self-attention than vehicles at equivalent distances — a safety blind spot. We further introduce counterfactual attention analysis to isolate the causal effect of individual scene elements on model attention.
73%
Cyclist Attention Deficit
4.7%
minADE Degradation from Distance Masking
8.48M
MTR-Lite Parameters
17.8K
Waymo Scenes
88.1%
Cyclist Miss Rate (vs 54.0% Vehicles)
2.314m
minADE@6 (54.4% Better Than CV Baseline)

Method

Framework Overview

Our visualization framework consists of three components built on top of an MTR-Lite Transformer trained on the Waymo Open Motion Dataset (20% subset, ~17,800 scenes):

MTR-Lite Architecture

A lightweight Motion Transformer variant with 4 encoder layers (global self-attention over 32 agent + 64 map tokens), 4 decoder layers (agent cross-attention + map cross-attention), and 64 intention queries refined into K=6 output modes via NMS. Trained for 60 epochs on Waymo with AdamW, cosine annealing, and mixed-precision training.

Counterfactual Experiments

By directly editing scene dictionaries (removing agents, flipping traffic signals, injecting pedestrians at varying distances), we perform the first counterfactual attention analysis for trajectory prediction — enabling causal (not just correlational) claims about how scene elements influence model reasoning.

Animated Attention Demonstrations

These animations show how attention evolves across encoder layers and over time for different driving scenarios.

Attention Evolution Across Encoder Layers

Attention evolution with cyclist
Encoder layer progression with cyclist — note the attention deficit on the cyclist token.
Attention evolution in dense traffic
Dense traffic: attention distributes broadly across multiple agents.
Attention evolution with pedestrian
Pedestrian scenario: how attention evolves through 4 encoder layers.
General driving attention evolution
General driving: progressive attention narrowing through layers (5.64 → 5.36 bits).

Temporal Attention Dynamics

Temporal dynamics with cyclist
9-second temporal attention dynamics with cyclist interaction.
Temporal dynamics with pedestrian
Temporal attention dynamics in pedestrian crossing scenario.

Spatial Attention Analysis

Spatial attention patterns across driving scenarios
Spatial attention patterns across diverse driving scenarios. BEV heatmaps show where the model "looks" when predicting trajectories.
Non-monotonic entropy evolution
Non-monotonic entropy evolution reveals layer specialization: Layers 0-2 narrow focus (5.64→5.36 bits), Layer 3 broadens to map attention (5.92 bits).
Tunnel vision failure pattern
Tunnel vision: failed predictions show concentrated attention (entropy 5.72 vs 5.94 bits) with 40% higher self-attention.
Cyclist attention deficit
Cyclists receive 73% less self-attention than vehicles, leading to 88.1% miss rate — a critical safety blind spot.
Attention head specialization
Attention head specialization: map-focused heads (93.3%) vs agent-focused heads (58.8%) reveal functional decomposition.
Distance masking ablation
Far-range attention ablation: even mild distance masking (α=0.05) degrades minADE by 4.7%, proving far-range context is essential.
Scene-type adaptation
Dynamic scene-type adaptation: 42.3% agent attention in dense traffic vs 18.4% in sparse scenes.
Mode-specific attention
Mode-specific attention: left-turn vs straight trajectory predictions attend to different scene elements.
Counterfactual analysis
Counterfactual: removing an agent redistributes 0.012 attention to the next-nearest agent, revealing a learned priority hierarchy.
Lane activation map
Lane attention selectivity: top 2 lanes capture >60% of decoder map attention.
Temporal attention patterns
Decoder attention refinement: self-attention decreases 23.8% across decoder layers as predictions are finalized.

Key Findings

1. Non-Monotonic Layer Specialization

Entropy evolves non-monotonically: 5.64 → 5.50 → 5.36 → 5.92 bits (Layers 0–3). Layers 0–2 progressively narrow focus on agents (agent share: 49.7% → 55.1% → 62.4%), while Layer 3 reverses to broad map attention (63.6% map tokens). This reveals a hierarchical strategy: agent interaction first, then map-conditioned planning.

2. Tunnel Vision Failure Mode

Failed predictions show lower entropy (5.72 vs 5.94 bits in successes), with self-attention 40% higher in failures (0.049 vs 0.035) and max single-token weight 49% higher (0.058 vs 0.039). The model “locks onto” the ego agent instead of distributing attention — a tunnel vision failure pattern.

3. Far-Range Attention is Essential

Distance masking at α=0.05 causes +4.7% minADE degradation — even mild far-range suppression hurts accuracy. Far-range agents (30–50m) receive 28.6% of attention despite being numerous. Distance-attention correlation: r = −0.681 (moderate, not extreme), proving far-range context is essential for accurate predictions.

4. Dynamic Scene-Type Adaptation

Dense traffic: 42.3% agent attention, entropy 6.11 bits. Sparse scenes: 18.4% agent attention, entropy 5.33 bits. Highway driving shows longer attention reach (21.4m mean top-5 distance), while intersections produce shorter, broader attention (17.0m, entropy 6.10 bits). The model dynamically adapts its attention strategy to scene complexity.

Citation

@article{zhou2026spatial,
  title={Spatial Attention Visualization for Interpretable Trajectory Prediction in Autonomous Driving: Discovering Safety Blind Spots Through Counterfactual Analysis},
  author={Zhou, Xingnan and Alecsandru, Ciprian},
  year={2026},
  note={In Preparation}
}