Summary
Poutine (Rowe et al., 2025) demonstrated that an unmodified
vision-language model (Qwen2.5-VL) can achieve near-human driving performance on the Waymo
End-to-End benchmark through text-encoded trajectories and reinforcement learning. We
Building on this approach, CTL-Drive implements VLT pre-training (Stage 1a + 1b) on a
single RTX 4090, building a complete pipeline — Waymo data extraction,
CoVLA-style Gemini annotation, QLoRA fine-tuning of Qwen3-VL-4B — achieving
ADE 1.28m (3s) / 2.99m (5s) and RFS 7.70, closely
closely approaching the #1 entry (NTR).
Ranked #15 on the Waymo E2E Driving Challenge
— trained on a single consumer GPU, no reinforcement learning, GRPO still to come.
#15
Waymo E2E Challenge
single RTX 4090
1.28m
ADE @ 3s
(#1 NTR: 1.17m)
2.99m
ADE @ 5s
(#1 NTR: 2.63m)
Waymo E2E Driving Challenge
Ranked #15 — Single GPU, No RL, Closing in on #1
Trained on a single RTX 4090 with QLoRA, our model achieves
ADE 1.28m / 2.99m (3s/5s) and
RFS 7.70 on the
Waymo E2E Driving Challenge.
The #1 entry (NTR) scores 1.17m / 2.63m —
we are within 0.11m ADE at 3s without any reinforcement learning.
Comparison with Leaderboard #1
| Metric |
#1 NTR |
Ours (#15) |
Gap |
| ADE @ 3s |
1.17m |
1.28m |
0.11m |
| ADE @ 5s |
2.63m |
2.99m |
0.36m |
| RFS |
8.05 |
7.70 |
0.35 |
| RL Stage |
— |
None |
|
| Training GPU |
Multi-GPU cluster |
Single RTX 4090 |
|
Competitive Landscape
Leaderboard standings after de-duplicating by team (each team's best submission only).
| Method |
RFS |
ADE 3s |
ADE 5s |
| NTR |
8.05 |
1.17m |
2.63m |
| RAP |
8.04 |
1.17m |
— |
| Poutine |
7.99 |
1.21m |
— |
| … 8 entries … |
| CTL-Drive (ours) |
7.70 |
1.28m |
2.99m |
Key takeaway: CTL-Drive reaches #15 on the official leaderboard
using a single consumer GPU and no reinforcement learning.
GRPO RL on our Google TRC allocation is the next step to close the remaining gap.
What's Already Done
The #15 submission was trained entirely on a single RTX 4090 with QLoRA:
Stage 1a CoVLA pre-training (228K frames) followed by Stage 1b WOD-E2E fine-tuning (90K samples,
front-camera only, turn-balanced sampling). The model uses intent conditioning and a turn-aware
fallback mechanism. No reinforcement learning was applied.
Meanwhile, we are scaling annotation to 795K frames via
Google TPU Research Cloud (TRC).
795K
Total frames
being annotated
v4-32
TPU pods
via Google TRC
8×
Parallel VLM
annotation workers
v5e · v6e
Additional TPU
for training
What's Coming Next
With TPU cluster access secured, three major improvements are actively in progress:
-
Full-Scale Annotation (In Progress) — Expanding CoVLA-style teacher annotation
to the complete 795K-frame dataset using 8 parallel Qwen 2.5-VL instances on Google TPU v4-32 pods.
WOD-E2E annotation is 99% complete; CoVLA annotation on track to finish within hours.
-
Stage 1b — Supervised Fine-Tuning — Re-train on the full 795K annotated
dataset with extended fine-tuning, targeting improved performance on under-represented scenarios
(sharp turns, complex intersections).
-
GRPO Reinforcement Learning (Stage 2) — Apply Group Relative Policy
Optimization with trajectory-quality rewards on TPU v5e/v6e pods. This is the key stage where
Poutine demonstrated near-human performance with GRPO (RFS 7.99 vs 8.13 human).
Why this matters: Reaching #15 with only Stage 1 pre-training on a single consumer GPU
establishes a strong baseline. With Google TRC providing TPU v4/v5e/v6e cluster access, we now have
the compute to execute the full Poutine pipeline — the current ranking represents a floor, not a ceiling.
What is Poutine?
Poutine is a vision-language model (VLM) approach to end-to-end autonomous driving developed
at Mila and Université de Montréal. Instead of designing complex perception-prediction-planning
pipelines, it fine-tunes an off-the-shelf VLM (Qwen2.5-VL-3B) to directly output future
trajectory waypoints as text tokens from camera images.
Near-Human Driving from a Language Model
On the Waymo End-to-End Challenge, Poutine achieved a Route Following Score of
7.99 — compared to 8.13 for a human expert driver.
The key insight: text-encoded trajectories combined with large-scale pre-training and
GRPO reinforcement learning can outperform complex multi-module architectures.
Training Pipeline
CTL-Drive follows a multi-stage VLM training pipeline, adapted for single-GPU training.
VLT pre-training is complete; GRPO RL is the next milestone.
Done
📦
Data Extraction
415K WOD-E2E + CoVLA
→
Done
🏷
Teacher Annotation
Gemini 3 Flash
→
Done
🧠
VLT Pre-training
16 iterations
→
Next
🎯
GRPO RL
Needs multi-GPU
Training Configuration
Model
Qwen3-VL-4B Instruct
4.57B parameters, DeepStack ViT fusion
Interleaved-MRoPE, 262K context window
LoRA Configuration
Method: QLoRA 4-bit NF4
Rank: 128 · Alpha: 256
Trainable: 132M params (2.89%)
Training Data
WOD-E2E: 20K annotated frames
CoVLA: 379K driving frames
50/50 mixed training ratio
Hardware
GPU: 1× RTX 4090 (24 GB)
VRAM usage: ~10 GB (QLoRA)
Training time: ~10h per iteration
Chain-of-Thought Output Format
Following Poutine, the model produces structured chain-of-thought reasoning before outputting
trajectory waypoints. This grounds the driving decision in explicit perception and planning.
[System] You are a driving assistant. Analyze the scene and predict the ego trajectory.
[User] <image> <image> <image>
Current speed: 8.2 m/s. Predict the next 5 seconds of ego trajectory.
[Assistant]
// Perception
The ego vehicle is on a two-lane urban road approaching a signalized intersection.
Detected: 2 passenger vehicles ahead, 1 pedestrian on crosswalk, traffic light is green.
// Prediction & Planning
Lead vehicle decelerating. Maintain lane, gradual deceleration to 5 m/s.
Command: LANE_FOLLOW
// Trajectory (5s @ 2Hz = 10 waypoints)
[[0.8, 0.0], [1.5, 0.0], [2.1, -0.1], [2.6, -0.1], [3.0, -0.1],
[3.3, -0.1], [3.5, -0.1], [3.7, -0.1], [3.8, -0.1], [3.9, -0.1]]
Results
We evaluate our best model (v16, trained with 50% CoVLA + 20K WOD-E2E) against the base
Qwen3-VL-4B model on held-out WOD-E2E validation frames, stratified by turning behavior.
Average Displacement Error (ADE) measures the mean L2 distance between predicted and
ground-truth waypoints over the 5-second horizon.
ADE by Scenario Type
| Scenario Type |
Base VLM ADE (m) |
Our Model ADE (m) |
Improvement |
| Straight |
4.24 |
2.91 |
−31.4% |
| Mild Turn |
5.07 |
3.33 |
−34.4% |
| Sharp Turn |
7.81 |
7.63 |
−2.3% |
| Overall |
4.87 |
3.99 |
−18.0% |
Trajectory Predictions
Sample trajectory overlays from the validation set. Green: ground truth. Red: model prediction.
The model produces smooth, directionally accurate trajectories for straight and mild-turn
scenarios, while sharp turns remain challenging.
Highway straight — strong tracking
Left turn — captures intent
Highway curve — smooth curvature
Urban driving — lane following
Key insight: On the Waymo E2E Challenge, our model achieves ADE 1.28m (3s) and 2.99m (5s),
within 0.11m of the #1 entry (NTR) at the 3-second horizon — without any reinforcement learning.
GRPO RL is expected to close the remaining gap and push the ranking higher.
Intersection-Type Analysis
Using our WayGraph toolkit, we
classify 56,797 Waymo scenarios by intersection topology (T-junction, 4-way cross,
multi-leg, roundabout, etc.). This enables the first intersection-type-stratified evaluation
of E2E driving models — revealing where models fail that aggregate metrics hide.
Research direction: By cross-referencing WayGraph intersection labels with
model performance, we can identify systematic failure modes (e.g., poor left-turn prediction
at unsignalized T-junctions) that inform targeted training data collection and reward design.
This intersection-type-stratified evaluation is a novel contribution to the E2E driving
evaluation literature.
Compute Infrastructure
This project spans two compute tiers: initial prototyping on a consumer GPU,
scaling to Google Cloud TPU pods for production training and annotation.
Google TPU Research Cloud (TRC)
Supported by Google's TPU Research Cloud program,
providing access to Cloud TPU v4, v5e, and v6e pods across US and EU regions.
Currently running 8 parallel Qwen 2.5-VL annotation workers on TPU v4-32 pods,
with v5litepod-64 and v6e-64 pods allocated for full fine-tuning and GRPO RL.
Active TPU Fleet
288 TPU chips across 6 VMs, spanning three generations of Google's custom AI accelerators.
Combined: ~7 TB HBM memory and ~152 PFLOPS bf16 compute —
roughly 920× the compute of the RTX 4090 used for the initial #15 submission.
| TPU VM |
Chips |
HBM |
bf16 Compute |
Role |
| 2× TPU v4-32 |
32 |
1,024 GB |
8.8 PFLOPS |
Annotation (8× VLM workers) |
| 2× v5litepod-64 |
128 |
2,048 GB |
25.2 PFLOPS |
Training (EU-west4 + US-central1) |
| 2× v6e-64 |
128 |
4,096 GB |
117.6 PFLOPS |
GRPO RL (US-east1) |
| Total |
288 |
~7 TB |
~152 PFLOPS |
|
920×
More compute
vs RTX 4090
291×
More memory
vs RTX 4090
3 regions
US-central, US-east
EU-west4
3 gens
TPU v4 · v5e
v6e (Trillium)
Scaling path: The #15 ranking was achieved on a single RTX 4090 (24 GB, QLoRA).
With Google TRC providing ~152 PFLOPS across 288 TPU chips, we now have the compute to execute
full-parameter fine-tuning and GRPO reinforcement learning at the scale described in the original
Poutine paper.