CTL-Drive: VLM-Based End-to-End Driving

Summary

Poutine (Rowe et al., 2025) demonstrated that an unmodified vision-language model (Qwen2.5-VL) can achieve near-human driving performance on the Waymo End-to-End benchmark through text-encoded trajectories and reinforcement learning. We Building on this approach, CTL-Drive implements VLT pre-training (Stage 1a + 1b) on a single RTX 4090, building a complete pipeline — Waymo data extraction, CoVLA-style Gemini annotation, QLoRA fine-tuning of Qwen3-VL-4B — achieving ADE 1.28m (3s) / 2.99m (5s) and RFS 7.70, closely closely approaching the #1 entry (NTR). Ranked #15 on the Waymo E2E Driving Challenge — trained on a single consumer GPU, no reinforcement learning, GRPO still to come.

#15

Waymo E2E Challenge
single RTX 4090

1.28m

ADE @ 3s
(#1 NTR: 1.17m)

2.99m

ADE @ 5s
(#1 NTR: 2.63m)

7.70

RFS
(#1 NTR: 8.05)

Waymo E2E Driving Challenge

Ranked #15 — Single GPU, No RL, Closing in on #1

Trained on a single RTX 4090 with QLoRA, our model achieves ADE 1.28m / 2.99m (3s/5s) and RFS 7.70 on the Waymo E2E Driving Challenge. The #1 entry (NTR) scores 1.17m / 2.63m — we are within 0.11m ADE at 3s without any reinforcement learning.

7.70

RFS (#1: 8.05)

Comparison with Leaderboard #1

Metric	#1 NTR	Ours (#15)	Gap
ADE @ 3s	1.17m	1.28m	0.11m
ADE @ 5s	2.63m	2.99m	0.36m
RFS	8.05	7.70	0.35
RL Stage	—	None
Training GPU	Multi-GPU cluster	Single RTX 4090

Competitive Landscape

Leaderboard standings after de-duplicating by team (each team's best submission only).

Method	RFS	ADE 3s	ADE 5s
NTR	8.05	1.17m	2.63m
RAP	8.04	1.17m	—
Poutine	7.99	1.21m	—
… 8 entries …
CTL-Drive (ours)	7.70	1.28m	2.99m

Key takeaway: CTL-Drive reaches #15 on the official leaderboard using a single consumer GPU and no reinforcement learning. GRPO RL on our Google TRC allocation is the next step to close the remaining gap.

What's Already Done

The #15 submission was trained entirely on a single RTX 4090 with QLoRA: Stage 1a CoVLA pre-training (228K frames) followed by Stage 1b WOD-E2E fine-tuning (90K samples, front-camera only, turn-balanced sampling). The model uses intent conditioning and a turn-aware fallback mechanism. No reinforcement learning was applied. Meanwhile, we are scaling annotation to 795K frames via Google TPU Research Cloud (TRC).

795K

Total frames
being annotated

v4-32

TPU pods
via Google TRC

8×

Parallel VLM
annotation workers

v5e · v6e

Additional TPU
for training

What's Coming Next

With TPU cluster access secured, three major improvements are actively in progress:

Full-Scale Annotation (In Progress) — Expanding CoVLA-style teacher annotation to the complete 795K-frame dataset using 8 parallel Qwen 2.5-VL instances on Google TPU v4-32 pods. WOD-E2E annotation is 99% complete; CoVLA annotation on track to finish within hours.
Stage 1b — Supervised Fine-Tuning — Re-train on the full 795K annotated dataset with extended fine-tuning, targeting improved performance on under-represented scenarios (sharp turns, complex intersections).
GRPO Reinforcement Learning (Stage 2) — Apply Group Relative Policy Optimization with trajectory-quality rewards on TPU v5e/v6e pods. This is the key stage where Poutine demonstrated near-human performance with GRPO (RFS 7.99 vs 8.13 human).

Why this matters: Reaching #15 with only Stage 1 pre-training on a single consumer GPU establishes a strong baseline. With Google TRC providing TPU v4/v5e/v6e cluster access, we now have the compute to execute the full Poutine pipeline — the current ranking represents a floor, not a ceiling.

What is Poutine?

Poutine is a vision-language model (VLM) approach to end-to-end autonomous driving developed at Mila and Université de Montréal. Instead of designing complex perception-prediction-planning pipelines, it fine-tunes an off-the-shelf VLM (Qwen2.5-VL-3B) to directly output future trajectory waypoints as text tokens from camera images.

Near-Human Driving from a Language Model

On the Waymo End-to-End Challenge, Poutine achieved a Route Following Score of 7.99 — compared to 8.13 for a human expert driver. The key insight: text-encoded trajectories combined with large-scale pre-training and GRPO reinforcement learning can outperform complex multi-module architectures.

7.99

RFS (human: 8.13)

Training Pipeline

CTL-Drive follows a multi-stage VLM training pipeline, adapted for single-GPU training. VLT pre-training is complete; GRPO RL is the next milestone.

Done

📦

Data Extraction

415K WOD-E2E + CoVLA

→

Done

🏷

Teacher Annotation

Gemini 3 Flash

→

Done

🧠

VLT Pre-training

16 iterations

→

🎯

GRPO RL

Needs multi-GPU

Training Configuration

Model

Qwen3-VL-4B Instruct

4.57B parameters, DeepStack ViT fusion

Interleaved-MRoPE, 262K context window

LoRA Configuration

Method: QLoRA 4-bit NF4

Rank: 128 · Alpha: 256

Trainable: 132M params (2.89%)

Training Data

WOD-E2E: 20K annotated frames

CoVLA: 379K driving frames

50/50 mixed training ratio

Hardware

GPU: 1× RTX 4090 (24 GB)

VRAM usage: ~10 GB (QLoRA)

Training time: ~10h per iteration

Chain-of-Thought Output Format

Following Poutine, the model produces structured chain-of-thought reasoning before outputting trajectory waypoints. This grounds the driving decision in explicit perception and planning.

[System] You are a driving assistant. Analyze the scene and predict the ego trajectory.

[User] <image> <image> <image>
Current speed: 8.2 m/s. Predict the next 5 seconds of ego trajectory.

[Assistant]
// Perception
The ego vehicle is on a two-lane urban road approaching a signalized intersection.
Detected: 2 passenger vehicles ahead, 1 pedestrian on crosswalk, traffic light is green.

// Prediction & Planning
Lead vehicle decelerating. Maintain lane, gradual deceleration to 5 m/s.
Command: LANE_FOLLOW

// Trajectory (5s @ 2Hz = 10 waypoints)
[[0.8, 0.0], [1.5, 0.0], [2.1, -0.1], [2.6, -0.1], [3.0, -0.1],
[3.3, -0.1], [3.5, -0.1], [3.7, -0.1], [3.8, -0.1], [3.9, -0.1]]

Results

We evaluate our best model (v16, trained with 50% CoVLA + 20K WOD-E2E) against the base Qwen3-VL-4B model on held-out WOD-E2E validation frames, stratified by turning behavior. Average Displacement Error (ADE) measures the mean L2 distance between predicted and ground-truth waypoints over the 5-second horizon.

ADE by Scenario Type

Scenario Type	Base VLM ADE (m)	Our Model ADE (m)	Improvement
Straight	4.24	2.91	−31.4%
Mild Turn	5.07	3.33	−34.4%
Sharp Turn	7.81	7.63	−2.3%
Overall	4.87	3.99	−18.0%

Trajectory Predictions

Sample trajectory overlays from the validation set. Green: ground truth. Red: model prediction. The model produces smooth, directionally accurate trajectories for straight and mild-turn scenarios, while sharp turns remain challenging.

Highway straight — strong tracking

Left turn — captures intent

Highway curve — smooth curvature

Urban driving — lane following

Key insight: On the Waymo E2E Challenge, our model achieves ADE 1.28m (3s) and 2.99m (5s), within 0.11m of the #1 entry (NTR) at the 3-second horizon — without any reinforcement learning. GRPO RL is expected to close the remaining gap and push the ranking higher.

Intersection-Type Analysis

Using our WayGraph toolkit, we classify 56,797 Waymo scenarios by intersection topology (T-junction, 4-way cross, multi-leg, roundabout, etc.). This enables the first intersection-type-stratified evaluation of E2E driving models — revealing where models fail that aggregate metrics hide.

Distribution of intersection types across 56,797 WOMD scenarios, classified by WayGraph lane graph topology analysis. The dataset is dominated by "no intersection" (straight road) scenarios, with T-junctions and 4-way crossings as the most common intersection types.

Research direction: By cross-referencing WayGraph intersection labels with model performance, we can identify systematic failure modes (e.g., poor left-turn prediction at unsignalized T-junctions) that inform targeted training data collection and reward design. This intersection-type-stratified evaluation is a novel contribution to the E2E driving evaluation literature.

Compute Infrastructure

This project spans two compute tiers: initial prototyping on a consumer GPU, scaling to Google Cloud TPU pods for production training and annotation.

Google TPU Research Cloud (TRC)

Supported by Google's TPU Research Cloud program, providing access to Cloud TPU v4, v5e, and v6e pods across US and EU regions. Currently running 8 parallel Qwen 2.5-VL annotation workers on TPU v4-32 pods, with v5litepod-64 and v6e-64 pods allocated for full fine-tuning and GRPO RL.

~152

PFLOPS bf16 total

Active TPU Fleet

288 TPU chips across 6 VMs, spanning three generations of Google's custom AI accelerators. Combined: ~7 TB HBM memory and ~152 PFLOPS bf16 compute — roughly 920× the compute of the RTX 4090 used for the initial #15 submission.

TPU VM	Chips	HBM	bf16 Compute	Role
2× TPU v4-32	32	1,024 GB	8.8 PFLOPS	Annotation (8× VLM workers)
2× v5litepod-64	128	2,048 GB	25.2 PFLOPS	Training (EU-west4 + US-central1)
2× v6e-64	128	4,096 GB	117.6 PFLOPS	GRPO RL (US-east1)
Total	288	~7 TB	~152 PFLOPS

920×

More compute
vs RTX 4090

291×

More memory
vs RTX 4090

3 regions

US-central, US-east
EU-west4

3 gens

TPU v4 · v5e
v6e (Trillium)

Scaling path: The #15 ranking was achieved on a single RTX 4090 (24 GB, QLoRA). With Google TRC providing ~152 PFLOPS across 288 TPU chips, we now have the compute to execute full-parameter fine-tuning and GRPO reinforcement learning at the scale described in the original Poutine paper.

CTL-Drive: VLM-Based End-to-End Driving on a Single GPU

Summary

Waymo E2E Driving Challenge

Ranked #15 — Single GPU, No RL, Closing in on #1

Comparison with Leaderboard #1

Competitive Landscape

What's Already Done

What's Coming Next

What is Poutine?

Near-Human Driving from a Language Model

Training Pipeline

Training Configuration

Model

LoRA Configuration

Training Data

Hardware

Chain-of-Thought Output Format

Results

ADE by Scenario Type

Trajectory Predictions

Intersection-Type Analysis

Compute Infrastructure

Google TPU Research Cloud (TRC)

Active TPU Fleet