Urban driving scene with trajectory overlay

CTL-Drive: VLM-Based End-to-End Driving on a Single GPU

Xingnan Zhou, Ciprian Alecsandru
Concordia University, Montreal · 2026
#15
Waymo E2E Challenge
795K
Frames Annotating
6 TPUs
Google TRC
Stage 1
RL Coming Next

Summary

Poutine (Rowe et al., 2025) demonstrated that an unmodified vision-language model (Qwen2.5-VL) can achieve near-human driving performance on the Waymo End-to-End benchmark through text-encoded trajectories and reinforcement learning. We Building on this approach, CTL-Drive implements VLT pre-training (Stage 1a + 1b) on a single RTX 4090, building a complete pipeline — Waymo data extraction, CoVLA-style Gemini annotation, QLoRA fine-tuning of Qwen3-VL-4B — achieving ADE 1.28m (3s) / 2.99m (5s) and RFS 7.70, closely closely approaching the #1 entry (NTR). Ranked #15 on the Waymo E2E Driving Challenge — trained on a single consumer GPU, no reinforcement learning, GRPO still to come.

#15
Waymo E2E Challenge
single RTX 4090
1.28m
ADE @ 3s
(#1 NTR: 1.17m)
2.99m
ADE @ 5s
(#1 NTR: 2.63m)
7.70
RFS
(#1 NTR: 8.05)

Waymo E2E Driving Challenge

Ranked #15 — Single GPU, No RL, Closing in on #1

Trained on a single RTX 4090 with QLoRA, our model achieves ADE 1.28m / 2.99m (3s/5s) and RFS 7.70 on the Waymo E2E Driving Challenge. The #1 entry (NTR) scores 1.17m / 2.63m — we are within 0.11m ADE at 3s without any reinforcement learning.

7.70
RFS (#1: 8.05)

Comparison with Leaderboard #1

Metric #1 NTR Ours (#15) Gap
ADE @ 3s 1.17m 1.28m 0.11m
ADE @ 5s 2.63m 2.99m 0.36m
RFS 8.05 7.70 0.35
RL Stage None
Training GPU Multi-GPU cluster Single RTX 4090

Competitive Landscape

Leaderboard standings after de-duplicating by team (each team's best submission only).

Method RFS ADE 3s ADE 5s
NTR 8.05 1.17m 2.63m
RAP 8.04 1.17m
Poutine 7.99 1.21m
… 8 entries …
CTL-Drive (ours) 7.70 1.28m 2.99m
Key takeaway: CTL-Drive reaches #15 on the official leaderboard using a single consumer GPU and no reinforcement learning. GRPO RL on our Google TRC allocation is the next step to close the remaining gap.

What's Already Done

The #15 submission was trained entirely on a single RTX 4090 with QLoRA: Stage 1a CoVLA pre-training (228K frames) followed by Stage 1b WOD-E2E fine-tuning (90K samples, front-camera only, turn-balanced sampling). The model uses intent conditioning and a turn-aware fallback mechanism. No reinforcement learning was applied. Meanwhile, we are scaling annotation to 795K frames via Google TPU Research Cloud (TRC).

795K
Total frames
being annotated
v4-32
TPU pods
via Google TRC
Parallel VLM
annotation workers
v5e · v6e
Additional TPU
for training

What's Coming Next

With TPU cluster access secured, three major improvements are actively in progress:

  1. Full-Scale Annotation (In Progress) — Expanding CoVLA-style teacher annotation to the complete 795K-frame dataset using 8 parallel Qwen 2.5-VL instances on Google TPU v4-32 pods. WOD-E2E annotation is 99% complete; CoVLA annotation on track to finish within hours.
  2. Stage 1b — Supervised Fine-Tuning — Re-train on the full 795K annotated dataset with extended fine-tuning, targeting improved performance on under-represented scenarios (sharp turns, complex intersections).
  3. GRPO Reinforcement Learning (Stage 2) — Apply Group Relative Policy Optimization with trajectory-quality rewards on TPU v5e/v6e pods. This is the key stage where Poutine demonstrated near-human performance with GRPO (RFS 7.99 vs 8.13 human).
Why this matters: Reaching #15 with only Stage 1 pre-training on a single consumer GPU establishes a strong baseline. With Google TRC providing TPU v4/v5e/v6e cluster access, we now have the compute to execute the full Poutine pipeline — the current ranking represents a floor, not a ceiling.

What is Poutine?

Poutine is a vision-language model (VLM) approach to end-to-end autonomous driving developed at Mila and Université de Montréal. Instead of designing complex perception-prediction-planning pipelines, it fine-tunes an off-the-shelf VLM (Qwen2.5-VL-3B) to directly output future trajectory waypoints as text tokens from camera images.

Near-Human Driving from a Language Model

On the Waymo End-to-End Challenge, Poutine achieved a Route Following Score of 7.99 — compared to 8.13 for a human expert driver. The key insight: text-encoded trajectories combined with large-scale pre-training and GRPO reinforcement learning can outperform complex multi-module architectures.

7.99
RFS (human: 8.13)

Training Pipeline

CTL-Drive follows a multi-stage VLM training pipeline, adapted for single-GPU training. VLT pre-training is complete; GRPO RL is the next milestone.

Done
📦
Data Extraction
415K WOD-E2E + CoVLA
Done
🏷
Teacher Annotation
Gemini 3 Flash
Done
🧠
VLT Pre-training
16 iterations
Next
🎯
GRPO RL
Needs multi-GPU

Training Configuration

Model

Qwen3-VL-4B Instruct

4.57B parameters, DeepStack ViT fusion

Interleaved-MRoPE, 262K context window

LoRA Configuration

Method: QLoRA 4-bit NF4

Rank: 128 · Alpha: 256

Trainable: 132M params (2.89%)

Training Data

WOD-E2E: 20K annotated frames

CoVLA: 379K driving frames

50/50 mixed training ratio

Hardware

GPU: 1× RTX 4090 (24 GB)

VRAM usage: ~10 GB (QLoRA)

Training time: ~10h per iteration

Chain-of-Thought Output Format

Following Poutine, the model produces structured chain-of-thought reasoning before outputting trajectory waypoints. This grounds the driving decision in explicit perception and planning.

[System] You are a driving assistant. Analyze the scene and predict the ego trajectory.

[User] <image> <image> <image>
Current speed: 8.2 m/s. Predict the next 5 seconds of ego trajectory.

[Assistant]
// Perception
The ego vehicle is on a two-lane urban road approaching a signalized intersection.
Detected: 2 passenger vehicles ahead, 1 pedestrian on crosswalk, traffic light is green.

// Prediction & Planning
Lead vehicle decelerating. Maintain lane, gradual deceleration to 5 m/s.
Command: LANE_FOLLOW

// Trajectory (5s @ 2Hz = 10 waypoints)
[[0.8, 0.0], [1.5, 0.0], [2.1, -0.1], [2.6, -0.1], [3.0, -0.1],
 [3.3, -0.1], [3.5, -0.1], [3.7, -0.1], [3.8, -0.1], [3.9, -0.1]]

Results

We evaluate our best model (v16, trained with 50% CoVLA + 20K WOD-E2E) against the base Qwen3-VL-4B model on held-out WOD-E2E validation frames, stratified by turning behavior. Average Displacement Error (ADE) measures the mean L2 distance between predicted and ground-truth waypoints over the 5-second horizon.

ADE by Scenario Type

Scenario Type Base VLM ADE (m) Our Model ADE (m) Improvement
Straight 4.24 2.91 −31.4%
Mild Turn 5.07 3.33 −34.4%
Sharp Turn 7.81 7.63 −2.3%
Overall 4.87 3.99 −18.0%

Trajectory Predictions

Sample trajectory overlays from the validation set. Green: ground truth. Red: model prediction. The model produces smooth, directionally accurate trajectories for straight and mild-turn scenarios, while sharp turns remain challenging.

Highway straight trajectory
Highway straight — strong tracking
Left turn trajectory
Left turn — captures intent
Highway curve trajectory
Highway curve — smooth curvature
Urban driving trajectory
Urban driving — lane following
Key insight: On the Waymo E2E Challenge, our model achieves ADE 1.28m (3s) and 2.99m (5s), within 0.11m of the #1 entry (NTR) at the 3-second horizon — without any reinforcement learning. GRPO RL is expected to close the remaining gap and push the ranking higher.

Intersection-Type Analysis

Using our WayGraph toolkit, we classify 56,797 Waymo scenarios by intersection topology (T-junction, 4-way cross, multi-leg, roundabout, etc.). This enables the first intersection-type-stratified evaluation of E2E driving models — revealing where models fail that aggregate metrics hide.

Intersection Type Distribution in WOMD
Distribution of intersection types across 56,797 WOMD scenarios, classified by WayGraph lane graph topology analysis. The dataset is dominated by "no intersection" (straight road) scenarios, with T-junctions and 4-way crossings as the most common intersection types.
Research direction: By cross-referencing WayGraph intersection labels with model performance, we can identify systematic failure modes (e.g., poor left-turn prediction at unsignalized T-junctions) that inform targeted training data collection and reward design. This intersection-type-stratified evaluation is a novel contribution to the E2E driving evaluation literature.

Compute Infrastructure

This project spans two compute tiers: initial prototyping on a consumer GPU, scaling to Google Cloud TPU pods for production training and annotation.

Google TPU Research Cloud (TRC)

Supported by Google's TPU Research Cloud program, providing access to Cloud TPU v4, v5e, and v6e pods across US and EU regions. Currently running 8 parallel Qwen 2.5-VL annotation workers on TPU v4-32 pods, with v5litepod-64 and v6e-64 pods allocated for full fine-tuning and GRPO RL.

~152
PFLOPS bf16 total

Active TPU Fleet

288 TPU chips across 6 VMs, spanning three generations of Google's custom AI accelerators. Combined: ~7 TB HBM memory and ~152 PFLOPS bf16 compute — roughly 920× the compute of the RTX 4090 used for the initial #15 submission.

TPU VM Chips HBM bf16 Compute Role
2× TPU v4-32 32 1,024 GB 8.8 PFLOPS Annotation (8× VLM workers)
2× v5litepod-64 128 2,048 GB 25.2 PFLOPS Training (EU-west4 + US-central1)
2× v6e-64 128 4,096 GB 117.6 PFLOPS GRPO RL (US-east1)
Total 288 ~7 TB ~152 PFLOPS
920×
More compute
vs RTX 4090
291×
More memory
vs RTX 4090
3 regions
US-central, US-east
EU-west4
3 gens
TPU v4 · v5e
v6e (Trillium)
Scaling path: The #15 ranking was achieved on a single RTX 4090 (24 GB, QLoRA). With Google TRC providing ~152 PFLOPS across 288 TPU chips, we now have the compute to execute full-parameter fine-tuning and GRPO reinforcement learning at the scale described in the original Poutine paper.