Metrics and Benchmarks for Automated Driving

Abstract

Deep neural networks (DNNs) conquer more and more of the autonomous vehicle driving stack – up to end-to-end trained systems. Numerous challenges and benchmarks exist, inviting to push metrics higher and higher. But does this directly translate to a safer or improved automated vehicle operation? The practitioner asks: How good is good enough, and are we even measuring the right things?

Our workshop invites both scientific and industrial approaches to deeper understand and overcome what we call the “crisis of metrics”. We request contributions to the questions: Which metrics are the right ones for which problems? What are the dos and don’ts? Do singular metrics, e.g. of perception outputs, still have their place in end-to-end trained systems? How to combine metrics? Are metrics invariant enough towards diverse system outputs that could all be considered as safe and reasonable operation? Should metrics be designed to take downstream applications into account instead of being fully independent to prevent silo thinking across research groups or departments?

In the age of generative approaches, challenges regarding metrics go even beyond. How to measure quality without ground truth or usefulness for certain purposes? Is it necessary that generated videos/images are real-world like to be useful, or is it rather enough if specific aspects that the downstream application (like object detectors) focus on are accurately generated?

What are the right losses in DNN training and how is the interplay with the right metrics? Are intermediate losses useful in end-to-end learning systems, or do we only focus on the system’s outcome? Should metrics and losses be developed, which are invariant with respect to multiple “good” solutions?

With this workshop, we bring practitioners and researchers together to discuss and potentially advance the state of the art of metrics and benchmarks for automated driving and to obtain a common understanding about the challenges ahead.

We invite contributions on the following topics:

Metrics and Benchmarks

  • Metrics for End-to-End Driving Systems
  • Alignment between single function metrics and system performance
  • Novel performance metrics and loss functions
  • Evaluation of un- and self-supervised learning
  • Plausibility checks of generative AI
  • Alignment of loss functions and system performance
  • Quantitative and qualitative comparison of Automated Driving stacks

Best Practices

  • Suitability of Generative AI data for training, test and validation
  • Best practices for fair performance comparisons
  • Novel test and validation data sets and procedures
  • Minimal performance requirements for reliable and safe AD systems
  • Online monitoring of function and system performance
  • Challenges and pitfalls of existing metrics

Organizers Info

Corina
Apachite

Continental AG

Holger
Caesar

TU Delft

Tim
Fingscheidt

TU Braunschweig

Christian
Hubschneider

FZI

Ulrich
Kreßel

Mercedes-Benz AG

Thomas
Monninger

Mercedes-Benz
RDNA Inc.

Jörg
Reichardt

Continental AG

Ömer
Sahin Tas

FZI

Andrei
Vatavu

Mercedes-Benz
RDNA Inc.

Marius
Zöllner

FZI

Dates & Agenda

March 15, 2025
March 30, 2025

April 25, 2025
June 22, 2025

Workshop Paper Submission
Note of Acceptance

Final Paper Submission
Workshops Date

March 15, 2025
Workshop Paper Submission
March, 30 2025
Note of Acceptance


April, 25 2025
Final Paper Submission
June 22, 2025
Workshops Date

TIMEAGENDA ITEMSPEAKER
12:30-13:30Lunch break and handover from Ensuring and Validating Safety for Automated Vehicles
Workshop: Metrics and Benchmarks for Automated Driving
13:30-14:10On Benchmarks and Annotations
This talk surveys several benchmarks commonly used for object detection and motion prediction. It highlights, in particular, datasets recently (co)developed by our Intelligent Vehicles group at TU Delft: the EuroCity Persons 2.0 dataset – an image dataset for person detection, tracking, and prediction, collected across 29 cities in 11 European countries – as well as the View-of-Delft and RaDelft datasets, which are multi-sensor datasets designed for urban driving in a European city center and include 4D radar data. The talk also explores different annotation strategies (fully manual, semi-supervised, and self-supervised), and discusses their effectiveness and efficiency.
Dariu M. Gavrila
Delft University of Technology
14:10-14:50Metrics for Automated Driving: Challenges and Pitfalls
The talk highlights challenges and pitfalls of commonly used evaluation metrics in modular automated driving stacks, in End-To-End stacks, as well as in generative AI. How good is good enough, and are we even measuring the right things? How useful are intermediate metrics in their current form in the overall system context and how to come up with metrics that really matter in the end?
Matthias Schreier
Technical University of Applied Sciences Würzburg-Schweinfurt
14:50-15:40Paper Presentation15 Min Presentation + 5 Min Discussion
14:50-15:10A Generalized Waypoint Loss for End-to-End Autonomous DrivingMalte Stelzer
Timo Bartels, Jan Bickerdt, Volker Patricio Schomerus, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt
15:10-15:30Empirical Spatial Error Bounds for Reliable Semantic Segmentation of Pedestrians and RidersTimo Bartels,
Malte Stelzer, Jan Bickerdt, Volker Patricio Schomerus, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt
15:30-16:00Coffee Break
16:00-16:502 Invited Talks
16:00-16:25MAN TruckScenes: a truckload of data to fill research gaps
Dive into MAN TruckScenes, the first public large-scale multimodal dataset for Autonomous Trucking! It’s addressing various challenges such as trailer occlusions, novel sensor perspectives, long-range perception and diverse weather conditions. Comprising 747 scenes, it features sensor data from cameras, lidars, and 4D radars, along with annotations for tracked bounding boxes and scene tags. Thereby MAN TruckScenes not only provides a foundation for exploring new perception solutions, but also serves as a valuable resource for developing benchmarks and metrics to tackle these challenges!
Fabian Kuttenreich
MAN Truck & Bus SE
16:25-16:50Bridging the Reality Gap: Simulation-Based Evaluation of Driving Models
Real-world evaluation of driving models is expensive, difficult to scale, and lacks controllability. Simulation offers a scalable alternative, enabling reproducible testing in both open-loop and closed-loop settings. This talk examines methods for assessing the domain gap between the real world and simulation and explores the role of scene reconstruction models in generating high-fidelity virtual environments.
Jannik Zürn
Wayve Technologies Ltd.
16:50-17:30Panel Session with conclusions and closing remarksWorkshop Chair and Speaker

Keynote Speakers

Dariu
Gavrila

TU Delft

Fabian
Kuttenreich

MAN Truck & Bus SE

Matthias Schreier

Technical University of Applied Sciences Würzburg-Schweinfurt

Jannik
Zürn

Wayve Technologies Ltd.

IEEE IV 2025

Scroll to Top