✨ TL;DR
This paper introduces DESPITE, a benchmark of over 12,000 tasks to evaluate safety in LLM-based robotic planning, revealing that even models with near-perfect planning ability produce dangerous plans 28% of the time. The study shows planning ability scales with model size while safety awareness remains relatively flat, creating a critical gap for deploying LLMs in robotics.
Large language models are increasingly being deployed as planners for robotic systems, but their safety in generating plans for physical robots remains poorly understood and systematically unevaluated. While these models may excel at generating valid plans that accomplish tasks, there is no comprehensive understanding of whether they avoid plans that could cause physical harm or violate normative constraints. The lack of systematic evaluation frameworks makes it difficult to assess the safety risks introduced when using LLMs for embodied planning in real-world robotic applications.
The researchers created DESPITE, a benchmark containing 12,279 tasks that span both physical dangers (like causing harm or damage) and normative dangers (like violating social norms or rules). The benchmark features fully deterministic validation to ensure consistent evaluation. They evaluated 23 different language models, including 18 open-source models ranging from 3 billion to 671 billion parameters and several proprietary models, including reasoning-capable models. The evaluation separately measured two key capacities: planning ability (whether models can generate valid plans) and safety awareness (whether models avoid dangerous plans), allowing the researchers to analyze the relationship between these two dimensions.