✨ TL;DR
This paper benchmarks cloud and local large language models on two System Dynamics tasks: extracting causal loop diagrams and providing interactive coaching. The best local models match mid-tier cloud performance on diagram extraction (77%) but struggle with long-context error-fixing tasks, with backend implementation choices mattering more than quantization levels.
System Dynamics modeling requires AI assistants that can extract structured causal relationships from text and provide interactive coaching on model building. While cloud-based LLMs offer strong performance, practitioners need to understand whether locally-hosted open-source models can provide comparable assistance, especially given privacy, cost, and deployment constraints. Existing evaluations have not systematically compared cloud versus local LLM performance on domain-specific System Dynamics tasks, nor have they examined how technical implementation choices (backend frameworks, quantization levels, model architectures) affect practical performance on these specialized tasks.
The authors created two purpose-built benchmarks: the CLD Leaderboard with 53 tests for structured causal loop diagram extraction, and the Discussion Leaderboard for evaluating interactive model discussion, feedback explanation, and coaching capabilities. They systematically evaluated multiple LLM families including proprietary cloud APIs and locally-hosted open-source models. For local models, they conducted extensive parameter sweeps across inference backends (llama.cpp GGUF vs. mlx_lm MLX), quantization levels (Q3, Q4_K_M, MLX-3bit, MLX-4bit, MLX-6bit), model architectures (reasoning vs. instruction-tuned), and sampling parameters (temperature, top-p, top-k). All experiments were run on Apple Silicon hardware with models ranging from 67B to 123B parameters, with careful documentation of timing data and exclusion of stuck requests.