Stanford researchers have discovered that large language models (LLMs) demonstrate high consistency on neutral topics but show significant variability when addressing controversial issues, according to a new study examining AI system reliability.
A research team led by doctoral candidate Jared Moore tested multiple LLMs with 8,000 questions across 300 topics, revealing that larger models like GPT-4 and Claude consistently outperform smaller models in response reliability. However, their consistency notably decreases when handling controversial subjects.
"You can't really declare that a large language model is biased if it gives different answers when a question is rephrased, nuanced, or translated into other languages," Moore told the Stanford Institute for Human-Centered Artificial Intelligence (HAI). The study tested questions in multiple languages, including English, Chinese, German, and Japanese.
The research uncovered distinct patterns in AI responses. LLMs show remarkable consistency on neutral topics like Thanksgiving, sometimes exceeding human reliability. However, when addressing contentious issues like euthanasia, the reactions became more varied and less predictable.
"We found that, in general, large models answers are pretty consistent across these different measures," noted Diyi Yang, Stanford University computer science professor and the study's senior author, in an interview with HAI. "Sometimes they were even more consistent than human participants."
The findings suggest that inconsistency on controversial topics indicates a lack of inherent bias. "With our particular methodology, we show that these models are actually incredibly inconsistent on controversial topics. So we shouldn't be ascribing these kinds of values to them," Moore explained to HAI.
Looking ahead, Moore is investigating why models show varying consistency across different topics. He suggests that "value pluralism" – programming models to reflect diverse perspectives rather than maintaining rigid consistency – might solve potential bias concerns.
"Often, we don't want perfect consistency. We don't want models to always express the same positions. You want them to represent a distribution of ideas," Moore told HAI.