Researchers at the Massachusetts Institute of Technology (MIT) and Penn State University have discovered that artificial intelligence models, if used in home surveillance systems, could recommend calling the police even when no criminal activity is evident. The study, which will be presented at the AAAI Conference on AI, Ethics, and Society, highlights significant inconsistencies in how these AI models apply social norms to similar activities captured on video.
The research team, led by graduate student Shomik Jain from MIT's Institute for Data, Systems, and Society (IDSS), tested three prominent LLMs – GPT-4, Gemini, and Claude – using real surveillance videos from Amazon Ring cameras. They asked the models two key questions: whether a crime was occurring in the video and if they would recommend calling the police.
Key findings from the study include:
1. Inconsistent crime detection: The models often failed to identify criminal activities, saying no crime was occurring in most videos, even though 39% of the footage showed criminal acts.
2. Overactive police recommendations: Despite rarely identifying crimes, the models suggested calling the police for 20% to 45% of the videos.
3. Demographic bias: Some models were less likely to recommend police intervention in predominantly white neighborhoods, even when controlling for other factors.
4. Terminology differences: In majority-white areas, models used terms like "delivery workers," while in neighborhoods with more residents of color, they used phrases like "burglary tools" or "casing the property."
5. Inconsistent decisions: Models often disagreed with each other about whether to call the police for the same video.
The researchers noted that while the AI models didn't show significant bias based on the skin tone of individuals in the videos, they exhibited biases related to neighborhood demographics. This finding suggests that current efforts to mitigate AI bias may be insufficient and unpredictable.
The study raises important questions about the reliability and fairness of AI systems in high-stakes decision-making processes. As these technologies continue to advance, the researchers stress the need for more comprehensive testing and transparency in AI development.