Google has announced DataGemma, a new set of open AI models designed to reduce hallucination in large language models (LLMs) by grounding them in real-world statistical data from Google's Data Commons.
DataGemma aims to address a critical challenge in generative AI: the tendency of LLMs to confidently present inaccurate information. By integrating vast amounts of publicly available, trustworthy data, these models seek to enhance the factuality and reasoning capabilities of AI systems.
The new models leverage two distinct approaches. The first, Retrieval-Interleaved Generation (RIG), enhances Google's Gemma 2 language model by proactively querying and fact-checking against Data Commons, a knowledge graph containing over 240 billion data points from trusted organisations like the UN, WHO, and CDC.
The second approach, Retrieval-Augmented Generation (RAG), utilises Gemini 1.5 Pro's long context window to incorporate relevant contextual information from Data Commons before generating responses, minimising hallucinations and improving accuracy.
Preliminary findings show notable improvements in the accuracy of language models when handling numerical facts. Google's research paper reports a significant reduction in hallucinations for various use cases, including research and decision-making.
The tech giant is making these models available to researchers and developers immediately, with plans to integrate the enhanced functionality into both Gemma and Gemini models through a phased, limited-access approach.
The development of DataGemma underscores the importance of grounding AI systems in factual data. This advancement shows promise to make LLMs more reliable and trustworthy.Google encourages researchers and developers to explore DataGemma using provided quickstart notebooks for both RIG and RAG approaches, furthering the collective effort to improve AI reliability and trustworthiness.