StarCoder2: Advancing Generative AI

ServiceNow, Hugging Face, and NVIDIA have joined forces to release StarCoder2, a family of open-access large language models (LLMs) designed for code generation. These models, created in partnership with the BigCode Community, promise to set new standards in performance, transparency, and cost-effectiveness.

StarCoder2 is a versatile and comprehensive LLM, trained on an impressive 619 programming languages. This extensive training allows the model to be further fine-tuned and embedded in enterprise applications, facilitating a wide range of tasks such as application source code generation, workflow generation, text summarisation, and more. Developers can leverage StarCoder2's capabilities, including code completion, advanced code summarisation, and code snippets retrieval.

The release includes three variants: a 3-billion-parameter model trained by ServiceNow, a 7-billion-parameter model trained by Hugging Face, and a 15-billion-parameter model built by NVIDIA using their NeMo framework.

Harm de Vries, lead of ServiceNow's StarCoder2 development team and co-lead of BigCode, emphasised the significance of this release, stating, "StarCoder2 stands as a testament to the combined power of open scientific collaboration and responsible AI practices with an ethical data supply chain. The state-of-the-art open-access model improves on prior generative AI performance to increase developer productivity and provides developers equal access to the benefits of code generation AI, which in turn enables organisations of any size to more easily meet their full business potential."

With broader and deeper programming training, StarCoder2 provides repository context, enabling accurate and context-aware predictions. These advancements benefit both seasoned software engineers and citizen developers.

Responsible innovation lies at the core of BigCode's purpose, as demonstrated by its open governance, transparent supply chain, use of open-source software, and the ability for developers to opt data out of training. StarCoder2 was built using responsibly sourced data under licence from the digital commons of Software Heritage, hosted by Inria.

Roberto Di Cosmo, director at Software Heritage, commended the collaboration, stating, "StarCoder2 is the first code generation AI model developed using the Software Heritage source code archive and built to align with our policies for responsible development of models for code. The collaboration of ServiceNow, Hugging Face, and NVIDIA exemplifies a shared commitment to ethical AI development, advancing technology for the greater good."

The open-access nature of the models, available under the BigCode Open RAIL-M licence and hosted on various platforms, encourages widespread adoption, experimentation, and innovation.

As Jonathan Cohen, vice president of applied research at NVIDIA, noted, "Since every software ecosystem has a proprietary programming language, code LLMs can drive breakthroughs in efficiency and innovation in every industry.''

The collaborative spirit and commitment to responsible AI practices demonstrated by the teams behind StarCoder2 serve as a example of how open scientific collaboration can drive technological progress while prioritising ethics and transparency.