Meta, determined to make a significant impact in the competitive field of generative AI, is embracing open source development.
Following the release of AI models for text generation, language translation, and audio creation, the company has now open-sourced Code Llama.
This machine learning system is capable of generating and explaining code in natural language, specifically English.
Similar in nature to GitHub Copilot, Amazon CodeWhisperer, and other open-source AI-powered code generators like StarCoder, StableCode, and PolyCoder, Code Llama can complete and debug code across various programming languages such as Python, C++, Java, PHP, Typescript, C#, and Bash.
Meta emphasizes the importance of an open approach to AI models, particularly those dedicated to coding, for both innovation and safety.
By releasing code-specific models like Code Llama to the public, the entire community can evaluate their capabilities, identify issues, and address vulnerabilities.
Code Llama, available in different versions optimized for Python and instruction understanding, is based on the Llama 2 text-generating model that Meta open-sourced earlier.
While Llama 2 could generate code, its quality was not on par with purpose-built models like Copilot.
To train Code Llama, Meta used the same dataset as Llama 2, which included publicly available sources from the web.
However, the model was given more time to learn the relationships between code and natural language by emphasizing the subset of training data that contained code.
Each Code Llama model, ranging in size from 7 billion to 34 billion parameters, was trained with 500 billion tokens of code and code-related data.
The Python-specific version underwent further fine-tuning with 100 billion tokens of Python code, while the instruction-understanding version was fine-tuned using feedback from human annotators to generate helpful and safe responses to questions.
Code Llama models can insert code into existing code and accept approximately 100,000 tokens of code as input.
The 7 billion-parameter model can run on a single GPU, while the 34 billion-parameter model, claimed to be the best-performing and largest by parameter count among open-source code generators, requires more powerful hardware.
The appeal of code-generating tools to programmers and non-programmers is substantial. GitHub reports that over 400 organizations currently use Copilot, with developers within those organizations coding 55% faster than before.
Additionally, a recent Stack Overflow survey found that 70% of respondents are already using or planning to use AI coding tools, citing increased productivity and accelerated learning.
However, like all generative AI, coding tools come with risks. They can inadvertently introduce security vulnerabilities, and there are concerns about intellectual property infringement.
Code-generating models may also be exploited for malicious purposes.
Meta acknowledges that Code Llama may generate inaccurate or objectionable responses and urges developers to perform safety testing and tuning tailored to their specific applications.
While the model was internally red-teamed, further audits from third parties are necessary.
Despite these risks, Meta places minimal restrictions on how developers can deploy Code Llama, as long as they agree not to use it maliciously.
The company hopes that Code Llama will inspire others to leverage Llama 2 to create innovative tools for research and commercial products, supporting software engineers across various sectors.
Source: Techcrunch