Programming languages: This open-source AI code generator is very good at writing in C
Publish Time: 07 Mar, 2022

Researchers from Carnegie Mellon University have released PolyCoder, an automated code generator model that was trained on multiple programming languages, which they say is particularly good at writing code in C.

The researchers hope their open source PolyCoder can democratize research into the field of AI code generation, which so far is dominated by well-funded companies like Alphabet-owned DeepMind and OpenAI. 

Recommends

The best programming languages

Here's a list of the most popular programming languages and where to learn them

Read now

"Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs... are not publicly available, leaving many questions about their model and data design decisions," the researchers said.

SEE: What is Agile software development? Everything you need to know about delivering better code, faster

The researchers point out that OpenAI's Codex, unveiled in August, is available through Microsoft-owned GitHub's Copilot tool but notes that it provides "non-free access" to the model's output through black-box API calls, but the model's weights and training data are unavailable.

The idea behind auto code generation is that it can save developers time, assuming the output is accurate and doesn't introduce security flaws. DeepMind claimed its recently unveiled AlphaCode code generator ranked in the top 54.3% of human participants in programming competitions. But training the model required "hundreds of petaFLOPS days" in Google's data centers. 

"Despite the great success of large language models of code, the strongest models are not publicly available," the researchers note. "This prevents the application of these models outside of well-resourced companies and limits research in this field for low-resourced organizations."

To fix this, the researchers have delivered their own model trained on code from multiple programming languages that they have called "PolyCoder".

The researchers explained: "We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, that was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex." 

The model was trained on data from several repositories from GitHub, covering 12 popular programming languages: C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala and TypeScript. The unfiltered dataset totaled 631GB of data and 38.9 million files. Also, to train PolyCoder, the researchers picked GPT-2 because of budget constraints.  

The researchers claimed some areas of success, particularly in C. However, Codex still trumped it in other languages. 

"Notably, PolyCoder outperforms Codex and all other models in the C language. Comparing the open-source models only, PolyCoder performs better than the similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript," the researchers note.

"In the other 11 languages other than C, all other open-source models, including ours, are significantly worse (higher perplexity) than Codex.

Developer

It's the end of programming as we know it -- again Developers feel secure in their jobs, but they're still thinking about quitting The future of the web will need a different sort of software developer The best Linux laptops for consumers and developers
  • It's the end of programming as we know it -- again
  • Developers feel secure in their jobs, but they're still thinking about quitting
  • The future of the web will need a different sort of software developer
  • The best Linux laptops for consumers and developers
I’d like Alerts: