Program Synthesis with Large Language Models (2021-08-16T00:00:00.000000Z)

TL;DR

The limits of the current generation of large language models for program synthesis in general purpose programming languages are explored, and the semantic grounding of these models is explored by fine-tuning them to predict the results of program execution.

Abstract

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

References106 items

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

GenLine and GenForm: Two Tools for Interacting with Generative Language Models in a Code Editor

Evaluating Large Language Models Trained on Code

Programming Puzzles

Implicit Representations of Meaning in Neural Language Models

Measuring Coding Challenge Competence With APPS

The Power of Scale for Parameter-Efficient Prompt Tuning

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Extracting Training Data from Large Language Models

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers

Deep Just-In-Time Inconsistency Detection Between Comments and Source Code

BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Where should I comment my code? A dataset and model for predicting locations that need comments

DreamCoder: growing generalizable, interpretable knowledge with wake–sleep Bayesian program learning

Unsupervised Translation of Programming Languages

Language Models are Few-Shot Learners

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback

IntelliCode compose: code generation using transformer

Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs

Global Relational Models of Source Code

LambdaNet: Probabilistic Type Inference using Graph Neural Networks

OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints

Code Prediction by Feeding Trees to Transformers

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Learning to Represent Programs with Property Signatures

Learning and Evaluating Contextual Embedding of Source Code

TypeWriter: neural type prediction with search-based validation

Learning to Fix Build Errors with Graph2Diff Neural Networks

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

SPoC: Search-based Pseudocode to Code

Write, Execute, Assess: Program Synthesis with a REPL

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

Deep learning type inference

Execution-Guided Neural Program Synthesis

Automatic Program Synthesis of Long Programs with a Learned Garbage Collector

Mapping Language to Code in Programmatic Context

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

NAPS: Natural Program Synthesis Dataset

DeepBugs: a learning approach to name-based bug detection

code2vec: learning distributed representations of code

Deep Contextualized Word Representations

Universal Language Model Fine-tuning for Text Classification

Learning to Represent Programs with Graphs

A Survey of Machine Learning for Big Code and Naturalness

Program Synthesis

Attention is All you Need

RobustFill: Neural Program Learning under Noisy I/O

Neural Sketch Learning for Conditional Program Generation

DeepCoder: Learning to Write Programs

Probabilistic model for code with decision trees

Hybrid computing using a neural network with dynamic external memory

Latent Predictor Networks for Code Generation

A Convolutional Attention Network for Extreme Summarization of Source Code

Exploring the Limits of Language Modeling

Automatic patch generation by learning correct code

Neural GPUs Learn Algorithms

Neural Random Access Machines

Semi-supervised Sequence Learning

Predicting Program Properties from "Big Code"

Neural Turing Machines

Learning to Execute

Phrase-Based Statistical Translation of Programming Languages

Towards a Big Data Curated Benchmark of Inter-project Code Clones

Code completion with statistical language models

Learning natural coding conventions

Structured Generative Models of Natural Source Code

Growing solver-aided languages with rosette

Syntax-guided synthesis

Lexical statistical machine translation for language migration

Mining source code repositories at massive scale using language modeling

On the naturalness of software

Generating Text with Recurrent Neural Networks

Automating string processing in spreadsheets using input-output examples

Combinatorial sketching for finite programs

On the synthesis of a reactive module

A Methodology for LISP Program Construction from Examples

Inferring LISP Programs From Examples

Knowledge and Reasoning in Program Synthesis

Toward automatic program synthesis

PROW: A Step Toward Automatic Program Writing

The FORTRAN automatic coding system

Prefix-Tuning: Optimizing Continuous Prompts for Generation

A large-scale benchmark for few-shot program induction and synthesis

Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Learning Libraries of Subroutines for Neurally-Guided Bayesian Program Induction

Improving Language Understanding by Generative Pre-Training

A syntactic neural model for general-purpose code generation

GenProg: A Generic Method for Automatic Software Repair

100

Recurrent neural network based language model

101

Make sure the function signature is not unusual

102

We explore the semantic grounding of our models, investigating the extent to which these models can execute code given speciﬁc inputs (Section 6)

103

Alan Turing’s Electronic Brain: The Struggle to Build the ACE, the World’s Fastest Computer

104

when the model solves a task

105

limitations of our current model point toward interesting

106

’Consider the following Python function:\n\n{code}\n\n’ \ + ’This function solves the task: "{description}"\n\n’ \ + ’Fill in the ??? below:\n\n{tests}’