Multi-lingual Evaluation of Code Generation Models (2022-10-27T00:00:00.000000Z)

TL;DR

This work presents new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingUAL models over mono-lingual, and the ability of few-shot prompting to teach the model new languages.

Abstract

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

TL;DR

Abstract

Authors

References73 items

Measuring The Impact Of Programming Language Distribution

CCTEST: Testing and Repairing Code Completion Systems

CoditT5: Pretraining for Source Code and Natural Language Editing

Grounded Copilot: How Programmers Interact with Code-Generating Models

Code Translation with Compiler Representations

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

OPT: Open Pre-trained Transformer Language Models

Natural Language to Code Translation with Execution

InCoder: A Generative Model for Code Infilling and Synthesis

PaLM: Scaling Language Modeling with Pathways

Training Compute-Optimal Large Language Models

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

A systematic evaluation of large language models of code

Synchromesh: Reliable code generation from pre-trained language models

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

AVATAR: A Parallel Corpus for Java-Python Program Translation

Program Synthesis with Large Language Models

Measuring Coding Challenge Competence With APPS

The Power of Scale for Parameter-Efficient Prompt Tuning

Unified Pre-training for Program Understanding and Generation

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

GraphCodeBERT: Pre-training Code Representations with Data Flow

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

Unsupervised Translation of Programming Languages

Language Models are Few-Shot Learners

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Data augmentation using back-translation for context-aware neural machine translation

Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

SPoC: Search-based Pseudocode to Code

A Study of BFLOAT16 for Deep Learning Training

The Curious Case of Neural Text Degeneration

Introducing MathQA - A Math-Aware Question Answering System

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

Tree-to-tree Neural Networks for Program Translation

Decoupled Weight Decay Regularization

Attention is All you Need

DeepFix: Fixing Common C Language Errors by Deep Learning

Probabilistic model for code with decision trees

Using machine translation for converting Python 2 to Python 3 code

Phrase-Based Statistical Translation of Programming Languages

Lexical statistical machine translation for language migration

Mining source code repositories at massive scale using language modeling

WordNet: A Lexical Database for English

A Conversational Paradigm for Program Synthesis

Improving automatically generated code from Codex via Automated Program Repair

2022) introduce a new dataset which is parallel across 7 programming languages

In addition, researchers proposed various ways of improving code generation models. For example, Poesia et al. (2022) propose Target Similarity Tuning for code retrieval augmentation and Con

2021) improves upon CodeBERT by leveraging AST and data flow

Jangda. A scalable and extensible approach to benchmarking nl2code for 18 programming languages, 2022

2022) introduce execution result–based minimum Bayes

2022) extended the dataset in Go and Rust languages

2022), and CodeGen (Nijkamp et al., 2022)

Prefix-Tuning: Optimizing Continuous Prompts for Generation

2021) presented a method generation dataset in Python based on

2021) composed a token and line completion

MBPP: PYTHON Note that we convert the original MBPP dataset (Austin et al., 2021) which has a slightly different format into HumanEval format (Chen et al., 2021) with function signature and docstring

In this setup, we use a complete function in Python as an input prompt. The transcoder model then generates a complete function in Java and C++

2020a) collected a corpus of parallel functions in Java, Python

2019) (e.g., “create a function” to “write one function

Error at test case 2" 33 end 34 x = min_cost

Error at 3th assert statement

18 for (let j = 0; j <= n; j++) { 19 dp

12)){} else { throw 'Error at 2th assert statement. Value = ' + JSON.stringify(x ) }

2) 30 if(compare(x, 8)){} else { throw 'Error at 1th assert statement

28 var arg01 : Int = 2 29 var arg02 : Int = 2 30 var x0 : Int = minCost(cost : arg00, m : arg01, n : arg02) 31 var v0 : Int = 8 32 assert(x0 == v0

# Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost

Exception --test case 1 did not pass

You are an expert Perl programmer, and here is your task