HIGHLIGHTS Program Synthesis with Large Language Models (Jacob Austin, Augustus Odena et al) (summarized by Rohin): Can we use large language models to solve programming problems? In order to answer this question, this paper builds the Mostly Basic Python Programming (MBPP) dataset. The authors asked crowd workers to provide a short problem statement, a Python function that solves the problem, and three test cases checking correctness. On average across the 974 programs, the reference solution has 7 lines of code, suggesting the problems are fairly simple. (This is partly because you can use library functions.) They also edit a subset of 426 problems to improve their quality, for example by making the problem statement less ambiguous or making the function signature more normal. They evaluate pretrained language models on this dataset across a range of model sizes from 0.244B to 137B parameters. (This largest model is within a factor of 2 of GPT-3.) They consider both few-shot and finetuned models. Since we have test cases that can be evaluated automatically, we can boost performance by generating lots of samples (80 in this case), evaluating them on the test cases, and then keeping the ones that succeed. They count a problem as solved if any sample passes all the test cases, and report as their primary metric the fraction of problems solved according to this definition. Note however that the test cases are not exhaustive: when they wrote more exhaustive tests for 50 of the problems, they found that about 12% of the so-called “solutions” did not pass the new tests (but conversely, 88% did). They also look at the fraction of samples which solve the problem, as a metric of the reliability or confidence of the model for a given problem. Some of their findings: 1. Performance increases approximately log-linearly with model size. The trend is clearer and smoother by the primary metric (fraction of problems solved by at least one sample) compared to the secondary metric (fraction of samples that solve their problem). 2. Finetuning provides a roughly constant boost across model sizes. An exception: at the largest model size, finetuning provides almost no benefit, though this could just be noise. 3. It is important to provide at least one test case to the model (boosts problems solved from 43% to 55%) but after that additional test cases don’t make much of a difference (an additional two examples per problem boosts performance to 59%). 4. In few-shot learning, the examples used in the prompt matter a lot. In a test of 15 randomly selected prompts for the few-shot 137B model, the worst one got ~1%, while the best one got ~59%, with the others distributed roughly uniformly between them. Ensembling all 15 prompts boosts performance to 66%. 5. In rare cases, the model overfits to the test cases. For example, in a question about checking whether the input is a Woodall number, there is only one test checking an actual Woodall number (383), and the model generates a program that simply checks whether the input is 383. 6. When choosing the best of multiple samples, you want a slightly higher temperature, in order to have more diversity of possible programs to check. 7. It is important to have high quality problem descriptions as input for the model. The 137B model solves 79% of problems in the edited dataset, but only solves 63% of the original (unedited) versions of those problems. The authors qualitatively analyze the edits on the problems that switched from unsolved to solved and find a variety of things that you would generally expect to help. Now for the controversial question everyone loves to talk about: does the model understand the meaning of the code, or is it “just learning statistical correlations”? One way to check this is to see whether the model can also execute code. Specifically, we provide the ground truth code for one of the problems in the MBPP dataset along with one of the test case inputs and ask the model to predict the output for that test case. Even after finetuning for this task, the 137B model gets only 21% right. This can be boosted to 27% by also providing example test cases for the code before predicting the output for a new test case. Overall, this suggests that the model doesn’t “understand” the code yet. We can take the model finetuned for execution and see how well it does on program synthesis. (We can do this because there are different prompts for execution and synthesis.) For the 8B model, the finetuning makes basically no difference: it’s equivalent to the original few-shot setting. However, for the 137B model, finetuning on execution actually leads to a small but non-trivial improvement in performance (from ~59% to ~63%, I think). This is true relative to either the few-shot or finetuned-for-synthesis setting, since they performed near-identically for the 137B model. So in fact the 137B model finetuned on execution is actually the strongest model, according to synthesis performance. So far we’ve just been looking at how our model performs when taking the best of multiple samples. However, if our goal is to actually use models for program synthesis, we aren’t limited to such simple tricks. Another approach is to have a human provide feedback in natural language when the model’s output is incorrect, and then have the model generate a new program. This feedback is very informal, for example, “Close, but you need to replace the underscore with an empty string”. This provides a huge performance boost: the 137B solves ~31% of problems on its first sample; adding just a single piece of human feedback per problem boosts performance to ~55%, and having four rounds of human feedback gets you to over 65%. The authors also introduce the MathQA-Python dataset, which provides arithmetic word problems and asks models to write programs that would output the correct answer to the problem. They only run a few experiments on this dataset, so I’ve mostly ignored it. The main upshot is that a finetuned 137B parameter model can solve 83.8% of problems with some sample. They don’t report metrics with a single sample, which seems like the more relevant metric for this dataset, but eyeballing other graphs I think it would be around 45%, which you could probably boost a little bit by decreasing the sampling temperature. |