Notes toward an empirical study of programming language effectiveness

Posted on May 19, 2016

I’ve decided to do an empirical study of the effects of programming language choice. As a PL enthusiast, I’d like to know what languages are actually helpful in practice and whether the ones I like are actually any good. The existing research is, well, not great. In this post, I’ll describe my current thoughts. If you have feedback on the design, please leave a comment - I make strong statements below, but I’m sure I’ve gotten some things wrong. If you’re interested in collaborating, get in touch too. This is big and complicated and I’d love to work with others on it.

A randomized controlled trial would be nice if I had infinite money and time

An observational study is the only way to go. Experiments need to use trivial tasks and small n or are unfeasibly expensive, even if I had institutional backing. Consider a typical software project: we’re talking hundreds of person-hours over a period of at least months. Projects done in a lab over a few days aren’t similar enough to be comparable. Readability, types, testing, documentation, etc matter much less when it’s all fresh in your mind and you don’t have to explain it to anyone. Refactoring is unlikely to happen at all. Let’s make up some numbers: a smallish project over three months with three developers working full time. We’ll assume they’re cheap and put them at $24/hr (the 25th percentile) . $24/hr * 40hrs/week * 4 weeks/month * 3 months * 3 developers is $34,560. Per case. For a sample size that gets you significance we’re talking hundreds of thousands to millions of dollars. Maybe more, since I expect the outcome standard deviation to be high.

Given the decision to do an observational study i.e. collect data on projects that already happened, we need to control for confounders. Confounding variables aside, collecting anything that influences outcomes will give us more accurate predictions and allow us to detect smaller effects. The easy/obvious ones are project age, start year, team size, topic area, commercial vs noncommercial and company age at project start. It’d be best if I could also measure team skill but I don’t have a test for that and even if I did I couldn’t administer it. Experience level would be a decent proxy but I also won’t be able to measure that. I just have to have to hope the measurable stuff is an adequate proxy. There are probably some I’m missing.

How to judge outcomes

People make various claims about what advantages various programming languages have. Java is good for large enterprise projects, Ruby lets you get something working fast, C lets you write fast programs, Haskell makes it harder to write bugs, etc, etc. In the interest of simplicity I’ve decided to skip anything more precise than user satisfaction. Everything else is difficult to measure consistently and only instrumental to the actual point of the software. I know it’s weird that I’m advocating a subjective measure in the interest of consistency, but more specific things like bug rate, performance and feature count are subjective too and difficult to compare across projects and categories. What counts as a bug? Is this one bug or five? Is this rendering engine faster than this compiler? Is CSS3 support one feature or lots? Etc.

So we’ll survey users. “Have you used a program that’s name starts with $RANDOM_LETTER in the last week?” If they haven’t, try again until they have. “What was it? How satisfied are you with the program?” The randomization is necessary: if we ask for the last one used, all the responses will be whatever program led them to filling out the survey (Twitter, email, SurveyMonkey, Firefox); if we pick a specific random program many of them won’t have interacted with it or we’ll only be able to ask about very popular ones. Maybe there’s a standard way to deal with this? Let me know.

It’s possible respondents opinions on the programming language(s) used affect their evaluations. I could collect their opinions for control, but I’m not convinced it’s a real problem and it’d make the opinion survey take longer and we’d consequently get less responses.

How to collect explanatory variables

We need to collect the languages used in the projects, what components they’re used for, confounding variables and any other useful predictors. For open source software this is relatively easy - I can probably write a tool to gather language and component information and maybe even topic; age and contributor count are in source control. For commercial projects we may have to survey people who work there. I expect some programmers would be willing to volunteer: many of us are interested in the topic. Maybe I’m incorrectly assuming that people are like me though.

If gathering data on proprietary projects proves too difficult, we can exclude them for now although that will mean throwing out a lot of data.

Social factors may also be useful. Finding out whether Linus’ verbal abuse hurts the project would be interesting. Sentiment analysis of mailing list and IRC archives would work. Checking whether a code of conduct is in place is also doable. This is pretty complicated, so I’ll hold off until the easier, more obviously relevant stuff is done.

The model

This is the complicated bit, and the one where I’m most dissatisfied with my answers.

Every substantial project involves more than one language. This website involves Haskell and a tiny bit of JavaScript. GHC involves Haskell, C, Make, Python, shell, C– and Perl. Every non-Node web application involves at least two.

The simplest solution is to only consider the “majority” language. I threw that one away. The point is to offer advice about what to choose and in real projects you often have to choose more than one.

So we have the constraint that the model must handle multiple languages and that different projects may have different numbers of languages. Additionally, we’d like to account for which language is used in which component. Maybe it’s bad to write your compiler in C, but good to write your RTS in C. Or Lua could be good for writing game logic but bad for renderers.

The variable number of features and the fact that they have no obvious numerical mapping puts us into the realm of machine learning algorithms. In particular, I intend to use an SVM. The non-empty set of component language pairs leads straightforwardly to a similarity metric. Projects with the same components in the same language are more similar to each other than those with the same components in different languages are more similar than projects with different components in different languages. I even found a paper on bag-of-tuples kernels. The other features have simple interpretations as dummies and reals.

Using an SVM over a more typical statistical technique makes some substantial sacrifices. First, complexity: I’m designing something in a field I just started learning about and there are plenty of opportunities to get it wrong. Second is interpretability: I won’t be able to say things like “Using Go makes programs 30% better as compared to Java on average”. Such statements are pretty vacuous anyway though. We’re limited to specific hypotheticals. Third is statistical significance: SVMs don’t give p-values or confidence intervals.

I think I have decent answers to the first and third problems. Posting things like this and using cross validation will help prevent me from fooling myself with a bad model. Bootstrapping can provide confidence intervals, though there are several methods to choose from. The second problem is, as I said, kind of dumb. However, for this to be useful it has to give advice. Me coming up with some hypotheticals manually and dynamically generating ones based on user input would be nice but may be too computationally expensive if we’re running 10,000 samples for bootstrapping.

Conclusion

As I said in the introduction, I’s appreciate feedback. I’d like to be proven wrong as soon as possible. Leave a comment or email me especially if you’re interested in collaborating.

As an aside, I’m looking for progamming work. If you or your company is looking for someone, particularly if the work involves programming language design and impementation, get in touch. It needs to be in Portland, remote, or cool enough that I’d want to move. My resume is here.