Sunday, August 30, 2020

An overview of the science on function length

This is the first part of a blog series where I examine programming concepts from a scientific perspective. In this part I dig up every study relating to function length I could find, fill in some gaps with original research, and examine what we can learn.

Highlights:

  • We find that in older studies (pre-2000) short functions correlate with higher defect density
  • There seem to be no post-2000 studies exclusively focusing on function length, but through original research we find that modern code bases exhibit similar behavior
  • We also find that in empirical experiments short functions make code slower to debug, with some weaker evidence suggesting they also make it slower for adding new features but faster to modify

Introduction

In programming lore it’s hard to find sources glorifying long functions. This is codified, for example, in the widely quoted book ‘Clean Code’ which states:

The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that. Functions should not be 100 lines long. Functions should hardly ever be 20 lines long

Favoring short functions isn’t by any means a new practise – a 1986 study “An Empirical Study of Software Design Practices” states “keeping modules small” as a good design practise. They also quote a book from 1974, stating “many programming standards limit module size to one page (or 50-60 source lines of code)”.

What is a small function?

What was meant by shorts functions in the 80s is different from small functions of today. In a study from 1991 ([1]) the line between small and large functions is drawn at 142 lines, staggeringly long by modern standards. In the same vein, a study from 1984 groups functions into buckets by increments of 50 lines ([2]).

However, nowadays the vast majority of functions are under 50 lines. A quick analysis of Eclipse, a popular open source IDE, reveals it averages about 8.6 lines per method in its source code. Above mentioned book “Clean Code”, published in 2008, states that functions should rarely be more than 20 lines, and the author has stated in other contexts he often prefers to use single-line functions, and that about half of the functions written by the author in Ruby are single-liners.

This shift in function sizes is perhaps partially due to changes in programming languages. In the 80s a Fortran “module” was commonly considered a function and some variables  (see eg. https://www.tutorialspoint.com/fortran/fortran_modules.htm) and function was the basic building block of software, whereas nowadays most Java or C++ programmers would define “module” as a class consisting of multiple functions.

Many studies focus on analyzing module sizes without specifying what a module is, so I’ve only included studies which specifically state they analyze functions. This, combined with the change in what we consider a basic building block of software, combined with the historical shift in function sizes, makes it hard to properly analyze and aggregate the studies, but we can certainly try.

Function length and defect density – pre-2000

The first studies concerning the length of functions I could find were from the early 80s. Nowadays most studies examine the code on a class and package level, whereas in the 80s object oriented programming was still quite rare and functions were the main building blocks of software, so it made a lot of sense to examine functions and their error proneness.

The early research on function length and defect density found that very small functions tended to have higher defect densities. This has been cited in “Code Complete”, where Steve McConnell says: “The routine should be allowed to grow organically up to 100-200 lines, decades of evidence say that routines of such length no more error prone then shorter routines.” Further, he gives a reference to a study that says routines 65 lines or longer are cheaper to develop. (see appendix for a more extensive take on the studies used in Code Complete)

To dive further, a study from 1984 ([2]) grouped function to buckets with increments of 50 lines and measured the error density for each bucket. They found that smaller functions contained, on average, more defects per line:

Table outlining function length and defect density in [2]

A follow-up study ([3]) examined five Fortran projects, and states that “no significant relationship was found between module size and fault rate“ even though, in the same paragraph, the authors also state that “nevertheless, [small modules] exhibited the highest average fault rate because a small module with even a single fault will show a very high fault rate“. It’s not clear why the authors state that module size and fault rate has no significant relationship and then state that average fault rate is highest for small modules. Another conclusion from [3] is that larger functions are cheaper to develop.

A third empirical study ([1]) analyzed 450 routines and found that “small” routines (those with fewer than 143 source statements, including comments) had 23 percent more errors per line of code than larger routines but were 2.4 times less expensive to fix than larger routines. This is interesting since it contrasts with some other experimental results we’ll see later.

So far it appears short functions are correlated with higher defect density, but due to the large sizes of the examined functions you’d be forgiven for doubting the relevance of these studies to modern software engineering.

Modern defect prediction literature

Modern defect prediction approaches are characterized by large data sets and machine learning methods applied on them. They mainly concentrate on correlating features such as “average method length” and “lines of code” with defects in an attempt to create defect prediction models.

Several such studies have found a correlation between size of methods and defects. For example [4] found that the size of the longest method in a class correlates positively with post-relase defects. Does this mean we should refactor our long methods into short ones to avoid defects?

The maximum length of function correlates with defects in [4]

Turns out the answer is no. In those same studies, the number of methods also correlates with defects:

The number of functions in a class also correlates with defects. [4]

To put it plainly, if we have a long function and split it into smaller ones, we’re not removing a source of defects, but we would simply be switching from one to another.

The same effect can be seen in other defect prediction studies. For example in [5] we see that both “Weighted Methods per Class” and “Average Method Complexity” correlate with defects, where former is defined as “number of methods in the class” and latter as “the average method size for each class“.

In both studies both metrics have a statistically significant correlation with defects, and this appears to be the trend across many other studies in defect prediction literature.

The literature doesn’t provide a straightforward way to measure which feature (length or number of methods) is more significant when predicting defects. It’s quite possible that, for example, measures such as “average method complexity” and “number of methods in a class” simply act as 2nd order estimators for number of lines, in which case we’d simply be comparing which measure correlates better with the underlying metric.

In addition, we’re interested in examining the length of the functions on a more granular level. Finding out that the number of methods and the length of methods both correlate with defects doesn’t give us anything tangible, and doing this sort of class or module level correlation analysis doesn’t get us very far.

However, the datasets used in many of these studies are open and we can download them to try to examine the relationship between functions length and defects ourselves.

Data set analysis

Time to scare the wikipedia editors among you and do some original research. All the code and data can be found at https://github.com/softwarebyscience/function-length/.

I took two datasets which contained the features we are interested in, such as the number of methods and lines of code per class. The first dataset I worked with is simply called “Eclipse Bug Database” and it contained data from three major Eclipse releases ([7]).

In addition, I used the “Bug Prediction Dataset” from http://bug.inf.usi.ch/download.php as it seemed larger than the Eclipse Bug Database. The data is collected between 2005 and 2009 from 5 different Java projects – for a more detailed explanation see [6].

The results from the first dataset are quite straightforward:

Here smaller methods clearly have a higher defect density, with the minimum defect density roughly found in classes where the methods average around 10-15 lines. What’s perhaps hard to see from the chart is that defect density starts very slightly increasing towards the right.

The results from the second dataset are a bit more interesting (moving average in red):

Perhaps most surprisingly, the distribution is quite different compared to the first dataset, even though I would’ve expected the distributions to look very similar. I have no good theories as to why this might be the case (apart from a bug in my code, although I did check it several times).

There are a few other interesting details here. The average error density in the second dataset is higher, and this is likely because we consider all bugs over the ~4 year period, whereas the first dataset only considers the last 6 months and specific type of bugs (see [6] and [7] for details). In addition there are some zero-length functions. These are likely methods of abstract classes, meant to be overwritten. When analyzing post-release defects we see a roughly similar picture, if noisier (graph can be found in appendix).

At this point it’s worth noting that both datasets are in Java, which is the promised language of one-liner methods. Java classes often have one-line getters and setters which are usually very simple (sometimes even automatically generated), only returning or setting a private variable, and I’d expect this to skew the results in favor of smaller functions.

However, from both datasets we can see that classes with average method length of under 5 lines tend to have increased defect densities.

Practical effects

To move beyond defects and to better understand the effects of short functions, we’ll examine some related experiments. I was able to find 3 experiments which we can use to measure the practical effects of different function sizes.

The oldest one is from 1988 ([8]). The study examines the effect of comments and procedures on a blob of code 73 lines long, both as a single monolithic procedure and after splitting it into smaller procedures. The study had some 150 students as test subjects. The study found that the students could answer questions best in the monolithic version of the program when comments were added, but when comments were removed the monolithic version became the worst performing one.

Another experiment was performed in 2016 in [9]. A program had been modified with advice taken from “Clean Code” by Robert C. Martin, for example by extracting code to functions with descriptive names. Out of 10 participants 5 subjects were assigned to work on the refactored version and 5 on the unrefactored. The participants were then asked to complete a series of three tasks: adding new functionality, changing existing and fixing a bug.

The author doesn’t provide average lines before and after refactoring, but from the examples provided the refactored methods largely fall between 5-15 non-blank lines while the original methods range from 23 to 65.

The author concludes that the participants working with longer functions were faster at debugging and with implementing new features but slower when modifying existing features. However, it appears there are no statistical significance tests, which would’ve been helpful due to the low sample size and close results when measuring time taken with adding new functionality.

Average time taken per task in [9]. Blue represents the average time taken when working with unrefactored code, red with refactored. Task 1 is adding new features, task 2 modifying existing, and task 3 is debugging.

The third study is from 2015 ([10]), where professional developers conducted a bug fixing task on refactored and unrefactored code they were unfamiliar with. Since the authors conducted several experiments, I’ve only taken the experiment where class structure was not changed by introducing new classes, and the refactoring was purely done by changing the methods of the class. In this experiment “[the authors] extracted several helper methods so that the method names could act as helpful descriptions of the steps in the process”, which is similar to the previous experiment. Unfortunately the authors don’t go into more details about the type of refactoring and don’t, for example, provide function lengths pre- and post-refactoring. For now, however, I’ll assume the function lengths are similar to the previous experiment, which seems reasonable considering that both experiments reference the same source for programming best practises (“Clean Code” by Robert C. Martin).

In the experiment the bug fix took less time in the original, non-refactored version (roughly 8.5 minutes vs 14.5 minutes) and the result was statistically significant. The authors suggest that this is due to developers being used to the old conventions, although this can also be taken to support the results of the previous experiment where bug fixing times were longer with shorter functions.

The above research and interpretations have caveats, but because two of the studies seem to support the idea that shorter functions (under 20 lines long) are associated with increased bug fix times, I’m willing to accept this idea until further research comes along. There’s also a case that can be made for smaller functions being better with modifying existing functionality and being worse when implementing new functionality, but the evidence there is weaker.

Finally, it’s surprising how well longer functions do with code comprehension – it’s hard to conclude from any of the studies that longer functions lose out to short functions when it comes to understanding the code.

Analysis of Results

All of the studies measuring defect density found increased defect density for smaller functions. One possible explanation was suggested by [2], who proposed that the increase in errors was due to what they called “interface errors” – that is, “those that were associated with structures existing outside the module’s local environment but which the module used.” This would include errors such as calling the wrong function.

On a more general level, a similar relationship between component size and defect density was found in an overview by [11]. While they didn’t directly examine function length, their explanation is interesting. Specifically, the authors claimed that the optimal size of a component follows from the human short term memory limit of 7 +- 2 items. To follow this logic, it’s possible that any ‘optimal’ average length of functions (if it exists) is determined by the limits of short term memory. This theory becomes interesting when we take a look back at our original research, where classes with average method lengths of 5-7 lines appeared to have the least defects in the second dataset, which is very similar to the 7 +- 2 range. However, I wouldn’t read too heavily into this – as I mentioned earlier, the nature of Java perhaps makes the “optimal” average method size appear smaller than in actually is.

Another model is presented by [12] where it’s found that that the expected number of defects in a module (an arbitrary software module, not function) can be best modelled with a mathematical function of the form ax + b instead of just ax. This is essentially saying that creating a module carries a cost in and of itself, and it’s plausible that creating functions will always carry some overhead in itself. Functions will, for example, always cause an additional indirection and rely on the programmer to think of a good name, and these might add up to bugs when another programmer calls the wrong function or misinterprets the function name. Such a model might explain the U shaped curve we saw in our second dataset analysis.

Criticism of defect density/function size connection

The above studies aren’t without criticism. A study from 1999 ([13]) points out that the relationship between size and defects isn’t causal, but rather that number of lines simply correlates with defects instead of directly causing them. This, of course, means we cannot reliably predict defects based on it.

In addition, they critizise the use of curve fitting to extremely noisy data, and point out that while we can fit a curve between the sizes of modules and defect density, it doesn’t mean we can predict defects with much reliability.

While I don’t fundamentally disagree with any of the above criticism, I believe that eg. examining averages still has its place instead of the more complex models the authors propose. This is especially true when advice favoring small function is so prevalent, and seemingly not based on any science.

Naturally, as with any study, there are many other variables we have possibly missed. For example, maybe small functions are more common in classes or packages which get changed very often, and since changes are associated with increased defects, so are smaller functions. However, examining these theories is beyond the scope of this post.

Conclusions

Considering that short functions tend to lead to longer debug times and that very short functions tend to have higher defect densities (both in historical and modern datasets) the case for using very short functions becomes weak.

As we could see from the historical studies, the definition of ‘short’ has changed over time. Still, if we focus on our dataset analysis to post-2000 data and studies and look at functions 1-3 lines long (short enough to be classified short in any study), the evidence is lopsided: there are few reasons for using them, and many reasons for not using them. As such, software developers should be wary of breaking their code into too small pieces, and actively avoid introducing very short (1-3 lines) functions when given the choice. At the very least unnecessary single-line functions (ie. excluding getters, setters etc.) should be all but banned.

For longer functions, the picture is less clear. Pre-2000 studies don’t appear to show any increase in defect density for longer functions, but our dataset exploration somewhat contradicts this. In experiments longer functions appeared to bring improvements at least to debug times. However, it’s hard to define the threshold for ‘long’ functions, as many studies don’t provide any figures on that.

Appendix

Welcome to the boring section. This section contains stuff like like disclaimers, studies that weren’t included in the above post and some bad science.

First of all, I have tried hard to separate my views from those expressed in the papers. However, this is not a scientific paper or a peer reviewed study, and I’ve drawn conclusions based on somewhat incomplete data using my personal judgement. I believe this is crucial if we’re going to make programming more grounded in good science, especially since there are many people making claims with zero scientific evidence.

Why did I exclude this or that study?

For this post I’ve waded through probably twice as many studies as I’ve ended up referencing. Most studies that seemed relevant turned out to be not so after closed examination. For older studies, a major reason was that many studies studied the relationship between “modules” and defects, not functions/methods/routines and defects. Some studies (such as [2]) did define module to be a subroutine, in which case I’ve included them here, but others simply didn’t define module or defined module to be specifically something else.

It even appears that some of the studies cited in Code Complete in relation to function length actually study things that are not directly functions. Specifically, Code Complete says “another study found that routine size was not correlated with errors, even though structural complexity and amount of data were correlated with errors.“ referencing [12]. However, this seems to be a misinterpretation when looking at what the authors mean by ‘module’:

Programs are composed of separately compilable subprograms called “modules;” typically, each module supports one or more system functions.

“Identifying Error-Prone Software – An Empirical Study”, Shen et al, 1985

Further, the modules the study examines are too large to be functions, even for the time – the average size of the modules examined ranges from 250 lines to 1000 lines, depending on the project examined. Perhaps this will be fixed in the next edition of Code Complete.

For defect prediction, even though it turns out most of the defect prediction studies are inapplicable for our use case, I’ve included them and the reasoning I use to exclude them here since even some scientific papers (such as “How Do Java Methods Grow” from 2015) drawn mistaken conclusions about method sizes based on the studies examined here.

I’ve excluded the criticism from Rosenberg’s 1997 paper “Some Misconceptions About Lines of Code” which fails to take note that defect density is the main metric used to critizise small functions, as pointed out by Malaiya et al in 2000.

The criticism from “Quantitative analysis of faults and failures in a complex software system” also doesn’t focus on function sizes.

While two studies appeared potentially relevant, I was unable to access them. These are “An empirical study of the effects of modularity on program modifiability” by Korson and Vaishnavi from 1986 and “Some factors affecting program repair maintenance: an empirical study” by Vessey and Weber from 1983.

When analyzing [10] the conclusion wouldn’t have changed even if I had taken all the tasks into account, as 3 of the 5 tasks favored the unrefactored code, 1 task favored the refactored one and 1 task didn’t have statistical difference.

The average number of lines for Eclipse is based on Eclipse bug dataset ([7]).

Post release defects in “Bug Prediction Dataset”:

Graphin the average methods length against defect density of a class for Bug Prediction Dataset gives us the following graph:

I believe it roughly matches the graph of overall defects as the minimum defect density is found in classes where the average method length is 5-10, but is quita a bit noisier due to less data.

Bibliography

[1] Analyzing Error-Prone System Structure, Selby and Basili, 1991

[2] Software Errors and Complexity: an Empirical Study, Basili and Perricone, 1984

[3] An Empirical Study of Software Design Practices, by Card et al, 1986

[4] Mining Metrics to Predict Component Failures, Nagappan et al., 2005

[5] Significance of Different Software Metrics in Defect Prediction, Jureczko, 2011

[6] An Extensive Comparison of Bug Prediction Approaches, Lanza et al, 2010

[7] Predicting Defects for Eclipse, 2007, Zimmermann, Premraj and Zeller

[8] Program Readability: Procedures Versus Comments, 1988, Terry

[9] Effects of Clean Code on Understandability, Henning Grimeland Koller, 2016, Master’s Thesis

[10] Old habits die hard – Why refactoring for understandability does not give immediate benefits, Ammerlaan et al, 2015

[11] Reexamining the Fault Density– Component Size Connection, Hatton, 1997

[12] Identifying error-pronesoftware – an empirical study, Shen et al, 1985

[13] A Critique of Software Defect Prediction Models, Fenton and Neil, 1999



from Hacker News https://ift.tt/3hDMFKk

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.