Showing posts with label code complexity. Show all posts
Showing posts with label code complexity. Show all posts

Tuesday, February 6, 2018

The IRS made me a better programmer

We US taxpayers have opinions of the IRS, the government agency tasked with the collection of taxes. Those opinions tend to be strong and tend to fall on the "not favorable" side. Yet the IRS did me a great favor and helped me become a better programmer.

The assistance I received was not through employment at the IRS, nor did they send me a memo entitled "How to be a better programmer". They did give me some information, not related to programming, yet it turned out to be the most helpful advice on programming in my career.

That advice was the simple philosophy: One operation at a time.

The IRS uses this philosophy when designing the forms for tax returns. There are a lot of forms, and some cover rather complex notions and operations, and all must be understandable by the average taxpayer. I've looked at these forms (and used a number of them over the years) and while I may dislike our tax laws, I must admit that the forms are as easy and understandable as tax law permits. (Tax law can be complex with intricate concepts, and we can consider this complexity to be "essential" -- it will be present in any tax form no matter how well you design it.)

Back to programming. How does the philosophy of "one operation at a time" change the way I write programs?

A lot, as it turns out.

The philosophy of "one operation at a time" is directly applicable to programming. Well, my programming, at least. I had, over the years, developed a style of combining operations onto a single line.

Here is a simplified example of my code, using the "multiple operations" style:

Foo harry = y.elements().iterate().select('harry')

It is concise, putting several activities on a single line. This style makes for shorter programs, but not necessarily more understandable programs. Shorter programs are better when the shortness is measured in operations, not raw lines. Packing a bunch of operations -- especially unrelated operations -- onto a single line is not simplifying a program. If anything, it is making it more complex, as we tend to assume that operations on the same line are somehow connected.

I changed my style. I shifted from multi-operation lines to single operation lines, and I was immediately pleased with the result.

Here's the example from above, but with the philosophy of one operation per line:

elements = y.elements()
Foo harry = nil
elements.each do |element|
  harry = element if element.name == 'harry'

I have found two immediate benefits from this new style.

The first benefit is a better experience when debugging. When stepping through the code with the debugger, I can examine intermediate values. Debuggers are line-oriented, and execute the single-line version all in one go. (While there are ways to force the debugger to execute each function separately, there are no variables to hold the intermediate results.)

The second benefit is that it is easier to identify duplicate code. By splitting operations onto multiple lines, I find it easier to identify duplicate sequences. Sometimes the code is not an exact duplicate, but the structure is the same. Sometimes portions of the code is the same. I can refactor the duplicated code into functions, which simplifies the code (fewer lines) and consolidates common logic in a single place (one point of truth).

Looking back, I can see that my code is somewhat longer, in terms of lines. (Refactoring common logic reduces it somewhat, but not enough to offset the expansion of multiline operations.)

Yet the longer code is easier to read, easier to explain to others, and easier to fix. And since the programs I am writing are much smaller than the computer's capabilities, there is little expense at slightly longer programs. I suspect that compilers (for languages that use them) are optimizing a lot of my "one at a time" operations and condensing them, perhaps better than I can. The executables produced are about the same size as before. Interpreters, too, seem to have little problem with multiple simple statements, and run the "one operation" version of programs just as fast as the "multiple operations" version. (This is my perception; I have not conducted formal time trials of the two versions.)

Simpler code, easier to debug, and easier to explain to others. What's not to like?

Sunday, April 10, 2016

Complexity of programming languages

Are programming languages becoming more or less complex?

It is a simple question, and like many simple questions, the answer is not so simple.

Let's look at a simple program in some languages. The simple program will print the numbers from 1 to 10. Here are programs in FORTRAN, BASIC, Pascal, C, C++, C#, and Python:

FORTRAN 66 (1966)

      DO 100 I = 1, 10
100   WRITE (6, 200) I
200   FORMAT (I5)
      STOP
      END

BASIC (1965)

10 FOR I = 1 TO 10
20 PRINT I
30 NEXT I
99 END

Pascal (1970)

program hello;
begin
  i: integer;

  for i := 1 to 10 do

    WriteLn(i);
end.

C (1978)

#include "stdlib.h"

int main(void)
{
    int i;

    for (i = 1; i <= 10; i++)

      printf("%d", i);
}

C++ (1983)

#include

int main()

{
    for (unsigned int i = 1; i <= 10; i++)
      std::cout << i << std::endl;

    return 0;

}

C# (2000)

using System;

class Program

{
    public static void Main(string[] args)
    {
        for (int i = 1; i <= 10; i++)
            Console.WriteLine(i);
    }
}

Python (2000)

for i in range(1, 10):
    print(str(i))

From this small sampling, a few things are apparent.

First, the programs vary in length. Python is the shortest with only two lines. It's also a recent language, so are languages becoming more terse over time? Not really, as FORTRAN and BASIC are the next shortest languages (with 5 and 4 lines, respectively) and C#, a contemporary of Python, requires 10 lines.

Second, the formatting of program statements has changed. FORTRAN and BASIC, the earliest languages in this set, have strong notions about columns and lines. FORTRAN limits line length to 72 characters, reserves the first 6 columns for line numbers, and another column for continuation characters (which allows statements to exceed the 72 character limit). BASIC relaxed the column restrictions but add the requirement for line numbers on each line. Pascal, C, C++, and C# care nothing about columns and lines, looking at tokens in the code to separate statements. Python relies on indentation for block definitions.

Some languages (BASIC, C#) are capable of printing things by simply mentioning them. Others languages need specifications. FORTRAN has FORMAT statements to specify the exact format of the output. C has a printf() function that needs similar formatting information. I find the mechanisms of BASIC and C# easier to use than the mechanisms of C and Python.

Let's consider a somewhat more complex program, one that lists a set of prime numbers. We'll look at BASIC and Lua, which span the computing age.

Lua

local N = 100
local M = 10
function PRIME()  -- PROCEDURE DECLARATION;
  local X, SQUARE, I, K, LIM, PRIM -- DECLARATION OF VARIABLES;
  local P, V = {}, {}
  P[1] = 2 -- ASSIGNMENT TO FIRST ELEMENT OF p;
  print(2) -- OUTPUT A LINE CONTAINING THE NUMBER 2;
  X = 1
  LIM = 1
  SQUARE = 4
  for I = 2, N do -- LOOP. I TAKES ON 2, 3, ... N;
    repeat -- STOPS WHEN "UNTIL" CONDITION IS TRUE;
      X = X + 2
      if SQUARE <= X then
        V[LIM] = SQUARE
        LIM = LIM + 1
        SQUARE = P[LIM] * P[LIM]
      end
      local K = 2
      local PRIM = true
      while PRIM and K < LIM do
        if V[K] < X then
          V[K] = V[K] + P[K]
        end
        PRIM = X ~= V[K]
        K = K + 1
      end
    until PRIM -- THIS LINE CLOSES THE REPEAT
    P[I] = X
    print(X)
  end
end
PRIME()


BASIC

100 LET N = 100
110 LET M = 10
200 DIM P(100), V(100)
300 LET P(1) = 2
310 PRINT P(1)
320 LET X = 1
330 LET L = 1
340 LET S = 4
350 FOR I = 2 TO N
360  REM    repeat -- STOPS WHEN "UNTIL" CONDITION IS TRUE;
370   LET X = X + 2
380   IF S > X THEN 420
390    LET V(L) = S
400    LET L = L + 1
410    LET S = P(L)^2
420   REM
430   LET K = 2
440   LET P = 1
450   REM while PRIM and K < LIM do
455   IF P <> 1 THEN 520
460   IF K >= L THEN 520
470    IF V(K) >= X THEN 490
480     LET V(K) = V(K) + P(K)
490    REM
500    LET P = 0
501    IF X = V(K) THEN 510
502    LET P = 1
510    LET K = K + 1
515   GOTO 450
520  IF P <> 1 THEN 360
530  LET P(I) = X
540  PRINT X
550 NEXT I
999 END

Both programs work. They produce identical output.  The version in Lua may be a bit easier to read, given that variable names can be more than a single letter.

They are about the same size and complexity. Two versions of a program, one from today and one from the early years of computing, yet they have similar complexity.

Programming languages encode operations into pseudo-English instructions. If the measure of a programming language's capability is its capacity to represent operations in a minimal number of steps, then this example shows that programming languages have not changed over the past five decades.

Caution is advised. These examples (printing "hello" and calculating prime numbers) may be poor representatives of typical programs. Perhaps we should withhold judgement until we consider more (and larger) programs. After all, very few people in 2016 use BASIC; there must be a reason they have selected modern languages.

Perhaps it is better to keep asking the question and examining our criteria for programming languages.

Sunday, April 6, 2014

How to untangle code: Make loops smaller

Loops all too often do many things. Some programs pack as much as possible into a single loop. The cause is possibly a desire to optimize, to reduce the work performed by the machine. The thought is that one loop is more efficient than several loops, because 1) there is only one loop to "set up" and "take down" and 2) the computer can perform tasks on multiple items as it "goes along" an array of data structures. This is not necessarily so.

Perhaps the code is:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
    member_b[i] = something + some_other_value(i);
    member_c[i] = member_a[i] + member_b[i];
}

The "optimization" of packing loops full of functionality is not necessarily a true optimization. The notion that loops are expensive is an old one, dating back to the mainframe era (when it was true). Since that time, we have designed faster processors, create more capable instruction sets, and improved compilers.

Packing loops full of instructions has a cost: the code is more complex. Being more complex, it is harder to read. (My example is simple, of course. Real code is more complicated. But I think the idea holds.)

I change this (admittedly simple) single loop to a set of smaller loops:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_c[i] = member_a[i] + member_b[i];
}

The revised code looks longer (and to some, terribly inefficient) but look again. Each loop is simple and can be easily understood. Each loop performs one task, and one task only.

Moreover, languages that support vector operations (and there are a few such languages) can see their code simplified further:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
member_c = member_a + member_b;

Using smaller loops isolates the steps in the loop. The smaller loops can be optimized independently.

If the functions 'some_value()' and 'some_other_value()' can be changed to return vectors of values, the code can be simplified further:

member_a = some_values();
member_b = something + some_other_values();
member_c = member_a + member_b;

Doesn't that look simpler than the earlier versions of the code?

Languages without vector operations can approach the brevity of vector operations. Assuming an object-oriented language (without operator overloading), one could write:

member_a = some_values();
member_b = Numbers.Add(something, some_other_values());
member_c = Numbers.Add(member_a, member_b);

Assuming you had the functions:

double[] Numbers.Add(double value, double[] values);
double[] Numbers.Add(double[] values1, double[] values2);

and these functions are not that hard to write.

Code can be complex, sometimes because we think we are "helping" the computer. It's often better to help ourselves, to write programs that are simple and easy to understand.

Thursday, November 14, 2013

Instead of simplicity, measure complexity

The IEEE Computer Society devoted their November magazine issue to "Simplicity in IT". Simplicity is a desirable trait, but I have found that one cannot measure it. Instead, one must measure its opposite: complexity.

Some qualities cannot be measured. I learned this lesson as a sysadmin, managing disk space for multiple users and groups. We had large but finite disk resources (resources are always finite), shared by different teams. Despite the large disk resources, the combined usage of the teams exceeded our resources -- in other words, we "ran out of free space". My job was to figure out "where the space had gone".

I quickly learned that the goal of "where the space had gone" was the wrong one. It is impossible to measure, because space doesn't "go" anywhere. I substituted new metrics: who is using space, and how much, and how does that compare to their usage last week? These were possible to measure, and more useful. A developer who uses more than four times the next developer, and more than ten times the average developer is (probably) working inefficiently.

The metric "disk space used by developer" is measurable. The metric "change in usage from last week" is also measurable. In contrast, the metric "where did the unallocated space go" is not.

The measure of simplicity is similar. Instead of measuring simplicity, measure the opposite: complexity. Instead of asking "why is our application (or code, or UI, or database schema) not simple?", ask instead "where is the complexity?"

Complexity in source code can be easily measured. There are a number of commercial tools, a number of open source tools, and I have written a few tools for my own use. Anyone who wants to measure the complexity of their system has tools available to them.

Measuring the change in complexity (such as the change from one week to the next) involves taking measurements at one time and storing them, then taking measurements at a later time and comparing them against the earlier measurements. That is a little more complex that merely taking measurements, but not much more complicated.

Identifying the complex areas of your system give you an indicator. It shows you the sections of your system that you must change to achieve simplicity. That work may be easy, or may be difficult; a measure of complexity merely points to the problem areas.

* * * *

When I measure code, I measure the following:

  • Lines of code
  • Source lines of code (non-comments)
  • Cyclomatic complexity
  • Boolean constants
  • Number of directly referenced classes
  • Number of indirectly referenced classes
  • Number of directly dependent classes
  • Number of indirectly dependent classes
  • Class interface complexity (a count of member variables and public functions)

I find that these metrics let me quickly identify the "problem classes" -- the classes that cause the most defects. I can work on those classes and simplify the system.

Tuesday, September 17, 2013

When programming, think like a computer

When programming, it is best to think like a computer. It is tempting to think like a human. But humans think very differently than computers (if we allow that computers think), and thinking like a human leads to complex programs.

This was brought home to me while reading William Conley's "Computer Optimization Techniques" which discusses the solutions to Integer Programming problems and related problems. Many of these problems can be solved with brute-force calculations, evaluating every possible solution and identifying the most profitable (or least expensive).

The programs for these brute-force methods are short and simple. Even in FORTRAN, they run less than fifty lines. Their brevity is due to their simplicity. There is no clever coding, no attempt to optimize the algorithm. The programs take advantage of the computer's strength of fast computation.

Humans think very differently. They tire quickly of routine calculations. They can identify patterns and have insights into shortcuts for algorithms. They can take creative leaps to solutions. These are all survival skills, useful for dealing with an uncertain environment and capable predators. But they are quite difficult to encode into a computer program. So hard that it is often more efficient to use brute-force calculations without insights and creative leaps. The time spent making the program "smart" is larger than the time saved by the improved program.

Brute-force is not always the best method for calculations. Sometimes you need a smart program, because the number of computations is staggering. In those cases, it is better to invest the time in improvements. (To his credit, Conley shows techniques to reduce the computations, sometimes by increasing the complexity of the code.)

Computing efficiency (that is, "smart" programs) has been a concern since the first computing machines were made. Necessary at first, the need for efficiency drops over time. Mainframe computers became faster, which allowed for "sloppy" programs ("sloppy" meaning "anything less than maximum efficiency").

Minicomputers were slower than mainframes, significantly less expensive, and another step away from the need for optimized, "smart" programs. PCs were another step. Today, smart phones have more computing power than PCs of a few years ago, at a fraction of the price. Cloud computing, a separate branch in the evolution of computing, offers cheap, readily-available computing power.

I won't claim that computing power is (or will ever be) "too cheap to meter". But it is cheap, and it is plentiful. And with cheap and plentiful computing power, we can build programs that use simple methods.

When writing a computer program, think like a computer. Start with a simple algorithm, one that is not clever. Chances are, it will be good enough.

Thursday, September 5, 2013

Measure code complexity

We measure many things on development projects, from the cost to the time to user satisfaction. Yet we do not measure the complexity of our code.

One might find this surprising. After all, complexity of code is closely tied to quality (or so I like to believe) and also an indication of future effort (simple code is easier to change than complicated code).

The problem is not in the measurement of complexity. We have numerous techniques and tools, spanning the range from "lines of code" to function points. There are commercial tools and open source tools that measure complexity.

No, the problem is not in techniques or tools.

It is a matter of will. We don't measure complexity because, in short, we don't want to.

I can think of a few reasons that discourage the measurement of source code complexity.

- The measurement of complexity is a negative one. That is, more complexity is worse. A result of 170 is better than a result of 270, and this inverted scale is awkward. We are trained to like positive measurements, like baseball scores. (Perhaps the golf enthusiasts would see more interest if they changed their scoring system.)

- There is no direct way to connect complexity to cost. While we understand that a complicated code base is harder to maintain that a simple one, we have no way of converting that extra complexity into dollars. If we reduce our complexity from 270 to 170 (or 37 percent), do we reduce the cost of development by the same percentage? Why or why not? (I suspect that there is a lot to be learned in this area. Perhaps several Masters theses can be derived from it.)

- Not knowing the complexity shifts risk from managers to developers. In organizations with antagonistic relations between managers and developers, a willful ignorance of code complexity pushes risk onto developers. Estimates, if made by managers, will ignore complexity. Estimates made by developers may be optimistic (or pessimistic) but may be adjusted by managers. In either case, schedule delays will be the fault of the developer, not the manager.

- Developers (in shops with poor management relations) may avoid the use of any metrics, fearing that they will be used for performance evaluations.

Looking forward, I can see a time when we do measure code complexity.

- A company considering the acquisition of software (including the source code), may want an unbiased opinion of the code. They may not completely trust the seller (who is biased towards the sale) and they may not trust their own people (who may be biased against 'outside' software).

- A project team may want to identify complex areas of their code, to identify high-risk areas.

- A development team may wish to estimate the effort for maintaining code, and may include the complexity as a factor in that effort.

The tools are available.

I believe that we will, eventually, consider complexity analysis a regular part of software development. Perhaps it will start small, like the adoption of version control and automated testing. Both of those techniques were at one time considered new and unproven. Today, they are considered 'best practices'.

Wednesday, October 19, 2011

Engineering vs. craft

Some folks consider the development of software to be a craft; others claim that it is engineering.

As much as I would like for software development to be engineering, I consider it a craft.

Engineering is a craft that must work within measurable constraints, and must optimize some measurable attributes. For example, bridges must support a specific, measurable load, and minimize the materials used in construction (again, measurable quantities).

We do not do this for software.

We manage not software but software development. That is, we measure the cost and time of the development effort, but we do not measure the software itself. (The one exception is measuring the quality of the software, but that is a difficult measurement and we usually measure the number and severity of defects, which is a negative measure.)

If we are to engineer software, then we must measure the software. (We can -- and should -- measure the development effort. Those are necessary measurements. But they are not, by themselves, sufficient for engineering.)

What can we measure in software? Here are some suggestions:

 - Lines of code
 - Number of classes
 - Number of methods
 - Average size of classes
 - Complexity (cyclomatic, McCabe, or whatever metric you like)
 - "Boolean complexity" (the number of boolean constants used within code that are not part of initialization)
 - The fraction of classes that are immutable

Some might find the notion of measuring lines of code abhorrent. I will argue that it is not the metric that is evil, it is the use of it to rank and rate programmers. The misuse of metrics is all too easy and can lead to poor code. (You get what you measure and reward.)

Why do we not measure these things? (Or any other aspect of code?)

Probably because there is no way to connect these metrics to project cost. In the end, project cost is what matters. Without a translation from lines of code (or any other metric) to cost, the metrics are meaningless. The code may be one class of ten thousand lines, or one hundred classes of one hundred lines each; without a conversion factor, the cost of each design is the same. (And the cost of each design is effectively zero, since we cannot convert design decisions into costs.)

Our current capabilities do not allow us to assign cost to design, or code size, or code complexity. The only costs we can measure are the development costs: number of programmers, time for testing, and number of defects.

One day in the future we will be able to convert complexity to cost. When we do, we will move from craft to engineering.