Showing posts with label untangling code. Show all posts
Showing posts with label untangling code. Show all posts

Sunday, June 15, 2014

Untangle code with automated testing

Of all of the tools and techniques for untangling code, the most important is automated testing.

What does automated testing have to do with the untangling of code?

Automated testing provides insurance. It provides a back-stop against which developers can make changes.

The task of untangling code, of making code readable, often requires changes across multiple modules and multiple classes. While a few improvements can be made to single modules (or classes), most require changes in multiple modules. Improvements can require changes to the methods exposed by a class, or remove access to member variables. These changes ripple though other classes.

Moreover, the improvement of tangled code often requires a re-thinking of the organization of the code. You move functions from one class to another. You rename variables. You split classes into smaller classes.

These are significant changes, and they can have significant effects on the operation of the code. Of course, while you want to change the organization of the code you want the results of calculations to remain unchanged. That's how automated tests can help.

Automated tests verify that your improvements have no effect on the calculations.

The tests must be automated. Manual tests are expensive: they require time and attention. Manual tests are easy to skip. Manual tests are easy to "flub". Manual tests can be difficult to verify. Automated tests are consistent, accurate, and most of all, cheap. They do not require attention or effort. They are complete.

Automated tests let programmers make significant improvements to the code base and have confidence that their changes are correct. That's how automated tests help you untangle code.

Sunday, June 8, 2014

Untangle code with small classes

If you want to simplify code, build small classes.

I have written (for different systems) classes for things such as ZIP Codes, account numbers, weights, years, year-month combinations, and file names.

These are small, simple classes, usually equipped with a constructor, comparison operators, and a "to string" operator. Sometimes they have other operators. For example, the YearMonth class has next_month() and previous_month() functions.

Why create a class for something as simple? After all, a year can easily be represented by an int (or an unsigned int, if you prefer). A file name can be held in a string. Why have a separate class for them?

Small classes provide a number of benefits.

Check for validity The constructor can check for the validity of the contents. With the proper checks in place, you know that every instance of the class is a valid instance. With primitive types (such as a string to hold a ZIP Code), you are never sure.

Consolidate redundant code A class can hold the logic that is duplicated in the main code. The Year class can tell if a year is a leap year, instead of repeating if (year % 4 == 0) in the code. It is easier (and more readable) to have code say if (year.is_leap_year()).

Consistent operations Our Year class performs the proper calculation for leap years (not the simple one listed above). Using the Year class for all instances of a year means that the calculations for leap year are consistent (and correct).

Clear names for operations Our Year class has operations named next_year() and previous_year() which give clear meaning to the operations year + 1 and year - 1.

Limit operations Custom classes provide the operations you specify and no others. The standard library provides classes with lots of operations, some of which may be inappropriate for your needs.

Add operations Our YearMonth class has operations next_month() and previous_month(), operations which are not supplied in the standard library's Date class. (Yes, one can add a TimeSpan object, provided one gets the right number of days in the TimeSpan, but the code is more complex.) Also, our YearMonth class can calculate the quarter of the year, something we need for our computations.

Prevent accidental use An object of a specific class cannot be used accidentally. If passed to a function or class, the target must be ready to accept the class. Our Year class cannot be carelessly passed to another function. If we stored our years in ints, those ints could be passed to any function that expected an int.

These benefits simplify the main code. Custom classes for small data elements let you ensure that objects are complete and internally consistent. They let you consolidate logic into a single place. They let you tailor the operations to your needs. They prevent accidental use or assignment.

Simplifying the main code means that the main code becomes, well, simpler. By moving low-level operations to low-level classes, your "mainline" code focusses on higher-level concepts. You spend less of your time worrying about low-level things and more of your time thinking about high-level (that is, application-level) ideas.

If you want to simplify code, build small classes.

Sunday, May 18, 2014

How to untangle code: Use variables for only one purpose

Early languages (COBOL, FORTRAN, BASIC, Pascal, C, and others) forced the declarations of variables into a single section of code. COBOL was the strictest of this taskmaster, with data declarations in sections of code separate from the procedural section.

With limited memory, it was often necessary to re-use variables. FORTRAN assisted in the efficient use of memory with the 'EQUIVALENCE' directive which let one specify variables that used the same memory locations.

Today, the situation has changed. Memory is cheap and plentiful. It is no longer necessary to use variables for more than one purpose. Our languages no longer have EQUIVALENCE statements -- something for which I am very grateful. Modern languages (including C++, C#, Java, Perl, Python, Ruby, and even the later versions of C) allow us to declare variables when we need them; we are not limited to declaring them in a specific location.

Using variables for more than one purpose is still tempting, but not necessary. Modern languages allow us to declare variables as we need them, and use different variables for different purposes.

Suppose we have code that calculates the total expenses and total revenue in a system.

Instead of this code:

void calc_total_expense_and_revenue()
{
    int i;
    double amount;

    amount = 0;
    for (i = 0; i < 10; i++)
    {
        amount += calc_expense(i);
    }

    store_expense(amount);

    amount = 0;

    for (i = 0; i < 10; i++)
    {
        amount += calc_revenue(i);
    }

    store_revenue(amount);
}

we can use this code:

void calc_total_expense_and_revenue()
{
    double expense_amount = 0;
    for (unsigned int i = 0; i < 10; i++)
    {
        expense_amount += calc_expense(i);
    }

    store_expense(expense_amount);

    double revenue_amount = 0;
    for (unsigned int i = 0; i < 10; i++)
    {
        revenue_amount += calc_revenue(i);
    }

    store_revenue(revenue_amount);
}

I much prefer the second version. Why, because the second version cleanly separates the calculation of expenses and revenue. In fact, the separation is so good we can break the function into two smaller functions:

void calc_total_expense()
{
    double expense_amount = 0;
    for (unsigned int i = 0; i < 10; i++)
    {
        expense_amount += calc_expense(i);
    }

    store_expense(expense_amount);
}

void calc_total_revenue()
{
    double revenue_amount = 0;
    for (unsigned int i = 0; i < 10; i++)
    {
        revenue_amount += calc_revenue(i);
    }

    store_revenue(revenue_amount);
}

Two small functions are better than one large function. Small functions are easier to read and easier to maintain. Using a variable for more than one purpose can tie those functions together. Using separate variables (or one variable for each purpose) allows us to separate functions.

Sunday, April 6, 2014

How to untangle code: Make loops smaller

Loops all too often do many things. Some programs pack as much as possible into a single loop. The cause is possibly a desire to optimize, to reduce the work performed by the machine. The thought is that one loop is more efficient than several loops, because 1) there is only one loop to "set up" and "take down" and 2) the computer can perform tasks on multiple items as it "goes along" an array of data structures. This is not necessarily so.

Perhaps the code is:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
    member_b[i] = something + some_other_value(i);
    member_c[i] = member_a[i] + member_b[i];
}

The "optimization" of packing loops full of functionality is not necessarily a true optimization. The notion that loops are expensive is an old one, dating back to the mainframe era (when it was true). Since that time, we have designed faster processors, create more capable instruction sets, and improved compilers.

Packing loops full of instructions has a cost: the code is more complex. Being more complex, it is harder to read. (My example is simple, of course. Real code is more complicated. But I think the idea holds.)

I change this (admittedly simple) single loop to a set of smaller loops:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_c[i] = member_a[i] + member_b[i];
}

The revised code looks longer (and to some, terribly inefficient) but look again. Each loop is simple and can be easily understood. Each loop performs one task, and one task only.

Moreover, languages that support vector operations (and there are a few such languages) can see their code simplified further:

for (unsigned int i = 0; i < num_items; i++)
{
    member_a[i] = some_value(i);
}
for (unsigned int i = 0; i < num_items; i++)
{
    member_b[i] = something + some_other_value(i);
}
member_c = member_a + member_b;

Using smaller loops isolates the steps in the loop. The smaller loops can be optimized independently.

If the functions 'some_value()' and 'some_other_value()' can be changed to return vectors of values, the code can be simplified further:

member_a = some_values();
member_b = something + some_other_values();
member_c = member_a + member_b;

Doesn't that look simpler than the earlier versions of the code?

Languages without vector operations can approach the brevity of vector operations. Assuming an object-oriented language (without operator overloading), one could write:

member_a = some_values();
member_b = Numbers.Add(something, some_other_values());
member_c = Numbers.Add(member_a, member_b);

Assuming you had the functions:

double[] Numbers.Add(double value, double[] values);
double[] Numbers.Add(double[] values1, double[] values2);

and these functions are not that hard to write.

Code can be complex, sometimes because we think we are "helping" the computer. It's often better to help ourselves, to write programs that are simple and easy to understand.

Sunday, March 30, 2014

How to untangle code: Start at the bottom

Messy code is cheap to make and expensive to maintain. Clean code is not so cheap to create but much less expensive to maintain. If you can start with clean code and keep the code clean, you're in a good position. If you have messy code, you can reduce your maintenance costs by improving your code.

But where to begin? The question is difficult to answer, especially on a large code base. Some ideas are:
  • Re-write the entire code
  • Re-write logical sections of code (vertical slices)
  • Re-write layers of code (horizontal slices)
  • Make small improvements everywhere
All of these ideas have merit -- and risk. For very small code sets, a complete re-write is possible. For a system larger than "small", though, a re-write entails a lot of risk.

Slicing the system (either vertically or horizontally) has the appeal of independent teams. The idea is to assign a number of teams to the project, with each project working on an independent section of code. Since the code sets are independent, the teams can work independently. This is an appealing idea but not always practical. It is rare that a system is composed of independent systems. More often, the system is composed of several mutually-dependent systems, and adjustments to any one sub-system will ripple throughout the code.

One can make small improvements everywhere, but this has its limits. The improvements tend to be narrow in scope and systems often need high-level revisions.

Experience has taught me that improvements must start at the "bottom" of the code and work upwards. Improvements at the bottom layer can be made with minimal changes to higher layers. Note that there are some changes to higher layers -- in most systems there are some affects that ripple "upwards". Once the bottom layer is "clean", one can move upwards to improve the next-higher level.

How to identify the bottom layer? In object-oriented code, the process is easy: classes that can stand alone are the bottom layer. Object-oriented code consists of different classes, and some (usually most) classes depend on other classes. (A "car system" depends on various subsystems: "drive train", "suspension", "electrical", etc., and those subsystems in turn depend on smaller components.)

No matter how complex the hierarchy, there is a bottom layer. Some classes are simple enough that they do not include other classes. (At least not other classes that you maintain. They may contain framework-provided classes such as strings and lists and database connections.)

These bottom classes are where I start. I make improvements to these classes, often making them immutable (so they can hold state but they cannot change state). I change their public methods to use consistent names. I simplify their code. When these "bottom" classes are complex (when they hold many member variables) I split them into multiple classes.

The result is a set of simpler, cleaner code that is reliable and readable.

Most of these changes affect the other parts of the system. I make changes gradually, introducing one or two and then re-building the system and fixing broken code. I create unit tests for the revised classes. I share changes with other members of the team and ask for their input.

I don't stop with just these "bottom" classes. Once cleaned, I move up to the next level of code: the classes than depend only on framework and the newly-cleaned classes. With a solid base of clean code below, one can improve the next layer of classes. The improvements are the same: make classes immutable, use consistent names for functions and variables, and split complex classes into smaller classes.

Using this technique, one works from the bottom of the code to the top, cleaning all of the code and ensuring that the entire system is maintainable.

This method is not without drawbacks. Sometimes there are cyclic dependencies between classes and there is no clear "bottom" class. (Good judgement and re-factoring can usually resolve that issue.) The largest challenge is not technical but political -- large code bases with large development teams often have developers with egos, developers who think that they own part of the code. They are often reluctant to give up control of "their" code. This is a management issue, and much has been written on "egoless programming".

Despite the difficulties, this method works. It is the only method that I have found to work. The other approaches too often run into the problem of doing too much at once. The "bottom up" method allows for small, gradual changes. It reduces risk, but cannot eliminate it. It lets the team work at a measured pace, and lets the team measure their progress (how many classes cleaned).

Sunday, March 23, 2014

How to untangle code: Limit functions to void or const

Over time, software can become messy. Systems that start with clean and readable code often degrade with hastily-made changes to code that is hard to understand. I untangle that code, restoring it to a state that is easy to understand.

One technique I use with object-oriented code is to limit member functions to 'void' or 'const'. That is, a function may change the state of an object or it may report a value contained in the object, but it cannot do both.

Dividing functions into two types - mutation functions and reporter functions - reduces the logic which modifies the object. Isolating the changes of state is a good way to simplify the code. Most difficulties with code occur with changes of state. (Functional programming avoids this problem by using immutable objects which can never change once they are initialized.)

Separating the logic to change an object from the logic to report on an object's state also frees the use of reporting functions. A combination function, one that reports a value and also changes its state can be called only when a change to the state is desired. But a 'const' function can be called at any time, and any number of times, on the object because it does not change the object's state. Thus you can refactor the code that calls 'const's functions and change the sequence of their calls (and the frequency of the calls) with confidence.

Here's a simple example. The 'combined' function:

double foo::update_totals ( double new_value )
{
    my_total += new_value;
    return my_total;
}

can be separated into two:

void foo::update_totals ( double new_value )
{
    my_total += new_value;
}

double foo::total ( void ) const
{
    return my_total;
}

These two functions perform the same operations as the single combined function, but now you are free to call the second function (total()) as many times as you like. Notice that  total() is marked as const. It cannot change the state of the object.

Your original calling code also changes:

{
    foo my_foo;

    // some code

    double total = my_foo.update_totals( 100.0 );
}

becomes:

{
    foo my_foo;

    // some code

    my_foo.update_totals( 100.0 );
    double total = my_foo.total();
}

An additional benefit of separating mutation logic from reporting logic is that functions are smaller. Smaller functions are easier to comprehend. (Yes, the calling function is slightly longer, but the reduction "gains" in the class outweigh the increase in the calling classes.)

Messy code is ... messy. You can make it less messy by separating mutation functions from reporting functions, and ensuring that all functions are either one or the other.

Sunday, March 16, 2014

How to untangle code: make member variables private

Tangled code is hard to read, hard to understand, and hard to change. (Well, hard to change and get right. Tangle code is easy to change and get the change wrong.)

I use a number of techniques to untangle code. Once untangled, code is easy to read, easy to understand, and easy to change (correctly).

One technique I use is to make member variables of classes private. In object-oriented programming languages, member variables can be marked as public, protected, or private. Public variables are open to all, and any part of the code can change them. Private variables are walled off from the rest of the code; only the "owning" object can make changes. (Protected variables live in an in-between state, available to some objects but not all. I tend to avoid the "protected" option.)

The benefits of private variables are significant. With public variables, an object's member variables are available to all parts of the code. It is the equivalent of leaving all of your belongings on your front lawn; anyone passing by can take things, leave new things, or just re-arrange your stuff. Private variables, in contrast, are not available to other parts of the code. Only the object that contains the variable can modify it. (The technical term for this isolation is "encapsulation".)

But one cannot simply change the designation of member variables from "public" to "private". Doing so often breaks existing code, because sections of the code have been built with the ability to access those variables.

The process of "privatizing" member variables is slow and gradual. I start with an general survey of the code and then select one class and change its member variables to private. The class I select is not picked at random. I pick a class that is "elementary", one that has no other dependencies. These "elementary" classes are easier to "privatize", since they can be modified without reliance on other classes. They also tend to be simpler and smaller than most classes in the system.

But while they do not depend on other classes their changes may affect other parts of the code. These low-level "elementary" classes are used by other classes in the system, and those dependent classes often break with the changes. Making member variables private means that those other classes cannot simply "reach into" the elementary class anymore.

To fix these problems, I create special accessor functions. These functions let other classes read (or write) the member variables. Often I find that only the "read" accessor is necessary. Sometimes "the write" accessor is necessary.

After I make the member variables private, I create the accessor functions. Then I modify the dependent code to use not the member variable but the accessor function.

These are simple changes; they do not change the semantics of the code. The compile helps you; once you make the member variables private it gleefully points out the other parts of the code that want access to the now-private variables. You know that you have corrected all of the dependent code when the compiler stops complaining.

Making member variables private is one step in untangling code, but not the only step. I will share more of my techniques in future posts.

Sunday, March 9, 2014

How to untangle code: Remove the tricks

We all have our specialties. Mine is the un-tangling of code. That is, I can re-factor messy code to make it readable (and therefore maintainable).

The process is sometimes complex and sometimes tedious. I have found (discovered?, identified?) a set of practices that allow me to untangle code. As practices, they are imprecise and subject to judgement. Yet they can be useful.

The first practice is to get rid of the tricks. "Tricks" are the neat little features of the language.

In C++, two common types of tricks are pointers and preprocessor macros. (And sometimes they are combined.)

Pointers are to be avoided because they can often cause unintended operations. In C, one must use pointers; in C++ they are to be used only when necessary. One can pass a reference to an object instead of a pointer (or better yet, a reference to a const object). The reference is bound to an object and cannot be changed; a pointer, on the other hand, can be changed to point to something else (if you are very disciplined that something else will be another instance of the same class).

We use pointers in C (and in early C++) to manage elements in a data structure such as a list or a tree. While we can use references, it is better to use members of the C++ STL (or the BOOST library). These containers handle memory allocation and de-allocation. I have successfully untangled programs and eliminated all "new" and "delete" calls from the code.

The other common trick of C++ is the preprocessor. The preprocessor macros are powerful constructs that let one perform all sorts of mischief including changing function names, language keywords, and constant values. Simple macro definitions such as

#define PI 3.1415

can be written in Java or C# (or even C++) as

const double PI = 3.1415;

so one does not really need the preprocessor for those defintions.

More sophisticated macros such as

#define array_value(x, y) { if (y < 100) x[y]; else x[0]; }

let you check the bounds of an array, but the STL std::vector<> container performs this checking for you.

The preprocessor also lets one construct function calls at compile time:

#define call_func(x, y, a1, a2) func_##x##y(a1, a2)

to convert this code

call_func(stats, avg, v1, v2);

to this

func_statsavg(v1, v2);

Mind you, the source code contains only the unconverted line, never the converted line. Your debugger does not know about the post-processed line either. In a sense, #define macros are lies that we programmers tell ourselves.

Worse, they are specific to C++ (and possibly C, depending on their use of object-oriented notations). When you write code that invokes the C++ preprocessor, you lock the code into that language. Java, C#, and later languages do not have the preprocessor (or anything like it).

So when un-tangling code (sometimes with the objective of moving code to another language), one of the first things I do is get rid of the tricks.