Fitzpatrick's Fabulous Future: debugging

Showing posts with label debugging. Show all posts

Tuesday, July 20, 2021

Debugging

Development consists of several tasks: analysis, design, coding, testing, and deployment are the typical tasks listed for development. There is one more: debugging, and that is the task I want to talk about.

First, let me observe that programmers, as a group, like to improve their processes. Programmers write the compilers and editors and operating systems, and they build tools to make tasks easier.

Over the years, programmers have built tools to assist in the tasks of development. Programmers were unhappy with machine coding, so they wrote assemblers which converted text codes to numeric codes. They were unhappy with those early assemblers because they still had to compute locations for jump targets, so they wrote symbolic assemblers that did that work.

Programmers wrote compilers for higher-level languages, starting with FORTRAN and FLOW-MATIC and COBOL. We've created lots of languages, and lots of compilers, since.

Programmers created editors to allow for creation and modification of source code. Programmers have created lots of editors, from simple text editors that can run on a paper-printing terminal to the sophisticated editors in today's IDEs.

Oh, yes, programmers created IDEs (integrated development environments) too.

And tools for automated testing.

And tools to simplify deployment.

Programmers have made lots of tools to make the job easier, for every aspect of development.

Except debugging. Debugging has not changed in decades.

There are three techniques for debugging, and they have not changed in decades.

Desk check: Not used today. Used in the days of mainframe and batch processing, prior to interactive programming. To "desk check" a program, one looks at the source code (usually on paper) and checks it for errors.

This technique was replaced by tools such as lint and techniques such as code reviews and pair programming.

Logging: Modify the code to print information to a file for later examination. Also know as "debug by printf()".

This technique is in use today.

Interactive debugging: This technique has been around since the early days of Unix. It was available in 8-bit operating systems like CP/M (the DDT program). The basic idea: Run the program with the debugger, pausing the execution it at some point. The debugger keeps the program loaded in memory, and one can examine or modify data. Some debuggers allow you to modify the code (typically with interpreted languages).

This technique is in use today. Modern IDEs such as Visual Studio and PyCharm provide interactive debuggers.

Those are the three techniques. They are fairly low-level technologies, and require the programmer to keep a lot of knowledge in his or her head.

These techniques gave us Kernighan's quote:

"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?"

— The Elements of Programming Style, 2nd edition, chapter 2

These debugging techniques are the equivalent of assemblers. They allow programmers to do the job, but put a lot of work on the programmers. They assist with the mechanical aspect of the task, but not the functional aspect. A programmer, working on a defect and using a debugger, usually follow the following procedure:

- understand the defect

- load the program in the debugger

- place some breakpoints in the source code, to pause execution at points that seem close to the error

- start the program running, wait for a breakpoint

- examine the state of the program (variables and their contents)

- step through the program, one line at a time, to see which decisions are made ('if' statements)

This process requires the programmer to keep a model of the program inside his or her head. It requires concentration, and interruptions or distractions can destroy that model, requiring the programmer to start again.

I think that we are ready for a breakthrough in debugging. A new approach that will make it easier for the programmer.

That new approach, I think, will be innovative. It will not be an incremental improvement on the interactive debuggers of today. (Those debuggers are the result of 30-odd years of incremental improvements, and they still require lots of concentration.)

The new debugger may be something completely new, such as running two (slightly different) versions of the same program and identifying the points in the code where execution varies.

Or possibly new techniques for visualizing the data of the program. Today's debuggers show us everything, with limited ways to specify items of interest (and other items that we don't care about and don't want to see).

Or possibly visualization of the program's state, which would be a combination of variables and executed statements.

I will admit that the effort to create a debugger (especially a new-style debugger) is hard. I have written two debuggers in my career: one for 8080 assembly language and another for an interpreter for BASIC. Both were challenges, and I was not happy with the results for either of them. I suspect that to write a debugger, one must be twice as clever as when writing the compiler or interpreter.

Yet I am hopeful that we will see a new kind of debugger. It may start as a tool specific to one language. It may be for an established language, but I suspect it will be for a newer one. Possibly a brand-new language with a brand-new debugger. (I suspect that it will be an interpreted language.) Once people see the advantages of it, the idea will be adopted by other language teams.

The new technique may be so different that we don't call it a debugger. We may give it a new name. So it may be that the new debugger is not a debugger at all.

Monday, June 26, 2017

The humble PRINT statement

One of the first statements we learn in any language is the "print" statement. It is the core of the "Hello, world!" program. We normally learn it and then discard it, focussing our efforts on database and web service calls.

But the lowly "print" statement has its uses, as I was recently reminded.

I was working on a project with a medium-sized C++ application. We needed information to resolve several problems, information that would normally be available from the debugger and from the profiler. But the IDE's debugger was not usable (executing under the debugger would require a run time of about six hours) and the IDE did not have a profiler.

What to do?

For both cases, the PRINT statement (the "fprintf()" statement, actually, as we were using C++) was the thing we needed. A few carefully placed statements allowed use to capture the necessary information, make decisions, and resolve the problems.

The process wasn't that simple, of course. We needed several iterations, adding and removing PRINT statements in various locations. We also captured counts and timings of various functions.

The effort was worth it.

PRINT statements (or "printf()", or "print()", or "puts()", whatever you use) are useful tools. Here's how they can help:

They can capture values of internal variables and state when the debugger is not available.
They can capture lots of values of variables and state, for analysis at a level higher than the interactive level of the debugger. (Consider viewing several thousand values for trends in a debugger.)
They can capture performance when a profiler is not available.
They can extract information from the "release" version of software, because sometimes the problem doesn't occur in "debug" mode.

They may be simple, but they are useful. Keep PRINT statements in your toolbox.

* * * * *

I was uncertain about the title for this column. I considered the C/C++ form of the generic statement ('printf()'). I also considered the general form used by other languages ('print()', 'puts()', 'WriteLine()'). I settled on BASIC's form of PRINT -- all capitals, no parentheses. All popular languages have such a statement; in the end, I suspect it matters little. Use what is best for you.

Monday, March 27, 2017

The provenance of variables

Just about every programming language has the concept of a 'variable', a container of a value. Variables are named 'variable' because their contents can vary -- as opposed to constants. The statement

a = 10

Assigns the variable 'a' a value of '10', denoted in the program as a constant.

(There are some languages which allow for the values of constants to be changed, so one can assign a new value to a constant. It leads to unusual results and is often considered a defect. But I digress.)

The nice thing about variables is that they can vary. The problem with variables is that they vary.

More specifically, when examining a program (say with a debugger), one can see the contents of a variable but one does not know how that value was calculated. Was it assigned? Was the variable incremented? Did the value come from a constant, or was it calculated? When was it assigned?

Here is an idea: Retain the source of values. Modify the notion of a variable. Instead of being a simple container for a value, hold the value and additional information.

For example, a small program:

file f = open("filename")
a = f.read()
b = f.read()
c = (a - b) / 100

Let's assume that the file contains the text "20 4", which is the number 20 followed by a space an then the number 4, all in text format.

In today's programming languages, the variables a, b, and c contains values, and nothing else. The variable 'a' contains 20, the variable 'b' contains '4', and the variable 'c' contains 0.16. Yet they contain no information about how those values were derived.

For a small program such as this example, we can easily look at the code and identify the source of the values. But larger programs are a different story. They are often complex, and the source of a value is not obvious.

With provenance, the variables a, b, and c still contain values, and in addition contain information about those values.

The variable 'a' contains the value 20 and the value 'filename', as that was the source of the value. It would also be possible to contain more information about the file, such as a creation date, a version number (for filesystems that support version numbers), and the position within the file. It can even contain the line number of the assignment, allowing the programmer easy access to the source.

The variable 'b' contains similar information.

The variable 'c' contains information about the variables 'a' and 'b', along with their states at the time of assignment. Consider the revised program:

file f = open("filename")
a = f.read()
b = f.read()
c = (a - b) / 100
... more code
a = 0
b = 1
... more code
d = c * 20

In this program, the variable 'd' is assigned the value 3.2 (assuming that the same file is read) and at that assignment, the variable 'c' holds information about 'a' and 'b' with their initial values of 20 and 4, not their current values of 0 and 1. Thus, a developer can examine the assignment to 'd' and understand the value of 'c'.

In addition to developers, provenance may be useful for financial auditors and examiners. Anyone who cares about the origins of a specific value will find provenance helpful.

Astute readers will be already thinking of the memory requirements for such a scheme. Retaining provenance requires memory -- a lot of memory. A simple variable holding an integer requires four bytes (on many modern systems). With provenance, a 'simple' integer would require the four bytes for the value and as many bytes as required to hold its history. Instead of four bytes, it may require 40, or 400.

Clearly, provenance is not free. It costs memory. It also costs time. Yet the benefits, I think, are clear. So, how to implement it? Some ideas:

- Provenance is needed only when debugging, not during production. Enable it as part of the normal debug information and remove it for 'release' mode.
- Provenance can be applied selectively, to a few variables and not to others.
- Provenance can be implemented selectively. Perhaps one needs only a few pieces of information, such as line number of assignment. Less information requires less memory.

Our computing capacity continues to grow. Processor capabilities, memory size, and storage size, are all increasing faster than program size. That is, our computers are getting bigger, and they are getting bigger faster than our programs are getting bigger. All of that 'extra' space should do something for us, right?

Thursday, November 11, 2010

Improve code with logging

I recently used a self-made logging class to improve my (and others') code. The improvements to code were a pleasant side-effect of the logging; I had wanted more information from the program, information that was not visible in the debugger, and wrote the logging class to capture and present that information. During my use of the logging class, I found the poorly structured parts of the code.

A logging class is a simple thing. My class has four key methods (Enter(), Exit(), Log(), and ToString() ) and a few auxiliary methods. Each method writes information to a text file. (The text file being specified by one of the auxiliary methods.) Enter() is used to capture the entry into a function; Exit() captures the return from the function; Log() adds an arbitrary message to the log file, including variable values; and ToString() converts our variables and structures to plain text. Combined, these methods let us capture the data we need.

I use the class to capture information about the flow of a program. Some of this information is available in the debugger but some is not. We're using Microsoft's Visual Studio, a very capable IDE, but some run-time information is not available. The problem is due, in part, to our program and the data structures we use. The most common is an array of doubles, allocated by 'new' and stored in a double*. The debugger can see the first value but none of the rest. (Oh, it can if we ask for x[n], where 'n' is a specific number, but there is no way to see the whole array, and repeating the request for an array of 100 values is tiresome.)

Log files provide a different view of the run-time than the debugger. The debugger can show you values at a point in time, but you must run the program and stop at the correct point in time. And once there, you can see only the values at that point in time. The log file shows the desired values and messages in sequence, and it can extract the 100-plus values of an array into a readable form. A typical log file would be:

** Enter: foo()
i == 0
my_vars = [10 20 22 240 255 11 0]
** Enter: bar()
r_abs == 22
** Exit: bar()
** Exit: foo()

The log file contains the text that I specify, and nothing beyond it.

Log files give me a larger view than the debugger. The debugger shows values for s single point in time; the log file shows me the values over the life of the program. I can see trends much easier with the log files.

But enough of the direct benefits of log files. Beyond showing me the run-time values of my data, they help me build better code.

Log files help me with code design by identifying the code that is poorly structured. I inject the logging methods into my code, instrumenting it. The function

double Foobar::square(double value)
{
return (value * value);
}

Becomes

double Foobar::square(double value)
{
Logger::Enter("Foobar::square(double)");
Logger::Log("value: ", Logger::ToString(value));
return (value * value);
Logger::Exit("Foobar::square(double)");
}

A bit verbose, and perhaps a little messy, but it gets the job done. The log file will contains lines for every invocation of Foobar::square().

Note that each instrumented function has a pair of methods: Enter() and Exit(). It's useful to know when each function starts and ends.

For the simple function above, one Enter() and one Exit() is needed. But for more complex functions, multiple Exit() calls are needed. For example:

double Foobar::square_root(double value)
{
if (value < 0.0)
return 0.0;
if (value == 0.0)
return 0.0;
return (pow(value, 0.5));
}

The instrumented version of this function must include not one but calls to Exit() for each return statement.

double Foobar::square_root(double value)
{
Logger::Enter("Foobar::square_root(double)");
Logger::Log("value: ", Logger::ToString(value));
if (value < 0.0)
{
Logger::Exit("Foobar::square_root(double)");
return 0.0;
}
if (value == 0.0)
{
Logger::Exit("Foobar::square_root(double)");
return 0.0;
}
Logger::Exit("Foobar::square_root(double)");
return (pow(value, 0.5));
}

Notice all of the extra work needed to capture the multiple exits of this function. This extra work is a symptom of poorly designed code.

In the days of structured programming, the notion of simplified subroutines was put forward. It stated that each subroutine ("function" or "method" in today's lingo) should have only one entry point and only one exit point. This rule seems to have been dropped.

At least the "only one exit point" portion of the rule. Modern day languages allow for only one entry point into a method. They allow for multiple exit points, and this lets us write poor code. A better (uninstrumented) version of the square root method is:

double Foobar::square_root(double value)
{
double result = 0.0;

if (is_rootable(value))
{
result = pow(value, 0.5);
}

return result;
}

bool Foobar::is_rootable(double value)
{
return (value > 0.0);
}

This code is longer but more readable. Instrumenting it is less work, too.

One can visually examine the code for the "extra return" problem, but instrumenting the code with my logging class made the problems immediately visible.

Fitzpatrick's Fabulous Future