Thursday, June 4, 2009

Grok around the clock

IBM has a product called "System Grokker" which has a lot of interesting ramifications.

The "Grokker" is a tool that analyzes source code, picks it apart, and presents higher level views. You can read more about it on the IBM web site.

This is similar to a project I was working on, for pretty much the same objective: a better understanding of code by viewing abstract models generated from the code. We built a small (and clumsy) analysis tool that parsed source code, identified classes, found references to classes, and built an affinity graph showing the relationships between classes.

We limited our tool to C, C++, C#, and Java. The parsing task was fairly easy, although each language needed its own parser. The C++ parser was the hardest, which says something about the language. (Templates were the worst part of it, and we ignored generics in C# and Java. We could get away with that -- the code base did not use generics in C# and Java.)

The IBM product creates abstract views of the code. An interesting idea, and well worth examining. The ability to view the system "from a height" is important and underrated. Systems are too large and too complex to view in chunks of twenty lines at a time. An abstract view lets one pull back from the detailed code and see a bigger picture.

But simply pulling back was not all we could do. We built a tree of dependencies (or maybe the term is "graph"). I had to put some "breakers" in place, since the graph can be cyclic (class A refers to class B, which refers to class C, which refers to A). But once I had de-cycled the tree I could calculate the "threat" and "vulnerability" of classes, based on their position in the graph. Classes with no dependencies were not vulnerable to problems from other classes; classes at the top of the graph were vulnerable to changes not just within the class but from their immediate supporting classes and the indirect supporting classes.

We inverted the graph and running similar analysis created a list of "threatening" classes. That is, we identified the classes that were most referenced, the ones that supported the most classes.

With the list of vulnerable and threatening classes, we could focus our efforts on redesign. We also used the list to guide or testing effort, putting more effort on the problem classes.

We also generated a list of cyclic dependencies. We used the list as an aid when re-thinking our design.

But all was not wonderful. The big challenge looming on the horizon is the dynamically-typed languages. Perl, Python, and Ruby don't have the syntactic clues for this kind of analysis. It works for statically-typed languages, where the language provides the information that one needs to build the models. Analyzing dynamically-typed languages will require a different method of collecting class references, possibly  through run-time analysis.

I wish the folks at IBM luck with their project. It is a bold step forward, one that is needed for our craft to improve.

Say, do you think that they are hiring?

No comments: