in scala programming debugging ~ read.

Losing track (Part 1)

Code has bugs. If you are lucky you spend only half of your life as a coder on debugging code. Most kind of bugs are easy to find and simple to fix. However, there are kinds of bugs that are hidden deeply under a stack of indirection. In that case, you try to track it down using the hopefully helpful tools of your run time environment.

Failure modes may differ wildly depending on the way a program is written and which kind of execution engine is used to run it. There's a particular kind of bug I want to present here which I've encountered in several forms. The thing I want to examine is how "debuggability" can differ quite a lot given the way you wrote the program.

An obvious problem

Here's the problem in its simplest form:

def a: Int = b  
def b: Int = a

a  

This is what you get if you run this program:

java.lang.StackOverflowError  
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)

The problem is clear. What we have here is a simple calculation where two methods call each other leading to a cycle. I could have shown the even simpler form of def a: Int = a but I guess it's easier to extrapolate from two methods to the realistic case where the cycle will span lots of methods so the problem will not be as obvious any more.

Tail calls

In the simple case above, the program terminates with a StackOverflowException. But let's talk about tail-calls. A tail-call is a method call which is the last thing done before a method returns itself to its caller. This means it is a method call whose return value will also be the return value of the method that made the method call.

For example, given

def f(x: Int): Int = x + 12  
def g(): Int = f(5)

def main(): Unit = {  
  g()
}

the call to f will be a tail-call in g because nothing is left to be done in g when f returns.

def f(x: Int): Int = x + 12  
def h(): Int = 12 * f(5)  

It is not a tail-call in h, however, because the result of h still needs to be multiplied by 12 after f returns.

If a call is a tail-call some languages and run times can do tail-call elimination to remove the intermediate stack frame when tail-calling a method. This means that you call the new method but leave no trace of the method that was executing before.

Without tail-call elimination, while executing f the stack would look like this:

main -> g -> f  

With tail-call elimination, the stack goes directly from

main -> g  

to

main -> f  

so that it will afterwards directly return from f to main.

What's the advantage of doing tail-call optimization? Each stack frame needs memory on the stack so removing a frame will save stack space. This is especially important to prevent that recursive calls will eat up all stack space for deeply nested data structures. Its main reason is to avoid a StackOverflowException.

But think back to our original bug. How did it manifest itself? By throwing a StackOverflowException. So, tail-call elimination and bug-finding may work against each other.

In the original case above we got a helpful StackOverflowException because the JVM doesn't do tail-calls (yet?). However, the Scala compiler is clever enough to optimize recursive tail-calls. Let's see what happens if we rewrite the program so that the Scala compiler can apply its tail-call optimization.

A first complication

We now make the example slightly more complex by 1) bringing it in a form that allows tail-call elimination to happen and 2) which shows that the problem gets harder if the cycle is dynamic in nature.

def f(argument: Int): Int =  
  argument match {
    case 42 ⇒ f(argument - 19)
    case 23 ⇒ f(argument * 2 - 4)
    case x  ⇒ x + 5
  }

def main(args: Array[String]): Unit = {  
  f(0) // everything's fine
  f(42)
}

What's the result of this program?

In fact, this program runs only for its side-effect which is ... heating your home indefinitely. It will not terminate but will spin one of your CPU cores with 100%. To find out what's going on, let's try my number one tool of choice for debugging and profiling : jstack, the command line tool bundled with the JDK which allows getting a stack trace for another Java process.

I usually run jstack at least twice as a kind of poor man's sampling profiler. This way I get 2 samples of what the program is doing (which is already kind of a statistic) and it often is already enough to see what the program is (probably) spending most of its time with.

Let's do this:

> jps
17452 TailCalls

> jstack 17452
[...]
"main"
   java.lang.Thread.State: RUNNABLE
    at example.TailCalls$.f(1tailcalls.scala:7)
    at example.TailCalls$.main(1tailcalls.scala:13)
    at example.TailCalls.main(1tailcalls.scala)
[...]

> jstack 17452
[...]
"main"
   java.lang.Thread.State: RUNNABLE
    at example.TailCalls$.f(1tailcalls.scala:6)
    at example.TailCalls$.main(1tailcalls.scala:13)
    at example.TailCalls.main(1tailcalls.scala)
[...]

Here, we got two almost equal stacktraces. In both cases, the main thread runs function f. Can you spot the difference?

We were lucky here, because when taking the stack traces the program was executing two different lines (7 vs. 6) in the two cases. The lines directly pinpoint the issue because they correspond to the lines containing case 42 => and case 23 => respectively.

Again, we have a cycle that is only triggered if the method is called with the right argument. Also, we don't get a StackOverflowException anymore because in this case scalac's tail-call elimination kicked in.

Fortunately, finding the mistake here was still simple because the cycle could be observed in the stack trace by looking at the differing line numbers. In more practical cases it may be much harder because the stack trace may not directly point to different branches of a conditional expression (if there is one at all) or the hot lines (which are most likely to turn up in the stack trace) are in greater distance to the actual problem.

To summarize, the obstacles for debugging this issue are these:

  • the program does not fail but it never terminates
  • the stack trace does not show the argument to the f call
  • the stack trace does not show the complete "trace" from the beginning because tail-call elimination removed the stack frames when doing the recursive call

This shows that stack "traces" may not always be helpful. In this example, we have already seen that tail-call elimination removed parts of the trace.

I put "trace" in quotes because "trace" seems to imply that the program leaves a track from the program start to the current program position.

However, it is not the purpose of the stack to keep the history of a thread. Actually, it's quite the opposite: the primary purpose of the stack is to keep all the necessary state to continue the execution of the thread. That is, it contains return addresses and local variable values which are needed when execution returns from currently open calls. The stack contains the future of a thread.

In simple cases, the history and future of a thread luckily agree, so we get useful debugging information. In other cases, however, information that is not needed anymore for continuing the program, can and will be thrown away and that is exactly what tail-call elimination is doing here.

One reason, that the JVM still has no tail-calls is the related concern that meaningful stack traces are lost when tail calls are eliminated naively. As we have seen this is certainly a valid practical argument.

Tail call elimination is not the only reason stack traces may be not useful. You can always convert recursive algorithms into iterative ones and sometimes it is required to do that for algorithms to complete in bounded space. In those cases, a stack trace does not contain the necessary information to debug a code cycle issue because important state like iteration variable values is either already lost (i.e. overwritten in memory) or not easily accessible because the stack trace contains only static information and not dynamic information like local variable values.

But this is only the start. In part 2 of this post, I'll try to make things more interesting by introducing more complicated control flow (a popular complication these days is running things concurrently...).

I'll leave you spinning here for a while and will return shortly with the second part of this story.

Thanks for reading!

(If you dare so you may try to register yourself in my private discourse instance and leave a comment. Everything is quite fresh so please expect road bumps and occasional downtime.)

Continue reading with part 2.