The Virtual Void A blog by Johannes Rudolph

Losing track (part 2)

In part 1, I introduced the problem of debugging cycling code using stack traces. We have seen that we already lose debugging information if tail call optimization kicks in.

Let’s see what happens if the control flow isn’t linear but there are concurrently running threads of execution.

Let’s run concurrently

Here’s an example that uses reciprocally dependent lazy vals that are evaluated concurrently:

import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global

object A {
  lazy val a: Int = {
    Thread.sleep(100)
    B.b
  }
}
object B {
  lazy val b: Int = {
    Thread.sleep(100)
    A.a
  }
}

def main(args: Array[String]): Unit = {
  val fa = Future(A.a)
  val fb = Future(B.b)

  val sum: Future[Int] =
    for {
      a  fa
      b  fb
    } yield a + b

  println("Waiting for completion....")
  Thread.sleep(10000)
  sum.value match {
    case Some(x)  println(s"Got result: $x")
    case None     println("Future was never completed.")
  }
}

Again, we’ve got the same bug: A.a depends on the value of B.b and vice versa. In the simplest case, we would just access A.a and see what happens. Mildly surprising, it would again fail with a StackOverflowException because the initializers of A.a and B.b would call themselves recursively (huh, recursive evaluation of a val?). But this is not the case I’m interested here.

Instead, we are interested in what happens if A.a and B.b are both evaluated concurrently from different threads. To make that happen we use Future.apply to run the initial evaluation of those lazy vals from different threads. I also added those sleep times to make it more likely that the lazy val initializers are actually executed concurrently.

The future result of those computations are combined in a for-comprehension to a final future value. We sleep in the main thread for 10 seconds and see if we got a result in the meantime.

Here’s what happens:

[info] Running example.CyclicLazyValsComplex 
Waiting for completion....
Future was never completed.

It turns out that we do not get any result (but also no CPU time is spent). What’s going on? Once again, jstack to the rescue, we find that indeed all threads are blocked:

> jps
7480 CyclicLazyValsComplex

> jstack 7480

"ForkJoinPool-1-worker-9"
   java.lang.Thread.State: BLOCKED (on object monitor)
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:10)
	- waiting to lock <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:18)
	- locked <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	[...]
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"ForkJoinPool-1-worker-13"
   java.lang.Thread.State: BLOCKED (on object monitor)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:16)
	- waiting to lock <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:12)
	- locked <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
	[...]
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"main" prio=10 tid=0x00007f6ebc00a000 nid=0x1d3d waiting on condition [0x00007f6ec3e6a000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at example.CyclicLazyValsComplex$.main(3complexLazyVals.scala:33)
	at example.CyclicLazyValsComplex.main(3complexLazyVals.scala)

[...]

Found one Java-level deadlock:
=============================
"ForkJoinPool-1-worker-9":
  waiting to lock monitor 0x00007f6e74006218 (object 0x00000007ade6c800, a example.CyclicLazyValsComplex$A$),
  which is held by "ForkJoinPool-1-worker-13"
"ForkJoinPool-1-worker-13":
  waiting to lock monitor 0x00007f6e60003c88 (object 0x00000007adfb6e80, a example.CyclicLazyValsComplex$B$),
  which is held by "ForkJoinPool-1-worker-9"

Java stack information for the threads listed above:
===================================================
"ForkJoinPool-1-worker-9":
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:10)
	- waiting to lock <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:18)
	- locked <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	[...]
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
"ForkJoinPool-1-worker-13":
	at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:16)
	- waiting to lock <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
	at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
	at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:12)
	- locked <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
	at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
    [...]
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Found 1 deadlock.

Very helpfully, in this case, the program ran itself into a stuck state which is again observable from stack traces and for which the JVM even provides a nice deadlock analysis that actually points to the problem: we have two lazy evaluated values that reciprocally need to access each other to initialize themselves. Luckily, access was guarded by locks which again is part of the stack trace so it was easy to figure out what the problem was.

It may not be clear why this is a realistic example. A lazy val is a mechanism often used to implement a cache. A computation is delayed until its result is first needed. A subsequent access waits for the completion of the initial computation, if necessary, and then returns the result.

This is a useful pattern that is used in some form or the other quite often. In more practical cases than this one the cycle may be much less obvious but still you may run into a similar deadlock. Unfortunately, these kinds of situations are more likely to happen under load when threads are preempted to do some other work which introduces delays similar to the artificial delays introduces by the sleep(100) lines.

As seen here, caching based on lazy vals relies on blocking calls to wait for the result of the computation. Nowadays it’s more fashionable (and more reasonable, of course) to use an asynchronous API that allows to subscribe to the result of a lazy computation. This allows to be notified about the completion of a computation without blocking a thread.

A Future-based cache

spray-caching contains a clever caching solution (introduced to spray by Mathias a few years ago) that doesn’t store the results of a computation but which stores the eventual result of the computation, i.e. a Future of the result. This is ingenious because it allows to put computations into the cache that are not yet complete. It sidesteps the so-called thundering herd problem in a very elegant way. The thundering herd problem describes what happens when lots of similar requests arrive at a cold cache (e.g. after a server restart). A simplistic cache may not have noted that it already has started computing the value for a key and may spawn redundant calculations for the same key until finally one of those redundant calculations puts a value in the cache. In the worst case this may lead to another break down of a server because of too much load.

The Future based solution on the other hand can immediately put a result Future into the cache to which subsequent requests for the same key can subscribe to. This prevents redundant computations and also keeps the cache unconcerned about subscribing to an ongoing computation because this will be naturally handled by the Future.

Here’s a super simple implementation of such a cache:

object Cache {
  def apply(f: Int  Future[Int])(implicit ec: ExecutionContext): Int  Future[Int] = {
    var cache = Map.empty[Int, Future[Int]]
    val lock = new Object

    key 
      lock.synchronized {
        cache.get(key) match {
          // cache hit
          case Some(result)  result
          // cache miss
          case None 
            val result = Promise[Int]()
            Future { // run `f` concurrently
              result.completeWith(f(key))
            }
            cache += key -> result.future
            result.future
        }
      }
  }
}

The cache constructor takes a function Int => Future[Int] and returns a function of the same type. The resulting function runs the underlying function only once per key and puts the resulting Future result into a Map. Calls with the same key will instantly return the cached Future.

A more practical implementation would use something more scalable than a simple synchronized block. spray-caching uses ConcurrentLinkedHashMap, for instance. It would also need to enforce some limits on cache size, etc. This simple implementation, though, should serve its demonstrating purpose here.

Let’s put it into use:

object CyclicFutureCache {
  import scala.concurrent.ExecutionContext.Implicits.global

  val f: Int  Future[Int] = Cache(calculateF)
  // the underlying heavy computation
  def calculateF(i: Int): Future[Int] =
    i match {
      case 42  f(i - 19)
      case 23  f(i * 2 - 4)
      case x   Future.successful(123)
    }

  def main(args: Array[String]): Unit = {
    val sum: Future[Int] =
      for {
        aRes  f(23)
        bRes  f(42)
      } yield aRes + bRes

    println("Waiting for completion....")

    sum.onComplete { res 
      println(s"Program was completed with $res")
      sys.exit()
    }

    // sleep while waiting for completion
    while (true) Thread.sleep(1)
  }
}

We have an expensive recursive function calculateF (similar to the one shown in part 1). For efficiency reasons we want to cache results of this function so we put the Cache around it which gives us a new cached function f. This is also what we call for the recursive invocations. Again we introduced the cyclic dependency bug similar to the ones seen in the previous examples.

Here’s the output of the program:

Waiting for completion....

Once again, the futures are never completed and the program is stuck. Let’s see what jstack says:

> jps
7918 CyclicFutures

> jstack 7918

"ForkJoinPool-1-worker-13" #10 daemon prio=5 os_prio=0 tid=0x00007f60642a3000 nid=0x1ee9 waiting on condition [0x00007f601b2d3000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000076db30338> (a scala.concurrent.forkjoin.ForkJoinPool)
	at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[...]

"main" #1 prio=5 os_prio=0 tid=0x00007f606400a000 nid=0x1ebb waiting on condition [0x00007f606a856000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at example.CyclicFutureCache$.main(4cyclicFutures.scala:54)
	at example.CyclicFutureCache.main(4cyclicFutures.scala)

[...]

This time we are completely left in the dark. Two threads are just waiting and there’s no sign on the stack that a computation is not yet complete. The program is stuck but we cannot see why.

What’s going on? Remember that a stack trace is a representation of the future of a thread or, in other words, the stack represents the continuation of a thread. The point of an asynchronous API like the Future based one from Scala is, however, to detach the continuation, the “what to do after a computation is complete”, from the thread which starts or runs the original computation.

Therefore, no thread or its stack contain any traces of the continuation. Instead, the continuation is explicitly managed on the heap. When you use onComplete (or any of the other Future combinators) you pass a function value to be invoked when the original computation is complete. This function value is put into a list of callbacks which the Promise keeps that implements the Future.

In our case, the cache created two Promises, one for the result of f(23) and one for the result of f(42). It then called promise.completeWith(...) which makes the promise subscribe to the Future given as an argument. The call sequence was roughly like this:

val f23Promise = Promise[Int]()
val f42Promise = Promise[Int]()
f23Promise.completeWith(f42Promise.future)
f42Promise.completeWith(f23Promise.future)

So, both Promises subscribed to the respective other but neither could make any progress.

In summary, when working with Future, we neither keep track of the history of the computation nor can we easily find out what the future of the computation is because we cannot inspect the list of still-waiting callbacks. Also, a Future consumer usually doesn’t track what it is currently waiting on. Instead, it just registers a callback at the producing Future. So, there’s a reference from the producer to the consumer but not the other way round.

I don’t have any ready made solutions at this point. What is needed is better support to trace Future call chains and execution. At some point I started the better-future-exceptions experiment to see what it would take to improve Future tracing infrastructure. akka-stream faced similar problems and @drewhk implemented a graph output showing who waits for what in the streaming scenario. It would be cool if we had something like this for Futures as well. If you have ideas please tell them in the comments!

You can get the examples from this project at github. They also include an example I have not discussed here which replicates a lazy calculation which actors but there is not much insight gained compared to the future example.

Thanks for reading.

Losing track (part 1)

Code has bugs. If you are lucky you spend only half of your life as a coder on debugging code. Most kind of bugs are easy to find and simple to fix. However, there are kinds of bugs that are hidden deeply under a stack of indirection. In that case, you try to track it down using the hopefully helpful tools of your run time environment.

Failure modes may differ wildly depending on the way a program is written and which kind of execution engine is used to run it. There’s a particular kind of bug I want to present here which I’ve encountered in several forms. The thing I want to examine is how “debuggability” can differ quite a lot given the way you wrote the program.

An obvious problem

Here’s the problem in its simplest form:

def a: Int = b
def b: Int = a

a

This is what you get if you run this program:

java.lang.StackOverflowError
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)
  at example.Simple$.b(0simple.scala:5)
  at example.Simple$.a(0simple.scala:4)

The problem is clear. What we have here is a simple calculation where two methods call each other leading to a cycle. I could have shown the even simpler form of def a: Int = a but I guess it’s easier to extrapolate from two methods to the realistic case where the cycle will span lots of methods so the problem will not be as obvious any more.

Tail calls

In the simple case above, the program terminates with a StackOverflowException. But let’s talk about tail-calls. A tail-call is a method call which is the last thing done before a method returns itself to its caller. This means it is a method call whose return value will also be the return value of the method that made the method call.

For example, given

def f(x: Int): Int = x + 12
def g(): Int = f(5)

def main(): Unit = {
  g()
}

the call to f will be a tail-call in g because nothing is left to be done in g when f returns.

def f(x: Int): Int = x + 12
def h(): Int = 12 * f(5)

It is not a tail-call in h, however, because the result of h still needs to be multiplied by 12 after f returns.

If a call is a tail-call some languages and run times can do tail-call elimination to remove the intermediate stack frame when tail-calling a method. This means that you call the new method but leave no trace of the method that was executing before.

Without tail-call elimination, while executing f the stack would look like this:

main -> g -> f

With tail-call elimination, the stack goes directly from

main -> g

to

main -> f

so that it will afterwards directly return from f to main.

What’s the advantage of doing tail-call optimization? Each stack frame needs memory on the stack so removing a frame will save stack space. This is especially important to prevent that recursive calls will eat up all stack space for deeply nested data structures. Its main reason is to avoid a StackOverflowException.

But think back to our original bug. How did it manifest itself? By throwing a StackOverflowException. So, tail-call elimination and bug-finding may work against each other.

In the original case above we got a helpful StackOverflowException because the JVM doesn’t do tail-calls (yet?). However, the Scala compiler is clever enough to optimize recursive tail-calls. Let’s see what happens if we rewrite the program so that the Scala compiler can apply its tail-call optimization.

A first complication

We now make the example slightly more complex by 1) bringing it in a form that allows tail-call elimination to happen and 2) which shows that the problem gets harder if the cycle is dynamic in nature.

def f(argument: Int): Int =
  argument match {
    case 42  f(argument - 19)
    case 23  f(argument * 2 - 4)
    case x   x + 5
  }

def main(args: Array[String]): Unit = {
  f(0) // everything's fine
  f(42)
}

What’s the result of this program?

In fact, this program runs only for its side-effect which is … heating your home indefinitely. It will not terminate but will spin one of your CPU cores with 100%. To find out what’s going on, let’s try my number one tool of choice for debugging and profiling : jstack, the command line tool bundled with the JDK which allows getting a stack trace for another Java process.

I usually run jstack at least twice as a kind of poor man’s sampling profiler. This way I get 2 samples of what the program is doing (which is already kind of a statistic) and it often is already enough to see what the program is (probably) spending most of its time with.

Let’s do this:

> jps
17452 TailCalls

> jstack 17452
[...]
"main"
   java.lang.Thread.State: RUNNABLE
	at example.TailCalls$.f(1tailcalls.scala:7)
	at example.TailCalls$.main(1tailcalls.scala:13)
	at example.TailCalls.main(1tailcalls.scala)
[...]

> jstack 17452
[...]
"main"
   java.lang.Thread.State: RUNNABLE
	at example.TailCalls$.f(1tailcalls.scala:6)
	at example.TailCalls$.main(1tailcalls.scala:13)
	at example.TailCalls.main(1tailcalls.scala)
[...]

Here, we got two almost equal stacktraces. In both cases, the main thread runs function f. Can you spot the difference?

We were lucky here, because when taking the stack traces the program was executing two different lines (7 vs. 6) in the two cases. The lines directly pinpoint the issue because they correspond to the lines containing case 42 => and case 23 => respectively.

Again, we have a cycle that is only triggered if the method is called with the right argument. Also, we don’t get a StackOverflowException anymore because in this case scalac’s tail-call elimination kicked in.

Fortunately, finding the mistake here was still simple because the cycle could be observed in the stack trace by looking at the differing line numbers. In more practical cases it may be much harder because the stack trace may not directly point to different branches of a conditional expression (if there is one at all) or the hot lines (which are most likely to turn up in the stack trace) are in greater distance to the actual problem.

To summarize, the obstacles for debugging this issue are these:

  • the program does not fail but it never terminates
  • the stack trace does not show the argument to the f call
  • the stack trace does not show the complete “trace” from the beginning because tail-call elimination removed the stack frames when doing the recursive call

This shows that stack “traces” may not always be helpful. In this example, we have already seen that tail-call elimination removed parts of the trace.

I put “trace” in quotes because “trace” seems to imply that the program leaves a track from the program start to the current program position.

However, it is not the purpose of the stack to keep the history of a thread. Actually, it’s quite the opposite: the primary purpose of the stack is to keep all the necessary state to continue the execution of the thread. That is, it contains return addresses and local variable values which are needed when execution returns from currently open calls. The stack contains the future of a thread.

In simple cases, the history and future of a thread luckily agree, so we get useful debugging information. In other cases, however, information that is not needed anymore for continuing the program, can and will be thrown away and that is exactly what tail-call elimination is doing here.

One reason, that the JVM still has no tail-calls is the related concern that meaningful stack traces are lost when tail calls are eliminated naively. As we have seen this is certainly a valid practical argument.

Tail call elimination is not the only reason stack traces may be not useful. You can always convert recursive algorithms into iterative ones and sometimes it is required to do that for algorithms to complete in bounded space. In those cases, a stack trace does not contain the necessary information to debug a code cycle issue because important state like iteration variable values is either already lost (i.e. overwritten in memory) or not easily accessible because the stack trace contains only static information and not dynamic information like local variable values.

But this is only the start. In part 2 of this post, I’ll try to make things more interesting by introducing more complicated control flow (a popular complication these days is running things concurrently…).

I’ll leave you spinning here for a while and will return shortly with the second part of this story.

Thanks for reading!

(If you dare so you may try to register yourself in my private discourse instance and leave a comment. Everything is quite fresh so please expect road bumps and occasional downtime.)

Continue reading with part 2.