in scala programming debugging ~ read.

Losing Track (Part 2)

In part 1, I introduced the problem of debugging cycling code using stack traces. We have seen that we already lose debugging information if tail call optimization kicks in.

Let's see what happens if the control flow isn't linear but there are concurrently running threads of execution.

Let's run concurrently

Here's an example that uses reciprocally dependent lazy vals that are evaluated concurrently:

import scala.concurrent.Future  
import scala.concurrent.ExecutionContext.Implicits.global

object A {  
  lazy val a: Int = {
    Thread.sleep(100)
    B.b
  }
}
object B {  
  lazy val b: Int = {
    Thread.sleep(100)
    A.a
  }
}

def main(args: Array[String]): Unit = {  
  val fa = Future(A.a)
  val fb = Future(B.b)

  val sum: Future[Int] =
    for {
      a ← fa
      b ← fb
    } yield a + b

  println("Waiting for completion....")
  Thread.sleep(10000)
  sum.value match {
    case Some(x) ⇒ println(s"Got result: $x")
    case None    ⇒ println("Future was never completed.")
  }
}

Again, we've got the same bug: A.a depends on the value of B.b and vice versa. In the simplest case, we would just access A.a and see what happens. Mildly surprising, it would again fail with a StackOverflowException because the initializers of A.a and B.b would call themselves recursively (huh, recursive evaluation of a val?). But this is not the case I'm interested here.

Instead, we are interested in what happens if A.a and B.b are both evaluated concurrently from different threads. To make that happen we use Future.apply to run the initial evaluation of those lazy vals from different threads. I also added those sleep times to make it more likely that the lazy val initializers are actually executed concurrently.

The future result of those computations are combined in a for-comprehension to a final future value. We sleep in the main thread for 10 seconds and see if we got a result in the meantime.

Here's what happens:

[info] Running example.CyclicLazyValsComplex 
Waiting for completion....  
Future was never completed.  

It turns out that we do not get any result (but also no CPU time is spent). What's going on? Once again, jstack to the rescue, we find that indeed all threads are blocked:

> jps
7480 CyclicLazyValsComplex

> jstack 7480

"ForkJoinPool-1-worker-9"
   java.lang.Thread.State: BLOCKED (on object monitor)
    at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:10)
    - waiting to lock <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
    at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
    at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:18)
    - locked <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
    at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
    [...]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"ForkJoinPool-1-worker-13"
   java.lang.Thread.State: BLOCKED (on object monitor)
    at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:16)
    - waiting to lock <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
    at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
    at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:12)
    - locked <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
    at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
    [...]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"main" prio=10 tid=0x00007f6ebc00a000 nid=0x1d3d waiting on condition [0x00007f6ec3e6a000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at example.CyclicLazyValsComplex$.main(3complexLazyVals.scala:33)
    at example.CyclicLazyValsComplex.main(3complexLazyVals.scala)

[...]

Found one Java-level deadlock:  
=============================
"ForkJoinPool-1-worker-9":
  waiting to lock monitor 0x00007f6e74006218 (object 0x00000007ade6c800, a example.CyclicLazyValsComplex$A$),
  which is held by "ForkJoinPool-1-worker-13"
"ForkJoinPool-1-worker-13":
  waiting to lock monitor 0x00007f6e60003c88 (object 0x00000007adfb6e80, a example.CyclicLazyValsComplex$B$),
  which is held by "ForkJoinPool-1-worker-9"

Java stack information for the threads listed above:  
===================================================
"ForkJoinPool-1-worker-9":
    at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:10)
    - waiting to lock <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
    at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
    at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:18)
    - locked <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
    at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
    [...]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
"ForkJoinPool-1-worker-13":
    at example.CyclicLazyValsComplex$B$.b$lzycompute(3complexLazyVals.scala:16)
    - waiting to lock <0x00000007adfb6e80> (a example.CyclicLazyValsComplex$B$)
    at example.CyclicLazyValsComplex$B$.b(3complexLazyVals.scala:16)
    at example.CyclicLazyValsComplex$A$.a$lzycompute(3complexLazyVals.scala:12)
    - locked <0x00000007ade6c800> (a example.CyclicLazyValsComplex$A$)
    at example.CyclicLazyValsComplex$A$.a(3complexLazyVals.scala:10)
    [...]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Found 1 deadlock.  

Very helpfully, in this case, the program ran itself into a stuck state which is again observable from stack traces and for which the JVM even provides a nice deadlock analysis that actually points to the problem: we have two lazy evaluated values that reciprocally need to access each other to initialize themselves. Luckily, access was guarded by locks which again is part of the stack trace so it was easy to figure out what the problem was.

It may not be clear why this is a realistic example. A lazy val is a mechanism often used to implement a cache. A computation is delayed until its result is first needed. A subsequent access waits for the completion of the initial computation, if necessary, and then returns the result.

This is a useful pattern that is used in some form or the other quite often. In more practical cases than this one the cycle may be much less obvious but still you may run into a similar deadlock. Unfortunately, these kinds of situations are more likely to happen under load when threads are preempted to do some other work which introduces delays similar to the artificial delays introduces by the sleep(100) lines.

As seen here, caching based on lazy vals relies on blocking calls to wait for the result of the computation. Nowadays it's more fashionable (and more reasonable, of course) to use an asynchronous API that allows to subscribe to the result of a lazy computation. This allows to be notified about the completion of a computation without blocking a thread.

A Future-based cache

spray-caching contains a clever caching solution (introduced to spray by Mathias a few years ago) that doesn't store the results of a computation but which stores the eventual result of the computation, i.e. a Future of the result. This is ingenious because it allows to put computations into the cache that are not yet complete. It sidesteps the so-called thundering herd problem in a very elegant way. The thundering herd problem describes what happens when lots of similar requests arrive at a cold cache (e.g. after a server restart). A simplistic cache may not have noted that it already has started computing the value for a key and may spawn redundant calculations for the same key until finally one of those redundant calculations puts a value in the cache. In the worst case this may lead to another break down of a server because of too much load.

The Future based solution on the other hand can immediately put a result Future into the cache to which subsequent requests for the same key can subscribe to. This prevents redundant computations and also keeps the cache unconcerned about subscribing to an ongoing computation because this will be naturally handled by the Future.

Here's a super simple implementation of such a cache:

object Cache {  
  def apply(f: Int ⇒ Future[Int])(implicit ec: ExecutionContext): Int ⇒ Future[Int] = {
    var cache = Map.empty[Int, Future[Int]]
    val lock = new Object

    key ⇒
      lock.synchronized {
        cache.get(key) match {
          // cache hit
          case Some(result) ⇒ result
          // cache miss
          case None ⇒
            val result = Promise[Int]()
            Future { // run `f` concurrently
              result.completeWith(f(key))
            }
            cache += key -> result.future
            result.future
        }
      }
  }
}

The cache constructor takes a function Int => Future[Int] and returns a function of the same type. The resulting function runs the underlying function only once per key and puts the resulting Future result into a Map. Calls with the same key will instantly return the cached Future.

A more practical implementation would use something more scalable than a simple synchronized block. spray-caching uses ConcurrentLinkedHashMap, for instance. It would also need to enforce some limits on cache size, etc. This simple implementation, though, should serve its demonstrating purpose here.

Let's put it into use:

object CyclicFutureCache {  
  import scala.concurrent.ExecutionContext.Implicits.global

  val f: Int ⇒ Future[Int] = Cache(calculateF)
  // the underlying heavy computation
  def calculateF(i: Int): Future[Int] =
    i match {
      case 42 ⇒ f(i - 19)
      case 23 ⇒ f(i * 2 - 4)
      case x  ⇒ Future.successful(123)
    }

  def main(args: Array[String]): Unit = {
    val sum: Future[Int] =
      for {
        aRes ← f(23)
        bRes ← f(42)
      } yield aRes + bRes

    println("Waiting for completion....")

    sum.onComplete { res ⇒
      println(s"Program was completed with $res")
      sys.exit()
    }

    // sleep while waiting for completion
    while (true) Thread.sleep(1)
  }
}

We have an expensive recursive function calculateF (similar to the one shown in part 1). For efficiency reasons we want to cache results of this function so we put the Cache around it which gives us a new cached function f. This is also what we call for the recursive invocations. Again we introduced the cyclic dependency bug similar to the ones seen in the previous examples.

Here's the output of the program:

Waiting for completion....  

Once again, the futures are never completed and the program is stuck. Let's see what jstack says:

> jps
7918 CyclicFutures

> jstack 7918

"ForkJoinPool-1-worker-13" #10 daemon prio=5 os_prio=0 tid=0x00007f60642a3000 nid=0x1ee9 waiting on condition [0x00007f601b2d3000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000076db30338> (a scala.concurrent.forkjoin.ForkJoinPool)
    at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[...]

"main" #1 prio=5 os_prio=0 tid=0x00007f606400a000 nid=0x1ebb waiting on condition [0x00007f606a856000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at example.CyclicFutureCache$.main(4cyclicFutures.scala:54)
    at example.CyclicFutureCache.main(4cyclicFutures.scala)

[...]

This time we are completely left in the dark. Two threads are just waiting and there's no sign on the stack that a computation is not yet complete. The program is stuck but we cannot see why.

What's going on? Remember that a stack trace is a representation of the future of a thread or, in other words, the stack represents the continuation of a thread. The point of an asynchronous API like the Future based one from Scala is, however, to detach the continuation, the "what to do after a computation is complete", from the thread which starts or runs the original computation.

Therefore, no thread or its stack contain any traces of the continuation. Instead, the continuation is explicitly managed on the heap. When you use onComplete (or any of the other Future combinators) you pass a function value to be invoked when the original computation is complete. This function value is put into a list of callbacks which the Promise keeps that implements the Future.

In our case, the cache created two Promises, one for the result of f(23) and one for the result of f(42). It then called promise.completeWith(...) which makes the promise subscribe to the Future given as an argument. The call sequence was roughly like this:

val f23Promise = Promise[Int]()  
val f42Promise = Promise[Int]()  
f23Promise.completeWith(f42Promise.future)  
f42Promise.completeWith(f23Promise.future)  

So, both Promises subscribed to the respective other but neither could make any progress.

In summary, when working with Future, we neither keep track of the history of the computation nor can we easily find out what the future of the computation is because we cannot inspect the list of still-waiting callbacks. Also, a Future consumer usually doesn't track what it is currently waiting on. Instead, it just registers a callback at the producing Future. So, there's a reference from the producer to the consumer but not the other way round.

I don't have any ready made solutions at this point. What is needed is better support to trace Future call chains and execution. At some point I started the better-future-exceptions experiment to see what it would take to improve Future tracing infrastructure. akka-stream faced similar problems and @drewhk implemented a graph output showing who waits for what in the streaming scenario. It would be cool if we had something like this for Futures as well. If you have ideas please tell them in the comments!

You can get the examples from this project at github. They also include an example I have not discussed here which replicates a lazy calculation which actors but there is not much insight gained compared to the future example.

Thanks for reading.