Bit rot is real!

Published on 2013-11-16
Tagged: debugging

Time for a story!

Last week at work, I was tracking down a very odd bug in our fork of V8. When running the Octane benchmark, V8 would crash about once every four runs. The crash always occurred because of a bad pointer, usually encountered by the garbage collector in memory that was allocated but not actively being used. The weird thing was that this bad pointer would show up in any part of memory: it could be in any kind of object at any position. Completely inconsistent. In some cases, I could figure out from the context what the pointer was supposed to be pointing to. In those cases, sometimes the pointer was mostly right and only one or two bits were wrong. Mysterious.

I tried bisecting and disabling various features to narrow down the root cause. Memory bugs are hard to isolate if you don't know where to look. However, the crash was infrequent enough that this testing was very time consuming, and I couldn't really be sure if disabling a feature actually prevented the crash.

A breakthrough finally came. I was running Octane for ten runs, and V8 reported a syntax error in one of the subtests on the fourth run and every run thereafter. How could that possibly happen? I downloaded the script for that subtest from the device and compared it with the script I had originally uploaded. Indeed there was an error: a semicolon had changed to a 'y' in the middle of the script! I had not modified the script on the device, and the file's timestamp was unchanged.

I immediately unplugged the device and rebooted it. When I downloaded the script again, the semicolon was back, and the syntax error was gone!

The only explanation I could come up with is that something was wrong with the device's physical RAM which caused random bytes to occasionally get corrupted. The syntax error occurred because this corruption mutated the byte holding the semicolon in the kernel's file cache. After rebooting, the script was loaded into a different area of RAM where the problem didn't occur, so it appeared correct. This corruption must have also been the cause of the random bad pointers.

I tested this explanation by borrowing another test device and running Octane with the same V8 binary. No problems after 40 runs. Unfortunately, I don't know of any equivalent of memtest86+ for Android, so I couldn't confirm. I sure won't use that device for stability testing anymore though.

I don't think I've encountered a bug before where I could confidently blame faulty hardware as the root cause. It makes me wonder how many of the weird, unreproducible crash reports we see are due to problems like this.