MPS issue job003359

TitleAmbiguous interior pointers do not keep objects alive
Assigned userRichard Brooksby
DescriptionOn Mac OS X 10.8.2 (x86-64), in the example Scheme interpreter from the MPS kit [1], the following program crashes:

MPS Toy Scheme Example
7944, 0> (let ((a (gc))) 0)
Segmentation fault: 11
AnalysisSome notes on the envelope of the bug:
1. The crash only occurs when optimization is turned on (-O).
2. Adding -fno-strict-aliasing makes no difference.
3. Adding -fno-inline-functions makes the crash go away.
4. Adding "assert(body);" before line 1287 makes the crash go away.

GDB shows that the crash is occurring in eval_body which has been inlined into entry_let:

Program received signal SIGSEGV, Segmentation fault.
entry_let (env=0x1002fb130, op_env=0x1002fb148, operator=0x1002fac38, operands=<value temporarily unavailable, due to optimizations>) at scheme.c:1287
1287 if (TYPE(body) != TYPE_PAIR)
(gdb) bt
#0 entry_let (env=0x1002fb130, op_env=0x1002fb148, operator=0x1002fac38, operands=<value temporarily unavailable, due to optimizations>) at scheme.c:1287
#1 0x0000000100002e6e in eval (env=<value temporarily unavailable, due to optimizations>, op_env=<value temporarily unavailable, due to optimizations>, exp=<value temporarily unavailable, due to optimizations>) at scheme.c:1154
#2 0x0000000100001ebc in start (p=<value temporarily unavailable, due to optimizations>, s=<value temporarily unavailable, due to optimizations>) at scheme.c:2893
#3 0x000000010000d739 in ProtTramp (resultReturn=0x7fff5fbffa00, f=<value temporarily unavailable, due to optimizations>, p=0x1002fe138, s=4298084400) at protix.c:132
#4 0x00000001000018cc in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at scheme.c:2995
(gdb) list
1282 */
1284 static obj_t eval_body(obj_t env, obj_t op_env, obj_t operator, obj_t body)
1285 {
1286 for (;;) {
1287 if (TYPE(body) != TYPE_PAIR)
1288 error("%s: illegal expression list", operator->;
1289 if (CDR(body) == obj_empty)
1290 return eval_tail(env, op_env, CAR(body));
1291 (void)eval(env, op_env, CAR(body));

This code has been inlined into entry_let line 1512, where the surrounding code looks like this:

1492 static obj_t entry_let(obj_t env, obj_t op_env, obj_t operator, obj_t operands)
1493 {
1494 obj_t inner_env, bindings;
1495 unless(TYPE(operands) == TYPE_PAIR &&
1496 TYPE(CDR(operands)) == TYPE_PAIR)
1497 error("%s: illegal syntax", operator->;
1498 inner_env = make_pair(obj_empty, env); /* TODO: common with interpret */
1499 bindings = CAR(operands);
1500 while(TYPE(bindings) == TYPE_PAIR) {
1501 obj_t binding = CAR(bindings);
1502 unless(TYPE(binding) == TYPE_PAIR &&
1503 TYPE(CAR(binding)) == TYPE_SYMBOL &&
1504 TYPE(CDR(binding)) == TYPE_PAIR &&
1505 CDDR(binding) == obj_empty)
1506 error("%s: illegal binding", operator->;
1507 define(inner_env, CAR(binding), eval(env, op_env, CADR(binding)));
1508 bindings = CDR(bindings);
1509 }
1510 if(bindings != obj_empty)
1511 error("%s: illegal bindings list", operator->;
1512 return eval_body(inner_env, op_env, operator, CDR(operands));
1513 }

DL and I stared at the disassembly and we believe that the following sequence of events occurs:

1. The value 'operands' is initially in rbx.
2. At around line 1500 the program executes the instruction "lea 0x10(%rbx),%r12" putting the address of CDR(operands) in r12.
3. The compiler knew that 'operands' will no longer be needed, so it issued an instruction to overwrite rbx (at around line 1503 with CAR(binding), as it happens).
4. Now r12 is the only thing keeping 'operands' alive, but this points to the interior of the pair object.
5. The garbage collection runs and finds that 'operands' is unreachable (because an ambiguous interior pointer does not keep an object in AMC alive) so it gets collected.

I added an interior pointer test to amcFixInPlace, using this function:

/* amcRefInterior -- test if 'ref' is an interior pointer in 'seg'. */

static int amcRefInterior(Pool pool, Seg seg, Ref ref)
  Addr p, limit;
  Format format;

  format = pool->format;
  p = AddrAdd(SegBase(seg), format->headerSize);
  if(ref < p) return 0;
  if(SegBuffer(seg) != NULL) {
    limit = BufferScanLimit(SegBuffer(seg));
  } else {
    limit = SegLimit(seg);
  limit = AddrAdd(limit, format->headerSize);
  while(p < limit) {
    if(p == ref) return 0;
    p = (*format->skip)(p);
    if(ref < p) return 1;
  return 0;

This showed that amcFixInPlace is in fact being asked to fix an interior pointer. (I note that RHSK identified the same problem in [3].)

My first attempt to fix it, in change 180370 [2], failed to take into account format auto-header. I ran the amcsshe test case and got this assertion:

MPS ASSERTION FAILURE: limit <= SegLimit(seg)

The problem is that I copied amcNailRangeIsUnmarked somewhat blindly from amcNailRangeIsMarked but in fact it's called in a different context. The latter takes base pointers and converted them to client pointers, but the former needs to take client pointers and convert them to base pointers before checking the nailboard (this also will cause nails in the header to be respected, which I think we want).

This change needs further work as it won't have the necessary performance:

1. Instead of calling out to BTIsResetRange, we should be inlining the nailboard bit-range check into amcScanNailedOnce and amcReclaimNailed. (By having a cursor passing over the nailboard in parallel to the pointer passing over the segment.)

2. Have better representation for nailboard. Do we really need to allocate a whole segment worth of nails just because of one ambiguous pointer? Perhaps some kind of sparse tree representation would be more efficient? (But see below.)

3. Special handling for large objects. If there's only one object in a segment, we only need one nail.

Also worth noting that Boehm has a "GC_reachable_here" macro that "Explicitly tell the collector that an object is reachable at a particular program point. This prevents the argument pointer from being optimized away, even it is otherwise no longer needed."


Just thinking about the scale:

* On a 32-bit platform, a page is 4096 bytes = 1024 words. There are 1024 bits in the nailboard, (32 words).

* On a 64-bit platform, a page is 4096 bytes = 512 words. There are 512 bits in the nailboard (8 words).

Running the Scheme interpreter, I found that nailed segments tended to have multiple nails: 31 nails were common (no doubt a consequence of the way the Scheme interpreter lays out its environment).

Given these numbers, I don't think it makes any sense to have any kind of multi-layer tree structure for the nailboard.
How foundautomated_test
Evidence[1] <>
[2] <>
[3] <>
Observed in1.110.0
Created byGareth Rees
Created on2012-11-05 10:57:01
Last modified byGareth Rees
Last modified on2014-04-07 14:23:19
History2012-11-05 GDR Created.
2013-03-12 GDR Nailboard representation analysis
2013-05-25 RB Downgraded to essential. This does not prevent us making useful releases.


Change Effect Date User Description
185285 closed 2014-04-07 14:23:18 Gareth Rees Merge branch/2014-01-15/nailboard into the master sources.
180941 open 2013-02-08 15:36:50 Gareth Rees Undo changes 180370 and 180390 (which supported interior pointers by checking all nails in an object), as these changes are not suitable for release. Work around job003359 for the example Scheme interpreter by setting the compilation option -fno-inline-functions.
180390 open 2012-11-07 17:17:05 Gareth Rees amcNailRangeIsUnmarked needs to take arguments as client pointers and look at the nails for the corresponding base pointers.
180370 open 2012-11-06 15:47:21 Gareth Rees Support ambiguous interior pointers in AMC by checking to see if any grain in an object is nailed (not just the first grain).
AMCSegDescribe shows which grains are nailed.