Notes on MPS job001548: assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg))

This document contains incomplete and informal notes concerning the investigation of MPS job001548: assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg)).

Not confidential. Readership: MPS developers.

Introduction

Imagine that the segment is a box containing some refs; the box has a lid (the MPS Shield) so we should know when any new ref is put in the box. We should keep the label on the box (the summary) correct, except at certain defined times (eg. while seg-scan is in progress?).

Something must have gone wrong at one of these three steps:

  1. We look at what's in the box (or default to UNIV), put a label on the box, and put the lid on (write-protect).

  2. Over some time, we take the lid on and off, some new refs get put into the box, and we try to keep the label correct. The writes can be from:

    1. mutator
    2. fix that updates a ref
    3. preserve-by-copy
  3. (during scan) we look at what was in the box, and check it against the label.

In a picture: DSC00687.JPG.

The final check is failing.

(For the rest of this document, stick to MPS terminology: 'label' = summary; 'box' = seg or zone usually; 'lid' = shield.)

Questions

.q.valid
There ARE certain times when the summary is allowed to be wrong. What times are those? (Don't know).
.q.gen
Is the code that generated the summary correct?
.q.check
Is the code that finally checks the summary correct? At the time we check it, is the summary supposed to be valid?
.q.shield
Is the shield code keeping the seg write-protected? For any time the shield is down we need guarantees about what refs might get written into the seg.
.q.maintain
Is the code that unions a newly-added ref into the current summary correct?

Tricky situations:

.sit.preserve-into
seg we are preserve-by-copying into. If we preserve into a seg we are currently scanning, the newly-preserved object must be scanned in *this* scan (there is no mechanism for putting the seg put back on grey-list).
.sit.multipage
seg that spans several OS pages
.sit.zone-boundary
seg that straddles a zone boundary
.sit.nailed
nailed seg

Approach

Special circumstances?

The relevant code hasn't changed much in a while, and the failures aren't very common. Both of these suggest that the code fails only in a fairly unusual combination of circumstances. So it's worth looking at data at the time of failure, to see if some circumstances (eg. nailed seg) are always present.

It's easy to make this programmatic: hack in if (!assert-cond) { Describe(); printf data; etc } before the assert, and run to crash several times.

See out-segdesc01.txt for a sample.

Also: could output telemetry. But that's not human friendly, and it would take me ages to wade through it :-(

Write a General-purpose Check Function

On the other hand, we don't have a general purpose "CheckAllSummariesNow()" function. Writing one would help here, and also catch other present or future defects.

Use the Source...

Thinking about the issues, and learning the source, is really useful for me. Not necessarily fastest, but loads of genuine extra benefit. See "How the code is supposed to work", below.

What is mpsicv doing anyway?

mpsicv is an internal test, that can go inside mps.h. Perhaps it's just doing something illegal? Better have a look inside. And add lots of printfs as mpsicv goes along.

mpsicv successfully completes its "for(200000 objects)" loop, with the 30-or-so collections that print out "Collection %u, %lu objects".

Failure happens when mpsicv then calls arena_commit_test(), which allocates memory until it hits commit limit, forcing full collections, which sometimes trigger the assert. See a1f and a1g1stFull.txt (the sixth ASSERT in a1g... shows nPolls is not always 1.000).

A general purpose CheckAllSummariesNow() function

Even though I don't know all the invariants, or all the times when the seg summary is valid, I can still write a CheckThisSummary() function, and run it at various known-good times, such as ArenaEnter/Leave.

How hard can it be? Should I use pool->scan or pool->walk? Scan should only see grey things. Walk should only see black things. Hmmm, in AMCWalk:

"/* NB, segments containing a mix of colours (i.e., nailed segs) are not handled properly: No objects are walked @@@@ */"

Using scan it would be:

  ScanStateInit()
  replace ss->fix
  ShieldExpose()
  PoolScan()
  ShieldCover()

Also see ArenaFormattedObjectsWalk() [walk.c]

How the code is supposed to work

Here are some notes on the parts of code I have studied while investigating the defect.

Partial scans

One tricky issue is partial scans of a segment: seg may be part grey (must scan), part white (should not scan).

I have worked out in my head how this ought to work. See http://info.ravenbrook.com/mail/2006/12/15/11-42-40/0.txt "keeping summaries during partial scans".

I wrote an abstract walk-through of a trace: example-abstract-trace.txt. Some further notes follow:

When a collection trace ends (and we reclaim all white objects) we can replace the old summary with the summary of black-for-this-trace objects. Arbitrarily calling this trace "1" (one), I call this summary "t1b".

What do we encounter during scan? We find *all* refs in all *grey* objects (and, optionally, in black objects too, though that's a waste).

We encounter five types of ref:

  1. obviously non-white: refs that aren't in the white zoneset;
  2. non-white (but in a zone that has some white objs);
  3. white becomes grey, but ref unchanged because object is preserved in place;
  4. old white that needs replacement (broken heart, weak);
  5. new grey replacement for old white (snapped-out, or splatted).

unfixedSummary is the accumulated summary of 1, 2, 3, and 4.

t1b is the accumulated summary of 1, 2, 3, and 5.

What does the current scan and fix code actually do?

Shield

See new notes at design.mps.shield.

B. Document History

  2006-12-18  RHSK  Created.
  2006-12-18  RHSK  Approaches.  How current code works.
  2006-12-18  RHSK  Link out-segdesc01.txt.  What's mpsicv doing?
  2006-12-21  RHSK  Link design/shield
  2007-01-04  RHSK  Three steps to wrong summary: link to picture.
  2007-01-04  RHSK  Fails in arena_commit_test.