MPS issue job001548

TitleMPS assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg))
Statusclosed
Priorityessential
Assigned userRichard Kistruck
OrganizationRavenbrook
DescriptionMPS assertion in trace.c: RefSetSub(ss.unfixedSummary, SegSummary(seg))

Repeatable: not always. (originally wasn't, now appears to be on
MacTel, see comments on 2007-03-09)
Recurrence: 5% or more of mpsicv runs with random seeds
Platforms: w3i3mv, xcppgc
Varieties: CI, HI, II
Age bounds: 1.106.0:Yes

Related Jobs:
  job001543: mpsicv on Mac OS X does not use reg&stack scanner.

RHSK 2006-12-13
__w3i3mv__
Assert fires:
        MPS ASSERTION FAILURE: RefSetSub(ss.unfixedSummary, SegSummary(seg))
        .\trace.c
        1111
in mpsicv on w3i3mv platform:

Using master/...@161206, CI (cool) build.

Also in version/1.106/...@155175 == release/1.106.0, CI (cool).

Also in master/...@161213, both CI and HI builds (hot: which now means with AVERs but without DEBUG = DIAGNOSTICS = STATISTICs = METERs -- see job001545 & job001546).

Repeatable: No. Repeating the randomize seed doesn't make it fail again. Example seeds that have failed once: 2715, 7909, 23634, 24772, 18186,

Recurrence: Yes (<1 hour). Now seen about 10 times, from w3i3mv CI or HI mpsicv. Looping tests produce a failure fairly often. Typical successful iterations-before-failure: 0, 42, 5, 14.

RHSK 2006-12-14
__xcppgc__
mpsicv on xcppgc (<= 1.107.0) is a bit different: the reg&stack scanner is not used. See job001543: mpsicv on Mac OS X does not use reg&stack scanner.

version/1.107/...@161223 == release/1.107.0:
xcppgc\ci\mpsicv gave 0 failures in 7 runs
xcppgc\hi\mpsicv gave 2 failures in 7 runs; seeds: 27271, 27283.
xcppgc\ti\mpsicv gave 0 failures in 7 runs
xcppgc\ii\mpsicv gave 3 failures in 7 runs; seeds: 28869, 28896, 28923
and with fixed seed 23954, 0 failures in 12 runs (3 each ci, hi, ti, ii)

HI and II have AVERs and checking at CheckLevelMINIMAL, but no DIAGNOSTICS = STATISTICs = METERs.
AnalysisRHSK 2006-12-13
This assert reports that the old SegSummary(seg) was incomplete. Imagine that the segment is a box containing some refs; the box has a lid (the MPS Shield) so we should know when any new ref is put in the box; we should keep the label on the box (the summary) correct, except at certain defined times (eg. while seg-scan is in progress?).

We have just (totally or partially) scanned the seg, accumulating the summary of all refs-before-fix ("unfixed") into ss.unfixedSummary. SegSummary should have had these already.

(We are about to update SegSummary with the summary of refs-after-fix, at least if it was a total scan).

(Hmmm... the assert only makes sense if unfixedSummary was inited to Empty at the start of scanning *this* segment. If not, it might have picked up some zone bits from other (previous) segments, in which case it's not surprising that it's not a subset of SegSummary() for *this* seg.)

RHSK 2006-12-18
See detailed analysis: http://www.ravenbrook.com/project/mps/doc/2006-12-18/job001548-summary/
See development branch: http://info.ravenbrook.com/project/mps/branch/2006-12-15/unfixed-summary/

DRJ 2007-03-01
Can't reproduce on master/...@161872 using OS X on Intel.
Two loops of:
while : ; do ./xci3gc/hi/mpsicv || break; done
gave up on first loop after 29 successful runs; second one gave
up at 15 successful runs.
Note: There is no stack scanner on this configuration (yet).

DRJ 2007-03-09

After implemting reg scanner for Intel Darwin (change 161877) and then
a proper protection module (change 161902) I can now reproduce this on
my Intel MacBook (Intel Darwin).

The first time I tried it, the loop:
while : ; do ./xci3gc/hi/mpsicv || break; done
stops with:
MPS ASSERTION FAILURE: RefSetSub(ss.unfixedSummary, SegSummary(seg))
trace.c
1111
Abort trap
After 76 runs (seed was 3670). Hmm, maybe I should've just left it to run
longer earlier.

Moreover, right now on the master sources change level 161907 on
my MacBook the 3670 seed makes the failure repeatable.

./xci3gc/hi/mpsicv 3670

always fails.

So does: ./xci3gc/hi/mpsicv 10259

DRJ 2007-03-09

Also fails on lii4gc. But not always repeatably.

Sometimes with seeds: 14742, 14884, 15025

It appears to be very easy to fail though. Usually only a few
different trials before one fails. And often some seeds, like 15025,
appear to fail > 50% of the time.

Yay! gdb works on this platform so I can catch an example failure in
the debugger. Yumm.


RHSK 2007-03-19
Failure appears to be during emergency tracing (xcppgc/hi/mpsicv).
See http://info.ravenbrook.com/project/mps...12-15/unfixed-summary/code/a1oEmerg.txt
See http://info.ravenbrook.com/project/mps...12-15/unfixed-summary/code/a1pEmerg.txt


RHSK 2007-04-18
If a pool causes MPS_FIX1() to be applied to the same ref *twice* in the same scan, then ss.unfixedSummary becomes 'polluted' with new (fixed) refs and therefore not an accurate statement of the seg's summary before this scan started.

The only problem this causes is to trip the .verify.segsummary AVER in trace.c.

This may happen when a pool class cannot remember whether it has already fixed the ref. In the case of the AMC poolclass, this happens by design when scanning a boarded segment under emergency tracing, and a new mark is made the segment:
http://info.ravenbrook.com/mail/2007/03/24/11-05-17/0.txt
http://info.ravenbrook.com/mail/2007/03/26/17-05-58/0.txt
(For interest, note that in a fwd-buffered mobile seg, MPS_FIX1 may get applied to the same ref more than once, but not in the same scan.)

The fix is to detect the rare circumstances where this re-fix in the same scan may have occurred. As far as we know at the moment, this is only the AMC boarded seg under ET case. In these circumstances, deal with the polluted unfixedSummary by moving it into fixedSummary, and clearing unfixedSummary.

For an "alternative correction", and other checks we should do, see:
http://info.ravenbrook.com/mail/2007/03/26/17-28-37/0.txt
How foundunknown
Evidencemaster/...@161206, w3i3mv, CI, mpsicv
http://info.ravenbrook.com/mail/2006/12/12/12-07-19/0.txt
http://info.ravenbrook.com/mail/2006/12/13/10-55-06/0.txt
Observed in1.106.2
Created byRichard Kistruck
Created on2006-12-13 16:48:00
Last modified byGareth Rees
Last modified on2014-04-12 22:05:52
History2006-12-13 RHSK Created; made critical.
2006-12-14 RHSK Also on Mac OS X PowerPC.
2006-12-14 RHSK Summarise occurrence.
2006-12-18 RHSK Analysis: link to doc for detailed analysis.
2006-12-18 RHSK Analysis: link to development branch unfixed-summary.
2007-03-01 DRJ Can't reproduce on MacTel.
2007-03-09 DRJ Reproduced on MacTel. And Repeatable.
2007-03-09 DRJ Reproduced on Linux
2007-03-19 RHSK Failure appears to be during emergency tracing. (fix link)
2007-04-18 RHSK Solved on 2007-03-24. Describe defect and fix.

Fixes

Change Effect Date User Description
162001 closed 2007-03-25 17:05:50 Richard Kistruck MPS br/unfixed-summary: amcScanNailed: Show how summaries change
when amcScanNailed loops. Highlight cases that would (previously)
have failed .verify.segsummary. Count the loops. Show whether it
wasTotal.
AMCSegSketch: correct it to show stalo and neo the right way round.
162000 closed 2007-03-25 15:59:05 Richard Kistruck MPS br/unfixed-summary: if amcScanNailed looped, ss.unfixedSUmmary is
not accurate, so move all of the ScanStateSummary into ss.fixedSumamry,
so that <impl/trace/#verify.segsummary> does not erroneously fail.
See also log file a2nNailedLoopReset.txt.