MPS issue job001989

Title	MPS _gc_start messages may cause assert or infinite loop
Status	closed
Priority	essential
Assigned user	Richard Kistruck
Organization	Ravenbrook
Description	MPS _gc_start messages may cause assert or infinite loop Related jobs: job001570 "MPS _gc_start messages may change while client reads them, or be silently skipped" If the client enables _gc_start messages with: mps_message_type_enable(arena, mps_message_type_gc_start()); and a collection starts before the client has collected (with mps_message_get) the _gc_start message from the preceding collection, then the message MPS queue becomes corrupted. This causes an assert, or (in varieties without asserts) prevents the message from being removed from the queue. This unremovable message is likely to cause the client to loop forever if it attempts to Get the message, and will cause such a loop if the client calls mps_arena_destroy. A typical assert is: MPS ASSERTION FAILURE: !RingIsSingle(_old) message.c 164 Mitigation: clients that do not enable _gc_start messages are unaffected; a client that enables them and then frequently checks for and collects them can apparently prevent the defect occurring.
Analysis	RHSK 2008-11-28 Discovered when looking at the code. There is one GCSTART message, statically allocated, per pre-allocated TraceStruct. If it's been Posted but not yet Got, then it's on the arena message Ring. When the trace ends, the TraceStruct is Del'd from arena->busyTraces, even though the MessageStruct et al is still in use. When a new trace starts, it re-uses that TraceStruct, treating it as uninitialised memory, calls TraceCreate, which calls MessageInit, which calls RingInit, thus corrupting the arena message Ring. Now it's right and proper that GCSTART messages are pre-allocated, in accordance with the principle that we avoid allocating when starting a trace. But it's just wrong to statically allocate them: they should be dynamically allocated with ControlAlloc, like other messages. Yes: if the client enables GCSTART and never collects the messages, then they will consume more & more memory. This is just not a problem: the client should not fail to collect messages it has enabled! Surprisingly, the message queue limps on for a long time after corruption with a single RingInited GCSTART message. Appending new messages (eg. finalization messages) works, and heals the next-direction of the ring, which allows these messages to then be Got as normal. Re-adding the same GCSTART message, when already present, also 'works'. But RingRemove simply leaves the ring unchanged; this causes an infinite loop. RHSK 2008-12-19 Fixed with a complete redesign of GC message lifecycle.
How found	inspection
Evidence	none
Observed in	1.108.2
Introduced in	1.107.0
Created by	Richard Kistruck
Created on	2008-11-28 17:24:45
Last modified by	Gareth Rees
Last modified on	2014-04-12 22:05:22
History	2008-11-28 RHSK Created. 2008-12-19 RHSK Fixed with a complete redesign of GC message lifecycle.

Fixes

Change	Effect	Date	User	Description
166995	closed	2008-12-19 16:25:20	Richard Kistruck	MPS br/timing design/message-gc: Complete re-write, for new GC message lifecycle. See job001989.
166993	closed	2008-12-19 14:27:48	Richard Kistruck	MPS br/timing: rename z001989a.c as zmess.c
166919	closed	2008-12-11 16:08:47	Richard Kistruck	MPS br/timing z001989a.c: oops, re-enable test of delayeed getting of messages. Demonstrates fix for job001989.
166918	closed	2008-12-11 16:06:14	Richard Kistruck	MPS br/timing TraceIdMessages: Create them at mps_arena_create time, re-create them immediately after a trace sends its last message. Store them, one per TraceId, in ArenaStruct. Cope correctly if they cannot be created (ControlAlloc fails). Destroy them correctly on mps_arena_destroy. See job001989 and design/message-gc#lifecycle (not yet written).
166881	open	2008-12-05 22:14:56	Richard Kistruck	MPS br/timing z001989a: get collection begin/end messages. FAILS: thereby demonstrating defect job001989.