MPS issue job003640

TitleCan't cope with stack overflows on W3I6MV
Statusclosed
Priorityessential
Assigned userDavid Lovemore
OrganizationRavenbrook
DescriptionW3I6MV uses span.c [1], which means that it can hit a stack overflow in the middle of an MPS operation, with the arena lock held, and the heap possibly in an inconsistent state. This is disastrous: the client program cannot continue with an inconsistent heap.
AnalysisThe MPS needs to have a definition of which operations are atomic. We have to be able to complete these operations, otherwise the heap may be left in an inconsistent state. "Any operation that holds the arena lock" is a conservative first definition of "atomic operation", but later we may be able to do better than this.

Two ideas for fixing the problem:

1. Implement a "stack probe" on W3I6MV (but see Q1 below).

In order for the client program to be able to handle a stack overflow exception in the middle of an MPS operation, we need to document how much stack we would need on top of their exception handler. (If their handler interacts with the MPS in any way, e.g. provoking incremental garbage collection by touching a page with a barrier.) Need some way of computing exception handler + MPS deepest stack + format methods.

2. Use another stack for the MPS.

Questions to answer:

Q1. "Stack Probe" is probably not the best name. What we really mean is, "ensure that there is enough stack space for the operation we are about to do." So is there a better way to ensure there's enough space? If so, we should do that instead.

Q2. What's the default Windows behaviour with respect to setting up guard pages? How does the MPS behave in the default case? What advice, if any, do we need to give to programmers who are configuring the guard page situation?

Q3. What happens on Unix?

Q4. When we probe the stack, what do we expect to happen if we hit a guard page? In general, someone else's exception handler will get called (typically the client program's). What range of behaviours are we going to design the MPS to cope with? For example, one possibility is (i) add more stack and continue; but some programs might want to (ii) unwind the stack and do something else (consider an interactive interpreter, for example, which will want to bail out to the prompt).

Q5. What about allocation point protocol? What if you get a stack overflow at some point in reserve or commit, and don't continue? Could this leave the allocation point in an inconsistent state?

Some answers:

A1. The "proper" way to do a stack probe on Windows with x64 is to use _chkstk, which is a stack probe that works even if the size is greater than a page. However this isn't a C function (it takes RAX as a parameter, not RCX). We have managed to avoid building with assembler on x64, and there is no inline assembler for x64 on windows. I have discovered using alloca() does a stack check which doesn't get compiled out with optimisation turned on.

A2. For how the stack is set up see here [4]. There is some documentation for guard page behaviour under the documentation for _resetstkoflw [2]. Normally there is one spare guard page, that is unprotected when a stack overflow occurs. The reserved space can be extended by using SetThreadStackGuarantee [3]. If the exception is handled and the stack unwound, it is possible to call _resetstkoflw() to restore the guard. If the guard isn't restored and another stack overflow occurs, the program will generate an access violation. I don't think we need to give any special advice for configuring guard pages, but see A5.

A3. On UNIX if you overflow the stack you get a SIGSEGV. The handler can execute on an alternate stack, if set up using sigaltstack. It should be possible to update the thread state so that execution proceeds from higher up the stack, but I don't have an example of this.

A4. Guard page behaviour is general is documented here [5]. When we hit a stack guard page, we should certainly cope with adding more stack and continuing. We may also want to cope with aborting the current operation at the point the stack overflow occurs and we resume from higher up the stack. For this to work every operation needs to leave behind a consistent state whenever we call ArenaEnter (which is what calls the stack probe). Also the operations that don't use Arena enter need to be consistent at any point a stack overflow may occur.
A cursory look at mpsi.c suggests that most operations will abort fine. However, we need to examine allocation points, stack frame allocation and SAC allocation more thoroughly.

A5. The allocation point protocol can leave be in an inconsistent state after an aborted allocation. A reserve without a commit leaves the alloc and init pointers out of sync. (One suggestion is to set alloc = init + size on reserve, so that the operation can be repeated without getting out of sync.)
However a failure is less likely with a functional stack probe, because the probe is likely to cause a stack overflow to either be in the arena enter on mps_buffer_fill or mps_buffer_trip. But see job003644 <https://info.ravenbrook.com/infosys/cgi/perfbrowse.cgi?%40job+job003644>.
How foundinspection
Evidence[1] <http://www.ravenbrook.com/project/mps/master/code/w3i6mv.nmk>
[2] <http://msdn.microsoft.com/en-us/library/89f73td2.aspx>
[3] <http://msdn.microsoft.com/en-us/library/ee388307.aspx>
[4] <http://msdn.microsoft.com/en-us/library/windows/desktop/ms686774.aspx>
[5] <http://msdn.microsoft.com/en-us/library/windows/desktop/aa366549.aspx>
Introduced in1.110.0
Created byGareth Rees
Created on2013-10-29 16:31:26
Last modified byGareth Rees
Last modified on2014-10-24 12:48:47
History2013-10-29 GDR Created.
2013-11-04 DL Added answers to analysis.
2013-11-05 DL More analysis.

Fixes

Change Effect Date User Description
183921 closed 2014-01-10 12:28:01 Gareth Rees Merge branch/2013-11-04/cet-i6-stack-probe to custom CET mainline.