MPS issue job004104

Title	fotest fails with "arena commit limit exceeded"
Status	closed
Priority	nice
Assigned user	apt
Organization	Ravenbrook
Description	In the cool variety, on platform xci6ll, and seed 558333490, fotest fails: $ noaslr code/xci6ll/cool/fotest 558333490 code/xci6ll/cool/fotest: randomize(): resetting initial state (v3) to: 558333490. stress MVFF: COMMIT_LIMIT: arena commit limit exceeded
Analysis	Adapted from APT's email `https://info.ravenbrook.com/mail/2018/09/25/16-54-36/0/` . Perforce change number 194948 (inferred from commit message). The same failure occurs in the master sources, so this isn’t the fault of the spare-fraction branch and the specific change is irrelevant, but I'll be quoting line numbers of that version. The test is failing while allocating memory. In a sense, this is entirely reasonable behaviour, although it is quite unlucky to succeed in allocating a big chunk from the arena and then to fail to allocate a 16-byte control structure. So what is the failover behaviour for? The design document does not explain the motivation clearly, but it can be inferred from the use of class Failover < `https://info.ravenbrook.com/project/mps/master/design/failover#overview` >. The purpose is to be able to combine two lands with different properties: with a CBS for the primary and a Freelist for the secondary, operations are fast so long as there is memory to allocate new nodes in the CBS, but operations can continue using the Freelist when memory is low. Viewed from this perspective, the test has perhaps found some undesirable behaviour; the benefit of wrapping `freeCBSStruct` in a Failover instance is largely lost because the `totalCBSStruct` has no fail-over behaviour. As far as I can see we have two options: 1. We could decide that this is a rare corner case and we don't care. This implies that the test is at fault, and we need to add some more logic to `oomAlloc()` to somehow distinguish the two CBS structs. 2. We could decide that inserting a range into `totalCBSStruct` must not fail. In that case we probably shouldn't be using a CBS structure for it. Perhaps we should instead record the allocations by putting a Ring node at the beginning of each block of memory that the MVFFPool claims from the arena, as MFSPool does? It would complicate `MVFFReduce()`. Adapted from GDR's email `https://info.ravenbrook.com/mail/2018/09/28/11-20-47/0/` . The reason that the MVFF’s free memory needs to be stored in a FailoverLand, is because the signature for mps_free has no way of reporting an error to the caller. This means that mps_free must not fail, and that requirement propagates down to MVFFFree. This requirement does not exist for MVFF's “total” memory, since mps_alloc returns a result code and so can report failure to the caller, and so we can represent this using a CBS, which is simpler. Hence my opinion is that the MPS's behaviour is correct, and the test case is in error. Your proposal [2. above] seems reasonable. A quick hack that might work would be to have a global variable which would signal to oomAlloc when it is safe to fail, and then set this before the call to mps_free and clear it afterwards. Made branch 2018-10-08/fotest-flag for the fix.
How found	automated_test
Evidence	[1] `https://travis-ci.org/Ravenbrook/mps/jobs/415371217` Line numbers are relative to perforce change 194948. We failed to reproduce this on Linux. My guess (not confirmed) is that when we ask the OS for some memory, Linux gives us space that is contiguous with space we already have, but Mac OS gives us isolated space. Perhaps we should have a configuration that over-allocates in order to tickle the code paths for non-contiguous allocations a bit more? Anyway... DL helped me to reproduce this on Mac OS. By single stepping, David found that the fotest.c fails at line 136 with k=9 and i=107: res = make(&obj, ap, ss[i]); Internally, the MVFF pool has already failed over to its secondary free list structure, as the test intended. `make()` is asking for a block of just under 2K. The MVFF pool finds that it cannot satisfy the allocation from any existing free space, so it asks the arena for more memory (poolmvff.c line 203), which succeeds: res = ArenaAlloc(&base, MVFFLocusPref(mvff), allocSize, pool); The MVFFPool then tries to record in `mvff->totalCBSStruct` that it has allocated this arena memory (poolmvff.c line 215), which fails: res = LandInsert(&coalescedRange, totalLand, &range); This result propagates up to the `make()` call in the test and is reported as a failure. (The result code is entirely spurious; it is chosen at random by `oomAlloc()`, and with this seed it happens to be `COMMIT_LIMIT`.)
Created by	Gareth Rees
Created on	2018-08-13 12:00:32
Last modified by	Gareth Rees
Last modified on	2019-02-07 10:21:32
History	2018-08-13 GDR Created. 2018-10-01 APT Analysis and evidence. 2018-10-08 APT Add branch name of fix.

Fixes

Change	Effect	Date	User	Description
195974	closed	2019-02-07 10:21:32	Gareth Rees	Add a flag to fotest.c so that oomAlloc knows when to return error codes, avoiding confusion between the test's pool and the MPS's own pool.