MPS issue job004104

Titlefotest fails with "arena commit limit exceeded"
Assigned userapt
DescriptionIn the cool variety, on platform xci6ll, and seed 558333490, fotest fails:

    $ noaslr code/xci6ll/cool/fotest 558333490
    code/xci6ll/cool/fotest: randomize(): resetting initial state (v3) to: 558333490.

    stress MVFF: COMMIT_LIMIT: arena commit limit exceeded
AnalysisAdapted from APT's email .

Perforce change number 194948 (inferred from commit message).
The same failure occurs in the master sources, so this isn’t the
fault of the spare-fraction branch and the specific change is
irrelevant, but I'll be quoting line numbers of that version.

The test is failing while allocating memory. In a sense, this is
entirely reasonable behaviour, although it is quite unlucky to
succeed in allocating a big chunk from the arena and then to
fail to allocate a 16-byte control structure.

So what is the failover behaviour for? The design document does
not explain the motivation clearly, but it can be inferred from
the use of class Failover
< >.

The purpose is to be able to combine two lands with different
properties: with a CBS for the primary and a Freelist for the
secondary, operations are fast so long as there is memory to
allocate new nodes in the CBS, but operations can continue
using the Freelist when memory is low.

Viewed from this perspective, the test has perhaps found some
undesirable behaviour; the benefit of wrapping `freeCBSStruct`
in a Failover instance is largely lost because the
`totalCBSStruct` has no fail-over behaviour.

As far as I can see we have two options:

 1. We could decide that this is a rare corner case and we don't
 care. This implies that the test is at fault, and we need to
 add some more logic to `oomAlloc()` to somehow distinguish the
 two CBS structs.

 2. We could decide that inserting a range into `totalCBSStruct`
 must not fail. In that case we probably shouldn't be using a
 CBS structure for it. Perhaps we should instead record the
 allocations by putting a Ring node at the beginning of each
 block of memory that the MVFFPool claims from the arena, as
 MFSPool does? It would complicate `MVFFReduce()`.

Adapted from GDR's email .

The reason that the MVFF’s free memory needs to be stored in a
FailoverLand, is because the signature for mps_free has no way
of reporting an error to the caller. This means that mps_free
must not fail, and that requirement propagates down to MVFFFree.

This requirement does not exist for MVFF's “total” memory, since
mps_alloc returns a result code and so can report failure to the
caller, and so we can represent this using a CBS, which is

Hence my opinion is that the MPS's behaviour is correct, and the
test case is in error. Your proposal [2. above] seems
reasonable. A quick hack that might work would be to have a
global variable which would signal to oomAlloc when it is safe
to fail, and then set this before the call to mps_free and clear
it afterwards.

Made branch 2018-10-08/fotest-flag for the fix.
How foundautomated_test

Line numbers are relative to perforce change 194948.

We failed to reproduce this on Linux. My guess (not confirmed)
is that when we ask the OS for some memory, Linux gives us space
that is contiguous with space we already have, but Mac OS gives
us isolated space. Perhaps we should have a configuration that
over-allocates in order to tickle the code paths for
non-contiguous allocations a bit more? Anyway...

DL helped me to reproduce this on Mac OS. By single stepping,
David found that the fotest.c fails at line 136 with k=9 and

      res = make(&obj, ap, ss[i]);

Internally, the MVFF pool has already failed over to its
secondary free list structure, as the test intended. `make()` is
asking for a block of just under 2K. The MVFF pool finds that it
cannot satisfy the allocation from any existing free space, so
it asks the arena for more memory (poolmvff.c line 203), which

  res = ArenaAlloc(&base, MVFFLocusPref(mvff), allocSize, pool);

The MVFFPool then tries to record in `mvff->totalCBSStruct` that
it has allocated this arena memory (poolmvff.c line 215), which

  res = LandInsert(&coalescedRange, totalLand, &range);

This result propagates up to the `make()` call in the test and
is reported as a failure. (The result code is entirely spurious;
it is chosen at random by `oomAlloc()`, and with this seed it
happens to be `COMMIT_LIMIT`.)
Created byGareth Rees
Created on2018-08-13 12:00:32
Last modified byGareth Rees
Last modified on2019-02-07 10:21:32
History2018-08-13 GDR Created.
2018-10-01 APT Analysis and evidence.
2018-10-08 APT Add branch name of fix.


Change Effect Date User Description
195974 closed 2019-02-07 10:21:32 Gareth Rees Add a flag to fotest.c so that oomAlloc knows when to return error codes, avoiding confusion between the test's pool and the MPS's own pool.