MPS issue job003539

TitleMPS pause times are not well regulated
Statusclosed
Priorityoptional
Assigned userRichard Brooksby
OrganizationRavenbrook
DescriptionThe MPS is overperforming on incremental short pause times on modern processors, causing hundreds or thousands of short pauses per second, when this isn't required for good user interaction and is generally inefficient due to the incremental bookkeeping overhead and context switching into and out of the MPS.

Some preliminary hacks indicate that there could be a significant performance improvement by regulating pause times better <https://info.ravenbrook.com/mail/2013/06/18/18-22-24/0/>. The "phasers on stun" hack involved using the RDTSC timer to avoid returning to the mutator before 100ms of CPU time had passed (or collection completed). However this simplistic hack didn't regulate the gaps between pauses, so probably isn't sufficient for production.

Non-interactive programs like Clasp don't want to pay the cost of incremental collection and barrier hits at all: they need maximum throughput instead. See [3].
AnalysisHigh-level design in <https://info.ravenbrook.com/mail/2013/07/05/00-46-27/0/>.

Göran is keen for "phasers on stun" (pause time regulation) in CVM and wants that to go ahead.

DL cannot currently reproduce this speedup. At the moment the frequency of collections is determined by ArenaPollALLOCTIME, and there is a calculate as to how much work to do in a TraceQuantum. This currently works out at around 1Mb. Setting ArenaPollALLOCTIME to be larger doesn't appear to improve performance. Completing a collection in TraceQuantum also doesn't appear to affect performance. I just can't reproduce at present.

DL did a further experiment of reducing the ALLOCTIME to 4096 (from 65536), and I did get a 2-3% slow down. It also seems to slow down marginally slower with a much larger value (655360). I think this parameter is fairly well-tuned. It's possibly saves 0.5% on startup time with a large value, but this is hard to measure accurately.

I guess the 25% speed up seen was related to the performance problems we were having before we made some improvements (see job003536). I recommend we do not proceed with this at present.

GDR 2014-05-15: After discussion with DL and RB this is back on the agenda. In the light of the conflicting findings reported above it's important to figure out what's going on. Maybe the experiments were carried out on different operating systems with different overheads for context switching and barrier hits?
How foundunknown
Evidence[1] "Not the CET and MPS status report" <https://info.ravenbrook.com/mail/2013/06/19/16-24-16/0/>
[2] MPS strategy discussion <https://info.ravenbrook.com/mail/2014/05/15/19-19-13/0/>
[3] https://info.ravenbrook.com/mail/2014/08/20/13-48-51/0/
Created byRichard Brooksby
Created on2013-07-08 13:01:55
Last modified byRichard Brooksby
Last modified on2016-03-15 06:34:37
History2013-07-08 RB Created to resolve job003534.
2013-11-22 DL Added to analysis. Can't repeat.
2013-12-10 DL Further test, can't repeat, but can get slow down.
2013-12-19 GDR Moved to MPS project and set to optional.

Fixes

Change Effect Date User Description
190053 closed 2016-03-15 06:31:08 Richard Brooksby Merging branch/2016-03-12/pause into the master sources.