MPS issue job001150

TitleMPS doesn't provide enough feedback information about what it is doing or what it has done
Statusclosed
Priorityoptional
Assigned userGareth Rees
OrganizationRavenbrook
DescriptionThe MPS provides no feedback information which allows a client program to find out what it is doing or what it has done, or any other statistical information such as counts of various operations or total bytes allocated, etc. This kind of information is needed to provide reassurance that the MPS is working, but also to help a programmer make good use of the MPS (not tuning, exactly, but simply avoiding abuse).

Some sample bits of data we might want are:

1. numbers of faults, to indicate how well barriers are working
2. allocation and collection statistics
3. current state: what's in progress?
4. feedback to help with chain size settings

See also [1] from GR.

Related job: job000666 "Difficult to tell what collections are happening"
AnalysisRB 2005-03-09: We need to gather a better list of information that might be required. We need to design a generic interface for such information requests. We need to implement the key information operations. Possibly we need to throw out the idea of "telemetry" and replace it with always-on statistics gathering.

RHSK 2005-12-09: I looked at the message mechanism [2] and telemetry [3].

GR 2004-12-02 [4]: "I remember us talking about some improved reporting facilities from MPS. For my current situation it would be useful to get more info like:

Number of pages used by MPS
Max number of pages used
Average number of pages used

For getting a better picture of what is going on in my virtual address space it would be useful with some way to enumerate the pages used by MPS (or page ranges).

It would be nice also to get a callback on the start of a bigger collection, or some other interface to measure time spent in collection. Our microTime() function typically has a resolution of 10 uS so it should be possible.

Simply a little bit less of a black box!"

RHSK 2006-05-22: Essence is: communicating with the MPS-client developer:

* what information is helpful for them?
* must synthesise information: do this where? and into what 'language'/model?
* how do we export and present this info?

RHSK 2006-05-22: DRJ did some work on this under job000666 [5]. This work was in two parts:

* synthesise descriptions, for example, annotate each ArenaStartCollect with a "why" code;
* various new message types to export this info.

RHSK 2006-05-25: As an example from Lisp GC, see [6].

GDR 2018-06-20: The telemetry subsystem has not been useful to anyone so far. The problem is that the telemetry data is low-level (comprehensive and detailed), but this is not useful by itself without a high-level picture from which an engineer can start to investigate. We imagined that users would develop queries of the telemetry data, for example by processing the output of mpseventtxt, or using mpseventsql to put the data into a database from which they could query it using SQL or standard database-browsing interfaces. But developing these queries is too onerous and so the data goes unused.

So my plan is, first, to develop an application that consumes a telemetry stream (either a recorded stream or a live stream) and extracts a set of time series from it, and plots a subset of the time series on a graph. An example of such a time series would be the number of bytes allocated from the arena to a pool. Then running this application is likely to suggest questions that need investigation (why does this time series have this shape?) and which can be used to guide the next steps of development.

Something worth noting is that no-one uses the telemetry system and it does not meet any customer requirements. Accordingly I will feel free to change it as necessary to meet new requirements, without worrying too much about backwards compatibility. For example, the ArenaFree event did not include the pool which was freeing the memory back to the arena. In theory this was unnecessary since you could use the base address of the freed memory to find the corresponding ArenaAlloc, but in practice this would force a consuming application to maintain a model of the whole address space.

To-do:

* Better handling of large volumes of data: compact representation via NumPy arrays? compress time series by averaging old points?
* Is there a better approach to getting event definitions?
* User documentation.
* Tree of arenas/pools/timeseries using QTreeWidget.
* Include ArenaAccess in the mark/space ratio computation.
* Labels near the top of the window are not readable.

Done:

* Fix compilation on GCC [7].
* The monitor program doesn't need the .py extension.
* Default to MPS_TELEMETRY_FILENAME or mpsio.log if unset.
* Accumulators should move in steps (not linearly interpolate).
* Determine labels for the time series using the Label events.
* Use the EventSync events to convert the time values to seconds.
* Check the version number in the monitor.
* Record the syncing issue (job004080).
* Linear interpolation between each pair of sync events
* Objects can be destroyed and then recreated at the same addresses.
* Endianness.
* Get it working live with animation.
* Embed in Qt
* Select time series to display.
* Show colours in checkbox panel.
* Keep colours the same.
* Command-W to close the window.
* Window title.
* Only linearly interpolate each batch once.
* Record the issue of cycle count wrap-around.
* Scrollbar for list of time series
* Support zooming during live animation.
* Clock values can wrap around (on some platforms) even though TSC can't
* Read and ignore unrecognized event codes.
* Organize the code and document it.
* Percent-CPU in the MPS (using ArenaPoll events).
* Represent the units of each TimeSeries.
* Draw two different scales using Axes.twinx [8].
* Running on Windows.
* Avoid dummy axes when only one kind of unit is being drawn.
* "The monitor didn't tell me when I pointed it at telemetry generated by an older version of the MPS. It just brought up a blank window."
* Tooltips for the lines.
* If an axis is hidden and then you click on a line, you get "no figure set when check if mouse is on line". Then clicking away from any line gets "AttributeError: 'NoneType' object has no attribute 'add_artist'" and then "ValueError: list.remove(x): x not in list"
* Time series for size of segments referencing each generation
* If annotation points at a line which is then removed, the annotation should be removed too.
* Keep checkboxes in order by name.
* Label the arena's top gen and default generation chain.
* Total of client pool allocation.
* Per-trace mortality as well as average mortality.
* When a trace of a generation begins and ends.
* When traces begin and end, report condemned memory.
* Support line style — for some time series we may just want to plot the points.
* Some measure of the rate of barrier hits.
* Handle more than two different kinds of unit.
How foundcustomer
Evidence[1] http://info.ravenbrook.com/mail/2004/12/02/07-53-32/0.txt
[2] http://info.ravenbrook.com/user/rhsk/mps/working/job/job001150/1150msgs.txt
[3] http://info.ravenbrook.com/project/mps/doc/2006-05-11/telemetry-log-events/
[4] http://info.ravenbrook.com/mail/2004/12/02/07-53-32/0.txt
[5] https://info.ravenbrook.com/infosys/cg...ject/mps/branch/2003-02-17/gcgenmsg/...
[6] http://info.ravenbrook.com/mail/2006/05/25/17-59-05/0.txt
[7] https://travis-ci.org/Ravenbrook/mps/builds/395007947
[8] https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.twinx.html
[9] https://info.ravenbrook.com/mail/2005/03/01/11-16-25/0/
[10] https://info.ravenbrook.com/mail/2005/03/01/15-08-46/0/
[11] https://info.ravenbrook.com/mail/2005/03/01/15-31-13/0/
[12] https://info.ravenbrook.com/mail/2005/03/02/15-03-33/0/
Raw notes from Configura workshop <http://info.ravenbrook.com/mail/2005/02/28/12-40-49/0.txt>.
The thread "Side-by-side comparison: MPS vs dumb stop-and-copy" [9] [10] [11] [12] shows where we had to measure things using other means and shouldn't've had to.
Observed in1.105.0
Created byRichard Brooksby
Created on2005-03-09 18:11:35
Last modified byGareth Rees
Last modified on2018-10-13 12:22:40
History2005-03-09 RB Created.
2005-03-10 RHSK Link to Configura mail.
2005-03-10 RHSK Subsumes job000666 (no longer).
2005-12-08 RHSK Use phrase "feedback information"
2005-12-09 RHSK Analysis: part of trace message lifecycle.
2006-05-22 RHSK more analysis; link to DRJ's branch; remove notes on workings of messages.
2006-05-25 RHSK link to example of Lisp GC feedback
2006-12-27 RHSK Unsubsume job000666, which is the initial _gc_start message implementation.
2013-06-16 RB Downgraded to "optional" since telemetry and message feedback is much improved.
2018-06-20 GDR More analysis, new plan.

Fixes

Change Effect Date User Description
195235 closed 2018-10-13 12:22:24 Gareth Rees Merge branch/2018-06-20/monitor into the master sources.
158620 open 2006-05-11 18:05:14 Richard Kistruck  MPS: telemetry log events: notes, for job001150
 add link in Doc index to new telemetry log events document
158619 open 2006-05-11 17:55:47 Richard Kistruck  MPS: telemetry log events: notes (for job001150):
 define Terminology, and give an Overview
158616 open 2006-05-11 16:59:59 Richard Kistruck MPS: telemetry log events: notes, for job001150