MPS issue job003503

TitleTelemetry system calls mps_io_flush() without mps_io_create()
Assigned userGareth Rees
DescriptionBruce Mitchener reported on IRC that the amcssth test case was hanging:

    brucem: any chance there's an infinite loop in amcssth?
    11:04 (on FreeBSD x86)
    11:05 (or a deadlock)
    11:05 err, sorry, Linux x86
AnalysisI ran this test case in GDB on lii6gc. I was able to get it to hang with the following backtrace in thread 2:

    #0 0x00007ffff76039cb in ?? () from /lib/x86_64-linux-gnu/
    #1 0x00007ffff757f3c0 in ?? () from /lib/x86_64-linux-gnu/
    #2 0x00007ffff757e7f8 in ?? () from /lib/x86_64-linux-gnu/
    #3 0x000000000043baa0 in mps_io_flush (mps_io=<optimized out>)
        at mpsioan.c:79
    #4 0x00000000004269f8 in EventSync () at event.c:149
    #5 0x0000000000426ac6 in EventFlush (kind=4) at event.c:90
    #6 0x0000000000437280 in amcFixInPlace (pool=<optimized out>,
        seg=0x7ffff7f7f978, ss=0x7ffff7500b60, refIO=<optimized out>)
        at poolamc.c:1715

(Line 79 of mpsioan.c is the call to fflush(f). Not sure why the name "fflush" is missing from the backtrace.)

This is the backtrace in thread 1:

    #0 0x00007ffff75387a4 in sigsuspend () from /lib/x86_64-linux-gnu/
    #1 0x0000000000435583 in suspendSignalHandler (sig=<optimized out>,
        info=<optimized out>, context=0x7fffffffdf00) at pthrdext.c:99
    #2 <signal handler called>
    #3 0x00007ffff75e892d in write () from /lib/x86_64-linux-gnu/
    #4 0x00007ffff757b883 in _IO_file_write ()
        from /lib/x86_64-linux-gnu/
    #5 0x00007ffff757b74a in ?? () from /lib/x86_64-linux-gnu/
    #6 0x00007ffff757ceb5 in _IO_do_write () from /lib/x86_64-linux-gnu/
    #7 0x00007ffff757bdc0 in _IO_file_sync ()
        from /lib/x86_64-linux-gnu/
    #8 0x00007ffff7570cfb in fflush () from /lib/x86_64-linux-gnu/
    #9 0x0000000000401e5a in test (arg=<optimized out>, s=<optimized out>)
        at amcssth.c:271
    #10 0x000000000043680d in ProtTramp (resultReturn=0x7fffffffe570,
        f=0x401b22 <test>, p=0x7ffff7ff6000, s=0) at protix.c:132
    #11 0x000000000040613c in mps_tramp (r_o=0x7fffffffe570, f=0x401b22 <test>,
        p=0x7ffff7ff6000, s=0) at mpsi.c:1378
    #12 0x00000000004020f2 in main (argc=<optimized out>, argv=0x7fffffffe678)
        at amcssth.c:334

So it looks as if there's deadlock here between the two simultaneous fflush() calls. How can that be possible? POSIX I/O is supposed to be thread-safe.

But looking at the argument to mps_io_flush(), it's clear that something has gone wrong. eventIO has not been initialized (because event logging was not requested) but mps_io_flush() is being called anyway:

    (gdb) frame 4
    #4 0x00000000004269f8 in EventSync () at event.c:149
    149 (void)mps_io_flush(eventIO);
    (gdb) p eventIO
    $2 = (mps_io_t) 0x0
    (gdb) p eventIOInited
    $3 = 0
    (gdb) p EventKindControl
    $4 = 0

If the environment variable MPS_TELEMETRY_CONTROL is not set then no events need to be written to the telemetry stream, and so mps_io_create() is never called, and so eventIO is never initialized. But EventSync() always calls mps_io_flush(). See event.c [1]. So we need to guard the call to mps_io_flush().

So why does fflush(NULL) deadlock against fflush(stdout) on Linux but not on OS X? Well, fflush(NULL) flushes all open files.
How foundautomated_test
Evidence[1] //
Observed in1.111.0
Created byGareth Rees
Created on2013-05-30 14:53:19
Last modified byGareth Rees
Last modified on2013-06-04 15:19:42
History2013-05-30 GDR Created.


Change Effect Date User Description
182322 closed 2013-05-30 15:24:41 Gareth Rees Guard the calls to mps_io_flush so that it can only be called after mps_io_create.