2007-06-04 17:37:35

by Matt Mackall

[permalink] [raw]
Subject: Interesting interaction between lguest and CFS

With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
on. In other windows, I'm compiling, reading email, etc. with no
noticeable problems.

If I fire up lguest and leave it sitting at a shell prompt for a
couple moments, when I return to type something at the prompt, it can
take 2-3 seconds for my input to echo. Once it starts responding,
typing latency disappears. Suspend the other app and the latency
disappears.

The system isn't swapping and has basically no I/O load.

Ok, my transcoding just finished as I was writing this. So I've
reproduced the problem with this Python script that I had handy:

memload.py:
#!/usr/bin/python
a = "a" * 16 * 1024 * 1024
while 1:
b = a[1:] + "b"
a = b[1:] + "c"

--
Mathematics is the supreme nostalgia of our time.


2007-06-04 17:54:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> on. In other windows, I'm compiling, reading email, etc. with no
> noticeable problems.
>
> If I fire up lguest and leave it sitting at a shell prompt for a
> couple moments, when I return to type something at the prompt, it can
> take 2-3 seconds for my input to echo. Once it starts responding,
> typing latency disappears. Suspend the other app and the latency
> disappears.
>
> The system isn't swapping and has basically no I/O load.

could you send me the /proc/PID/sched file of that task which shows this
2-3 seconds delay? (if it's a shell and a graphical terminal then please
send the sched-debug file of both.) I suspect the latency goes away if
you strace -f the shell task(s)? If not then please send the strace
output too.

Ingo

2007-06-04 18:12:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> on. In other windows, I'm compiling, reading email, etc. with no
> noticeable problems.
>
> If I fire up lguest and leave it sitting at a shell prompt for a
> couple moments, when I return to type something at the prompt, it can
> take 2-3 seconds for my input to echo. Once it starts responding,
> typing latency disappears. Suspend the other app and the latency
> disappears.

btw., does this only happen with lguest, or with other idle shells too?

perhaps lguest does something unusual when it idles? Does it fiddle
sched_clock() in any way?

Ingo

2007-06-04 18:41:43

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, Jun 04, 2007 at 07:54:36PM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> > on. In other windows, I'm compiling, reading email, etc. with no
> > noticeable problems.
> >
> > If I fire up lguest and leave it sitting at a shell prompt for a
> > couple moments, when I return to type something at the prompt, it can
> > take 2-3 seconds for my input to echo. Once it starts responding,
> > typing latency disappears. Suspend the other app and the latency
> > disappears.
> >
> > The system isn't swapping and has basically no I/O load.
>
> could you send me the /proc/PID/sched file of that task which shows this
> 2-3 seconds delay? (if it's a shell and a graphical terminal then please
> send the sched-debug file of both.) I suspect the latency goes away if
> you strace -f the shell task(s)? If not then please send the strace
> output too.

Not sure why the shell task (bash) is interesting here?

gnome-terminal(4177)->bash(4409)->lg(24934)->lguest(24936)->lguest(24944)

(lg is a shell script to launch lguest)

/proc/4177/sched
wait_start : 0
wait_start_fair : 0
exec_start : 0
sleep_start : 53594838106004537
sleep_start_fair : 53575815099165186
block_start : 0
sleep_max : 57476665627
block_max : 18014060106626075
exec_max : 4021419
wait_max : 557161172
last_ran : 0
wait_runtime : 23552233
wait_runtime_overruns : 1864716
wait_runtime_underruns : 310
sum_exec_runtime : 678350336100
clock-delta : 94

cat /proc/4409/sched
wait_start : 0
wait_start_fair : 0
exec_start : 0
sleep_start : 53593602868366272
sleep_start_fair : 53575146154577161
block_start : 0
sleep_max : 2006065218683
block_max : 457256091
exec_max : 3995741
wait_max : 107832482
last_ran : 0
wait_runtime : 23634554
wait_runtime_overruns : 316
wait_runtime_underruns : 9
sum_exec_runtime : 289263702
clock-delta : 83

/proc/24936/sched
wait_start : 0
wait_start_fair : 0
exec_start : 0
sleep_start : 53594765858891273
sleep_start_fair : 53575748785282138
block_start : 0
sleep_max : 21492554
block_max : 27044576
exec_max : 4008057
wait_max : 1253670288
last_ran : 0
wait_runtime : 23955448
wait_runtime_overruns : 270678
wait_runtime_underruns : 1
sum_exec_runtime : 36978464446
clock-delta : 81

/proc/24944/sched
wait_start : 0
wait_start_fair : 0
exec_start : 0
sleep_start : 53594628265104224
sleep_start_fair : 53575623092295018
block_start : 0
sleep_max : 223179035512
block_max : 0
exec_max : 3960054
wait_max : 4334838886
last_ran : 0
wait_runtime : 23959470
wait_runtime_overruns : 1243
wait_runtime_underruns : 15
sum_exec_runtime : 17872418
clock-delta : 81

strace -f doesn't kill the latency, here's an strace:

24936 13:39:44.469329 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:44.469572 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:44.469677 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:44.469765 read(7, 0xbfe6c010, 8) = -1 EINTR (Interrupted system call)
24936 13:39:48.699490 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
24936 13:39:48.699655 sigreturn() = ? (mask now [])
24936 13:39:48.699842 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.699939 select(4, [0 3], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
24936 13:39:48.700063 write(7, "\1\0\0\0\0\0\0\0", 8) = 5675552
24936 13:39:48.700168 readv(0, [{"sdfsdfsdfsdfsdon\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 256}], 1) = 14
24936 13:39:48.700360 write(7, "\2\0\0\0\3\0\0\0", 8) = 0
24936 13:39:48.700451 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.700547 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.700635 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.700872 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.700962 writev(1, [{"s", 1}], 1) = 1
24936 13:39:48.701200 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.701297 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.701385 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.701495 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.701584 writev(1, [{"d", 1}], 1) = 1
24936 13:39:48.701773 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.701870 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.701957 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.702061 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.702149 writev(1, [{"f", 1}], 1) = 1
24936 13:39:48.702336 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.702431 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.702519 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.702622 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.702710 writev(1, [{"s", 1}], 1) = 1
24936 13:39:48.702898 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.702993 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.703157 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.703264 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.703352 writev(1, [{"d", 1}], 1) = 1
24936 13:39:48.703541 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.703636 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.703724 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.703828 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.703916 writev(1, [{"f", 1}], 1) = 1
24936 13:39:48.704103 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.704198 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.704285 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.704389 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.704476 writev(1, [{"s", 1}], 1) = 1
24936 13:39:48.704662 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.704758 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.704845 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.704948 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.705036 writev(1, [{"d", 1}], 1) = 1
24936 13:39:48.705222 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.705317 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.705405 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.705508 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.705595 writev(1, [{"f", 1}], 1) = 1
24936 13:39:48.705781 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.705877 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.705964 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.706068 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.706155 writev(1, [{"s", 1}], 1) = 1
24936 13:39:48.706341 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.706483 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.706571 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.706675 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.706763 writev(1, [{"d", 1}], 1) = 1
24936 13:39:48.706950 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.707045 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.707221 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.707325 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.707414 writev(1, [{"f", 1}], 1) = 1
24936 13:39:48.707600 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.707695 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.707800 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.707904 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.707992 writev(1, [{"s", 1}], 1) = 1
24936 13:39:48.708177 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.708273 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.708360 read(7, "P\216\304\2\0\0\0\0", 8) = 8
24936 13:39:48.708463 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.708551 writev(1, [{"d", 1}], 1) = 1
24936 13:39:48.708736 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.708832 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.708919 read(7, 0xbfe6c010, 8) = -1 EINTR (Interrupted system call)
24936 13:39:48.951555 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
24936 13:39:48.951697 sigreturn() = ? (mask now [])
24936 13:39:48.951884 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.951983 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
24936 13:39:48.952092 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
24936 13:39:48.952184 read(7, <unfinished ...>



--
Mathematics is the supreme nostalgia of our time.

2007-06-04 18:42:48

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, Jun 04, 2007 at 08:11:51PM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> > on. In other windows, I'm compiling, reading email, etc. with no
> > noticeable problems.
> >
> > If I fire up lguest and leave it sitting at a shell prompt for a
> > couple moments, when I return to type something at the prompt, it can
> > take 2-3 seconds for my input to echo. Once it starts responding,
> > typing latency disappears. Suspend the other app and the latency
> > disappears.
>
> btw., does this only happen with lguest, or with other idle shells too?

Only noticed with lguest.

> perhaps lguest does something unusual when it idles? Does it fiddle
> sched_clock() in any way?

In the host? That'd be weird.

--
Mathematics is the supreme nostalgia of our time.

2007-06-04 19:12:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> > btw., does this only happen with lguest, or with other idle shells
> > too?
>
> Only noticed with lguest.

ah, so both the shell and the 'competing' CPU hog was running within the
same lguest instance?

Ingo

2007-06-04 19:28:24

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, Jun 04, 2007 at 09:12:29PM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > > btw., does this only happen with lguest, or with other idle shells
> > > too?
> >
> > Only noticed with lguest.
>
> ah, so both the shell and the 'competing' CPU hog was running within the
> same lguest instance?

No.

The CPU hog is running in the host. A single instance of lguest using
busybox is started in another shell. When it reaches a shell prompt,
-that- shell is occassionally unresponsive for long stretches.

For the purposes of CFS, of course, the shell running inside lguest is
invisible.

Also, for comparison, running the same kernel and disk image inside
qemu doesn't show any lag, despite being substantially slower (no
kernel/VT acceleration).

--
Mathematics is the supreme nostalgia of our time.

2007-06-04 19:40:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> > ah, so both the shell and the 'competing' CPU hog was running within
> > the same lguest instance?
>
> No.
>
> The CPU hog is running in the host. A single instance of lguest using
> busybox is started in another shell. When it reaches a shell prompt,
> -that- shell is occassionally unresponsive for long stretches.

so the shell within lguest is affected by the CPU hog outside? Could you
send me the sched-debug stats of the CPU hog and of the lguest host
process as well? (or were those amongst the ones you already sent?)

Ingo

2007-06-04 20:00:29

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, Jun 04, 2007 at 09:40:38PM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > > ah, so both the shell and the 'competing' CPU hog was running within
> > > the same lguest instance?
> >
> > No.
> >
> > The CPU hog is running in the host. A single instance of lguest using
> > busybox is started in another shell. When it reaches a shell prompt,
> > -that- shell is occassionally unresponsive for long stretches.
>
> so the shell within lguest is affected by the CPU hog outside? Could you
> send me the sched-debug stats of the CPU hog and of the lguest host
> process as well? (or were those amongst the ones you already sent?)

I sent you the stats for:

- both lguest processes
- the bash which started lguest (which I suspect you don't actually
want)
- the gnome-terminal that lguest was running inside of
(keystrokes to the shell in lguest go through that terminal)

The last is probably not very interesting either, as gnome-terminal is
a single process with multiple windows and none of the other windows
were affected.

So I sent you everything but the stats for the CPU hog.

I suggest you try this yourself - lguest is incredibly easy to get up
and running. It's also quite useful: I can test-boot kernels with it
in less than a second, or about 10x faster than basic qemu, and 100x
faster than a real boot. And as it uses a pty as console, you can do
things like pipe it through grep.

You'll need the following tab-damaged patch for rc3-mm1:

--- mm.orig/arch/i386/kernel/sched-clock.c 2007-06-04 12:15:51.000000000 -0500
+++ mm/arch/i386/kernel/sched-clock.c 2007-06-04 12:15:57.000000000
-0500
@@ -133,7 +133,7 @@ static void resync_freq(void *arg)
struct sc_data *sc = &__get_cpu_var(sc_data);

sc->sync_base = jiffies;
- if (!cpu_has_tsc) {
+ if (!cpu_has_tsc || tsc_disable) {
sc->unstable = 1;
return;
}

--
Mathematics is the supreme nostalgia of our time.

2007-06-04 20:13:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> So I sent you everything but the stats for the CPU hog.

ok, thx.

> I suggest you try this yourself [...]

yep, i know how to use lguest and i'm already running it :) But i cannot
trigger the problem, that's why i was curious about your host-side
stats.

Ingo

2007-06-04 20:15:42

by James Morris

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, 4 Jun 2007, Ingo Molnar wrote:

> perhaps lguest does something unusual when it idles? Does it fiddle
> sched_clock() in any way?

One of the lguest processes calls nice(2), which seems like a long shot,
but also not impossibly related. (Note that newer patches in the lguest
queue remove this, so it'll depend on exactly which patches you have
applied).


- James
--
James Morris
<[email protected]>

2007-06-04 20:21:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Ingo Molnar <[email protected]> wrote:

> > I suggest you try this yourself [...]
>
> yep, i know how to use lguest and i'm already running it :) But i
> cannot trigger the problem, that's why i was curious about your
> host-side stats.

i can run a hundred CPU hogs on the host and the lguest instance is
still usable under CFS:

@%@%> time ls
bin home lost+found proc sys var
dev lib mnt root tmp
etc linuxrc opt sbin usr
real 0m 0.33s
user 0m 0.00s
sys 0m 0.00s

i'm wondering whether the delay goes away if you boot with maxcpus=1?
(if you have multiple cpus/cores on your box)

Ingo

2007-06-04 20:24:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* James Morris <[email protected]> wrote:

> On Mon, 4 Jun 2007, Ingo Molnar wrote:
>
> > perhaps lguest does something unusual when it idles? Does it fiddle
> > sched_clock() in any way?
>
> One of the lguest processes calls nice(2), which seems like a long
> shot, but also not impossibly related. (Note that newer patches in
> the lguest queue remove this, so it'll depend on exactly which patches
> you have applied).

i've got 2.6.22-rc3-mm1's lguest (with a trivial boot-time crash fix in
sched_clock that i ran into). But nice(2) should only cause a gentle,
20% smaller CPU bandwidth to be allocated by CFS to that thread, nothing
more drastic like user-visible delays.

Ingo

2007-06-04 22:52:27

by Olof Johansson

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, Jun 04, 2007 at 12:37:10PM -0500, Matt Mackall wrote:
> With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> on. In other windows, I'm compiling, reading email, etc. with no
> noticeable problems.
>
> If I fire up lguest and leave it sitting at a shell prompt for a
> couple moments, when I return to type something at the prompt, it can
> take 2-3 seconds for my input to echo. Once it starts responding,
> typing latency disappears. Suspend the other app and the latency
> disappears.
>
> The system isn't swapping and has basically no I/O load.

Is this on hvc console? The hvc driver was recently modified to back off
it's timing interval when there's no input, the max value is 2 seconds.

See b791072ba3c3b29bab1890963bde52eb944a8451 for the changes in question.


-Olof

2007-06-05 00:19:18

by Rusty Russell

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, 2007-06-04 at 12:37 -0500, Matt Mackall wrote:
> With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> on. In other windows, I'm compiling, reading email, etc. with no
> noticeable problems.
>
> If I fire up lguest and leave it sitting at a shell prompt for a
> couple moments, when I return to type something at the prompt, it can
> take 2-3 seconds for my input to echo. Once it starts responding,
> typing latency disappears. Suspend the other app and the latency
> disappears.

This sounds like the waker process (nice 19) not getting a chance to
run. You can hack around it for the moment by changing "nice(19)" in
Documentation/lguest/lguest.c to something less aggressive.

The real solution is to switch to an fd. Fortunately, I have just such
a patch. I am shuffling it forward in my queue now, and will send when
testing is complete...

Thanks!
Rusty.


2007-06-05 01:02:24

by Rusty Russell

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, 2007-06-04 at 17:58 -0500, Olof Johansson wrote:
> On Mon, Jun 04, 2007 at 12:37:10PM -0500, Matt Mackall wrote:
> > With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> > on. In other windows, I'm compiling, reading email, etc. with no
> > noticeable problems.
> >
> > If I fire up lguest and leave it sitting at a shell prompt for a
> > couple moments, when I return to type something at the prompt, it can
> > take 2-3 seconds for my input to echo. Once it starts responding,
> > typing latency disappears. Suspend the other app and the latency
> > disappears.
> >
> > The system isn't swapping and has basically no I/O load.
>
> Is this on hvc console? The hvc driver was recently modified to back off
> it's timing interval when there's no input, the max value is 2 seconds.
>
> See b791072ba3c3b29bab1890963bde52eb944a8451 for the changes in question.

Hmm, I thought that was only supposed to happen for "interruptless"
consoles?

Rusty.

2007-06-05 01:24:42

by Stephen Rothwell

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, 05 Jun 2007 11:02:00 +1000 Rusty Russell <[email protected]> wrote:
>
> > See b791072ba3c3b29bab1890963bde52eb944a8451 for the changes in question.
>
> Hmm, I thought that was only supposed to happen for "interruptless"
> consoles?

The comment certainly says it only applies to polled mode.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (409.00 B)
(No filename) (189.00 B)
Download all attachments

2007-06-05 02:17:18

by Olof Johansson

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 11:02:00AM +1000, Rusty Russell wrote:
> On Mon, 2007-06-04 at 17:58 -0500, Olof Johansson wrote:
> >
> > Is this on hvc console? The hvc driver was recently modified to back off
> > it's timing interval when there's no input, the max value is 2 seconds.
> >
> > See b791072ba3c3b29bab1890963bde52eb944a8451 for the changes in question.
>
> Hmm, I thought that was only supposed to happen for "interruptless"
> consoles?

Ok, my bad. I thought lguest was polled as well.


-Olof

2007-06-05 02:31:38

by Rusty Russell

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, 2007-06-05 at 10:19 +1000, Rusty Russell wrote:
> On Mon, 2007-06-04 at 12:37 -0500, Matt Mackall wrote:
> > With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> > on. In other windows, I'm compiling, reading email, etc. with no
> > noticeable problems.
> >
> > If I fire up lguest and leave it sitting at a shell prompt for a
> > couple moments, when I return to type something at the prompt, it can
> > take 2-3 seconds for my input to echo. Once it starts responding,
> > typing latency disappears. Suspend the other app and the latency
> > disappears.
>
> This sounds like the waker process (nice 19) not getting a chance to
> run. You can hack around it for the moment by changing "nice(19)" in
> Documentation/lguest/lguest.c to something less aggressive.
>
> The real solution is to switch to an fd. Fortunately, I have just such
> a patch. I am shuffling it forward in my queue now, and will send when
> testing is complete...

Does this solve it for you?

lguest: Replace signal with fd operation

We currently use a "waker" process: a child of the launcher which
selects() on the incoming file descriptors. It sends a SIGUSR1 to the
launcher whenever select() returns to kick the launcher out of the
kernel.

This has nasty side-effects: the waker needs to keep sending signals
to avoid the race, so we nice it to try to make sure the launcher runs
soon. Also the launcher blocks SIGUSR1 when it's not running the
guest, so it doesn't have to deal with other interrupted reads...

It's better to explicitly tell the kernel to break out of the guest,
and this is what we do, with a new LHREQ_BREAK command. This makes
the launcher return -EAGAIN from reading /dev/lguest, and blocks the
waker until the launcher calls LHREQ_BREAK, avoiding the race.

We also take precautions against simultaneous writes or reads on the
/dev/lguest fd. As only root can have these file descriptors, it's
not much of a problem, but we want to relax that restriction
eventually.

The main improvement is in consistency, rather than raw benchmark
results:

Before:
Time for one context switch via pipe: 9265 (4534 - 9495)
Time for one Copy-on-Write fault: 67687 (14898 - 159125)
Time to exec client once: 1102812 (795843 - 1128250)
Time for one fork/exit/wait: 712000 (400625 - 723156)
Time for gettimeofday(): 16681 (16378 - 35835)
Time to send 4 MB from host: 141317343 (140165578 - 141469500)
Time for one int-0x80 syscall: 272 (272 - 575)
Time for one syscall via libc: 275 (274 - 904)
Time for two PTE updates: 16232 (6430 - 16316)
Time to read from disk (256 kB): 16786750 (16597500 - 31493250)
Time for one disk read: 192656 (189312 - 958687)
Time for inter-guest pingpong: 110453 (104492 - 316429)

After:
Time for one context switch via pipe: 4687 (4563 - 4857)
Time for one Copy-on-Write fault: 44523 (11628 - 77855)
Time to exec client once: 814765 (805796 - 829875)
Time for one fork/exit/wait: 405875 (400562 - 434750)
Time for gettimeofday(): 16644 (16203 - 16931)
Time to send 4 MB from host: 136530000 (121522250 - 151629000)
Time for one int-0x80 syscall: 273 (272 - 274)
Time for one syscall via libc: 279 (277 - 279)
Time for two PTE updates: 6439 (6395 - 6528)
Time to read from disk (256 kB): 16787000 (16641250 - 16861250)
Time for one disk read: 192187 (190515 - 193843)
Time for inter-guest pingpong: 111093 (109203 - 136554)

Signed-off-by: Rusty Russell <[email protected]>

---
Documentation/lguest/lguest.c | 105 +++++++++------------------------------
drivers/lguest/core.c | 9 ++-
drivers/lguest/lg.h | 5 +
drivers/lguest/lguest_user.c | 62 +++++++++++++++++++----
include/linux/lguest_launcher.h | 1
5 files changed, 90 insertions(+), 92 deletions(-)

diff -r 7d068abb5a41 Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Tue Jun 05 09:51:59 2007 +1000
+++ b/Documentation/lguest/lguest.c Tue Jun 05 10:32:23 2007 +1000
@@ -16,7 +16,6 @@
#include <fcntl.h>
#include <stdbool.h>
#include <errno.h>
-#include <signal.h>
#include <ctype.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
@@ -341,15 +340,14 @@ static void set_fd(int fd, struct device
devices->max_infd = fd;
}

-/* We send lguest_add signals while input is pending: avoids races. */
-static void wake_parent(int pipefd, struct device_list *devices)
-{
- nice(19);
-
+/* When input arrives, we tell the kernel to kick lguest out with -EAGAIN. */
+static void wake_parent(int pipefd, int lguest_fd, struct device_list *devices)
+{
set_fd(pipefd, devices);

for (;;) {
fd_set rfds = devices->infds;
+ u32 args[] = { LHREQ_BREAK, 1 };

select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
if (FD_ISSET(pipefd, &rfds)) {
@@ -357,25 +355,14 @@ static void wake_parent(int pipefd, stru
if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
exit(0);
FD_CLR(ignorefd, &devices->infds);
- }
- kill(getppid(), SIGUSR1);
- }
-}
-
-/* We don't want signal to kill us, just jerk us out of kernel. */
-static void wakeup(int signo)
-{
-}
-
-static int setup_waker(struct device_list *device_list)
+ } else
+ write(lguest_fd, args, sizeof(args));
+ }
+}
+
+static int setup_waker(int lguest_fd, struct device_list *device_list)
{
int pipefd[2], child;
- struct sigaction act;
-
- act.sa_handler = wakeup;
- sigemptyset(&act.sa_mask);
- act.sa_flags = 0;
- sigaction(SIGUSR1, &act, NULL);

pipe(pipefd);
child = fork();
@@ -384,7 +371,7 @@ static int setup_waker(struct device_lis

if (child == 0) {
close(pipefd[1]);
- wake_parent(pipefd[0], device_list);
+ wake_parent(pipefd[0], lguest_fd, device_list);
}
close(pipefd[0]);

@@ -620,6 +607,7 @@ static void handle_input(int fd, int chi
for (;;) {
struct device *i;
fd_set fds = devices->infds;
+ unsigned int num = 0;

if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0)
break;
@@ -632,6 +620,7 @@ static void handle_input(int fd, int chi
write(childfd, &i->fd, sizeof(i->fd));
}
}
+ num++;
}
}
}
@@ -900,27 +889,26 @@ static void __attribute__((noreturn))
static void __attribute__((noreturn))
run_guest(int lguest_fd, int waker_fd, struct device_list *device_list)
{
- sigset_t sigset;
-
- sigemptyset(&sigset);
- sigaddset(&sigset, SIGUSR1);
for (;;) {
+ u32 args[] = { LHREQ_BREAK, 0 };
unsigned long arr[2];
int readval;

- sigprocmask(SIG_UNBLOCK, &sigset, NULL);
+ /* We read from the /dev/lguest device to run the Guest. */
readval = read(lguest_fd, arr, sizeof(arr));
- sigprocmask(SIG_BLOCK, &sigset, NULL);
-
- if (readval == sizeof(arr))
+
+ if (readval == sizeof(arr)) {
handle_output(lguest_fd, arr[0], arr[1], device_list);
- else if (errno == ENOENT) {
+ continue;
+ } else if (errno == ENOENT) {
char reason[1024] = { 0 };
read(lguest_fd, reason, sizeof(reason)-1);
errx(1, "%s", reason);
- } else if (errno != EINTR)
+ } else if (errno != EAGAIN)
err(1, "Running guest failed");
handle_input(lguest_fd, waker_fd, device_list);
+ if (write(lguest_fd, args, sizeof(args)) < 0)
+ err(1, "Resetting break");
}
}

@@ -1014,7 +1002,7 @@ int main(int argc, char *argv[])
*(int *)(boot + 0x23c) = 1;

lguest_fd = tell_kernel(pgdir, start, page_offset);
- waker_fd = setup_waker(&device_list);
+ waker_fd = setup_waker(lguest_fd, &device_list);

run_guest(lguest_fd, waker_fd, &device_list);
}
diff -r 7d068abb5a41 drivers/lguest/core.c
--- a/drivers/lguest/core.c Tue Jun 05 09:51:59 2007 +1000
+++ b/drivers/lguest/core.c Tue Jun 05 09:56:32 2007 +1000
@@ -313,7 +313,12 @@ int run_guest(struct lguest *lg, unsigne
}

if (signal_pending(current))
- return -EINTR;
+ return -ERESTARTSYS;
+
+ /* If Waker set break_out, return to Launcher. */
+ if (lg->break_out)
+ return -EAGAIN;
+
maybe_do_interrupt(lg);

try_to_freeze();
diff -r 7d068abb5a41 drivers/lguest/lg.h
--- a/drivers/lguest/lg.h Tue Jun 05 09:51:59 2007 +1000
+++ b/drivers/lguest/lg.h Tue Jun 05 09:53:28 2007 +1000
@@ -16,6 +16,7 @@
#include <linux/futex.h>
#include <linux/lguest.h>
#include <linux/lguest_launcher.h>
+#include <linux/wait.h>
#include <linux/err.h>
#include <asm/semaphore.h>
#include "irq_vectors.h"
@@ -138,6 +139,10 @@ struct lguest
u32 esp1;
u8 ss1;

+ /* Do we need to stop what we're doing and return to userspace? */
+ int break_out;
+ wait_queue_head_t break_wq;
+
/* Bitmap of what has changed: see CHANGED_* above. */
int changed;
struct lguest_pages *last_pages;
diff -r 7d068abb5a41 drivers/lguest/lguest_user.c
--- a/drivers/lguest/lguest_user.c Tue Jun 05 09:51:59 2007 +1000
+++ b/drivers/lguest/lguest_user.c Tue Jun 05 10:36:58 2007 +1000
@@ -30,6 +30,30 @@ static long user_get_dma(struct lguest *
return udma;
}

+/* To force the Guest to stop running and return to the Launcher, the
+ * Waker sets writes LHREQ_BREAK and the value "1" to /dev/lguest. The
+ * Launcher then writes LHREQ_BREAK and "0" to release the Waker. */
+static int break_guest_out(struct lguest *lg, const u32 __user *input)
+{
+ unsigned long on;
+
+ /* Fetch whether they're turning break on or off.. */
+ if (get_user(on, input) != 0)
+ return -EFAULT;
+
+ if (on) {
+ lg->break_out = 1;
+ /* Pop it out (may be running on different CPU) */
+ wake_up_process(lg->tsk);
+ /* Wait for them to reset it */
+ return wait_event_interruptible(lg->break_wq, !lg->break_out);
+ } else {
+ lg->break_out = 0;
+ wake_up(&lg->break_wq);
+ return 0;
+ }
+}
+
/* + irq */
static int user_send_irq(struct lguest *lg, const u32 __user *input)
{
@@ -49,6 +73,10 @@ static ssize_t read(struct file *file, c

if (!lg)
return -EINVAL;
+
+ /* If you're not the task which owns the guest, go away. */
+ if (current != lg->tsk)
+ return -EPERM;

if (lg->dead) {
size_t len;
@@ -75,13 +103,20 @@ static int initialize(struct file *file,
int err, i;
u32 args[4];

- if (file->private_data)
- return -EBUSY;
-
- if (copy_from_user(args, input, sizeof(args)) != 0)
- return -EFAULT;
-
+ /* We grab the Big Lguest lock, which protects the global array
+ * "lguests" and multiple simultaneous initializations. */
mutex_lock(&lguest_lock);
+
+ if (file->private_data) {
+ err = -EBUSY;
+ goto unlock;
+ }
+
+ if (copy_from_user(args, input, sizeof(args)) != 0) {
+ err = -EFAULT;
+ goto unlock;
+ }
+
i = find_free_guest();
if (i < 0) {
err = -ENOSPC;
@@ -107,10 +142,12 @@ static int initialize(struct file *file,
lg->tsk = current;
get_task_struct(lg->tsk);
lg->mm = get_task_mm(lg->tsk);
+ init_waitqueue_head(&lg->break_wq);
lg->last_pages = NULL;
- mutex_unlock(&lguest_lock);
-
file->private_data = lg;
+
+ mutex_unlock(&lguest_lock);
+
return sizeof(args);

free_regs:
@@ -136,6 +173,10 @@ static ssize_t write(struct file *file,
return -EINVAL;
if (lg && lg->dead)
return -ENOENT;
+
+ /* If you're not the task which owns the Guest, you can only break */
+ if (lg && current != lg->tsk && req != LHREQ_BREAK)
+ return -EPERM;

switch (req) {
case LHREQ_INITIALIZE:
@@ -144,6 +185,8 @@ static ssize_t write(struct file *file,
return user_get_dma(lg, (const u32 __user *)input);
case LHREQ_IRQ:
return user_send_irq(lg, (const u32 __user *)input);
+ case LHREQ_BREAK:
+ return break_guest_out(lg, (const u32 __user *)input);
default:
return -EINVAL;
}
diff -r 7d068abb5a41 include/linux/lguest_launcher.h
--- a/include/linux/lguest_launcher.h Tue Jun 05 09:51:59 2007 +1000
+++ b/include/linux/lguest_launcher.h Tue Jun 05 09:53:28 2007 +1000
@@ -68,5 +68,6 @@ enum lguest_req
LHREQ_INITIALIZE, /* + pfnlimit, pgdir, start, pageoffset */
LHREQ_GETDMA, /* + addr (returns &lguest_dma, irq in ->used_len) */
LHREQ_IRQ, /* + irq */
+ LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
};
#endif /* _ASM_LGUEST_USER */


2007-06-05 02:34:38

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 10:18:54AM +1000, Rusty Russell wrote:
> On Mon, 2007-06-04 at 12:37 -0500, Matt Mackall wrote:
> > With 2.6.22-rc3-mm1, I've got a long-running video transcoding going
> > on. In other windows, I'm compiling, reading email, etc. with no
> > noticeable problems.
> >
> > If I fire up lguest and leave it sitting at a shell prompt for a
> > couple moments, when I return to type something at the prompt, it can
> > take 2-3 seconds for my input to echo. Once it starts responding,
> > typing latency disappears. Suspend the other app and the latency
> > disappears.
>
> This sounds like the waker process (nice 19) not getting a chance to
> run. You can hack around it for the moment by changing "nice(19)" in
> Documentation/lguest/lguest.c to something less aggressive.

Manual renicing seems to have done the trick.

This is still a little suspicious though. Why does it take two seconds
to wake up and then behave responsively for a long interval after that?

--
Mathematics is the supreme nostalgia of our time.

2007-06-05 04:18:34

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 12:31:15PM +1000, Rusty Russell wrote:
> Does this solve it for you?

Nope. Doesn't accept input and hogs the CPU with lots of system time.

--
Mathematics is the supreme nostalgia of our time.

2007-06-05 04:38:01

by Rusty Russell

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Mon, 2007-06-04 at 23:18 -0500, Matt Mackall wrote:
> On Tue, Jun 05, 2007 at 12:31:15PM +1000, Rusty Russell wrote:
> > Does this solve it for you?
>
> Nope. Doesn't accept input and hogs the CPU with lots of system time.

Dumb question: did you replace both the module and the launcher?

Rusty.


2007-06-05 07:10:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Rusty Russell <[email protected]> wrote:

> This sounds like the waker process (nice 19) not getting a chance to
> run. You can hack around it for the moment by changing "nice(19)" in
> Documentation/lguest/lguest.c to something less aggressive.

even if the waker runs at nice+19, under CFS (which schedules
deterministically) it should not be delayed by a single nice-0 task by
more than ~70 msecs. That is a far cry from the 'seconds' that Matt saw
- and the sched stats he sent look anomalous too. Also, i'm not able to
reproduce it at all - while any nice+19 related problem should be
readily reproducable on my box too. So my current take is that this is
some sort of host-side CFS related problem and that the lguest host
threads somehow were the first ones that managed to trigger it in some
really bad way.

OTOH i do agree with your patch, that nice+19 thing is quite hacky :-)

Ingo

2007-06-05 07:19:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> sleep_max : 57476665627
> block_max : 18014060106626075

hm, this block_max looks a bit suspect, it's 003fffb1359e341b. Does your
box make any use of cpufreq? (what CPU is it?)

> strace -f doesn't kill the latency, here's an strace:
>
> 24936 13:39:44.469329 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0
> 24936 13:39:44.469572 select(4, [0 3], NULL, NULL, {0, 0}) = 0 (Timeout)
> 24936 13:39:44.469677 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
> 24936 13:39:44.469765 read(7, 0xbfe6c010, 8) = -1 EINTR (Interrupted system call)
> 24936 13:39:48.699490 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---

hm, this indeed implicates some wakeup problem. lguest task 24936's
relevant stats are:

block_start : 0
sleep_max : 21492554
block_max : 27044576
exec_max : 4008057
wait_max : 1253670288

the wait_max means it was delayed on the runqueue for 1.2 seconds. Could
you try to get a /proc/sched_debug snapshot done exactly during the
'delay' window - it's 4 seconds so you should in theory be able to
trigger it by doing something like:

sleep 3; cat /proc/sched_debug > sched_debug.txt

Click into the lguest window and trigger the delay.

Ingo

2007-06-05 09:24:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> Ok, my transcoding just finished as I was writing this. So I've
> reproduced the problem with this Python script that I had handy:
>
> memload.py:
> #!/usr/bin/python
> a = "a" * 16 * 1024 * 1024
> while 1:
> b = a[1:] + "b"
> a = b[1:] + "c"

i suspect the same latency is triggered by a simple infinite loop too:

while : ; do : ; done &

correct?

Ingo

2007-06-05 10:44:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


Matt, could you send me your .config too? Maybe i can reproduce it
with your config.

Ingo

2007-06-05 13:50:08

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 02:37:38PM +1000, Rusty Russell wrote:
> On Mon, 2007-06-04 at 23:18 -0500, Matt Mackall wrote:
> > On Tue, Jun 05, 2007 at 12:31:15PM +1000, Rusty Russell wrote:
> > > Does this solve it for you?
> >
> > Nope. Doesn't accept input and hogs the CPU with lots of system time.
>
> Dumb question: did you replace both the module and the launcher?

Oh, probably not. Yep, actually loading the new module appears to fix it.

--
Mathematics is the supreme nostalgia of our time.

2007-06-05 14:04:17

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 09:19:04AM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > sleep_max : 57476665627
> > block_max : 18014060106626075
>
> hm, this block_max looks a bit suspect, it's 003fffb1359e341b. Does your
> box make any use of cpufreq? (what CPU is it?)

Yes, I use the conservative governor. As I posted yesterday, setting
the governor to performance doesn't help: the CPU hog drives the CPU
to full speed.

> > 24936 13:39:44.469677 rt_sigprocmask(SIG_UNBLOCK, [USR1], NULL, 8) = 0
> > 24936 13:39:44.469765 read(7, 0xbfe6c010, 8) = -1 EINTR (Interrupted system call)
> > 24936 13:39:48.699490 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
>
> hm, this indeed implicates some wakeup problem. lguest task 24936's
> relevant stats are:
>
> block_start : 0
> sleep_max : 21492554
> block_max : 27044576
> exec_max : 4008057
> wait_max : 1253670288
>
> the wait_max means it was delayed on the runqueue for 1.2 seconds. Could
> you try to get a /proc/sched_debug snapshot done exactly during the
> 'delay' window - it's 4 seconds so you should in theory be able to
> trigger it by doing something like:
>
> sleep 3; cat /proc/sched_debug > sched_debug.txt
>
> Click into the lguest window and trigger the delay.

I did:

while true; do sleep 1; cat /proc/sched_debug > sched_debug.txt; done

and got this, hopefully inside the window:

Sched Debug Version: v0.02
now at 257428593818894 nsecs

cpu: 0
.nr_running : 3
.raw_weighted_load : 2063
.nr_switches : 242830075
.nr_load_updates : 30172063
.nr_uninterruptible : 0
.jiffies : 64282148
.next_balance : 0
.curr->pid : 27182
.clock : 125650217819008823
.prev_clock_raw : 257428516403535
.clock_warps : 9
.clock_unstable_events : 41133344
.clock_max_delta : 3954619
.fair_clock : 125631632368075562
.prev_fair_clock : 0
.exec_clock : 125649753800946713
.prev_exec_clock : 0
.wait_runtime : -12865352643
.wait_runtime_overruns : 41121996
.wait_runtime_underruns: 4107250
.cpu_load[0] : 1039
.cpu_load[1] : 1039
.cpu_load[2] : 1039
.cpu_load[3] : 1041
.cpu_load[4] : 1053
.wait_runtime_rq_sum : -48091562

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
R cat 27182 125631632367998747 -76815 -91562 6 120 250189 -91562 75163 0 0
python 26973 125631632404141566 36066004 -24000000 25486 120 93635486661 -1940103390 140903 49 2069
lguest 26986 125631633015454796 647379234 -24000000 1481 139 7638721 37069331905 90071079196 1461 2


--
Mathematics is the supreme nostalgia of our time.

2007-06-05 14:06:59

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 11:23:44AM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <[email protected]> wrote:
>
> > Ok, my transcoding just finished as I was writing this. So I've
> > reproduced the problem with this Python script that I had handy:
> >
> > memload.py:
> > #!/usr/bin/python
> > a = "a" * 16 * 1024 * 1024
> > while 1:
> > b = a[1:] + "b"
> > a = b[1:] + "c"
>
> i suspect the same latency is triggered by a simple infinite loop too:
>
> while : ; do : ; done &
>
> correct?

Yes.

--
Mathematics is the supreme nostalgia of our time.

2007-06-05 15:14:22

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On 05/06/07, Matt Mackall <[email protected]> wrote:
> > [...]
> > Click into the lguest window and trigger the delay.
>
> I did:
>
> while true; do sleep 1; cat /proc/sched_debug > sched_debug.txt; done
>
> and got this, hopefully inside the window:
>
> Sched Debug Version: v0.02
> now at 257428593818894 nsecs
>
> cpu: 0
> .nr_running : 3
> .raw_weighted_load : 2063
> .nr_switches : 242830075
> .nr_load_updates : 30172063
> .nr_uninterruptible : 0
> .jiffies : 64282148
> .next_balance : 0
> .curr->pid : 27182
> .clock : 125650217819008823
> .prev_clock_raw : 257428516403535

The delta (clock - prev_clock_raw) looks insane.

The current time (which doesn't depend on rq_clock() --> sched_clock() is
" now at 257428593818894 nsecs " (right at the beginning of the output).

'prev_clock_raw' is updated any time rq_clock() is called - basically
upon any scheduling operation (e.g. enqueue/dequeue)

now - prev_clock_raw == 257428593818894 - 257428516403535 == ~ 80 ms.

while 'clock' reports something crazy.. that would explain why there
wes a huge "block_max" reported earlier.. I guess, sched_clock() is
tsc-based in your case?

Any way to get it switched to jiffies-based one and repeat the test?


--
Best regards,
Dmitry Adamushko

2007-06-05 15:41:18

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On 05/06/07, Dmitry Adamushko <[email protected]> wrote:
> > now at 257428593818894 nsecs
> >
> > cpu: 0
> > .nr_running : 3
> > .raw_weighted_load : 2063
> > .nr_switches : 242830075
> > .nr_load_updates : 30172063
> > .nr_uninterruptible : 0
> > .jiffies : 64282148
> > .next_balance : 0
> > .curr->pid : 27182
> > .clock : 125650217819008823
> > .prev_clock_raw : 257428516403535
>
> The delta (clock - prev_clock_raw) looks insane.
>

err.. please ignore it. 'clock' and 'prev_clock_raw' follow each other
only as long as the 'clock_warps' == 0.. otherwise 'delta' just
doesn't make any sense. And in your case:

.clock_warps : 9


--
Best regards,
Dmitry Adamushko

2007-06-05 19:08:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Ingo Molnar <[email protected]> wrote:

> Matt, could you send me your .config too? Maybe i can reproduce it
> with your config.

ok, got it (off-list) and you seem to have the default HZ=250. Could you
change it to HZ=1000 - does the delay get alot smaller?

if yes then this could be roughly the expected worst-case behavior for a
nice+19 'signalling' task - i'd suggest to try the dont-use-renice patch
Rusty sent (with the hang it causes fixed), or to change the nice(19) to
nice(10).

(the rq_clock behavior still looks a bit anomalous, that perhaps
contributes to the problem.)

Ingo

2007-06-05 19:39:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> On Tue, Jun 05, 2007 at 02:37:38PM +1000, Rusty Russell wrote:
> > On Mon, 2007-06-04 at 23:18 -0500, Matt Mackall wrote:
> > > On Tue, Jun 05, 2007 at 12:31:15PM +1000, Rusty Russell wrote:
> > > > Does this solve it for you?
> > >
> > > Nope. Doesn't accept input and hogs the CPU with lots of system time.
> >
> > Dumb question: did you replace both the module and the launcher?
>
> Oh, probably not. Yep, actually loading the new module appears to fix it.

oh - so Rusty's fix did the trick?

Ingo

2007-06-05 19:50:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Dmitry Adamushko <[email protected]> wrote:

> On 05/06/07, Dmitry Adamushko <[email protected]> wrote:
> >> now at 257428593818894 nsecs
> >>
> >> cpu: 0
> >> .nr_running : 3
> >> .raw_weighted_load : 2063
> >> .nr_switches : 242830075
> >> .nr_load_updates : 30172063
> >> .nr_uninterruptible : 0
> >> .jiffies : 64282148
> >> .next_balance : 0
> >> .curr->pid : 27182
> >> .clock : 125650217819008823
> >> .prev_clock_raw : 257428516403535
> >
> >The delta (clock - prev_clock_raw) looks insane.
> >
>
> err.. please ignore it. 'clock' and 'prev_clock_raw' follow each other
> only as long as the 'clock_warps' == 0.. otherwise 'delta' just
> doesn't make any sense. And in your case:
>
> .clock_warps : 9

yeah. Look at the sched_clock() reliability counts:

.clock_warps : 9
.clock_unstable_events : 41133344
.clock_max_delta : 3954619

i.e. a quite large 'tsc unstable events' counter and with a handful
negative warps detected, the actual maximum TSC delta was 3954 usecs
(clock_max_delta), so i think rq_clock() behaved sane and sched_clock()
has good quality on Matt's box. So the large block_max is still
mysterious to me:

> > sleep_max : 57476665627
> > block_max : 18014060106626075

because the whole rq_clock() logic is about avoiding large jumps. So i
dont see how 'now' could _ever_ have ended up being so very large. Even
if the new sched-clock code in arch/i386/kernel/sched-clock.c frequently
switches between sc->unstable == 0 or 1, (i.e. frequently switches
between jiffies based and TSC based clock), the rq_clock() logic should
cushion off any jumpiness or small warps from that. I'm wondering
whether this line:

r = (jiffies_64 - sc->sync_base) * (1000000000 / HZ);

could produce funnies. But even if it does, it could at most produce a
very large _positive_ jump - which they would have to show up in
clock_max_delta. Negative jumps are completely ignored by rq_clock().

although ... the 'raw' rq_clock() values are very large:

.clock : 125650217819008823

$ printf "%016Lx\n" 125650217819008823
01be6635170d6f37

which could be due to some TSC corruption? So perhaps somewhere we
calculated block_max with 'now' zero, hence resulting in that very large
value? Still, i dont see how that could happen. Weird.

it would be nice to monitor how 'clock' changes in time on that box. It
should be advancing monotonically and at most with the speed of the CPU
clock. (but it can move slower, over C3/unstable/cpufreq periods.) This
could be done the following way:

while sleep 1; do cat /proc/sched_debug >> sched_debug.txt; done

Matt, could you run this for 1-2 minutes and send us the sched_debug.txt
output?

Ingo

2007-06-06 20:20:36

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 09:08:06PM +0200, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > Matt, could you send me your .config too? Maybe i can reproduce it
> > with your config.
>
> ok, got it (off-list) and you seem to have the default HZ=250. Could you
> change it to HZ=1000 - does the delay get alot smaller?

Well I'm too busy/lazy to reboot at the moment, but I've reproduced
this behavior with the following:

watchdelta.py:
import time

last = time.time()
while 1:
t = time.time()
delta = t - last
if delta > .5:
print delta
last = t

$ nice -n 20 python watchdelta.py
0.696138858795
0.772134065628
0.852141141891
2.22020101547
0.600075960159
0.528147935867
0.556170940399
0.536160945892
0.572169065475
0.672204971313
0.576174974442
0.640056848526
0.548166036606
0.568286895752
0.544052124023
0.572114944458
0.534852027893
0.596146821976
0.552155971527
0.552098035812
0.564115047455
0.546761989594
0.63211107254
0.876178026199
0.584072828293
0.544088125229
0.760088920593
1.07612800598
0.552155017853
...

At load=1 (the first several lines), this prints out messages
sporadically. At higher loads, every iteration is a >.5s delay. So
it's probably not particular to lguest at all.

--
Mathematics is the supreme nostalgia of our time.

2007-06-06 20:23:44

by Matt Mackall

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS

On Tue, Jun 05, 2007 at 09:50:15PM +0200, Ingo Molnar wrote:
> Matt, could you run this for 1-2 minutes and send us the sched_debug.txt
> output?

http://selenic.com/sched_debug.txt.gz

--
Mathematics is the supreme nostalgia of our time.

2007-06-08 09:34:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Matt Mackall <[email protected]> wrote:

> On Tue, Jun 05, 2007 at 09:50:15PM +0200, Ingo Molnar wrote:
> > Matt, could you run this for 1-2 minutes and send us the sched_debug.txt
> > output?
>
> http://selenic.com/sched_debug.txt.gz

thanks. It shows the anomaly in action:

now at 365294215491977 nsecs

.jiffies : 91248553
.next_balance : 0
.curr->pid : 18589
.clock : 125652924079659272
.prev_clock_raw : 365201238127457
.clock_warps : 9
.clock_unstable_events : 61896358
.clock_max_delta : 3997813

next one is:

now at 365295219388142 nsecs

.jiffies : 91248804
.next_balance : 0
.curr->pid : 18591
.clock : 125653018059166371
.prev_clock_raw : 365295217642619
.clock_warps : 9
.clock_unstable_events : 61896359
.clock_max_delta : 92976502936

251 jiffies passed, at 250 Hz that's 1 second - this proves that the
sample is indeed an accurate once-per-second sample according to the
timer interrupt. The 'now' timestamp (ktime_get() based) shows
1003896165 nanosecs passed - this too is showing a precise 1 second
sample, according to GTOD.

So all the time references we have show that (no surprise here) 1 second
passed between the two samples. But sched_clock() shows a _large_ jump:

.clock : 125652924079659272
.clock : 125653018059166371

also reflected in .clock_max_delta:

.clock_max_delta : 92976502936

that's a 93 seconds jump (!) in a single 1-second sample. We also had a
single sched-clock-unstable event:

.clock_unstable_events : 61896358
.clock_unstable_events : 61896359

could you please try a test-boot with 'notsc' - do the scheduling
weirdnesses go away? Also,

There are two reasons why the sched_clock() in -mm could behave like
that - either the sched-clock-share patches in it are buggy and we do
not smoothly switch over from sc->unstable == 1 to sc->unstable == 0, or
the TSC itself is unstable. To test the latter theory, could you run a 1
minute tsc-dump on your box:

./tsc-dump > tsc-dump.txt

You can pick tsc-dump up from:

http://redhat.com/~mingo/time-warp-test/

(please run this on a recent -mm kernel so that we have the same
ACPI-idle characteristics as on the buggy kernel.)

to test the former theory, could you boot with 'notsc' - do the
weirdnesses go away? (please create another sched-debug.txt as well)

Ingo

2007-06-08 10:19:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: Interesting interaction between lguest and CFS


* Ingo Molnar <[email protected]> wrote:

> thanks. It shows the anomaly in action:

> So all the time references we have show that (no surprise here) 1
> second passed between the two samples. But sched_clock() shows a
> _large_ jump:
>
> .clock : 125652924079659272
> .clock : 125653018059166371
>
> also reflected in .clock_max_delta:
>
> .clock_max_delta : 92976502936
>
> that's a 93 seconds jump (!) in a single 1-second sample. [...]

we also had a similar jump in exec_clock, which suggests that the
sched_clock() "jump" occured while there was a task running. (i.e. not
during ACPI-C2/C3) Could you try the patch below? It should catch such
large jumps. The 'clock_overflows' counter in /proc/sched_debug will
show whether this new protection triggered on your box.

Ingo

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -149,12 +149,12 @@ struct rq {

u64 clock, prev_clock_raw;
s64 clock_max_delta;
- u64 fair_clock, prev_fair_clock;
- u64 exec_clock, prev_exec_clock;
+ u64 fair_clock, prev_fair_clock;
+ u64 exec_clock, prev_exec_clock;
s64 wait_runtime;
unsigned long wait_runtime_overruns, wait_runtime_underruns;

- unsigned int clock_warps;
+ unsigned int clock_warps, clock_overflows;
unsigned int clock_unstable_events;

struct sched_class *load_balance_class;
@@ -245,9 +245,17 @@ static inline unsigned long long __rq_cl
clock++;
rq->clock_warps++;
} else {
- if (unlikely(delta > rq->clock_max_delta))
- rq->clock_max_delta = delta;
- clock += delta;
+ /*
+ * Catch too large forward jumps too:
+ */
+ if (delta > 2*TICK_NSEC) {
+ clock++;
+ rq->clock_overflows++;
+ } else {
+ if (unlikely(delta > rq->clock_max_delta))
+ rq->clock_max_delta = delta;
+ clock += delta;
+ }
}

rq->prev_clock_raw = now;
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -117,13 +119,13 @@ static void print_cpu(struct seq_file *m
P(clock);
P(prev_clock_raw);
P(clock_warps);
+ P(clock_overflows);
P(clock_unstable_events);
P(clock_max_delta);
- rq->clock_max_delta = 0;
P(fair_clock);
- P(prev_fair_clock);
+ P(prev_fair_clock);
P(exec_clock);
- P(prev_exec_clock);
+ P(prev_exec_clock);
P(wait_runtime);
P(wait_runtime_overruns);
P(wait_runtime_underruns);