2004-01-13 21:09:40

by Haakon Riiser

[permalink] [raw]
Subject: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

When I first upgraded to Linux 2.6.0, I noticed that my mail
program (mutt) would occasionally stall when I send mails, even
when I send to a local account.

I investigated this further using the following strace command
(must be done as root) on qmail-inject:

env - /usr/bin/strace -s1024 -tt -T -fF -o /tmp/st.out \
/var/qmail/bin/qmail-inject </tmp/mail.test

/tmp/mail.test is this file:

-----BEGIN-----
From: hakonrk
To: hakonrk
Subject: hello

world
-----END-----

The entire output from strace is available for download from
<http://home.chello.no/~hakonrk/2.6.1.strace.out>, but I think
only this line is interesting:

5083 16:00:30.714309 write(5, "\0", 1) = 1 <1.637019>

This the only write() to file descriptor 5 issued by qmail-queue,
right before it exists, and this is what causes the delay.
Notice that writing this single NUL-byte takes almost 1.64 seconds
(the number in the angle brackets at the end of the line is the
time spent in the system call, given by the -T flag to strace).
Furthermore, the kernel appears to be busy-waiting in this system
call, because the CPU usage quickly jumps to 100%.

Compare this to the result on Linux 2.4.24:
<http://home.chello.no/~hakonrk/2.4.24.strace.out>
Here's the same line that causes the delay on 2.6:

5172 17:55:23.979799 write(5, "\0", 1) = 1 <0.000441>

Three orders of magnitude faster on 2.4!

I should also mention that the write-delay varies greatly. I got
the above result (1.64 seconds) on a 2.6-system that had been
up for around four hours. After rebooting, the delay dropped to
0.3 seconds, but it increases steadily while the system is up.

Finally, some info on my system:

OS: Slackware 9.1
Kernels: 2.6.0, 2.6.1, 2.4.24
Qmail: 1.03
libc: 2.3.2
gcc: 3.2.3
CPU: Athlon XP2500+

Kernel 2.6.1 was configured with
<http://home.chello.no/~hakonrk/config-2.6.1>
and 2.4.24 with <http://home.chello.no/~hakonrk/config-2.4.24>.
If you need to know more, just ask.

If anyone on this list uses Qmail on Linux 2.6.x, I'd appreciate
it if you could do the strace test and see if you can reproduce
this bug. Thanks in advance!

--
Haakon


2004-01-13 21:50:39

by Andrew Morton

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

Haakon Riiser <[email protected]> wrote:
>
> The entire output from strace is available for download from
> <http://home.chello.no/~hakonrk/2.6.1.strace.out>, but I think
> only this line is interesting:
>
> 5083 16:00:30.714309 write(5, "\0", 1) = 1 <1.637019>
>
> This the only write() to file descriptor 5 issued by qmail-queue,
> right before it exists, and this is what causes the delay.
> Notice that writing this single NUL-byte takes almost 1.64 seconds
> (the number in the angle brackets at the end of the line is the
> time spent in the system call, given by the -T flag to strace).
> Furthermore, the kernel appears to be busy-waiting in this system
> call, because the CPU usage quickly jumps to 100%.

Odd.

5083 16:00:30.714217 open("lock/trigger", O_WRONLY|O_NONBLOCK) = 5 <0.000013>
5083 16:00:30.714260 fcntl64(5, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK) <0.000004>
5083 16:00:30.714287 fcntl64(5, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 <0.000004>
5083 16:00:30.714309 write(5, "\0", 1) = 1 <1.637019>
5083 16:00:32.351378 close(5) = 0 <0.000005>

Seems innocuous. What filesystem type is lock/trigger on?

Can you generate a kernel profile of this activity?

Boot with `profile=1', copy this:

SM=/boot/System.map
TIMEFILE=/tmp/prof.time
sudo readprofile -r
sudo readprofile -M10
time $@
readprofile -n -v -m $SM | sort -n +2 | tail -40 | tee $TIMEFILE >&2

into /tmp/foo and do

/tmp/foo whatever-command-you-were-using

Thanks.

2004-01-13 23:26:32

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

[Andrew Morton]

> Seems innocuous. What filesystem type is lock/trigger on?

Reiserfs.

> Can you generate a kernel profile of this activity?

Done. Note that I had to reboot to do the profiling, so the
delay is not as bad as the first example:

344 00:19:07.476825 write(5, "\0", 1) = 1 <0.291500>

Still around 700 times slower than 2.4 though. :-) Anyway,
here are the results:

Output from time:

real 0m0.309s
user 0m0.011s
sys 0m0.004s

Data in prof.time:

c0117440 do_page_fault 2 0.0015
c01196e0 __wake_up 2 0.0208
c0139d80 free_hot_cold_page 2 0.0078
c0139ea0 buffered_rmqueue 2 0.0057
c013d650 kmem_cache_free 2 0.0250
c0141e10 page_address 2 0.0104
c0142a20 zap_pte_range 2 0.0046
c0145670 __insert_vm_struct 2 0.0139
c0146300 find_vma 2 0.0208
c0151860 dentry_open 2 0.0038
c0153760 fget 2 0.0312
c0168e90 locks_remove_posix 2 0.0074
c016dbb0 dnotify_flush 2 0.0104
c0210f40 decrement_counters_in_path 2 0.0250
c022bd90 atomic_dec_and_lock 2 0.0250
c022c340 strncpy_from_user 2 0.0179
c0120b00 current_kernel_time 3 0.0469
c013d6a0 kfree 3 0.0268
c0144040 do_anonymous_page 3 0.0057
c0153640 __fput 3 0.0104
c015f640 pipe_poll 3 0.0234
c0161660 may_open 3 0.0065
c01654f0 __pollwait 3 0.0144
c01655c0 max_select_fd 3 0.0134
c022c650 __copy_from_user_ll 3 0.0170
c0145490 find_vma_prepare 4 0.0357
c0145ab0 do_mmap_pgoff 4 0.0022
c0182190 write_profile 4 0.0625
c0211030 is_leaf 4 0.0086
c0155230 __find_get_block 5 0.0223
c02112a0 is_tree_node 6 0.0536
c022c5d0 __copy_to_user_ll 6 0.0469
c01606a0 link_path_walk 8 0.0035
c01656a0 do_select 9 0.0125
c01659a0 sys_select 9 0.0070
c016aaa0 __d_lookup 10 0.0298
c0109174 system_call 16 0.3636
c0211310 search_by_key 16 0.0044
c022c0b0 fast_clear_page 21 0.2188
00000000 total 232 0.0001

--
Haakon

2004-01-13 23:42:33

by Andrew Morton

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

Haakon Riiser <[email protected]> wrote:
>
> > Seems innocuous. What filesystem type is lock/trigger on?
>
> Reiserfs.
>
> > Can you generate a kernel profile of this activity?
>
> Done. Note that I had to reboot to do the profiling, so the
> delay is not as bad as the first example:
>
> 344 00:19:07.476825 write(5, "\0", 1) = 1 <0.291500>
>
> Still around 700 times slower than 2.4 though. :-) Anyway,
> here are the results:
>
> Output from time:
>
> real 0m0.309s
> user 0m0.011s
> sys 0m0.004s
>
> Data in prof.time:

OK, that's inconclusive. Could you do a few runs, or leave it a day or
two, wait until the problem is really prominent and see if you can gather a
clearer profile? The profiling overhead is negligible when profiling is
enabled but not in use, so there is no need to reboot.


Thanks.

2004-01-13 23:46:18

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

> Output from time:
>
> real 0m0.309s
> user 0m0.011s
> sys 0m0.004s

Just wanted to comment on my own data, since I just noticed it myself:

The output from time indicates that the system is _not_ using CPU
while delaying, so you might wonder why I said it did. The reason
is that I'm using an AfterStep applet (ascpu) to monitor CPU usage,
and it appeared to work fine in 2.6. Now, I see that there are
differences: For example, another problem I encountered while
upgrading to 2.6 was that disk intensive jobs, such as updating
the slocate database, made ascpu report 100% CPU usage. I just
ran top (procps 2.0.16) beside it, and it reported approximately
10% CPU usage, which is no more than 2.4 used.

I don't know how ascpu measures CPU usage, but it's interesting
that it appears to work fine for the most part, while giving
_completely_ different results from all other programs (e.g.,
time, top, ps) in the write-delay case, and other disk related
activities.

(For the record, I've never seen ascpu's results differ from
top's under Linux 2.4.x.)

--
Haakon

2004-01-14 00:06:56

by Andrew Morton

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

Haakon Riiser <[email protected]> wrote:
>
> > Output from time:
> >
> > real 0m0.309s
> > user 0m0.011s
> > sys 0m0.004s
>
> Just wanted to comment on my own data, since I just noticed it myself:
>
> The output from time indicates that the system is _not_ using CPU
> while delaying, so you might wonder why I said it did. The reason
> is that I'm using an AfterStep applet (ascpu) to monitor CPU usage,
> and it appeared to work fine in 2.6. Now, I see that there are
> differences: For example, another problem I encountered while
> upgrading to 2.6 was that disk intensive jobs, such as updating
> the slocate database, made ascpu report 100% CPU usage. I just
> ran top (procps 2.0.16) beside it, and it reported approximately
> 10% CPU usage, which is no more than 2.4 used.

2.6 has finer-grained cpu utilisation accounting; probably ascpu is
accidentally lumping I/O wait into system time.

Still, some of those delays seem excessive and if you indeed are seeing
longer runtimes than with 2.4, something is up.

2004-01-14 00:09:46

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

[Andrew Morton]

> OK, that's inconclusive. Could you do a few runs, or leave it a day or
> two, wait until the problem is really prominent and see if you can gather a
> clearer profile? The profiling overhead is negligible when profiling is
> enabled but not in use, so there is no need to reboot.

I tried running the test over and over, and after around 20 times,
it suddenly jumped from 0.3 seconds to 1.6 again. Results follow:

strace:

678 01:04:16.722816 write(5, "\0", 1) = 1 <1.635007>

time:

real 0m1.666s
user 0m0.008s
sys 0m0.005s

prof.time:

c0139ea0 buffered_rmqueue 9 0.0256
c0141e10 page_address 9 0.0469
c0151a70 get_unused_fd 9 0.0216
c0161830 open_namei 9 0.0089
c01f8500 bin_search_in_dir_item 9 0.0511
c0144040 do_anonymous_page 10 0.0189
c0146300 find_vma 10 0.1042
c013d650 kmem_cache_free 11 0.1375
c0145ab0 do_mmap_pgoff 11 0.0061
c0153640 __fput 11 0.0382
c0161180 path_lookup 11 0.0362
c01654f0 __pollwait 11 0.0529
c022bec0 memcpy 11 0.1146
c011aad0 add_wait_queue 12 0.1250
c0151860 dentry_open 12 0.0227
c01539b0 __constant_c_and_count_memset 12 0.0833
c0151d50 filp_close 13 0.0903
c01f85b0 search_by_entry_key 13 0.0262
c01457c0 vma_merge 14 0.0186
c02112a0 is_tree_node 14 0.1250
c0142a20 zap_pte_range 15 0.0347
c022c650 __copy_from_user_ll 15 0.0852
c015f640 pipe_poll 16 0.1250
c01655c0 max_select_fd 16 0.0714
c0117440 do_page_fault 17 0.0131
c0106d30 default_idle 19 0.3958
c013d6a0 kfree 20 0.1786
c0153760 fget 21 0.3281
c01606a0 link_path_walk 21 0.0091
c022c5d0 __copy_to_user_ll 22 0.1719
c0211030 is_leaf 23 0.0496
c016aaa0 __d_lookup 28 0.0833
c0203d70 reiserfs_readdir 28 0.0184
c0155230 __find_get_block 36 0.1607
c01656a0 do_select 47 0.0653
c01659a0 sys_select 50 0.0391
c0211310 search_by_key 90 0.0247
c0109174 system_call 107 2.4318
c022c0b0 fast_clear_page 119 1.2396
00000000 total 1251 0.0005

--
Haakon

2004-01-14 10:28:12

by Giuliano Pochini

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6



On Wed, 14 Jan 2004, Haakon Riiser wrote:
> For example, another problem I encountered while
> upgrading to 2.6 was that disk intensive jobs, such as updating
> the slocate database, made ascpu report 100% CPU usage. I just
> ran top (procps 2.0.16) beside it, and it reported approximately
> 10% CPU usage, which is no more than 2.4 used.

It makes sense, since HZ is 10 times higher in 2.6. Did you
recompile ascpu ? Check if ascpu assumes HZ is 100. In that
case it may overstimate the cpu time by a factor 10.


--
Giuliano.

2004-01-14 11:20:28

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

[Giuliano Pochini]

> On Wed, 14 Jan 2004, Haakon Riiser wrote:
>> For example, another problem I encountered while
>> upgrading to 2.6 was that disk intensive jobs, such as updating
>> the slocate database, made ascpu report 100% CPU usage. I just
>> ran top (procps 2.0.16) beside it, and it reported approximately
>> 10% CPU usage, which is no more than 2.4 used.
>
> It makes sense, since HZ is 10 times higher in 2.6. Did you
> recompile ascpu ? Check if ascpu assumes HZ is 100. In that
> case it may overstimate the cpu time by a factor 10.

No, I never thought of recompiling it, since top seems to work fine
with the same binaries. But thanks for the tip; I'll certainly try
it and see if it works!

--
Haakon

2004-01-14 11:29:23

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

I have some more info on this bug:

Apparently, lock/trigger is a named pipe that connects to
qmail-send. I immediately tried stracing qmail-send while sending
the test message, but unfortunately, the bug is not reproducible
while qmail-send is straced. The entire process is almost as
fast as it is under 2.4 when I strace qmail-send.

At least now we have a temporary workaround:

strace -p $(pidof qmail-send) -o /dev/null &

:-)

Btw, the output from strace -p $(pidof qmail-send) can be viewed here:
(probably not interesting though, since the bug is not reproduced
under strace)

http://home.chello.no/~hakonrk/qmail-send.st.out.gz

--
Haakon

2004-01-15 00:12:27

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

[Giuliano Pochini]

> On Wed, 14 Jan 2004, Haakon Riiser wrote:
>> For example, another problem I encountered while
>> upgrading to 2.6 was that disk intensive jobs, such as updating
>> the slocate database, made ascpu report 100% CPU usage. I just
>> ran top (procps 2.0.16) beside it, and it reported approximately
>> 10% CPU usage, which is no more than 2.4 used.
>
> It makes sense, since HZ is 10 times higher in 2.6. Did you
> recompile ascpu ? Check if ascpu assumes HZ is 100. In that
> case it may overstimate the cpu time by a factor 10.

Just recompiled it, and it made no difference -- it still reports
100% CPU usage while doing disk I/O.

--
Haakon

2004-01-20 00:49:40

by Haakon Riiser

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

I think I've been able to create a simple test program that
demonstrates the bug I encountered with Qmail.

What Qmail did was basically to use a named pipe as a trigger,
where one program select()s on the FIFO file descriptor, waiting
for another program to write() the FIFO. Once select() returns,
the listener close()s the FIFO (the data was not important,
it was only used as a signal), does some work, then re-open()s
the FIFO file, and ends up in the same select() waiting for the
whole thing to happen again.

I created a small program to simulate this behavior:

/****************************************************************************/
#include <sys/select.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void run_listener(void)
{
int i;
int fd;
fd_set rfds, wfds;
struct timeval tv = { 1477, 355000 };

mkfifo("test.fifo", 0700);

for (;;) {
if ((fd = open("test.fifo", O_RDONLY | O_NONBLOCK)) < 0) {
perror("open test.fifo");
exit(1);
}
FD_ZERO(&rfds);
FD_SET(fd, &rfds);
FD_ZERO(&wfds);
switch (select(fd + 1, &rfds, &wfds, NULL, &tv)) {
case 0:
fprintf(stderr, "select timed out\n");
exit(0);
case -1:
perror("select()");
exit(1);
}
if (close(fd) < 0) {
perror("close");
}
/* Do some work before returning to select() */
for (i = 0; i < 1000000; i++)
;
}
}

void run_writer(void)
{
struct timeval tv1, tv2;
int fd;

for (;;) {
while ((fd = open("test.fifo", O_WRONLY | O_NONBLOCK)) < 0)
;
gettimeofday(&tv1, NULL);
if (write(fd, &fd, 1) == 1) {
gettimeofday(&tv2, NULL);
fprintf(stderr, "dt = %f ms\n",
(tv2.tv_sec - tv1.tv_sec) * 1000.0 +
(tv2.tv_usec - tv1.tv_usec) / 1000.0);
}
if (close(fd) < 0) {
perror("close");
}
}
}

int main(int argc, char *argv[])
{
signal(SIGPIPE, SIG_IGN);
if (argc > 1) {
run_listener();
}
run_writer();
return 0;
}
/****************************************************************************/

Start this program with arguments to run it in listener mode,
and without to run as writer. The writer will print the time it
spent in write() every time write() succeeds. The results are
quite interesting. You don't even need to print the actual time
spent to see that something is wrong -- any output would suffice.

When run on 2.4.24, output lines from the writer are scrolling
past in a constant high pace, and the numbers are always fractions
of a millisecond.

When run on 2.6.1, everything appears to be fine for a short while,
but suddenly things slow down to the point where write()s take
up to 300 ms! This happens for a second or so, and the delay
drops down to sub-milliseconds again. But wait a little longer,
and it happens again: Slows down to 100-300 msecs per write()
for a short time, and suddenly jumps back to normal pace.

Here's an example run on 2.6. On my computer, there was around
700 sub-msec lines between each series of long delays, so I grep
out the lines that take one msec or more.

$ ./a.out LISTENER &
[1] 449
$ ./a.out 2>&1 | grep -v '= 0'
dt = 302.220000 ms
dt = 100.586000 ms
dt = 101.651000 ms
dt = 100.603000 ms
dt = 101.574000 ms
dt = 100.272000 ms
dt = 101.020000 ms
dt = 101.775000 ms
dt = 99.725000 ms
dt = 100.477000 ms
dt = 100.128000 ms
dt = 101.708000 ms
dt = 100.041000 ms

I don't know if this is exactly the same problem as the one I'm
seeing with Qmail, but it might be worth investigating nonetheless.

--
Haakon

2004-01-20 05:52:20

by Peter Maas

[permalink] [raw]
Subject: Re: Busy-wait delay in qmail 1.03 after upgrading to Linux 2.6

Re: Haakon Riiser
>I think I've been able to create a simple test program that
>demonstrates the bug I encountered with Qmail.

I've compiled and tested the code you posted, i thought that i would add
that renicing the LISTENER processes shows some interesting effects.

at nice 0 the writer spits out about 40 writes under .01 ms, then 3 to 5
writes at 97+ ms, after about 3 minutes its starts to get more random.
at nice -1 the writer spits out about 40 writes under .01 ms, then about 40
writes at 97+ ms, kept this cycle for over 10 minutes.
at nice 1 the writer spits out a few hundred writes under .01 ms, and
occasionally one at 97+ ms.

same effect if elevator=deadline

This thread caught my intrest because I run qmail on this box.

Peter Maas

Pentium II - 400MHz 128MB PC100
kernel-2.6.1-1.126 from http://people.redhat.com/arjanv/2.6/
(not subscribed - reads archives)