2002-10-27 10:27:55

by Adam J. Richter

[permalink] [raw]
Subject: Pauses in 2.5.44 (some kind of memory policy change?)

I run /usr/bin/mail to read my mail box file, which has about
24 megabytes (in 2300 messages, mostly spam). After this, about half
of the time, my keyboard and mouse will intermittently stop responding
for a second or two, maybe one or two times, and then everything
seems to be OK. This happens *after* the mail spool has been read.
This did not happen in previous kernels (well, maybe 2.5.43, I can't
quite be sure about that one).

The mail spool is on NFS, but I suspect the culprit might be
some kind of memory balancing change in 2.5.44.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."



2002-10-27 10:39:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Pauses in 2.5.44 (some kind of memory policy change?)

"Adam J. Richter" wrote:
>
> I run /usr/bin/mail to read my mail box file, which has about
> 24 megabytes (in 2300 messages, mostly spam). After this, about half
> of the time, my keyboard and mouse will intermittently stop responding
> for a second or two, maybe one or two times, and then everything
> seems to be OK. This happens *after* the mail spool has been read.
> This did not happen in previous kernels (well, maybe 2.5.43, I can't
> quite be sure about that one).
>
> The mail spool is on NFS, but I suspect the culprit might be
> some kind of memory balancing change in 2.5.44.
>

Clean pagecache usually doesn't cause much trouble...

Please send a `vmstat 1' trace which covers the episode.

2002-10-27 21:22:07

by Adam J. Richter

[permalink] [raw]
Subject: Re: Pauses in 2.5.44 (some kind of memory policy change?)

Andrew Morton wrote:
>"Adam J. Richter" wrote:
>>
>> I run /usr/bin/mail to read my mail box file, which has about
>> 24 megabytes (in 2300 messages, mostly spam). After this, about half
>> of the time, my keyboard and mouse will intermittently stop responding
>> for a second or two, maybe one or two times, and then everything
>> seems to be OK. This happens *after* the mail spool has been read.
>> This did not happen in previous kernels (well, maybe 2.5.43, I can't
>> quite be sure about that one).
>>
>> The mail spool is on NFS, but I suspect the culprit might be
>> some kind of memory balancing change in 2.5.44.
>>

>Clean pagecache usually doesn't cause much trouble...

>Please send a `vmstat 1' trace which covers the episode.

Here are two traces. Also, thanks for your help in pointing out
the new version of procps at surriel.com/procps.


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 52160 31420 135796 0 0 1 9 91 40 0 0 99
0 0 0 0 52156 31420 135796 0 0 0 28 1054 267 1 1 98
0 0 0 0 52156 31420 135796 0 0 0 0 1043 159 1 1 98
1 0 2 0 45252 31420 142520 0 0 0 0 3573 349 9 38 53

...Started /usr/bin/mail around here...

1 0 2 0 30024 31420 157636 0 0 0 0 6540 510 20 80 0
1 0 2 0 14908 31420 172544 0 0 0 0 6433 499 16 84 0
3 0 2 0 9196 31420 178172 0 0 0 17152 6939 259 3 97 0
1 0 1 0 2452 31420 185112 0 0 0 4056 4061 278 9 91 0
0 0 0 0 2636 31420 184848 0 0 0 0 1294 195 2 6 92

...Pause occurred around here...

1 0 1 0 2664 31420 184848 0 0 0 4004 1766 55 0 44 56
0 0 0 0 2676 31420 184848 0 0 0 40 1103 282 2 3 95
0 0 0 0 2676 31420 184848 0 0 0 0 1064 169 0 0 100
0 0 0 0 2676 31420 184848 0 0 0 0 1082 225 1 1 98
0 0 0 0 2676 31420 184848 0 0 0 0 1097 263 1 1 98


Here is another run:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 0 3992 30860 180424 0 0 1 9 92 40 0 0 99
0 0 0 0 3988 30860 180424 0 0 0 0 1069 267 1 2 97
0 0 0 0 3988 30860 180424 0 0 0 0 1032 136 1 0 99
2 0 1 0 2464 29912 182696 0 0 4 380 1121 229 18 19 63
2 0 3 0 4548 24556 185780 0 0 0 4732 1983 206 26 74 0
3 0 1 0 4572 24532 185780 0 0 0 10172 2696 46 1 99 0
0 0 0 0 3000 23460 188292 0 0 0 640 1160 248 12 18 70
...Pause occurred here...
2 0 1 0 3348 23460 188292 0 0 0 9412 3409 109 0 58 41
0 0 0 0 3352 23460 188292 0 0 0 4 1036 147 1 2 97
0 0 0 0 3352 23460 188292 0 0 0 0 1032 153 0 1 99
0 0 0 0 3352 23460 188292 0 0 0 0 1014 105 1 0 99
1 0 0 0 3352 23460 188292 0 0 0 0 1089 435 1 1 98


Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-10-27 21:36:16

by Andrew Morton

[permalink] [raw]
Subject: Re: Pauses in 2.5.44 (some kind of memory policy change?)

"Adam J. Richter" wrote:
>
> ...
> 1 0 2 0 45252 31420 142520 0 0 0 0 3573 349 9 38 53
>
> ...Started /usr/bin/mail around here...
>
> 1 0 2 0 30024 31420 157636 0 0 0 0 6540 510 20 80 0
> 1 0 2 0 14908 31420 172544 0 0 0 0 6433 499 16 84 0
> 3 0 2 0 9196 31420 178172 0 0 0 17152 6939 259 3 97 0
> 1 0 1 0 2452 31420 185112 0 0 0 4056 4061 278 9 91 0
> 0 0 0 0 2636 31420 184848 0 0 0 0 1294 195 2 6 92
>
> ...Pause occurred around here...
>
> 1 0 1 0 2664 31420 184848 0 0 0 4004 1766 55 0 44 56
> 0 0 0 0 2676 31420 184848 0 0 0 40 1103 282 2 3 95
> 0 0 0 0 2676 31420 184848 0 0 0 0 1064 169 0 0 100

Sorry, don't know.

It's possible that your X server got paged out, but the system
doesn't seem to be under any sort of stress, and there's not
much page reclaim happening and no evidence of executable pagein.

I'm assuming that everything is on local disks apart from that
mail file. Really, you haven't told me much. What's all that
`bo' activity there? What filesystems are in use?

Could it be a networking problem? Are your keyboard and mouse
dependent on ethernet traffic in any way (eg: executables on
NFS).

Did the vmstat output exhibit any stalls?

What makes you believe it's a vm/fs thing rather than a keyboard/mouse
thing?

So hm. You'll need to investigate further please.

2002-10-27 21:53:11

by Adam J. Richter

[permalink] [raw]
Subject: Re: Pauses in 2.5.44 (some kind of memory policy change?)

>"Adam J. Richter" wrote:
>>
>> ...
>> 1 0 2 0 45252 31420 142520 0 0 0 0 3573 349 9 38 53
>>
>> ...Started /usr/bin/mail around here...
>>
>> 1 0 2 0 30024 31420 157636 0 0 0 0 6540 510 20 80 0
>> 1 0 2 0 14908 31420 172544 0 0 0 0 6433 499 16 84 0
>> 3 0 2 0 9196 31420 178172 0 0 0 17152 6939 259 3 97 0
>> 1 0 1 0 2452 31420 185112 0 0 0 4056 4061 278 9 91 0
>> 0 0 0 0 2636 31420 184848 0 0 0 0 1294 195 2 6 92
>>
>> ...Pause occurred around here...
>>
>> 1 0 1 0 2664 31420 184848 0 0 0 4004 1766 55 0 44 56
>> 0 0 0 0 2676 31420 184848 0 0 0 40 1103 282 2 3 95
>> 0 0 0 0 2676 31420 184848 0 0 0 0 1064 169 0 0 100

>Sorry, don't know.

>It's possible that your X server got paged out, but the system
>doesn't seem to be under any sort of stress, and there's not
>much page reclaim happening and no evidence of executable pagein.

I don't know exactly what the "bo" column represents, but I
find it surprising that *after* /usr/bin/mail has read my mail spool,
often after bo has dropped to near zero, and often after I have typed
a few characters to the mail prompt which have been echoed just fine,
then "bo" spikes back up and then I experience the ~1 second pause.
By the way, the mail program isn't even running at this point. It is
waiting for input from the tty line discipline, and the echoing
resumes without my having to hit the return key.

>I'm assuming that everything is on local disks apart from that
>mail file. Really, you haven't told me much. What's all that
>`bo' activity there? What filesystems are in use?

My home directory is on NFS via autofs. The mail spool is on
NFS. Everything else is on local ext3 partitions.

>Could it be a networking problem? Are your keyboard and mouse
>dependent on ethernet traffic in any way (eg: executables on
>NFS).

I have already checked with tcpdump. No significant network
traffic addressed to my machine's ethernet interface occurs during
this time.

>Did the vmstat output exhibit any stalls?

Yes. It stalled with the keyboard and mouse. I'm pretty sure
everything is stalled.

>What makes you believe it's a vm/fs thing rather than a keyboard/mouse
>thing?

Everything seems to stall at the same time, the second jump in
the "bo" number when the pause occurs, the fact that it occurs *after*
the big IO is done and ~24MB of RAM has been allocated.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-10-27 22:48:13

by Simon Kirby

[permalink] [raw]
Subject: Re: Pauses in 2.5.44 (some kind of memory policy change?)

On Sun, Oct 27, 2002 at 01:42:27PM -0800, Andrew Morton wrote:

> "Adam J. Richter" wrote:
> >...
> > 3 0 2 0 9196 31420 178172 0 0 0 17152 6939 259 3 97 0
> > 1 0 1 0 2452 31420 185112 0 0 0 4056 4061 278 9 91 0
>
> Sorry, don't know.
>
> It's possible that your X server got paged out, but the system
> doesn't seem to be under any sort of stress, and there's not
> much page reclaim happening and no evidence of executable pagein.
>
> I'm assuming that everything is on local disks apart from that
> mail file. Really, you haven't told me much. What's all that
> `bo' activity there? What filesystems are in use?

The "bi" and "bo" are accidentally reversed in the kernel. :)
I can't believe nobody else has noticed this.

(I'm pretty sure I checked that vmstat was not reversing them. The
numbers in /proc/vmstat were backwards...)

Simon-

[ Simon Kirby ][ Network Operations ]
[ [email protected] ][ NetNation Communications ]
[ Opinions expressed are not necessarily those of my employer. ]