LinuxLists.cc - [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

2000-11-01 18:39:36

Subject: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

Hi VM/procfs hackers,

System is UP Athlon 700mhz with 256mb ram running vanilla 2.4.0-test10.
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)

I'd like to report what seems like a performance problem in the latest
kernels. Actually, all recent kernels have exhibited this problem, but
I was waiting for the new VM stuff to stabilize before reporting it.

My test is: run 7 processes that each allocate and randomly access 32mb
of ram (on a 256mb machine). Even though 7*32MB = 224MB, this still
sends the machine lightly into swap. The machine continues to function
fairly smoothly for the most part. I can do filesystem operations, run
new programs, move desktops in X etc.

Except: programs which access /proc/<pid>/stat stall for an inderminate
amount of time. For example, 'ps' and 'vmstat' stall BADLY in these
scenarios. I have had the stalls last over a minute in higher VM
pressure situations.

Unfortunately, when system is thrashing, it's nice to be able to run
'ps' in order to get the PID to kill, and run a reliable vmstat to
monitor it.

Here's a segment of an strace of 'ps' showing a 12 second stall (this
isn't the worst I've seen by any means, but a 12 second stall trying to
get process info for 1 swapping task can easily snowball into a DOS).

0.000119 open("/proc/4746/stat", O_RDONLY) = 7
0.000072 read(7, "4746 (hog) D 4739 4739 827 34817"..., 511) = 181
12.237161 close(7) = 0

The wchan of the stalled 'ps' is in __down_interruptible, which probably
doesn't help much.

This worked absolutely fine in 2.2. Even under extreme swap pressure,
vmstat continues to function fine, spitting out messages every second as
it should.

David Mansfield

2000-11-01 18:56:30

by Rik van Riel

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Wed, 1 Nov 2000, David Mansfield wrote:

> I'd like to report what seems like a performance problem in the latest
> kernels. Actually, all recent kernels have exhibited this problem, but
> I was waiting for the new VM stuff to stabilize before reporting it.
>
> My test is: run 7 processes that each allocate and randomly
> access 32mb of ram (on a 256mb machine). Even though 7*32MB =
> 224MB, this still sends the machine lightly into swap. The
> machine continues to function fairly smoothly for the most part.
> I can do filesystem operations, run new programs, move desktops
> in X etc.
>
> Except: programs which access /proc/<pid>/stat stall for an
> inderminate amount of time. For example, 'ps' and 'vmstat'
> stall BADLY in these scenarios. I have had the stalls last over
> a minute in higher VM pressure situations.

I have one possible reason for this ....

1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
down(&mm->mmap_sem);

2) but, in order to do that, it has to wait until the process
it is trying to stat has /finished/ its page fault, and is
not into its next one ...

3) combine this with the elevator starvation stuff (ask Jens
Axboe for blk-7 to alleviate this issue) and you have a
scenario where processes using /proc/<pid>/stat have the
possibility to block on multiple processes that are in the
process of handling a page fault (but are being starved)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-11-02 07:19:33

by Mike Galbraith

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Wed, 1 Nov 2000, Rik van Riel wrote:

> On Wed, 1 Nov 2000, David Mansfield wrote:
>
> > I'd like to report what seems like a performance problem in the latest
> > kernels. Actually, all recent kernels have exhibited this problem, but
> > I was waiting for the new VM stuff to stabilize before reporting it.
> >
> > My test is: run 7 processes that each allocate and randomly
> > access 32mb of ram (on a 256mb machine). Even though 7*32MB =
> > 224MB, this still sends the machine lightly into swap. The
> > machine continues to function fairly smoothly for the most part.
> > I can do filesystem operations, run new programs, move desktops
> > in X etc.
> >
> > Except: programs which access /proc/<pid>/stat stall for an
> > inderminate amount of time. For example, 'ps' and 'vmstat'
> > stall BADLY in these scenarios. I have had the stalls last over
> > a minute in higher VM pressure situations.
>
> I have one possible reason for this ....
>
> 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> down(&mm->mmap_sem);
>
> 2) but, in order to do that, it has to wait until the process
> it is trying to stat has /finished/ its page fault, and is
> not into its next one ...
>
> 3) combine this with the elevator starvation stuff (ask Jens
> Axboe for blk-7 to alleviate this issue) and you have a
> scenario where processes using /proc/<pid>/stat have the
> possibility to block on multiple processes that are in the
> process of handling a page fault (but are being starved)

I'm experimenting with blk.[67] in test10 right now. The stalls
are not helped at all. It doesn't seem to become request bound
(haven't instrumented that yet to be sure) but the stalls persist.

-Mike

2000-11-02 08:48:17

by Christoph Rohland

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

Hi Rik,

I can probably give some more datapoints. of my test machine (there is a 'vmstat
[root@ls3016 /root]# killall shmtst
[root@ls3016 /root]#
1 12 2 0 1607668 18932 2110496 0 10 2 0 1607564 18932 2110496 0 0 0 10 2 0 1607408 18932 2110496 0 0 0 10 2 0 1607560 18932 2110508 0 0 0 10 2 0 1607556 18932 2110512 0 0 0 10 2 0 1607528 18932 2110512 0 0 0 10 2 0 1607440 18932 2110516 0 0 0 10 2 0 1607528 18932 2110516 0 0 0 10 2 0 1607556 18932 2110516 0 0 0 10 2 0 1607600 18932 2110528 0 0 0 10 2 0 1607384 18932 2110528 0 0 0 10 2 0 1607284 18932 2110528 0 0 0 10 2 0 1607668 18932 2110528 0 0 0 10 2 0 1607576 18932 2110528 0 0 0 10 2 0 1607656 18932 2110528 0 0 0 10 2 0 1607928 18932 2110528 0 0 procs memory swap r b w swpd free buff cache si so 0 10 2 0 1608440 18932 2110528 0 0 0 10 2 0 1608260 18932 2110528 0 0 0 10 2 0 1608208 18932 2110528 0 0 0 10 2 0 1608396 18932 2110528 0 0 0 10 2 0 1608204 18932 2110548 0 0 0 10 2 0 1607888 18932 2110580 0 0 0 10 2 0 1608040 18932 2110584 0 0 0 10 2 0 1608032 18932 2110588 0 0 0 10 2 0 1608028 18932 2110592 0 0 0 10 3 0 1608028 18932 2110592 0 0 0 10 2 0 1608032 18932 2110592 0 0 0 10 2 0 1608024 18932 2110592 0 0 0 10 2 0 1608016 18932 2110592 0 0 0 10 2 0 1608024 18932 2110592 0 0 0 10 2 0 1608012 18932 2110592 0 0 0 10 2 0 1608020 18932 2110592 0 0 0 10 2 0 1608012 18932 2110600 0 0 0 10 2 0 1607992 18932 2110608 0 0 0 10 2 0 1607992 18932 2110612 0 0 0 10 2 0 1607984 18932 2110612 0 0 0 10 2 0 1607980 18932 2110620 0 0 procs memory swap r b w swpd free buff cache si so 0 10 2 0 1607972 18932 2110624 0 0 0 10 2 0 1607940 18932 2110640 0 0 0 10 2 0 1607936 18932 2110644 0 0 0 10 2 0 1607940 18932 2110644 0 0 0 10 2 0 1607944 18932 2110644 0 0 0 10 2 0 1607944 18932 2110644 0 0
[root@ls3016 /root]# ps l
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY 100 0 820 1 9 0 2200 1168 wait4 S 100 0 862 820 14 0 1756 976 wait4 S 000 0 878 862 9 0 1080 360 down D 000 0 879 862 9 0 1080 360 down D 040 0 880 878 9 0 1092 416 wait_o D 040 0 881 878 9 0 1080 360 down D 444 0 882 878 9 0 0 0 do_exi Z 040 0 883 878 9 0 1092 416 wait_o D 040 0 884 878 9 0 1092 416 down D 040 0 885 878 9 0 1092 416 down D 444 0 886 878 9 0 0 0 do_exi Z 444 0 887 879 9 0 0 0 do_exi Z 040 0 888 878 9 0 1080 360 down D 040 0 889 878 9 0 1092 416 down D 000 0 891 862 9 0 1136 488 nanosl S 000 0 1226 862 19 0 2756 1084 - [root@ls3016 /root]# 0 10 2 0 1607936 0 10 2 0 1607912 18932 2110660 0 0 0 10 2 0 1607848 18932 2110672 0 0 0 10 2 0 1607892 18932 2110688 0 0 0 11 2 0 1607868 18932 2110704 0 0
So the processes don't finish exiting shared mmaped some 666000000 bytes long
The rest of the machine behaves nicely.

Greetings
Christoph

Here is the console output
5' running in background):
0 0 67154 1115842 1050063 2029389 0 2 98
0 300 317 426 0 0 100
0 301 336 473 0 0 100
0 307 318 430 0 0 100
0 304 324 433 0 0 100
0 272 308 410 0 1 99
0 315 323 438 0 1 99
0 323 316 424 0 0 100
0 304 309 410 0 0 100
0 298 314 418 0 0 100
0 296 307 406 0 1 99
0 304 315 421 0 0 100
0 298 304 402 0 0 100
0 285 307 405 0 0 100
0 292 303 399 0 1 99
0 313 310 408 0 0 100
io system cpu
bi bo in cs us sy id
0 340 313 417 0 1 99
0 298 318 426 0 0 100
0 314 334 448 0 1 99
0 323 316 421 0 1 99
0 334 333 458 0 0 100
0 336 329 448 0 1 99
0 317 321 435 0 0 100
0 241 318 425 0 0 100
0 257 325 443 0 1 99
0 258 323 435 0 0 99
0 241 316 425 0 0 100
0 261 337 460 0 0 100
0 253 328 444 0 0 100
0 252 320 435 0 0 100
0 255 326 446 0 0 100
0 255 326 444 0 1 99
0 261 341 469 0 0 100
0 261 344 479 0 0 100
0 264 342 471 0 0 100
0 266 334 462 0 0 100
0 273 340 468 0 0 99
io system cpu
bi bo in cs us sy id
0 266 345 474 0 1 99
0 256 341 462 0 0 100
0 262 339 462 0 1 99
0 261 333 450 0 1 99
0 253 335 454 0 0 100
0 272 352 479 0 1 99
TIME COMMAND
ttyS0 0:00 login -- ro
ttyS0 0:00 -bash
ttyS0 11:27 ./shmtst 10
ttyS0 15:21 ./shmtst 15
ttyS0 8:55 ./shmtst 10
ttyS0 10:22 ./shmtst 10
ttyS0 10:00 [shmtst <de
ttyS0 9:30 ./shmtst 10
ttyS0 8:44 ./shmtst 10
ttyS0 9:01 ./shmtst 10
ttyS0 7:59 [shmtst <de
ttyS0 17:11 [shmtst <de
ttyS0 10:21 ./shmtst 10
ttyS0 9:06 ./shmtst 10
ttyS0 0:23 vmstat 5
R ttyS0 0:00 ps l
18932 2110652 0 0 0 275 368 488 0 0 99
0 266 334 457 0 0 100
0 302 354 498 0 0 100
0 287 352 496 0 0 100
1 282 338 472 0 1 99
at least 47*5sec. They have
plain file on a 8GB machine.

2000-11-02 22:00:21

by Val Henson

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Thu, Nov 02, 2000 at 08:19:06AM +0100, Mike Galbraith wrote:
> On Wed, 1 Nov 2000, Rik van Riel wrote:
>
> > I have one possible reason for this ....
> >
> > 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> > down(&mm->mmap_sem);
> >
> > 2) but, in order to do that, it has to wait until the process
> > it is trying to stat has /finished/ its page fault, and is
> > not into its next one ...
> >
> > 3) combine this with the elevator starvation stuff (ask Jens
> > Axboe for blk-7 to alleviate this issue) and you have a
> > scenario where processes using /proc/<pid>/stat have the
> > possibility to block on multiple processes that are in the
> > process of handling a page fault (but are being starved)
>
> I'm experimenting with blk.[67] in test10 right now. The stalls
> are not helped at all. It doesn't seem to become request bound
> (haven't instrumented that yet to be sure) but the stalls persist.
>
> -Mike

This is not an elevator starvation problem.

I also experienced these stalls with my IDE-only system. Unless I'm
badly mistaken, the elevator is only used on SCSI disks, therefore
elevator starvation cannot be blamed for this problem. These stalls
are particularly annoying since I want to find the pid of the process
hogging memory in order to kill it, but the read from /proc stalls for
45 seconds or more.

-VAL

2000-11-03 00:31:25

by Jens Axboe

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Thu, Nov 02 2000, Val Henson wrote:
> > > 3) combine this with the elevator starvation stuff (ask Jens
> > > Axboe for blk-7 to alleviate this issue) and you have a
> > > scenario where processes using /proc/<pid>/stat have the
> > > possibility to block on multiple processes that are in the
> > > process of handling a page fault (but are being starved)
> >
> > I'm experimenting with blk.[67] in test10 right now. The stalls
> > are not helped at all. It doesn't seem to become request bound
> > (haven't instrumented that yet to be sure) but the stalls persist.
> >
> > -Mike
>
> This is not an elevator starvation problem.

True, but the blk-xx patches help work-around (what I believe) is
bad flushing behaviour by the vm.

> I also experienced these stalls with my IDE-only system. Unless I'm
> badly mistaken, the elevator is only used on SCSI disks, therefore
> elevator starvation cannot be blamed for this problem. These stalls
> are particularly annoying since I want to find the pid of the process
> hogging memory in order to kill it, but the read from /proc stalls for
> 45 seconds or more.

You are badly mistaken.

--
* Jens Axboe <[email protected]>
* SuSE Labs

2000-11-03 05:57:06

by Mike Galbraith

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Thu, 2 Nov 2000, Jens Axboe wrote:

> On Thu, Nov 02 2000, Val Henson wrote:
> > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > Axboe for blk-7 to alleviate this issue) and you have a
> > > > scenario where processes using /proc/<pid>/stat have the
> > > > possibility to block on multiple processes that are in the
> > > > process of handling a page fault (but are being starved)
> > >
> > > I'm experimenting with blk.[67] in test10 right now. The stalls
> > > are not helped at all. It doesn't seem to become request bound
> > > (haven't instrumented that yet to be sure) but the stalls persist.
> > >
> > > -Mike
> >
> > This is not an elevator starvation problem.
>
> True, but the blk-xx patches help work-around (what I believe) is
> bad flushing behaviour by the vm.

I very much agree. Kflushd is still hungry for free write
bandwidth here.

Of course it's _going_ to have to wait if you're doing max IO
throughput, but when you're flushing, you need to let kflushd
have the bandwidth it needs to do it's job. I don't think it's
getting what it needs, and am trying two things.

1. Revoke read's ability to steal requests while we're in a
heavy flushing situation. Flushing must proceed, and it must
go at full speed. (Actually, reversing the favoritism when
you need flush bandwidth makes sense to me, and does help if
limited.. if not limited, it hurts like hell)

2. Use the information that we are starving (or going full bore)
to tell the VM to keep it's fingers off dirty buffers. If we're
flushing at disk speed, page_launder() can't do anything useful
with dirty buffers, it can only do harm IMHO.

-Mike

P.S. Before I revert to Luke Codecrawler mode, I have a wild
problem theory I'd appreciate comments on.. preferably the kind
where I become extremely busy thinking about their content ;-)

If one __alloc_pages() is waiting for kswapd, kswapd tries to do
synchronous flushing.. if the write queue is nearly (or) exausted
and page_launder() couldn't clean any buffers on it's first pass,
it blasts the queue some more and stalls. If kswapd, kflushd and
kupdate are all waiting for a request, and then say a GFP_BUFFER
allocation comes along.. (we're low on memory) we do SCHED_YIELD
schedule(). If we're holding IO locks, nobody can do IO. OK, if
there's nobody else running, we come right back and either finish
the allocation of fail. But, if you have other allocations trying
to flush buffers (GFP_KERNEL eg), they are not only in danger of
stacking up due to a request shortage, but they can't get whatever
IO locks the GFP_BUFFER allocation is holding anyway so are doomed
until we do schedule back to the GFP_BUFFER allocating task.

Isn't scheduling while holding IO locks the wrong thing to do? It's
protected from neither GFP_BUFFER nor PF_MEMALLOC.

I must be missing something.. but what?

<ears at maximum gain>

2000-11-03 15:46:34

by Mike Galbraith

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Fri, 3 Nov 2000, Mike Galbraith wrote:

> On Thu, 2 Nov 2000, Jens Axboe wrote:
>
> > On Thu, Nov 02 2000, Val Henson wrote:
> > > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > > Axboe for blk-7 to alleviate this issue) and you have a
> > > > > scenario where processes using /proc/<pid>/stat have the
> > > > > possibility to block on multiple processes that are in the
> > > > > process of handling a page fault (but are being starved)
> > > >
> > > > I'm experimenting with blk.[67] in test10 right now. The stalls
> > > > are not helped at all. It doesn't seem to become request bound
> > > > (haven't instrumented that yet to be sure) but the stalls persist.
> > > >
> > > > -Mike
> > >
> > > This is not an elevator starvation problem.
> >
> > True, but the blk-xx patches help work-around (what I believe) is
> > bad flushing behaviour by the vm.
>
> I very much agree. Kflushd is still hungry for free write
> bandwidth here.

In the LKML tradition of code talks and silly opinions walk...

Attached is a diagnostic patch which gets kflushd under control,
and takes make -j30 bzImage build times down from 12 minutes to
9 here. I have no more massive context switching on write, and
copies seem to go a lot quicker to boot. (that may be because
some of my failures were really _really_ horrible)

Comments are very welcome. I haven't had problems with this yet,
but it's early so... This patch isn't supposed to be pretty either
(hw techs don't do pretty;) it's only supposed to say 'Huston...'
so be sure to grab a barfbag before you take a look.

-Mike

P.S. almost forgot. vmstat freezes were shortened too :-)

Attachments:

vm_pudding.diff (12.99 kB)

2000-11-03 18:27:20

by Jens Axboe

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Fri, Nov 03 2000, Mike Galbraith wrote:
> > I very much agree. Kflushd is still hungry for free write
> > bandwidth here.
>
> In the LKML tradition of code talks and silly opinions walk...
>
> Attached is a diagnostic patch which gets kflushd under control,
> and takes make -j30 bzImage build times down from 12 minutes to
> 9 here. I have no more massive context switching on write, and
> copies seem to go a lot quicker to boot. (that may be because
> some of my failures were really _really_ horrible)
>
> Comments are very welcome. I haven't had problems with this yet,
> but it's early so... This patch isn't supposed to be pretty either
> (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> so be sure to grab a barfbag before you take a look.

Super, looks pretty good from here. I'll give it a go when I get back.
In addition, here's a small patch that disables the read stealing
of requests from the write list -- does that improve behaviour
when we are busy flushing?

--
* Jens Axboe <[email protected]>
* SuSE Labs

Attachments:

(No filename) (1.03 kB)
read_steal.diff (998.00 B)
Download all attachments

2000-11-04 05:44:14

by Mike Galbraith

[permalink] [raw]

Subject: Re: [BUG] /proc/<pid>/stat access stalls badly for swapping process, 2.4.0-test10

On Fri, 3 Nov 2000, Jens Axboe wrote:

> On Fri, Nov 03 2000, Mike Galbraith wrote:
> > > I very much agree. Kflushd is still hungry for free write
> > > bandwidth here.
> >
> > In the LKML tradition of code talks and silly opinions walk...
> >
> > Attached is a diagnostic patch which gets kflushd under control,
> > and takes make -j30 bzImage build times down from 12 minutes to
> > 9 here. I have no more massive context switching on write, and
> > copies seem to go a lot quicker to boot. (that may be because
> > some of my failures were really _really_ horrible)
> >
> > Comments are very welcome. I haven't had problems with this yet,
> > but it's early so... This patch isn't supposed to be pretty either
> > (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> > so be sure to grab a barfbag before you take a look.
>
> Super, looks pretty good from here. I'll give it a go when I get back.
> In addition, here's a small patch that disables the read stealing
> of requests from the write list -- does that improve behaviour
> when we are busy flushing?

Yes. I've done this a bit differently here, and have had good
results. I only disable stealing when I need flush throughput.

Now that the box isn't biting off more than it can chew quite
as often, I'll try this again. I'm pretty darn sure that I can
get more throughput, but :> I've learned that getting too much
can do really OOGLY things. (turns box into single user single
tasking streaming IO monster from hell)

-Mike