Hi VM/procfs hackers,
System is UP Athlon 700mhz with 256mb ram running vanilla 2.4.0-test10.
gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
I'd like to report what seems like a performance problem in the latest
kernels. Actually, all recent kernels have exhibited this problem, but
I was waiting for the new VM stuff to stabilize before reporting it.
My test is: run 7 processes that each allocate and randomly access 32mb
of ram (on a 256mb machine). Even though 7*32MB = 224MB, this still
sends the machine lightly into swap. The machine continues to function
fairly smoothly for the most part. I can do filesystem operations, run
new programs, move desktops in X etc.
Except: programs which access /proc/<pid>/stat stall for an inderminate
amount of time. For example, 'ps' and 'vmstat' stall BADLY in these
scenarios. I have had the stalls last over a minute in higher VM
pressure situations.
Unfortunately, when system is thrashing, it's nice to be able to run
'ps' in order to get the PID to kill, and run a reliable vmstat to
monitor it.
Here's a segment of an strace of 'ps' showing a 12 second stall (this
isn't the worst I've seen by any means, but a 12 second stall trying to
get process info for 1 swapping task can easily snowball into a DOS).
0.000119 open("/proc/4746/stat", O_RDONLY) = 7
0.000072 read(7, "4746 (hog) D 4739 4739 827 34817"..., 511) = 181
12.237161 close(7) = 0
The wchan of the stalled 'ps' is in __down_interruptible, which probably
doesn't help much.
This worked absolutely fine in 2.2. Even under extreme swap pressure,
vmstat continues to function fine, spitting out messages every second as
it should.
David Mansfield
On Wed, 1 Nov 2000, David Mansfield wrote:
> I'd like to report what seems like a performance problem in the latest
> kernels. Actually, all recent kernels have exhibited this problem, but
> I was waiting for the new VM stuff to stabilize before reporting it.
>
> My test is: run 7 processes that each allocate and randomly
> access 32mb of ram (on a 256mb machine). Even though 7*32MB =
> 224MB, this still sends the machine lightly into swap. The
> machine continues to function fairly smoothly for the most part.
> I can do filesystem operations, run new programs, move desktops
> in X etc.
>
> Except: programs which access /proc/<pid>/stat stall for an
> inderminate amount of time. For example, 'ps' and 'vmstat'
> stall BADLY in these scenarios. I have had the stalls last over
> a minute in higher VM pressure situations.
I have one possible reason for this ....
1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
down(&mm->mmap_sem);
2) but, in order to do that, it has to wait until the process
it is trying to stat has /finished/ its page fault, and is
not into its next one ...
3) combine this with the elevator starvation stuff (ask Jens
Axboe for blk-7 to alleviate this issue) and you have a
scenario where processes using /proc/<pid>/stat have the
possibility to block on multiple processes that are in the
process of handling a page fault (but are being starved)
regards,
Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000
http://www.conectiva.com/ http://www.surriel.com/
On Wed, 1 Nov 2000, Rik van Riel wrote:
> On Wed, 1 Nov 2000, David Mansfield wrote:
>
> > I'd like to report what seems like a performance problem in the latest
> > kernels. Actually, all recent kernels have exhibited this problem, but
> > I was waiting for the new VM stuff to stabilize before reporting it.
> >
> > My test is: run 7 processes that each allocate and randomly
> > access 32mb of ram (on a 256mb machine). Even though 7*32MB =
> > 224MB, this still sends the machine lightly into swap. The
> > machine continues to function fairly smoothly for the most part.
> > I can do filesystem operations, run new programs, move desktops
> > in X etc.
> >
> > Except: programs which access /proc/<pid>/stat stall for an
> > inderminate amount of time. For example, 'ps' and 'vmstat'
> > stall BADLY in these scenarios. I have had the stalls last over
> > a minute in higher VM pressure situations.
>
> I have one possible reason for this ....
>
> 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> down(&mm->mmap_sem);
>
> 2) but, in order to do that, it has to wait until the process
> it is trying to stat has /finished/ its page fault, and is
> not into its next one ...
>
> 3) combine this with the elevator starvation stuff (ask Jens
> Axboe for blk-7 to alleviate this issue) and you have a
> scenario where processes using /proc/<pid>/stat have the
> possibility to block on multiple processes that are in the
> process of handling a page fault (but are being starved)
I'm experimenting with blk.[67] in test10 right now. The stalls
are not helped at all. It doesn't seem to become request bound
(haven't instrumented that yet to be sure) but the stalls persist.
-Mike
Hi Rik,
I can probably give some more datapoints. Here is the console output
of my test machine (there is a 'vmstat 5' running in background):
[root@ls3016 /root]# killall shmtst
[root@ls3016 /root]#
1 12 2 0 1607668 18932 2110496 0 0 67154 1115842 1050063 2029389 0 2 98
0 10 2 0 1607564 18932 2110496 0 0 0 300 317 426 0 0 100
0 10 2 0 1607408 18932 2110496 0 0 0 301 336 473 0 0 100
0 10 2 0 1607560 18932 2110508 0 0 0 307 318 430 0 0 100
0 10 2 0 1607556 18932 2110512 0 0 0 304 324 433 0 0 100
0 10 2 0 1607528 18932 2110512 0 0 0 272 308 410 0 1 99
0 10 2 0 1607440 18932 2110516 0 0 0 315 323 438 0 1 99
0 10 2 0 1607528 18932 2110516 0 0 0 323 316 424 0 0 100
0 10 2 0 1607556 18932 2110516 0 0 0 304 309 410 0 0 100
0 10 2 0 1607600 18932 2110528 0 0 0 298 314 418 0 0 100
0 10 2 0 1607384 18932 2110528 0 0 0 296 307 406 0 1 99
0 10 2 0 1607284 18932 2110528 0 0 0 304 315 421 0 0 100
0 10 2 0 1607668 18932 2110528 0 0 0 298 304 402 0 0 100
0 10 2 0 1607576 18932 2110528 0 0 0 285 307 405 0 0 100
0 10 2 0 1607656 18932 2110528 0 0 0 292 303 399 0 1 99
0 10 2 0 1607928 18932 2110528 0 0 0 313 310 408 0 0 100
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 10 2 0 1608440 18932 2110528 0 0 0 340 313 417 0 1 99
0 10 2 0 1608260 18932 2110528 0 0 0 298 318 426 0 0 100
0 10 2 0 1608208 18932 2110528 0 0 0 314 334 448 0 1 99
0 10 2 0 1608396 18932 2110528 0 0 0 323 316 421 0 1 99
0 10 2 0 1608204 18932 2110548 0 0 0 334 333 458 0 0 100
0 10 2 0 1607888 18932 2110580 0 0 0 336 329 448 0 1 99
0 10 2 0 1608040 18932 2110584 0 0 0 317 321 435 0 0 100
0 10 2 0 1608032 18932 2110588 0 0 0 241 318 425 0 0 100
0 10 2 0 1608028 18932 2110592 0 0 0 257 325 443 0 1 99
0 10 3 0 1608028 18932 2110592 0 0 0 258 323 435 0 0 99
0 10 2 0 1608032 18932 2110592 0 0 0 241 316 425 0 0 100
0 10 2 0 1608024 18932 2110592 0 0 0 261 337 460 0 0 100
0 10 2 0 1608016 18932 2110592 0 0 0 253 328 444 0 0 100
0 10 2 0 1608024 18932 2110592 0 0 0 252 320 435 0 0 100
0 10 2 0 1608012 18932 2110592 0 0 0 255 326 446 0 0 100
0 10 2 0 1608020 18932 2110592 0 0 0 255 326 444 0 1 99
0 10 2 0 1608012 18932 2110600 0 0 0 261 341 469 0 0 100
0 10 2 0 1607992 18932 2110608 0 0 0 261 344 479 0 0 100
0 10 2 0 1607992 18932 2110612 0 0 0 264 342 471 0 0 100
0 10 2 0 1607984 18932 2110612 0 0 0 266 334 462 0 0 100
0 10 2 0 1607980 18932 2110620 0 0 0 273 340 468 0 0 99
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 10 2 0 1607972 18932 2110624 0 0 0 266 345 474 0 1 99
0 10 2 0 1607940 18932 2110640 0 0 0 256 341 462 0 0 100
0 10 2 0 1607936 18932 2110644 0 0 0 262 339 462 0 1 99
0 10 2 0 1607940 18932 2110644 0 0 0 261 333 450 0 1 99
0 10 2 0 1607944 18932 2110644 0 0 0 253 335 454 0 0 100
0 10 2 0 1607944 18932 2110644 0 0 0 272 352 479 0 1 99
[root@ls3016 /root]# ps l
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
100 0 820 1 9 0 2200 1168 wait4 S ttyS0 0:00 login -- ro
100 0 862 820 14 0 1756 976 wait4 S ttyS0 0:00 -bash
000 0 878 862 9 0 1080 360 down D ttyS0 11:27 ./shmtst 10
000 0 879 862 9 0 1080 360 down D ttyS0 15:21 ./shmtst 15
040 0 880 878 9 0 1092 416 wait_o D ttyS0 8:55 ./shmtst 10
040 0 881 878 9 0 1080 360 down D ttyS0 10:22 ./shmtst 10
444 0 882 878 9 0 0 0 do_exi Z ttyS0 10:00 [shmtst <de
040 0 883 878 9 0 1092 416 wait_o D ttyS0 9:30 ./shmtst 10
040 0 884 878 9 0 1092 416 down D ttyS0 8:44 ./shmtst 10
040 0 885 878 9 0 1092 416 down D ttyS0 9:01 ./shmtst 10
444 0 886 878 9 0 0 0 do_exi Z ttyS0 7:59 [shmtst <de
444 0 887 879 9 0 0 0 do_exi Z ttyS0 17:11 [shmtst <de
040 0 888 878 9 0 1080 360 down D ttyS0 10:21 ./shmtst 10
040 0 889 878 9 0 1092 416 down D ttyS0 9:06 ./shmtst 10
000 0 891 862 9 0 1136 488 nanosl S ttyS0 0:23 vmstat 5
000 0 1226 862 19 0 2756 1084 - R ttyS0 0:00 ps l
[root@ls3016 /root]# 0 10 2 0 1607936 18932 2110652 0 0 0 275 368 488 0 0 99
0 10 2 0 1607912 18932 2110660 0 0 0 266 334 457 0 0 100
0 10 2 0 1607848 18932 2110672 0 0 0 302 354 498 0 0 100
0 10 2 0 1607892 18932 2110688 0 0 0 287 352 496 0 0 100
0 11 2 0 1607868 18932 2110704 0 0 1 282 338 472 0 1 99
So the processes don't finish exiting at least 47*5sec. They have
shared mmaped some 666000000 bytes long plain file on a 8GB machine.
The rest of the machine behaves nicely.
Greetings
Christoph
On Thu, Nov 02, 2000 at 08:19:06AM +0100, Mike Galbraith wrote:
> On Wed, 1 Nov 2000, Rik van Riel wrote:
>
> > I have one possible reason for this ....
> >
> > 1) the procfs process does (in fs/proc/array.c::proc_pid_stat)
> > down(&mm->mmap_sem);
> >
> > 2) but, in order to do that, it has to wait until the process
> > it is trying to stat has /finished/ its page fault, and is
> > not into its next one ...
> >
> > 3) combine this with the elevator starvation stuff (ask Jens
> > Axboe for blk-7 to alleviate this issue) and you have a
> > scenario where processes using /proc/<pid>/stat have the
> > possibility to block on multiple processes that are in the
> > process of handling a page fault (but are being starved)
>
> I'm experimenting with blk.[67] in test10 right now. The stalls
> are not helped at all. It doesn't seem to become request bound
> (haven't instrumented that yet to be sure) but the stalls persist.
>
> -Mike
This is not an elevator starvation problem.
I also experienced these stalls with my IDE-only system. Unless I'm
badly mistaken, the elevator is only used on SCSI disks, therefore
elevator starvation cannot be blamed for this problem. These stalls
are particularly annoying since I want to find the pid of the process
hogging memory in order to kill it, but the read from /proc stalls for
45 seconds or more.
-VAL
On Thu, Nov 02 2000, Val Henson wrote:
> > > 3) combine this with the elevator starvation stuff (ask Jens
> > > Axboe for blk-7 to alleviate this issue) and you have a
> > > scenario where processes using /proc/<pid>/stat have the
> > > possibility to block on multiple processes that are in the
> > > process of handling a page fault (but are being starved)
> >
> > I'm experimenting with blk.[67] in test10 right now. The stalls
> > are not helped at all. It doesn't seem to become request bound
> > (haven't instrumented that yet to be sure) but the stalls persist.
> >
> > -Mike
>
> This is not an elevator starvation problem.
True, but the blk-xx patches help work-around (what I believe) is
bad flushing behaviour by the vm.
> I also experienced these stalls with my IDE-only system. Unless I'm
> badly mistaken, the elevator is only used on SCSI disks, therefore
> elevator starvation cannot be blamed for this problem. These stalls
> are particularly annoying since I want to find the pid of the process
> hogging memory in order to kill it, but the read from /proc stalls for
> 45 seconds or more.
You are badly mistaken.
--
* Jens Axboe <[email protected]>
* SuSE Labs
On Thu, 2 Nov 2000, Jens Axboe wrote:
> On Thu, Nov 02 2000, Val Henson wrote:
> > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > Axboe for blk-7 to alleviate this issue) and you have a
> > > > scenario where processes using /proc/<pid>/stat have the
> > > > possibility to block on multiple processes that are in the
> > > > process of handling a page fault (but are being starved)
> > >
> > > I'm experimenting with blk.[67] in test10 right now. The stalls
> > > are not helped at all. It doesn't seem to become request bound
> > > (haven't instrumented that yet to be sure) but the stalls persist.
> > >
> > > -Mike
> >
> > This is not an elevator starvation problem.
>
> True, but the blk-xx patches help work-around (what I believe) is
> bad flushing behaviour by the vm.
I very much agree. Kflushd is still hungry for free write
bandwidth here.
Of course it's _going_ to have to wait if you're doing max IO
throughput, but when you're flushing, you need to let kflushd
have the bandwidth it needs to do it's job. I don't think it's
getting what it needs, and am trying two things.
1. Revoke read's ability to steal requests while we're in a
heavy flushing situation. Flushing must proceed, and it must
go at full speed. (Actually, reversing the favoritism when
you need flush bandwidth makes sense to me, and does help if
limited.. if not limited, it hurts like hell)
2. Use the information that we are starving (or going full bore)
to tell the VM to keep it's fingers off dirty buffers. If we're
flushing at disk speed, page_launder() can't do anything useful
with dirty buffers, it can only do harm IMHO.
-Mike
P.S. Before I revert to Luke Codecrawler mode, I have a wild
problem theory I'd appreciate comments on.. preferably the kind
where I become extremely busy thinking about their content ;-)
If one __alloc_pages() is waiting for kswapd, kswapd tries to do
synchronous flushing.. if the write queue is nearly (or) exausted
and page_launder() couldn't clean any buffers on it's first pass,
it blasts the queue some more and stalls. If kswapd, kflushd and
kupdate are all waiting for a request, and then say a GFP_BUFFER
allocation comes along.. (we're low on memory) we do SCHED_YIELD
schedule(). If we're holding IO locks, nobody can do IO. OK, if
there's nobody else running, we come right back and either finish
the allocation of fail. But, if you have other allocations trying
to flush buffers (GFP_KERNEL eg), they are not only in danger of
stacking up due to a request shortage, but they can't get whatever
IO locks the GFP_BUFFER allocation is holding anyway so are doomed
until we do schedule back to the GFP_BUFFER allocating task.
Isn't scheduling while holding IO locks the wrong thing to do? It's
protected from neither GFP_BUFFER nor PF_MEMALLOC.
I must be missing something.. but what?
<ears at maximum gain>
On Fri, 3 Nov 2000, Mike Galbraith wrote:
> On Thu, 2 Nov 2000, Jens Axboe wrote:
>
> > On Thu, Nov 02 2000, Val Henson wrote:
> > > > > 3) combine this with the elevator starvation stuff (ask Jens
> > > > > Axboe for blk-7 to alleviate this issue) and you have a
> > > > > scenario where processes using /proc/<pid>/stat have the
> > > > > possibility to block on multiple processes that are in the
> > > > > process of handling a page fault (but are being starved)
> > > >
> > > > I'm experimenting with blk.[67] in test10 right now. The stalls
> > > > are not helped at all. It doesn't seem to become request bound
> > > > (haven't instrumented that yet to be sure) but the stalls persist.
> > > >
> > > > -Mike
> > >
> > > This is not an elevator starvation problem.
> >
> > True, but the blk-xx patches help work-around (what I believe) is
> > bad flushing behaviour by the vm.
>
> I very much agree. Kflushd is still hungry for free write
> bandwidth here.
In the LKML tradition of code talks and silly opinions walk...
Attached is a diagnostic patch which gets kflushd under control,
and takes make -j30 bzImage build times down from 12 minutes to
9 here. I have no more massive context switching on write, and
copies seem to go a lot quicker to boot. (that may be because
some of my failures were really _really_ horrible)
Comments are very welcome. I haven't had problems with this yet,
but it's early so... This patch isn't supposed to be pretty either
(hw techs don't do pretty;) it's only supposed to say 'Huston...'
so be sure to grab a barfbag before you take a look.
-Mike
P.S. almost forgot. vmstat freezes were shortened too :-)
On Fri, Nov 03 2000, Mike Galbraith wrote:
> > I very much agree. Kflushd is still hungry for free write
> > bandwidth here.
>
> In the LKML tradition of code talks and silly opinions walk...
>
> Attached is a diagnostic patch which gets kflushd under control,
> and takes make -j30 bzImage build times down from 12 minutes to
> 9 here. I have no more massive context switching on write, and
> copies seem to go a lot quicker to boot. (that may be because
> some of my failures were really _really_ horrible)
>
> Comments are very welcome. I haven't had problems with this yet,
> but it's early so... This patch isn't supposed to be pretty either
> (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> so be sure to grab a barfbag before you take a look.
Super, looks pretty good from here. I'll give it a go when I get back.
In addition, here's a small patch that disables the read stealing
of requests from the write list -- does that improve behaviour
when we are busy flushing?
--
* Jens Axboe <[email protected]>
* SuSE Labs
On Fri, 3 Nov 2000, Jens Axboe wrote:
> On Fri, Nov 03 2000, Mike Galbraith wrote:
> > > I very much agree. Kflushd is still hungry for free write
> > > bandwidth here.
> >
> > In the LKML tradition of code talks and silly opinions walk...
> >
> > Attached is a diagnostic patch which gets kflushd under control,
> > and takes make -j30 bzImage build times down from 12 minutes to
> > 9 here. I have no more massive context switching on write, and
> > copies seem to go a lot quicker to boot. (that may be because
> > some of my failures were really _really_ horrible)
> >
> > Comments are very welcome. I haven't had problems with this yet,
> > but it's early so... This patch isn't supposed to be pretty either
> > (hw techs don't do pretty;) it's only supposed to say 'Huston...'
> > so be sure to grab a barfbag before you take a look.
>
> Super, looks pretty good from here. I'll give it a go when I get back.
> In addition, here's a small patch that disables the read stealing
> of requests from the write list -- does that improve behaviour
> when we are busy flushing?
Yes. I've done this a bit differently here, and have had good
results. I only disable stealing when I need flush throughput.
Now that the box isn't biting off more than it can chew quite
as often, I'll try this again. I'm pretty darn sure that I can
get more throughput, but :> I've learned that getting too much
can do really OOGLY things. (turns box into single user single
tasking streaming IO monster from hell)
-Mike