2001-07-07 13:42:18

by Jeff Garzik

[permalink] [raw]
Subject: VM in 2.4.7-pre hurts...

Oh this is a fun one :)

When building gcc-2.96 RPM using gcc-2.96 under kernel 2.4.7 on alpha,
the system goes --deeply-- into swap. Not pretty at all. The system
will be 200MB+ into swap, with 200MB+ in cache! I presume this affects
2.4.7-release also.

System has 256MB of swap, and 384MB of RAM.

Only patches applied are Rik's recent OOM killer friendliness patch, and
Andrea's ksoftirq patch.

I ran "vmstat 3" throughout the build, and that output is attached. I
also manually ran "ps wwwaux >> ps.txt" periodically. This second
output is not overly helpful, because the system was swapping and
unuseable for the times when the 'ps' output would be most useful.

Both outputs are attached, as they compressed pretty nicely.

--
Jeff Garzik | A recent study has shown that too much soup
Building 1024 | can cause malaise in laboratory mice.
MandrakeSoft |


Attachments:
vmstat.txt.bz2 (5.84 kB)
ps.txt.bz2 (15.92 kB)
Download all attachments

2001-07-07 14:06:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Jeff Garzik wrote:
>
> Oh this is a fun one :)
>
> When building gcc-2.96 RPM using gcc-2.96 under kernel 2.4.7 on alpha,
> the system goes --deeply-- into swap. Not pretty at all. The system
> will be 200MB+ into swap, with 200MB+ in cache! I presume this affects
> 2.4.7-release also.
>
> System has 256MB of swap, and 384MB of RAM.
>
> Only patches applied are Rik's recent OOM killer friendliness patch, and
> Andrea's ksoftirq patch.
>
> I ran "vmstat 3" throughout the build, and that output is attached. I
> also manually ran "ps wwwaux >> ps.txt" periodically. This second
> output is not overly helpful, because the system was swapping and
> unuseable for the times when the 'ps' output would be most useful.

Sorry, I forgot to mention that OOM killer kicked in twice. You can
probably pick out the points where it kicked in, in the vmstat output.

--
Jeff Garzik | A recent study has shown that too much soup
Building 1024 | can cause malaise in laboratory mice.
MandrakeSoft |

2001-07-07 17:29:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sat, 7 Jul 2001, Jeff Garzik wrote:
>
> When building gcc-2.96 RPM using gcc-2.96 under kernel 2.4.7 on alpha,
> the system goes --deeply-- into swap. Not pretty at all. The system
> will be 200MB+ into swap, with 200MB+ in cache! I presume this affects
> 2.4.7-release also.

Note that "200MB+ into swap, with 200MB+ in cache" is NOT bad in itself.

It only means that we have scanned the VM, and allocated swap-space for
200MB worth of VM space. It does NOT necessarily mean that any actual
swapping has been taking place: you should realize that the "cache" is
likely to be not at least partly the _swap_ cache that hasn't been written
out.

This is an accounting problem, nothing more. It looks strange, but it's
normal.

Now, the fact that the system appears unusable does obviously mean that
something is wrong. But you're barking up the wrong tree.

Although it might be the "right tree" in the sense that we might want to
remove the swap cache from the "cached" output in /proc/meminfo. It might
be more useful to separate out "Cached" and "SwapCache": add a new line to
/proc/meminfo that is "swapper_space.nr_pages", and make the current code
that does

atomic_read(&page_cache_size)

do

(atomic_read(&page_cache_size) - swapper_space.nrpages)

instead. That way the vmstat output might be more useful, although vmstat
obviously won't know about the new "SwapCache:" field..

Can you try that, and see if something else stands out once the misleading
accounting is taken care of?

Linus

2001-07-07 17:37:28

by Jeff Garzik

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Linus Torvalds wrote:
>
> On Sat, 7 Jul 2001, Jeff Garzik wrote:
> >
> > When building gcc-2.96 RPM using gcc-2.96 under kernel 2.4.7 on alpha,
> > the system goes --deeply-- into swap. Not pretty at all. The system
> > will be 200MB+ into swap, with 200MB+ in cache! I presume this affects
> > 2.4.7-release also.
>
> Note that "200MB+ into swap, with 200MB+ in cache" is NOT bad in itself.
>
> It only means that we have scanned the VM, and allocated swap-space for
> 200MB worth of VM space. It does NOT necessarily mean that any actual
> swapping has been taking place: you should realize that the "cache" is
> likely to be not at least partly the _swap_ cache that hasn't been written
> out.
>
> This is an accounting problem, nothing more. It looks strange, but it's
> normal.
>
> Now, the fact that the system appears unusable does obviously mean that
> something is wrong. But you're barking up the wrong tree.

Two more additional data points,

1) partially kernel-unrelated. MDK's "make" macro didn't support
alpha's /proc/cpuinfo output, "make -j$numprocs" became "make -j" and
fun ensued.

2) I agree that 200MB into swap and 200MB into cache isn't bad per se,
but when it triggers the OOM killer it is bad.

Alan suggested that I insert the following into the OOM killer code, as
the last test before returning 1.

cnt++;
if ((cnt % 5000) != 0)
return 0;

I did this, and while watching "vmstat 3", the cache was indeed being
trimmed, whereas it was not before.

So, the OOM killer appears to be getting triggered early, but the rest
of the report was my screwup not the kernel.

--
Jeff Garzik | A recent study has shown that too much soup
Building 1024 | can cause malaise in laboratory mice.
MandrakeSoft |

2001-07-07 18:01:01

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sat, 7 Jul 2001, Jeff Garzik wrote:

> 2) I agree that 200MB into swap and 200MB into cache isn't bad
> per se, but when it triggers the OOM killer it is bad.

Please read my patch for the OOM killer. It substracts the
swap cache from the cache figure you quote and ONLY goes
into oom_kill() if the page & buffer cache together take
less than 4% of memory (see /proc/sys/vm/{buffermem,pagecache}).

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-07 17:55:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sat, 7 Jul 2001, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >
> > Now, the fact that the system appears unusable does obviously mean that
> > something is wrong. But you're barking up the wrong tree.
>
> Two more additional data points,
>
> 1) partially kernel-unrelated. MDK's "make" macro didn't support
> alpha's /proc/cpuinfo output, "make -j$numprocs" became "make -j" and
> fun ensued.

Ahh, well..

The kernel source code is set up to scale quite well, so yes a "make -j"
will parallellise a bit teoo well for most machines, and you'll certainly
run out of memory on just about anything (I routinely get load averages of
30+, and yes, you need at least half a GB of RAM for it to not be
unpleasant - and probably more like a full gigabyte on an alpha).

So I definitely think the kernel likely did the right thing. It's not even
clear that the OOM killer might not have been right - due to the 2.4.x
swap space allocation, 256MB of swap-space is a bit tight on a 384MB
machine that actually wants to use a lot of memory.

> 2) I agree that 200MB into swap and 200MB into cache isn't bad per se,
> but when it triggers the OOM killer it is bad.

Note that it might easily have been 256MB into swap (ie it had eaten _all_
of your swap) at some stage - and you just didn't see it in the vmstat
output because obviously at that point the machine was a bit loaded.

But neutering the OOM killer like Alan suggested may be a rather valid
approach anyway. Delaying the killing sounds valid: if we're truly
livelocked on the VM, we'll be calling down to the OOM killer so much that
it's probably quite valid to say "only return 1 after X iterations".

Linus

2001-07-07 18:09:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Linus Torvalds wrote:
>
> On Sat, 7 Jul 2001, Jeff Garzik wrote:
> > Linus Torvalds wrote:
> > >
> > > Now, the fact that the system appears unusable does obviously mean that
> > > something is wrong. But you're barking up the wrong tree.
> >
> > Two more additional data points,
> >
> > 1) partially kernel-unrelated. MDK's "make" macro didn't support
> > alpha's /proc/cpuinfo output, "make -j$numprocs" became "make -j" and
> > fun ensued.
>
> Ahh, well..
>
> The kernel source code is set up to scale quite well, so yes a "make -j"
> will parallellise a bit teoo well for most machines, and you'll certainly
> run out of memory on just about anything (I routinely get load averages of
> 30+, and yes, you need at least half a GB of RAM for it to not be
> unpleasant - and probably more like a full gigabyte on an alpha).

"make -j" is a lot of fun on a dual athlon w/ 512mb :)

> So I definitely think the kernel likely did the right thing. It's not even
> clear that the OOM killer might not have been right - due to the 2.4.x
> swap space allocation, 256MB of swap-space is a bit tight on a 384MB
> machine that actually wants to use a lot of memory.

Sigh. since I am a VM ignoramus I doubt my opinion matters much at all
here... but it would be nice if oddball configurations like 384MB with
50MB swap could be supported. I don't ask that it perform optimally at
all, but at least the machine should behave predictably...

This type of swap configuration makes sense for, "my working set is
pretty much always in RAM, including i/dcache, but let's have some swap
just-in-case"


> > 2) I agree that 200MB into swap and 200MB into cache isn't bad per se,
> > but when it triggers the OOM killer it is bad.
>
> Note that it might easily have been 256MB into swap (ie it had eaten _all_
> of your swap) at some stage - and you just didn't see it in the vmstat
> output because obviously at that point the machine was a bit loaded.

I'm pretty sure swap was 100% full. I should have sysrq'd and checked
but I forgot.


> But neutering the OOM killer like Alan suggested may be a rather valid
> approach anyway. Delaying the killing sounds valid: if we're truly
> livelocked on the VM, we'll be calling down to the OOM killer so much that
> it's probably quite valid to say "only return 1 after X iterations".

cnt % 5000 may have been a bit extreme but it was fun to see it thrash.
sysrq was pretty much the only talking point into the system.

--
Jeff Garzik | A recent study has shown that too much soup
Building 1024 | can cause malaise in laboratory mice.
MandrakeSoft |

2001-07-07 18:12:11

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sat, 7 Jul 2001, Jeff Garzik wrote:

> Sigh. since I am a VM ignoramus I doubt my opinion matters much
> at all here... but it would be nice if oddball configurations
> like 384MB with 50MB swap could be supported.

It would be fun if we had 48 hours in a day, too ;)

This particular thing has been on the TODO list of the
VM developers for a while, but we just haven't gotten
around to it.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-07 18:56:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sat, 7 Jul 2001, Rik van Riel wrote:
>
> Not at all. Note that try_to_swap_out() will happily
> create swap cache pages with a very high page->age,
> pages which are in absolutely no danger of being
> evicted from memory...

That seems to be a bug in "add_to_swap_cache()".

In fact, I do not see any part of the whole path that sets the page age at
all, so we're basically using a completely uninitialized field here (it's
been initialized way back when the page was allocated, but because it
hasn't been part of the normal aging scheme it has only been aged up,
never down, so the value is pretty much random by the time we actually add
it to the swap cache pool).

Suggested fix:

--- v2.4.6/linux/mm/swap_state.c Tue Jul 3 17:08:22 2001
+++ linux/mm/swap_state.c Sat Jul 7 11:49:13 2001
@@ -81,6 +81,7 @@
BUG();
flags = page->flags & ~((1 << PG_error) | (1 << PG_arch_1));
page->flags = flags | (1 << PG_uptodate);
+ page->age = PAGE_AGE_START;
add_to_page_cache_locked(page, &swapper_space, entry.val);
}

Does that make a difference for people?

That would certainly help explain why aging doesn't work for some people.

Linus

2001-07-07 20:12:19

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sat, 7 Jul 2001, Linus Torvalds wrote:

> In fact, I do not see any part of the whole path that sets the
> page age at all, so we're basically using a completely
> uninitialized field here (it's been initialized way back when
> the page was allocated, but because it hasn't been part of the
> normal aging scheme it has only been aged up, never down, so the
> value is pretty much random by the time we actually add it to
> the swap cache pool).

Not quite. The more a page has been used, the higher the
page->age will be. This means the system has a way to
distinguish between anonymous pages which were used once
and anonymous pages which are used lots of times.


> Suggested fix:

[snip disabling of page aging for anonymous memory]

> That would certainly help explain why aging doesn't work for some people.

As would your patch ;)

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-07 21:25:58

by Alan

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

> instead. That way the vmstat output might be more useful, although vmstat
> obviously won't know about the new "SwapCache:" field..
>
> Can you try that, and see if something else stands out once the misleading
> accounting is taken care of?

Its certainly misleading. I got Jeff to try making oom return 4999 out of 5000
times regardless.

2001-07-07 21:29:58

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sat, 7 Jul 2001, Alan Cox wrote:

> > instead. That way the vmstat output might be more useful, although vmstat
> > obviously won't know about the new "SwapCache:" field..
> >
> > Can you try that, and see if something else stands out once the misleading
> > accounting is taken care of?
>
> Its certainly misleading. I got Jeff to try making oom return
> 4999 out of 5000 times regardless.

In that case, he _is_ OOM. ;)

1) (almost) no free memory
2) no free swap
3) very little pagecache + buffer cache

regards,

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-07 21:33:28

by Alan

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

> But neutering the OOM killer like Alan suggested may be a rather valid
> approach anyway. Delaying the killing sounds valid: if we're truly
> livelocked on the VM, we'll be calling down to the OOM killer so much that
> it's probably quite valid to say "only return 1 after X iterations".

Its hiding the real accounting screw up with a 'goes bang at random less
often' - nice hack, but IMHO bad long term approach. We need to get the maths
right. We had similar 2.2 problems the other way (with nasty deadlocks)
until Andrea fixed that


2001-07-07 21:34:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Rik van Riel wrote:
>
> On Sat, 7 Jul 2001, Alan Cox wrote:
>
> > > instead. That way the vmstat output might be more useful, although vmstat
> > > obviously won't know about the new "SwapCache:" field..
> > >
> > > Can you try that, and see if something else stands out once the misleading
> > > accounting is taken care of?
> >
> > Its certainly misleading. I got Jeff to try making oom return
> > 4999 out of 5000 times regardless.
>
> In that case, he _is_ OOM. ;)
>
> 1) (almost) no free memory
> 2) no free swap
> 3) very little pagecache + buffer cache

It got -considerably- farther after Alan's suggested hack to the OOM
killer; so at least in this instance, OOM killer appeared to me to be
killing too early...

--
Jeff Garzik | A recent study has shown that too much soup
Building 1024 | can cause malaise in laboratory mice.
MandrakeSoft |

2001-07-07 21:43:28

by Alan

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

> > Its certainly misleading. I got Jeff to try making oom return
> > 4999 out of 5000 times regardless.
>
> In that case, he _is_ OOM. ;)

Hardly

> 1) (almost) no free memory
> 2) no free swap
> 3) very little pagecache + buffer cache

Large amounts of cache, which went away when the OOM code was neutered

2001-07-07 21:45:38

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sat, 7 Jul 2001, Alan Cox wrote:

> > > Its certainly misleading. I got Jeff to try making oom return
> > > 4999 out of 5000 times regardless.
> >
> > In that case, he _is_ OOM. ;)
>
> Hardly
>
> > 1) (almost) no free memory
> > 2) no free swap
> > 3) very little pagecache + buffer cache
>
> Large amounts of cache, which went away when the OOM code was neutered

So Jeff backed out my patch before testing yours? ;)

Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"


http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/

2001-07-08 15:43:57

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sun, 8 Jul 2001, Mike Galbraith wrote:

> is very oom with no disk activity. It _looks_ (xmm and vmstat) like
> it just ran out of cleanable dirty pages. With or without swap,

... Bingo. You hit the infamous __wait_on_buffer / ___wait_on_page
bug. I've seen this for quite a while now on our quad xeon test
machine, with some kernel versions it can be reproduced in minutes,
with others it won't trigger at all.

And after a recompile it's usually gone ...

I hope there is somebody out there who can RELIABLY trigger
this bug, so we have a chance of tracking it down.

> tar
> Trace; c012f2da <__wait_on_buffer+6a/8c>
> Trace; c01303c9 <bread+45/64>
> Trace; c01500ea <ext2_read_inode+fe/3c8>
> Trace; c01411f5 <get_new_inode+d1/15c>
> Trace; c0141416 <iget4+c2/d4>
> Trace; c0150b03 <ext2_lookup+43/68>
> Trace; c0138401 <path_walk+529/748>
> Trace; c0137aed <getname+5d/9c>
> Trace; c01389d8 <__user_walk+3c/58>
> Trace; c0135cc6 <sys_lstat64+16/70>
> Trace; c0106ae3 <system_call+33/38>

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-08 17:13:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sat, 7 Jul 2001, Rik van Riel wrote:
>
> Not quite. The more a page has been used, the higher the
> page->age will be. This means the system has a way to
> distinguish between anonymous pages which were used once
> and anonymous pages which are used lots of times.

Wrong.

We already _have_ that aging: it's called "do not add anonymous pages to
the page cache unless they are old".

Pages that are used lots of time won't ever _get_ to the point where they
get added to the swap cache, because they are always marked young.

So by the time we get to this point, we _know_ what the age should be. I
tried to explain this to you earlier. We should NOT use the old
"page->age", because that one is 100% and totally bogus. It has _nothing_
to do with the page age. It's ben randomly incremented, without ever
having been on any of the aging lists, and as such it is a totally bogus
number.

In comparison, just setting page->age to PAGE_AGE_START is _not_ a random
number. It's a reasonable number that depends on the _knowledge_ that

(a) _had_ the page been on any of the aging lists, it would have been
aged down every time we passed it, and
(b) it's obviously been aged up every time we passed it in the VM so far
(because it hadn't been added to the swap cache earlier).

Are you with me?

Now, add to the above two _facts_, the knowledge that the aging of the VM
space is done roughly at the same rate as the aging of the active lists
(we call "swap_out()" every time we age the active list when under memory
pressure, and they go through similar percentages of their respective
address spaces), and you get

- an anonymous page, by the time we add it to the swap cache, would have
been aged down and up roughly the same number of times.

Ergo, it's age _should_ be the same as PAGE_AGE_START.

> > That would certainly help explain why aging doesn't work for some people.
>
> As would your patch ;)

No. Do the math. My patch gives the age the _right_ age. Previously, it
had a completely random age that had _nothing_ to do with any other page
age.

Linus

2001-07-08 17:16:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sun, 8 Jul 2001, Rik van Riel wrote:

> On Sun, 8 Jul 2001, Mike Galbraith wrote:
>
> > is very oom with no disk activity. It _looks_ (xmm and vmstat) like
> > it just ran out of cleanable dirty pages. With or without swap,
>
> ... Bingo. You hit the infamous __wait_on_buffer / ___wait_on_page
> bug. I've seen this for quite a while now on our quad xeon test
> machine, with some kernel versions it can be reproduced in minutes,
> with others it won't trigger at all.
>
> And after a recompile it's usually gone ...
>
> I hope there is somebody out there who can RELIABLY trigger
> this bug, so we have a chance of tracking it down.

Well, my box seems to think I'm a somebody. If it changes it's mind,
I'll let you know. I'll throw whatever rocks I can find at it to get
it all angry and confused. You sneak up behind it and do the stake and
mallot number.

tar -rvf /dev/null /usr/local (10 gig of.. mess) with X/KDE running
seems 100% repeatable here.

'scuse me while I go recompile again and hope it just goes away ;-)

-Mike

2001-07-08 17:59:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sun, 8 Jul 2001, Rik van Riel wrote:
>
> ... Bingo. You hit the infamous __wait_on_buffer / ___wait_on_page
> bug. I've seen this for quite a while now on our quad xeon test
> machine, with some kernel versions it can be reproduced in minutes,
> with others it won't trigger at all.

Hmm.. That would explain why the "tar" gets stuck, but why does the whole
machine grind to a halt with all other processes being marked runnable?

> I hope there is somebody out there who can RELIABLY trigger
> this bug, so we have a chance of tracking it down.
>
> > tar
> > Trace; c012f2da <__wait_on_buffer+6a/8c>
> > Trace; c01303c9 <bread+45/64>

I wonder if "getblk()" returned a locked not-up-to-date buffer.. That
would explain how the buffer stays locked forever - the "ll_rw_block()"
will not actually submit any IO on a locked buffer, so there won't be any
IO to release it.

And it's interesting to see that this happens for a _inode_ block, not a
data block - which could easily have been dirty and scheduled for a
write-out. So I wonder if there is some race between "write buffer and try
to free it" and "getblk()".

The locking in "try_to_free_buffers()" is rather anal, so I don't see how
this could happen, but..

That still doesn't explain why everybody is busy running. I'd have
expected all the processes to end up waiting for the page or buffer, not
stuck in a live-lock.

Linus

2001-07-08 18:30:19

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sun, 8 Jul 2001, Linus Torvalds wrote:


> (a) _had_ the page been on any of the aging lists, it would have been
> aged down every time we passed it, and
> (b) it's obviously been aged up every time we passed it in the VM so far
> (because it hadn't been added to the swap cache earlier).

> - an anonymous page, by the time we add it to the swap cache, would have
> been aged down and up roughly the same number of times.

Hmmm, indeed. I guess this also means page aging in its
current form cannot even work well with exponential down
aging since the down aging on the pageout list always
cancels out the up aging in swap_out() ...

I guess it's time we found some volunteers to experiment
with linear down aging (page->age--;) since that one will
be able to withstand pages being referenced only in the
page tables.

(now, off to a project 4000 km from home for the next 2
weeks ... bbl)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-08 18:24:30

by Rik van Riel

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sun, 8 Jul 2001, Linus Torvalds wrote:
> On Sun, 8 Jul 2001, Rik van Riel wrote:
> >
> > ... Bingo. You hit the infamous __wait_on_buffer / ___wait_on_page
> > bug. I've seen this for quite a while now on our quad xeon test
> > machine, with some kernel versions it can be reproduced in minutes,
> > with others it won't trigger at all.
>
> Hmm.. That would explain why the "tar" gets stuck, but why does the whole
> machine grind to a halt with all other processes being marked runnable?

If __wait_on_buffer and ___wait_on_page get stuck, this could
mean a page doesn't get unlocked. When this is happening, we
may well be running into a dozens of pages which aren't getting
properly unlocked on IO completion.

This in turn would get the rest of the system stuck in the
pageout code path, eating CPU like crazy.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-08 19:31:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sun, 8 Jul 2001, Rik van Riel wrote:
>
> If __wait_on_buffer and ___wait_on_page get stuck, this could
> mean a page doesn't get unlocked. When this is happening, we
> may well be running into a dozens of pages which aren't getting
> properly unlocked on IO completion.

Absolutely. But that, in turn, should cause just others getting stuck, not
running, no?

Anyway, having looked at the buffer case, I htink I found a potentially
nasty bug: "unlock_buffer()" with a buffer cout of zero.

Why is this nasty? unlock_buffer() does:

extern inline void unlock_buffer(struct buffer_head *bh)
{
clear_bit(BH_Lock, &bh->b_state);
smp_mb__after_clear_bit();
if (waitqueue_active(&bh->b_wait))
wake_up(&bh->b_wait);
}

but by doing the "clear_bit()", it also potentially free's the buffer, so
an interrupt coming in (or another CPU) can end up doing a kfree() on the
bh.

At which point the "waitqueue_active()" and the wakeup call are operating
on random memory.

This does not explain __wait_on_buffer(), but it's a bug none-the-less.

Anybody can find anything else fishy with buffer handling?

Linus

2001-07-09 03:00:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Sun, 8 Jul 2001, Linus Torvalds wrote:
>
> Anyway, having looked at the buffer case, I htink I found a potentially
> nasty bug: "unlock_buffer()" with a buffer count of zero.

[ Basic problem: write-completion race with the code that does
"try_to_free_buffers" ]

Suggested fix:
- every b_end_io() function should decrement the buffer count as the
_last_ thing it does to the buffer.
- every bh IO submitter would have to increment the bh count before
submitting it for IO.

This way, I don't think try_to_free_buffers() can be confused: the only
thing that can suddenly start using a buffer is:
- finding it on a buffer list (eg the BUF_DIRTY list) in order to wait on
it or submit it for IO.
- finding it on the hash list (ie "getblk()") during a bh lookup.

In both cases, try_to_free_buffers() owns the spinlocks that are required
for the list/hash lookup, so these things are nicely synchronized with
trying to free the buffer.

The only thing I can see that _wasn't_ synchronized with trying to free
the buffer was the IO completion, and if we make it the rule that the IO
is always handled with the buffer count elevated, and that the last thing
the IO completion does is equivalent to

mb(); /* make sure all other bh changes are visible to other CPU's */
atomic_dec(&bh->b_count); /* potentially free the bh */

then we're ok (and "unlock_buffer()" ends up acting as the "mb()", so that
part is covered). This way, if "try_to_free_buffers()" has seen
bh->b_count being zero (which is the only case where it will consider
freeing the buffer), we know that we're not racing with IO completion.

Do people agree with this, or see any holes in it?

Linus

2001-07-09 07:58:31

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sun, 8 Jul 2001, Rik van Riel wrote:

> On Sun, 8 Jul 2001, Linus Torvalds wrote:
> > On Sun, 8 Jul 2001, Rik van Riel wrote:
> > >
> > > ... Bingo. You hit the infamous __wait_on_buffer / ___wait_on_page
> > > bug. I've seen this for quite a while now on our quad xeon test
> > > machine, with some kernel versions it can be reproduced in minutes,
> > > with others it won't trigger at all.
> >
> > Hmm.. That would explain why the "tar" gets stuck, but why does the whole
> > machine grind to a halt with all other processes being marked runnable?
>
> If __wait_on_buffer and ___wait_on_page get stuck, this could
> mean a page doesn't get unlocked. When this is happening, we
> may well be running into a dozens of pages which aren't getting
> properly unlocked on IO completion.
>
> This in turn would get the rest of the system stuck in the
> pageout code path, eating CPU like crazy.

I don't know exactly what is happening, but I do know _who_ is causing
the problem I'm seeing.. it's tmpfs. When mounted on /tmp and running
X/KDE, the tar [1] will oom my box every time because page_launder trys
and always failing to get anything scrubbed after the tar has run for a
while. Unmount tmpfs/restart X and do the same tar, and all is well.

(it's not locked pages aparantly. I modified page_launder to move those
to the active list, and refill_inactive_scan to rotate them to the end
of the active list. inactive_dirty list still grows ever larger, filling
with 'stuff' that page_launder can't clean until you're totally oom)

-Mike

2001-07-09 08:27:50

by Christoph Rohland

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Hi Mike,

On Mon, 9 Jul 2001, Mike Galbraith wrote:
> I don't know exactly what is happening, but I do know _who_ is
> causing the problem I'm seeing.. it's tmpfs. When mounted on /tmp
> and running X/KDE, the tar [1] will oom my box every time because
> page_launder trys and always failing to get anything scrubbed after
> the tar has run for a while. Unmount tmpfs/restart X and do the
> same tar, and all is well.
>
> (it's not locked pages aparantly. I modified page_launder to move
> those to the active list, and refill_inactive_scan to rotate them to
> the end of the active list. inactive_dirty list still grows ever
> larger, filling with 'stuff' that page_launder can't clean until
> you're totally oom)

Do you have set the size parameter for tmpfs? Else it will grow until
oom.

Another point I see with tmpfs is the following: tmpfs writepage
simply moves the page to the swap cache. But it does not schedule a
writeout of the page. So we have to scan the swap cache to really free
memory.

Greetings
Christoph


2001-07-09 09:20:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On 9 Jul 2001, Christoph Rohland wrote:

> Hi Mike,
>
> On Mon, 9 Jul 2001, Mike Galbraith wrote:
> > I don't know exactly what is happening, but I do know _who_ is
> > causing the problem I'm seeing.. it's tmpfs. When mounted on /tmp
> > and running X/KDE, the tar [1] will oom my box every time because
> > page_launder trys and always failing to get anything scrubbed after
> > the tar has run for a while. Unmount tmpfs/restart X and do the
> > same tar, and all is well.
> >
> > (it's not locked pages aparantly. I modified page_launder to move
> > those to the active list, and refill_inactive_scan to rotate them to
> > the end of the active list. inactive_dirty list still grows ever
> > larger, filling with 'stuff' that page_launder can't clean until
> > you're totally oom)
>
> Do you have set the size parameter for tmpfs? Else it will grow until
> oom.

No, but that doesn't appear to be the problem. The patchlet below
fixes^Wmakes it work ok without ooming, whether I have swap enabled
or not.

--- mm/shmem.c.org Mon Jul 9 09:03:27 2001
+++ mm/shmem.c Mon Jul 9 09:03:46 2001
@@ -264,8 +264,8 @@
info->swapped++;

spin_unlock(&info->lock);
-out:
set_page_dirty(page);
+out:
UnlockPage(page);
return error;
}

So, did I fix it or just bust it in a convenient manner ;-)

-Mike

Rik. Kswapd should check oom even if there is no inactive shortage,
else you get livelock when you can't scrub instead of kaboom. Maybe
a count of loops before killing things, but IMHO a check is needed.

2001-07-09 09:32:18

by Christoph Rohland

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Hi Mike,

On Mon, 9 Jul 2001, Mike Galbraith wrote:
> --- mm/shmem.c.org Mon Jul 9 09:03:27 2001
> +++ mm/shmem.c Mon Jul 9 09:03:46 2001
> @@ -264,8 +264,8 @@
> info->swapped++;
>
> spin_unlock(&info->lock);
> -out:
> set_page_dirty(page);
> +out:
> UnlockPage(page);
> return error;
> }
>
> So, did I fix it or just bust it in a convenient manner ;-)

... now you drop random pages. This of course helps reducing memory
pressure ;-)

But still this may be a hint. You are not running out of swap, aren't
you?

Greetings
Christoph


2001-07-09 09:39:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On 9 Jul 2001, Christoph Rohland wrote:

> Hi Mike,
>
> On Mon, 9 Jul 2001, Mike Galbraith wrote:
> > --- mm/shmem.c.org Mon Jul 9 09:03:27 2001
> > +++ mm/shmem.c Mon Jul 9 09:03:46 2001
> > @@ -264,8 +264,8 @@
> > info->swapped++;
> >
> > spin_unlock(&info->lock);
> > -out:
> > set_page_dirty(page);
> > +out:
> > UnlockPage(page);
> > return error;
> > }
> >
> > So, did I fix it or just bust it in a convenient manner ;-)
>
> ... now you drop random pages. This of course helps reducing memory
> pressure ;-)

(shoot. I figured that was too easy to be right)

> But still this may be a hint. You are not running out of swap, aren't
> you?

I'm running oom whether I have swap enabled or not. The inactive
dirty list starts growing forever, until it's full of (aparantly)
dirty pages and I'm utterly oom.

With swap enabled, I keep allocating until there's nothing left.
Actual space usage is roughly 30mb (of 256mb), but when you can't
allocate anymore you're toast too, with the same dirt buildup.

-Mike

2001-07-09 11:18:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Mon, 9 Jul 2001, Mike Galbraith wrote:

> On 9 Jul 2001, Christoph Rohland wrote:
>
> > Hi Mike,
> >
> > On Mon, 9 Jul 2001, Mike Galbraith wrote:
> > > --- mm/shmem.c.org Mon Jul 9 09:03:27 2001
> > > +++ mm/shmem.c Mon Jul 9 09:03:46 2001
> > > @@ -264,8 +264,8 @@
> > > info->swapped++;
> > >
> > > spin_unlock(&info->lock);
> > > -out:
> > > set_page_dirty(page);
> > > +out:
> > > UnlockPage(page);
> > > return error;
> > > }
> > >
> > > So, did I fix it or just bust it in a convenient manner ;-)
> >
> > ... now you drop random pages. This of course helps reducing memory
> > pressure ;-)
>
> (shoot. I figured that was too easy to be right)

Urk! Yeah, removing the only thing in the world keeping ramfs/tmpfs
page pegged was kinda.. dumb. I'd ask for a browm paper baggie, but
my pointy head might damage it ;-)

> > But still this may be a hint.

_Anyway_, tmpfs is growing and growing from stdout. If I send
output to /dev/null, no growth. Nothing in tmpfs is growing, so I
presume the memory is disappearing down one of X or KDE's sockets.

No such leakage without tmpfs, and I can do all kinds of normal
file type use of tmpfs with no leakage.

-Mike


2001-07-09 11:27:55

by Christoph Rohland

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Hi Mike,

On Mon, 9 Jul 2001, Mike Galbraith wrote:
>> But still this may be a hint. You are not running out of swap,
>> aren't you?
>
> I'm running oom whether I have swap enabled or not. The inactive
> dirty list starts growing forever, until it's full of (aparantly)
> dirty pages and I'm utterly oom.
>
> With swap enabled, I keep allocating until there's nothing left.
> Actual space usage is roughly 30mb (of 256mb), but when you can't
> allocate anymore you're toast too, with the same dirt buildup.

AFAIU you are running oom without the oom killer kicking in.

That's reasonable with tmpfs: the tmpfs pages are accounted to the
page cache, but are not freeable if there is no free swap space. So
the vm tries to free memory without success. (The same should be true
for ramfs but exaggerated by the fact that ramfs can never free the
page)

In the -ac series I introduced separate accounting for shmem pages and
do reduce the page cache size by this count for meminfo and
vm_enough_memory. Perhaps the oom coding needs also some knowledge
about this?

Greetings
Christoph


2001-07-09 11:32:05

by Christoph Rohland

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Hi Mike,

On Mon, 9 Jul 2001, Mike Galbraith wrote:
>> > But still this may be a hint.
>
> _Anyway_, tmpfs is growing and growing from stdout. If I send
> output to /dev/null, no growth. Nothing in tmpfs is growing, so I
> presume the memory is disappearing down one of X or KDE's sockets.

So tmpfs is not growing, but you still have a mem leak only with
tmpfs? Is there some deleted file allocating blocks? Or did
redirecting stdout fix the problem. I am not sure that I understand
the situation.

> No such leakage without tmpfs, and I can do all kinds of normal
> file type use of tmpfs with no leakage.

BTW I am running /tmp on tmpfs all the time with KDE and never
experienced something like that. But of course I ran oom without size
limits.

Greetings
Christoph


2001-07-09 12:27:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On 9 Jul 2001, Christoph Rohland wrote:

> Hi Mike,
>
> On Mon, 9 Jul 2001, Mike Galbraith wrote:
> >> > But still this may be a hint.
> >
> > _Anyway_, tmpfs is growing and growing from stdout. If I send
> > output to /dev/null, no growth. Nothing in tmpfs is growing, so I
> > presume the memory is disappearing down one of X or KDE's sockets.
>
> So tmpfs is not growing, but you still have a mem leak only with
> tmpfs? Is there some deleted file allocating blocks? Or did
> redirecting stdout fix the problem. I am not sure that I understand
> the situation.

tmpfs is growing. Redirecting output to /dev/null fixes it. (whee)

> > No such leakage without tmpfs, and I can do all kinds of normal
> > file type use of tmpfs with no leakage.
>
> BTW I am running /tmp on tmpfs all the time with KDE and never
> experienced something like that. But of course I ran oom without size
> limits.

I only started using tmpfs for /tmp after I found out how much faster
gcc runs without the -pipe switch. Writing temp files to disk is
much faster than using -pipe.. with tmpfs, it's even faster :)

-Mike

2001-07-09 12:21:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On 9 Jul 2001, Christoph Rohland wrote:

> Hi Mike,
>
> On Mon, 9 Jul 2001, Mike Galbraith wrote:
> >> But still this may be a hint. You are not running out of swap,
> >> aren't you?
> >
> > I'm running oom whether I have swap enabled or not. The inactive
> > dirty list starts growing forever, until it's full of (aparantly)
> > dirty pages and I'm utterly oom.
> >
> > With swap enabled, I keep allocating until there's nothing left.
> > Actual space usage is roughly 30mb (of 256mb), but when you can't
> > allocate anymore you're toast too, with the same dirt buildup.
>
> AFAIU you are running oom without the oom killer kicking in.

Yes.

> That's reasonable with tmpfs: the tmpfs pages are accounted to the
> page cache, but are not freeable if there is no free swap space. So
> the vm tries to free memory without success. (The same should be true
> for ramfs but exaggerated by the fact that ramfs can never free the
> page)

Why is it growing from stdout output? (last message.. mail lag)

> In the -ac series I introduced separate accounting for shmem pages and
> do reduce the page cache size by this count for meminfo and
> vm_enough_memory. Perhaps the oom coding needs also some knowledge
> about this?

It doesn't _look_ like that would make a difference.

This is how the oom looks from here:

Start tar -rvf /dev/null /usr/local; box has 128mb ram/256mb swap

We have many dirty and many clean pages at first. As the dcache/icache
etc grow beneath, we start consuming clean/dirty pages, but we don't
have a free shortage at first, so the caches bloat up. We finally start
shrinking caches, but it's too late, and we allocate some swap. (This
cannot be prevented even by VERY agressive aging/shrinking, even with
a hard limit to start shrinking caches whether we have a free shortage
or not) It is impossible (or I can't get there from here anyway) to
keep the dirty list long enough to keep inactive_shortage happy.. we're
at full memory pressure, so we have to have a full 25% of ram inactive.

We keep allocating much swap and writing a little of it until slowly
but surely, we run out of allocatable swap. In two runs taring up
an fs with 341057 inodes, swapspace is gone (but hardly used).

If I stop the tar _well_ before oom, and try to do swapoff, oom.

There is very little in /tmp, and tar isn't touching any of it. X must
be pouring pages down a socket to my client, coz that's the only thing
in /tmp. It seems to be leaking away whether we have swap active or not..
just a question of what we run out of first.. swapspace or pages to put
there.

-Mike

2001-07-09 16:22:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Mon, 9 Jul 2001, Mike Galbraith wrote:
>
> I'm running oom whether I have swap enabled or not. The inactive
> dirty list starts growing forever, until it's full of (aparantly)
> dirty pages and I'm utterly oom.

Does it help if you just remove the

if (atomic_read(&page->count) > 2)
goto out;

from shmem_writepage()?

It _shouldn't_ matter (because writepage() should only be called with
inactive pages anyway), but your problem certainly sounds like your
inactive dirty list is not able to write out shmfs pages.

Note that if you don't have swap, you're screwed anyway. You just don't
have anywhere to put the pages.

Linus

2001-07-09 19:47:16

by Christoph Rohland

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Hi Linus,

On Mon, 9 Jul 2001, Linus Torvalds wrote:
> Does it help if you just remove the
>
> if (atomic_read(&page->count) > 2)
> goto out;
>
> from shmem_writepage()?
>
> It _shouldn't_ matter (because writepage() should only be called
> with inactive pages anyway), but your problem certainly sounds like
> your inactive dirty list is not able to write out shmfs pages.

No, it does matter. It prevents races against getpage.

Greetings
Christoph


2001-07-09 20:48:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On 9 Jul 2001, Christoph Rohland wrote:
>
> No, it does matter. It prevents races against getpage.

No it doesn't.

We have the page locked.

And if somebody does "getpage()" and doesn't check for the page lock, then
that test _still_ doesn't prevent races, because the getpage might happen
just _after_ the "atomic_read()".

As it stands now, that atomic_read() does _nothing_. If you think
something depends on it, then that something is already buggy.

Linus

2001-07-10 02:56:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Sun, Jul 08, 2001 at 07:59:01PM -0700, Linus Torvalds wrote:
>
> On Sun, 8 Jul 2001, Linus Torvalds wrote:
> >
> > Anyway, having looked at the buffer case, I htink I found a potentially
> > nasty bug: "unlock_buffer()" with a buffer count of zero.
>
> [ Basic problem: write-completion race with the code that does
> "try_to_free_buffers" ]
>
> Suggested fix:
> - every b_end_io() function should decrement the buffer count as the
> _last_ thing it does to the buffer.
> - every bh IO submitter would have to increment the bh count before
> submitting it for IO.

it seems to me there's a bit of overkill in the pre4 fix. First of all
such race obviously couldn't happen with the async_io and kiobuf
handlers, the former because the page stays locked so the bh cannot go
away with the locked page, the latter because the bh are private not
visible at all to the vm. Playing with the b_count for those two cases
are just wasted cycles.

Secondly even for the sync_io handler we shouldn't need to get_bh before
submitting the I/O. Once we get the bh locked we're just fine. As far I
can see all we need is to pin the bh around unlock_buffer in the I/O
completion handler which is simpler and equally efficient (possibly more
efficient in SMP where we won't risk to bh_put in a cpu different than
the bh_get, but probably the cacheline ping pong wouldn't change much
since we must do the lock/unlock on the b_state anyways).

Also somebody should explain me why end_buffer_write exists in first
place, it is just wasted memory and icache.

This is the way I would have fixed the smp race against pre3. Can you
see anything that isn't fixed by the below patch and that is fixed by
pre4? Unless somebody can see something I will drop some of the
unnecessary get_bh/put_bh from pre5.

--- bh-fix/drivers/block/ll_rw_blk.c.~1~ Tue Jul 10 04:33:57 2001
+++ bh-fix/drivers/block/ll_rw_blk.c Tue Jul 10 04:49:11 2001
@@ -963,10 +963,12 @@
/*
* Default IO end handler, used by "ll_rw_block()".
*/
-static void end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+void end_buffer_io_sync(struct buffer_head *bh, int uptodate)
{
mark_buffer_uptodate(bh, uptodate);
+ atomic_inc(&bh->b_count);
unlock_buffer(bh);
+ atomic_dec(&bh->b_count);
}

/**
--- bh-fix/fs/buffer.c.~1~ Wed Jul 4 04:03:46 2001
+++ bh-fix/fs/buffer.c Tue Jul 10 04:50:15 2001
@@ -161,13 +161,6 @@
atomic_dec(&bh->b_count);
}

-/* End-of-write handler.. Just mark it up-to-date and unlock the buffer. */
-static void end_buffer_write(struct buffer_head *bh, int uptodate)
-{
- mark_buffer_uptodate(bh, uptodate);
- unlock_buffer(bh);
-}
-
/*
* The buffers have been marked clean and locked. Just submit the dang
* things..
@@ -182,7 +175,7 @@
atomic_inc(&wait->b_count);
do {
struct buffer_head * bh = *array++;
- bh->b_end_io = end_buffer_write;
+ bh->b_end_io = end_buffer_io_sync;
submit_bh(WRITE, bh);
} while (--count);
wait_on_buffer(wait);
--- bh-fix/include/linux/fs.h.~1~ Mon Jul 9 20:25:17 2001
+++ bh-fix/include/linux/fs.h Tue Jul 10 04:50:05 2001
@@ -1061,6 +1061,7 @@

/* reiserfs_writepage needs this */
extern void set_buffer_async_io(struct buffer_head *bh) ;
+extern void end_buffer_io_sync(struct buffer_head *, int);

#define BUF_CLEAN 0
#define BUF_LOCKED 1 /* Buffers scheduled for write */

Andrea

2001-07-10 04:05:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Tue, 10 Jul 2001, Andrea Arcangeli wrote:
>
> it seems to me there's a bit of overkill in the pre4 fix.

It may look that way, but look closer..

> First of all
> such race obviously couldn't happen with the async_io and kiobuf
> handlers, the former because the page stays locked so the bh cannot go
> away with the locked page, the latter because the bh are private not
> visible at all to the vm. Playing with the b_count for those two cases
> are just wasted cycles.

No. Playing with the bh count for those two makes the rules be the same
for everybody, because the sync_io handler needs it, and then we might as
well just make it a general rule: IO in-flight shows up as an elevated
count. It also makes sense from a "reference count" standpoint - the
buffer head count really means "how many references do we have to it", and
the reference from a IO request is very much a reference.

> Also somebody should explain me why end_buffer_write exists in first
> place, it is just wasted memory and icache.

Now that I agree with, we could just get rid of one of them.

> This is the way I would have fixed the smp race against pre3. Can you
> see anything that isn't fixed by the below patch and that is fixed by
> pre4?

I can. I "fixed" it your way at first, and it doesn't actually help a
thing.

Look:

CPU #1 CPU #2

try_to_free_buffers()

if (atomic_read(&bh->b_count)

end_buffer_io_sync()

atomic_inc(&bh->b_count);
bit_clear(BH_Locked, &bh->b_flags);

|| bh->b_flags & BUSY_BITS)
free bh

if (waitqueue_active(&bh->b_wait))
wakeup(&bh->b_wait);
atomic_dec(&bh->b_count);


See? Your patch with the b_count stuff inside end_buffer_io_sync() fixes
absolutely _nothing_ - we still have exactly the same issue.

You can fix it with some interesting memory barriers inside the
try_to_free_buffers() logic, but by that time my fix is (a) much more
obvious and (b) faster too.

Notice how my fix actually has the same number of atomic operations as
your fix, except my fix actually _fixes_ the race?

(Yeah, I'm not counting the async IO ones - those I admit are not
necessary, but at the same time I really prefer to have the different IO
paths look as similar as possible. And see the reference count issue).

Linus

2001-07-10 04:21:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Mon, 9 Jul 2001, Linus Torvalds wrote:
>
> Look:
>
> CPU #1 CPU #2
>
> try_to_free_buffers()
>
> if (atomic_read(&bh->b_count)
>
> end_buffer_io_sync()
>
> atomic_inc(&bh->b_count);
> bit_clear(BH_Locked, &bh->b_flags);
>
> || bh->b_flags & BUSY_BITS)
> free bh

I forgot to note that it doesn't help to re-order the tests here - but we
_could_ do

if (bh->b_flags & BUSY_BITS)
goto buffer_busy;
rmb();
if (atomic_read(&bh->b_count))
goto buffer_busy;

together with having the proper write memory barriers in
"end_buffer_io_sync()" to make sure that the BH_Locked thing shows up in
the right order with bh->b_count updates.

In contrast, the version in pre4 doesn't depend on any memory ordering
between BH_Locked at all - it really only depends on a memory barrier
before the final atomic_dec() that releases the buffer, as it ends up
being sufficient for try_to_free_buffers() to just worry about the buffer
count when it comes to IO completion. The b_flags BUSY bits don't matter
wrt the IO completion at all - they end up being used only for "idle"
buffers (which in turn are totally synchronized by the LRU and hash
spinlocks, so that is the "obviously correct" case)

I personally think it's a hard thing to depend on memory ordering,
especially if there are two independent fields. Which is why I really
don't think that the pre4 fix is "overkill".

Oh, it does really need a

smp_mb_before_atomic_dec();

as part of the "put_bh()". On x86, this obviously is a no-op. And we
actually need that one in general - not just for IO completion - as long
as we consider the "atomic_dec(&bh->b_flags)" to "release" the buffer.

Andrea?

Linus

2001-07-10 05:11:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Mon, Jul 09, 2001 at 09:03:33PM -0700, Linus Torvalds wrote:
> No. Playing with the bh count for those two makes the rules be the same
> for everybody, because the sync_io handler needs it, and then we might as
> well just make it a general rule: IO in-flight shows up as an elevated
> count. It also makes sense from a "reference count" standpoint - the

At least for the kiobuf case the notion of IO in flight is hold in the
iobuf->io_count, about the reference count there isn't a reference count
at all on those bh, they're just an array of ram in the iobuf.

Infact the b_count is also never read, exactly because the notion of
reference-count/in-flight-I/O doesn't belong there.

One valid argument is to do the same thing before all sumbit_bh which
doesn't seem to have a big value to me at the moment, maybe I'm wrong
in the long term but certainly at this moment that sounds more like
wasted cpu than cleaner code.

so I guess I'd prefer something like this:

--- 2.4.7pre5/fs/buffer.c.~1~ Tue Jul 10 03:55:21 2001
+++ 2.4.7pre5/fs/buffer.c Tue Jul 10 06:52:38 2001
@@ -2006,7 +2006,6 @@

kiobuf = bh->b_private;
__unlock_buffer(bh);
- put_bh(bh);
end_kio_request(kiobuf, uptodate);
}

@@ -2131,7 +2130,6 @@
offset += size;

atomic_inc(&iobuf->io_count);
- get_bh(tmp);
submit_bh(rw, tmp);
/*
* Wait for IO if we have got too much


(personally I guess I'd still prefer to do the same for the async-IO
handler, but for it at least the b_count is checked eventually by
somebody [try_to_free_pages] so it somehow make more sense even if in
practice it isn't not needed there either)

Andrea

2001-07-10 05:43:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Mon, Jul 09, 2001 at 09:20:23PM -0700, Linus Torvalds wrote:
> In contrast, the version in pre4 doesn't depend on any memory ordering
> between BH_Locked at all - it really only depends on a memory barrier
> before the final atomic_dec() that releases the buffer, as it ends up
> being sufficient for try_to_free_buffers() to just worry about the buffer
> count when it comes to IO completion. The b_flags BUSY bits don't matter
> wrt the IO completion at all - they end up being used only for "idle"
> buffers (which in turn are totally synchronized by the LRU and hash
> spinlocks, so that is the "obviously correct" case)
>
> I personally think it's a hard thing to depend on memory ordering,

Sometime memory ordering pays off by avoiding locks, but this isn't the
case ;).

> especially if there are two independent fields. Which is why I really
> don't think that the pre4 fix is "overkill".

It certainly isn't overkill in respect of doing get_bh in an implicitly
sychronized points where we submit the I/O (that was my second argument
and that was plain wrong).

My first arguments about "overkill" were for async I/O and kiobufs, where
the race cannot trigger. Mainly for the kiobufs I/O I'm still not very
convinced.

> Oh, it does really need a
>
> smp_mb_before_atomic_dec();
>
> as part of the "put_bh()". On x86, this obviously is a no-op. And we
> actually need that one in general - not just for IO completion - as long
> as we consider the "atomic_dec(&bh->b_flags)" to "release" the buffer.
>
> Andrea?

yes, agreed.

Andrea

2001-07-10 14:57:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Tue, Jul 10, 2001 at 07:43:15AM +0200, Andrea Arcangeli wrote:
> My first arguments about "overkill" were for async I/O and kiobufs, where
> the race cannot trigger. Mainly for the kiobufs I/O I'm still not very
> convinced.

another issue is that I don't see any value in defining the
unlock_buffer() with the get_bh/put_bh in it. The get_bh over there is
just useless because we don't have the memory barriers in the get_bh and
try_to_free_buffer paths, that's what party fooled me last night in
thinking we didn't need to get_bh in the implicit synchronization point
but that using unlock_buffer (with the get_bh in it was enough), otherwise
unlock_buffer would been pointless (and infact now it sorted out it is
as far I can tell ;).

So I'd kill unlock_buffer and replace it with __unlock_buffer.

Andrea

2001-07-10 18:48:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...


On Tue, 10 Jul 2001, Andrea Arcangeli wrote:
>
> another issue is that I don't see any value in defining the
> unlock_buffer() with the get_bh/put_bh in it.

Ahh, I forgot about that.

That's just a remnant of my original fix (which looked very much like
yours, and had the same bug), and I just hadn't cleaned up after I did the
real fix. Thanks for reminding me - done.

I also did the end_buffer_io_sync() cleanup.

Linus

2001-07-11 02:38:00

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...



On Mon, 9 Jul 2001, Mike Galbraith wrote:

> On 9 Jul 2001, Christoph Rohland wrote:
>
> > Hi Mike,
> >
> > On Mon, 9 Jul 2001, Mike Galbraith wrote:
> > > --- mm/shmem.c.org Mon Jul 9 09:03:27 2001
> > > +++ mm/shmem.c Mon Jul 9 09:03:46 2001
> > > @@ -264,8 +264,8 @@
> > > info->swapped++;
> > >
> > > spin_unlock(&info->lock);
> > > -out:
> > > set_page_dirty(page);
> > > +out:
> > > UnlockPage(page);
> > > return error;
> > > }
> > >
> > > So, did I fix it or just bust it in a convenient manner ;-)
> >
> > ... now you drop random pages. This of course helps reducing memory
> > pressure ;-)
>
> (shoot. I figured that was too easy to be right)
>
> > But still this may be a hint. You are not running out of swap, aren't
> > you?
>
> I'm running oom whether I have swap enabled or not. The inactive
> dirty list starts growing forever, until it's full of (aparantly)
> dirty pages and I'm utterly oom.

We can make sure if this (inactive full of dirty pages) is really the case
with the tracing code.

The shmem fix in 2.4.7-pre5 is the solution for your problem ?

If not, I'll port the tracing code to the pre5 and hopefully we can
actually figure out what is going on here.

2001-07-11 04:04:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

On Tue, 10 Jul 2001, Marcelo Tosatti wrote:

> On Mon, 9 Jul 2001, Mike Galbraith wrote:
>
> > On 9 Jul 2001, Christoph Rohland wrote:
> >

[snip]

> > > > So, did I fix it or just bust it in a convenient manner ;-)
> > >
> > > ... now you drop random pages. This of course helps reducing memory
> > > pressure ;-)
> >
> > (shoot. I figured that was too easy to be right)
> >
> > > But still this may be a hint. You are not running out of swap, aren't
> > > you?
> >
> > I'm running oom whether I have swap enabled or not. The inactive
> > dirty list starts growing forever, until it's full of (aparantly)
> > dirty pages and I'm utterly oom.
>
> We can make sure if this (inactive full of dirty pages) is really the case
> with the tracing code.

The problem turned out to be KDE. It opens/unlinks two files in /tmp
per terminal, and lseeks/writes terminal output to them continually.
With tmpfs/ramfs mounted on /tmp....

> The shmem fix in 2.4.7-pre5 is the solution for your problem ?

No. Now, the pages go to the active list. The solution is to not
use KDE-2.1's terminals if I fire up X ;-) It doesn't livelock at
oom anymore though.. thrashes uncontrollably until sysrq-e fixes it's
attitude problem.

(I never saw this problem before because I rarely run X, and never do
anything which generates enough terminal output to oom the box if I do
run it. I only ran the tar thing by chance, trying to figure out why
people were reporting updatedb causing massive swapping, when it didn't
do that here.. shrug)

> If not, I'll port the tracing code to the pre5 and hopefully we can
> actually figure out what is going on here.

No need. What looked like a really weird kernel bug turned out to be
a userland bug triggering oom livelock bug.

-Mike


2001-07-11 19:42:40

by Christoph Rohland

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

Hi Linus,

On Mon, 9 Jul 2001, Linus Torvalds wrote:
>
> On 9 Jul 2001, Christoph Rohland wrote:
>>
>> No, it does matter. It prevents races against getpage.
>
> No it doesn't.
>
> We have the page locked.
>
> And if somebody does "getpage()" and doesn't check for the page
> lock, then that test _still_ doesn't prevent races, because the
> getpage might happen just _after_ the "atomic_read()".
>
> As it stands now, that atomic_read() does _nothing_. If you think
> something depends on it, then that something is already buggy.

Yep, you are right. This check hides another error: We cannot use
find_get_page for shmem since this is getting the page without the
lock like you described. I removed this optimization. Also
__find_lock_page has to check that mapping and index are still the
ones we looked for.

I append a patch to fix these errors (and the other obvious buglets in
shmem.c I did send to you several times).

Stephen, could you crosscheck? You had the test case which triggered
the count > 2 bug.

Greetings
Christoph


diff -uNr 7-pre6/mm/filemap.c 7-pre6-fix/mm/filemap.c
--- 7-pre6/mm/filemap.c Wed Jul 11 09:59:01 2001
+++ 7-pre6-fix/mm/filemap.c Wed Jul 11 20:49:14 2001
@@ -760,7 +760,7 @@
lock_page(page);

/* Is the page still hashed? Ok, good.. */
- if (page->mapping)
+ if (page->mapping == mapping && page->index == offset)
return page;

/* Nope: we raced. Release and try again.. */
diff -uNr 7-pre6/mm/shmem.c 7-pre6-fix/mm/shmem.c
--- 7-pre6/mm/shmem.c Wed Jul 11 09:59:01 2001
+++ 7-pre6-fix/mm/shmem.c Wed Jul 11 20:44:35 2001
@@ -3,7 +3,8 @@
*
* Copyright (C) 2000 Linus Torvalds.
* 2000 Transmeta Corp.
- * 2000 Christoph Rohland
+ * 2000-2001 Christoph Rohland
+ * 2000-2001 SAP AG
*
* This file is released under the GPL.
*/
@@ -33,7 +34,7 @@
#define TMPFS_MAGIC 0x01021994

#define ENTRIES_PER_PAGE (PAGE_SIZE/sizeof(unsigned long))
-#define NR_SINGLE (ENTRIES_PER_PAGE + SHMEM_NR_DIRECT)
+#define SHMEM_MAX_BLOCKS (SHMEM_NR_DIRECT + ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)

static struct super_operations shmem_ops;
static struct address_space_operations shmem_aops;
@@ -193,7 +194,14 @@
}

out:
- info->max_index = index;
+ /*
+ * We have no chance to give an error, so we limit it to max
+ * size here and the application will fail later
+ */
+ if (index > SHMEM_MAX_BLOCKS)
+ info->max_index = SHMEM_MAX_BLOCKS;
+ else
+ info->max_index = index;
info->swapped -= freed;
shmem_recalc_inode(inode);
spin_unlock (&info->lock);
@@ -311,6 +319,7 @@
return page;
}

+ shmem_recalc_inode(inode);
if (entry->val) {
unsigned long flags;

@@ -390,22 +399,9 @@

static int shmem_getpage(struct inode * inode, unsigned long idx, struct page **ptr)
{
- struct address_space * mapping = inode->i_mapping;
int error;

- *ptr = NOPAGE_SIGBUS;
- if (inode->i_size <= (loff_t) idx * PAGE_CACHE_SIZE)
- return -EFAULT;
-
- *ptr = __find_get_page(mapping, idx, page_hash(mapping, idx));
- if (*ptr) {
- if (Page_Uptodate(*ptr))
- return 0;
- page_cache_release(*ptr);
- }
-
down (&inode->i_sem);
- /* retest we may have slept */
if (inode->i_size < (loff_t) idx * PAGE_CACHE_SIZE)
goto sigbus;
*ptr = shmem_getpage_locked(inode, idx);
@@ -1024,6 +1020,8 @@
unsigned long max_inodes, inodes;
struct shmem_sb_info *info = &sb->u.shmem_sb;

+ max_blocks = info->max_blocks;
+ max_inodes = info->max_inodes;
if (shmem_parse_options (data, NULL, &max_blocks, &max_inodes))
return -EINVAL;

@@ -1071,7 +1069,7 @@
sb->u.shmem_sb.free_blocks = blocks;
sb->u.shmem_sb.max_inodes = inodes;
sb->u.shmem_sb.free_inodes = inodes;
- sb->s_maxbytes = (unsigned long long)(SHMEM_NR_DIRECT + (ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)) << PAGE_CACHE_SHIFT;
+ sb->s_maxbytes = (unsigned long long)SHMEM_MAX_BLOCKS << PAGE_CACHE_SHIFT;
sb->s_blocksize = PAGE_CACHE_SIZE;
sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
sb->s_magic = TMPFS_MAGIC;
@@ -1279,9 +1277,11 @@
struct qstr this;
int vm_enough_memory(long pages);

- error = -ENOMEM;
+ if (size > (unsigned long long) SHMEM_MAX_BLOCKS << PAGE_CACHE_SHIFT)
+ return ERR_PTR(-EINVAL);
+
if (!vm_enough_memory((size) >> PAGE_SHIFT))
- goto out;
+ return ERR_PTR(-ENOMEM);

this.name = name;
this.len = strlen(name);
@@ -1289,7 +1289,7 @@
root = tmpfs_fs_type.kern_mnt->mnt_root;
dentry = d_alloc(root, &this);
if (!dentry)
- goto out;
+ return ERR_PTR(-ENOMEM);

error = -ENFILE;
file = get_empty_filp();
@@ -1315,7 +1315,6 @@
put_filp(file);
put_dentry:
dput (dentry);
-out:
return ERR_PTR(error);
}
/*

2001-07-12 06:16:13

by David Lang

[permalink] [raw]
Subject: Re: VM in 2.4.7-pre hurts...

this sounds similar to a problem I am attempting to duplicate in my lab
(or will be when I get back into the office on monday anyway :-)

a box that over time starts to have more and more CPU eaten up by the
system while processing the same load, lots of memory available (>100MB
free) and no swap allocated (at least as reported by top)

in my case the machine would be writing lots of log entries (via syslog)
as well as forking frequently (dozens of times/sec)

the loadave would start at <1 for a freshly booted box but would climb to
~12-14 around the time it collapsed, rebooting the box (under the same
load) would clear things up.

no swapping was involved on my machine, but is it possible that similar
effects could happen with writes/reads to unix sockets or syslog writing
to the disk? (~200MB of logs per min, logs rotated every hour)

David Lang



On Sun, 8 Jul 2001, Rik van Riel wrote:

> Date: Sun, 8 Jul 2001 15:23:53 -0300 (BRST)
> From: Rik van Riel <[email protected]>
> To: Linus Torvalds <[email protected]>
> Cc: Mike Galbraith <[email protected]>,
> Jeff Garzik <[email protected]>,
> Daniel Phillips <[email protected]>, [email protected]
> Subject: Re: VM in 2.4.7-pre hurts...
>
> On Sun, 8 Jul 2001, Linus Torvalds wrote:
> > On Sun, 8 Jul 2001, Rik van Riel wrote:
> > >
> > > ... Bingo. You hit the infamous __wait_on_buffer / ___wait_on_page
> > > bug. I've seen this for quite a while now on our quad xeon test
> > > machine, with some kernel versions it can be reproduced in minutes,
> > > with others it won't trigger at all.
> >
> > Hmm.. That would explain why the "tar" gets stuck, but why does the whole
> > machine grind to a halt with all other processes being marked runnable?
>
> If __wait_on_buffer and ___wait_on_page get stuck, this could
> mean a page doesn't get unlocked. When this is happening, we
> may well be running into a dozens of pages which aren't getting
> properly unlocked on IO completion.
>
> This in turn would get the rest of the system stuck in the
> pageout code path, eating CPU like crazy.
>
> regards,
>
> Rik
> --
> Virtual memory is like a game you can't win;
> However, without VM there's truly nothing to lose...
>
> http://www.surriel.com/ http://distro.conectiva.com/
>
> Send all your spam to [email protected] (spam digging piggy)
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>