2002-09-05 04:57:17

by Daniel Phillips

[permalink] [raw]
Subject: Race in shrink_cache

Hi Marcelo,

This looks really suspicious, vmscan.c#435:

spin_unlock(&pagemap_lru_lock);
if (put_page_testzero(page))
__free_pages_ok(page, 0);
/* avoid to free a locked page */
page_cache_get(page);

/* whoops, double free coming */

I suggest you bump the page count before releasing the lru lock. The race
shown above may not in fact be possible, but the current code is fragile.

--
Daniel


2002-09-05 06:19:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in shrink_cache

Daniel Phillips wrote:
>
> Hi Marcelo,
>
> This looks really suspicious, vmscan.c#435:
>
> spin_unlock(&pagemap_lru_lock);
> if (put_page_testzero(page))
> __free_pages_ok(page, 0);
> /* avoid to free a locked page */
> page_cache_get(page);
>
> /* whoops, double free coming */
>
> I suggest you bump the page count before releasing the lru lock. The race
> shown above may not in fact be possible, but the current code is fragile.
>

That's OK. The page has a ref because of nonzero ->buffers And it
is locked, which pins page->buffers.

2002-09-05 06:29:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: Race in shrink_cache

On Thursday 05 September 2002 08:36, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > Hi Marcelo,
> >
> > This looks really suspicious, vmscan.c#435:
> >
> > spin_unlock(&pagemap_lru_lock);
> > if (put_page_testzero(page))
> > __free_pages_ok(page, 0);
> > /* avoid to free a locked page */
> > page_cache_get(page);
> >
> > /* whoops, double free coming */
> >
> > I suggest you bump the page count before releasing the lru lock. The race
> > shown above may not in fact be possible, but the current code is fragile.
> >
>
> That's OK. The page has a ref because of nonzero ->buffers And it
> is locked, which pins page->buffers.

Yes, true. Calm down ladies and gentlemen, and move away from the exits,
there is no fire. While we're in here, do you have any idea what this is
about:

/*
* We must not allow an anon page
* with no buffers to be visible on
* the LRU, so we unlock the page after
* taking the lru lock
*/

That is, what's scary about an anon page without buffers?

--
Daniel

2002-09-05 06:50:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in shrink_cache

Daniel Phillips wrote:
>
> ...
> /*
> * We must not allow an anon page
> * with no buffers to be visible on
> * the LRU, so we unlock the page after
> * taking the lru lock
> */
>
> That is, what's scary about an anon page without buffers?

ooop. That's an akpm comment. umm, err..

It solves this BUG:

http://www.cs.helsinki.fi/linux/linux-kernel/2001-37/0594.html

Around the 2.4.10 timeframe, Andrea started putting anon pages
on the LRU. Then he backed that out, then put it in again. I
think this comment dates from the time when anon pages were
not on the LRU. So there's a little window there where the
page is unlocked, we've just dropped its swapdev buffers, the page is
on the LRU and pagemap_lru_lock is not held.

So another CPU came in, found the page on the LRU, saw that it had
no ->mapping and no ->buffers and went BUG.

The fix was to take pagemap_lru_lock before unlocking the page.

The comment is stale.

2002-09-05 07:22:09

by Daniel Phillips

[permalink] [raw]
Subject: Re: Race in shrink_cache

On Thursday 05 September 2002 09:07, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ...
> > /*
> > * We must not allow an anon page
> > * with no buffers to be visible on
> > * the LRU, so we unlock the page after
> > * taking the lru lock
> > */
> >
> > That is, what's scary about an anon page without buffers?
>
> ooop. That's an akpm comment. umm, err..
>
> It solves this BUG:
>
> http://www.cs.helsinki.fi/linux/linux-kernel/2001-37/0594.html
>
> Around the 2.4.10 timeframe, Andrea started putting anon pages
> on the LRU. Then he backed that out, then put it in again. I
> think this comment dates from the time when anon pages were
> not on the LRU. So there's a little window there where the
> page is unlocked, we've just dropped its swapdev buffers, the page is
> on the LRU and pagemap_lru_lock is not held.
>
> So another CPU came in, found the page on the LRU, saw that it had
> no ->mapping and no ->buffers and went BUG.
>
> The fix was to take pagemap_lru_lock before unlocking the page.
>
> The comment is stale.

With the atomic_dec_and_lock strategy, the page would be freed immediately on
the buffers being released, and with the lru=1 strategy it doesn't matter
in terms of correctness whether the page ends up on the lru or not, so I was
inclined not to worry about this anyway, still, when you a dire-looking
comment like that...

You said something about your lru locking strategy in 2.5.33-mm2. I have not
reverse engineered it yet, would you care to wax poetic?

--
Daniel

2002-09-05 07:36:14

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in shrink_cache

Daniel Phillips wrote:
>
> ...
> You said something about your lru locking strategy in 2.5.33-mm2. I have not
> reverse engineered it yet, would you care to wax poetic?

I'm not sure what you're after here? Strategy is to make the locks per-zone,
per-node, to not take them too long, to not take them too frequently, to do
as much *needed* work as possible for a single acquisition of the lock and
to not do unnecessary work while holding it - and that includes not servicing
ethernet interrupts.

Not having to bump page counts when moving pages from the LRU into a private
list would be nice.

The strategy for fixing the double-free race is to wait until you
buy an IDE disk ;)

The elaborate changelogs are at

http://linux.bkbits.net:8080/linux-2.5/user=akpm/ChangeSet@-4w?nav=!-|index.html|stats|!+|index.html

2002-09-05 13:28:45

by Rik van Riel

[permalink] [raw]
Subject: Re: Race in shrink_cache

On Thu, 5 Sep 2002, Daniel Phillips wrote:

> /*
> * We must not allow an anon page
> * with no buffers to be visible on
> * the LRU, so we unlock the page after
> * taking the lru lock
> */
>
> That is, what's scary about an anon page without buffers?

Nothing, those are placed on the LRU all the time.

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-05 18:34:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: Race in shrink_cache

On Thursday 05 September 2002 09:53, Andrew Morton wrote:
> Not having to bump page counts when moving pages from the LRU into a private
> list would be nice.

I'm not sure what your intended application is here. It's easy enough to
change the lru state bit to a scalar, the transitions of which are protected
naturally by the lru lock. This gives you N partitions of the lru list
(times M zones) and page_cache_release does the right thing for all of them.

On the other hand, if what you want is a private list that page_cache_release
doesn't act on automatically, all you have to do is set the lru state to zero,
leave the page count incremented and move to the private list. You then take
explicit responsibility for freeing the page or moving it back onto a
mainstream lru list.

An example of an application of the latter technique is a short delay list to
(finally) implement the use-once concept properly. A newly instantiated page
goes onto the hot end of this list instead of the inactive list as it does
now, and after a short delay dependent on the allocation activity in the
system, is moved either to the (per zone) active or inactive list, depending
on whether it was referenced. Thus the use-once list is not per-zone and
removal from it is always explicit. So it's not like the other lru lists,
even though it uses the same link field.

While I'm meandering here, I'll mention that the above approach finally makes
use-once work properly for swap pages, which always had the problem that we
couldn't detect the second, activating reference (and this was fudged by
always marking a swapped-in page as referenced, i.e., kludging away the
mechanism). It also solves the problem of detecting clustered references,
such as reading through a page a byte at a time, which should only count as a
single reference. Right now we do a stupid hack that works in a lot of
cases, but fails in enough cases to be annoying, all in the name of trying to
get by without implementing a dedicated list.

Readahead on the other hand needs to be handled with dedicated per-zone lru
lists, so that we can conveniently and accurately claw back readahead that
happens to have gone too far ahead. So this is an example of the first kind
of usage. Once read, a readahead page moves to the used-once queue, and from
there either to the inactive or active queue as above.

--
Daniel

2002-09-05 18:48:46

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in shrink_cache

Daniel Phillips wrote:
>
> ..
> On the other hand, if what you want is a private list that page_cache_release
> doesn't act on automatically, all you have to do is set the lru state to zero,
> leave the page count incremented and move to the private list. You then take
> explicit responsibility for freeing the page or moving it back onto a
> mainstream lru list.

That's the one. Page reclaim speculatively removes a chunk (typically 32) of
pages from the LRU, works on them, and puts back any unfreeable ones later
on. And the rest of the VM was taught to play correctly with pages that
can be off the LRU. This was to avoid hanging onto the LRU lock while
running page reclaim.

And when those 32 pages are speculatively removed, their refcounts are
incremented. Maybe that isn't necessary - I'd need to think about
that. If it isn't, then the double-free thing is fixed. If it is
necessary then then lru-adds-a-ref approach is nice, because shrink_cache
doesn't need to page_cache_get each page while holding the LRU lock,
as you say.

2002-09-05 19:02:11

by Daniel Phillips

[permalink] [raw]
Subject: Re: Race in shrink_cache

On Thursday 05 September 2002 20:51, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > ..
> > On the other hand, if what you want is a private list that page_cache_release
> > doesn't act on automatically, all you have to do is set the lru state to zero,
> > leave the page count incremented and move to the private list. You then take
> > explicit responsibility for freeing the page or moving it back onto a
> > mainstream lru list.
>
> That's the one. Page reclaim speculatively removes a chunk (typically 32) of
> pages from the LRU, works on them, and puts back any unfreeable ones later
> on.

Convergent evolution. That's exactly the same number I'm handling as a
chunk in my rmap sharing project (backgrounded for the moment). In this
case, the 32 comes from the number of bits you can store in a long, and
it conveniently happens to fall in the sweet spot of performance as well.

In this case I only have to manipulate one list element though, because
I'm taking the lru list itself out onto the shared rmap node, saving 8
bytes in struct page as a side effect. So other than the size of the
chunk, the resemblance is slight. ETA for this is 2.7, unless you
suddenly start threatening to back out rmap again.

> And the rest of the VM was taught to play correctly with pages that
> can be off the LRU. This was to avoid hanging onto the LRU lock while
> running page reclaim.
>
> And when those 32 pages are speculatively removed, their refcounts are
> incremented. Maybe that isn't necessary - I'd need to think about
> that. If it isn't, then the double-free thing is fixed. If it is
> necessary then then lru-adds-a-ref approach is nice, because shrink_cache
> doesn't need to page_cache_get each page while holding the LRU lock,
> as you say.

I think the extra refcount strategy is inherently stronger, and this
is an example of why. The other would require you to take/drop an
extra count explicitly for your private list.

--
Daniel

2002-09-05 19:21:50

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in shrink_cache

Daniel Phillips wrote:
>
> On Thursday 05 September 2002 20:51, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > ..
> > > On the other hand, if what you want is a private list that page_cache_release
> > > doesn't act on automatically, all you have to do is set the lru state to zero,
> > > leave the page count incremented and move to the private list. You then take
> > > explicit responsibility for freeing the page or moving it back onto a
> > > mainstream lru list.
> >
> > That's the one. Page reclaim speculatively removes a chunk (typically 32) of
> > pages from the LRU, works on them, and puts back any unfreeable ones later
> > on.
>
> Convergent evolution. That's exactly the same number I'm handling as a
> chunk in my rmap sharing project (backgrounded for the moment). In this
> case, the 32 comes from the number of bits you can store in a long, and
> it conveniently happens to fall in the sweet spot of performance as well.

That's SWAP_CLUSTER_MAX. I've never really seen a reason to change
its value. On 4k pagesize.

The pagevecs use 16 pages. The thinking here is that we want it to
be large enough to be efficient, but small enough so that all the
pageframes are still in L1 when we come around and hit on them again.
Plus pagevecs are placed on the stack.

> ...
> I think the extra refcount strategy is inherently stronger, and this
> is an example of why. The other would require you to take/drop an
> extra count explicitly for your private list.

OK. I assume you taught invalidate_inode_pages[2] about the extra ref?

2002-09-05 19:53:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: Race in shrink_cache

On Thursday 05 September 2002 21:22, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > On Thursday 05 September 2002 20:51, Andrew Morton wrote:
> > > Daniel Phillips wrote:
> > > >
> > > > ..
> > > > On the other hand, if what you want is a private list that page_cache_release
> > > > doesn't act on automatically, all you have to do is set the lru state to zero,
> > > > leave the page count incremented and move to the private list. You then take
> > > > explicit responsibility for freeing the page or moving it back onto a
> > > > mainstream lru list.
> > >
> > > That's the one. Page reclaim speculatively removes a chunk (typically 32) of
> > > pages from the LRU, works on them, and puts back any unfreeable ones later
> > > on.
> >
> > Convergent evolution. That's exactly the same number I'm handling as a
> > chunk in my rmap sharing project (backgrounded for the moment). In this
> > case, the 32 comes from the number of bits you can store in a long, and
> > it conveniently happens to fall in the sweet spot of performance as well.
>
> That's SWAP_CLUSTER_MAX. I've never really seen a reason to change
> its value. On 4k pagesize.
>
> The pagevecs use 16 pages. The thinking here is that we want it to
> be large enough to be efficient, but small enough so that all the
> pageframes are still in L1 when we come around and hit on them again.
> Plus pagevecs are placed on the stack.
>
> > ...
> > I think the extra refcount strategy is inherently stronger, and this
> > is an example of why. The other would require you to take/drop an
> > extra count explicitly for your private list.
>
> OK. I assume you taught invalidate_inode_pages[2] about the extra ref?

Yes, all the magic tests have been replaced with versions that depend on
the LRU_PLUS_CACHE define (1 or 2). As a side effect, this lets you find
all the places that care about this with a simple grep, so I'd be inclined
to leave the symbol in, even after we decide which setting we prefer.

--
Daniel