2002-08-21 15:41:30

by Christian Ehrhardt

[permalink] [raw]
Subject: Race in pagevec code


Hi,

the following could explain Ooops reports which
BUG_ON(page->pte.chain != NULL) and possibly others. The kernel I'm
talking about is 2.5.31+akpm-everything.gz but others might be
affected as well:

Summary: Don't try to work with page_count == 0.
Don't hold references to a page that don't count towards
the page count.


Processor 1 Processor 2
refill_inactive grabs a page with
page_count == 1 from the lru.
__pagevec_release tries to release
the same page with page_count == 1
and does put_page_test_zero
bringing the page_count to 0
(No spinlocks involved!)

Interrupt or preempt or something
else that gives the other processor
some time.

refill_inactive calls page_cache_get
bringing the page_count back to 1
(this should ring some alarm bells)

refill_inactive continues all the way
down until it calls __pagevec_release
itself.

__pagevec_release calls put_page_test_zero
bringing the count back to 0

Both processors succeeded in bringing the page_count to zero,
i.e. both processors will add the page to their own
pages_to_free_list.

actually frees the page and reallocates
it to a new process which immediatly
sets page->pte.chain to something useful.

actully frees the page and sees
BUGs on page->pte.chain beeing
non zero.

Depending on what the new process does with the page this may also
explain other Ooopses on one of the BUGs in free_pages_ok as well.
Also it probably isn't necessarily a pagevec only BUG. I think
__page_cache_release has very much the same Problem:

Processor 1 Processor 2
page_cache_release frees a page
still on a lru list brining the
page count to to 0.
=> Interrupt to make the timeing
possible

Grabs page from lru list
and temporarily increments
page_count back to 1.
BUG_ON(page_count(page) != 0)
in __page_cache_release.

I don't have a fix but I think the only real solution is to
increment the page count if a page is on a lru list. After all
this is a reference to the page.

regards Christian Ehrhardt

--
THAT'S ALL FOLKS!


2002-08-21 17:26:48

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in pagevec code

Christian Ehrhardt wrote:
>
> ...
> Both processors succeeded in bringing the page_count to zero,
> i.e. both processors will add the page to their own
> pages_to_free_list.

This is why __pagevec_release() has the refcount check inside the lock.
If someone else grabbed a ref to the page (also inside the lock) via
the LRU, __pagevec_release doesn't free it.

So the rule could be stated as: the page gets freed when there are
no references to it, presence on the LRU counts as a reference,
serialisation is via pagemap_lru_lock.

> ..
>
> I don't have a fix but I think the only real solution is to
> increment the page count if a page is on a lru list. After all
> this is a reference to the page.

One would think so, but that doesn't really change anything.

I agree the locking and reffing in there is really nasty. It
doesn't help that I put four, repeat four bugs in the 20-line
__page_cache_release(). __pagevec_release() is, I think, OK.

It would be much simpler to grab the lock each time
page_cache_release() is executed, but our performance targets
for 2.5 preclude that.

The page->pte.chain != NULL problems predate the locking changes.
We haven't found that one yet.

2002-08-21 20:24:29

by Rik van Riel

[permalink] [raw]
Subject: Re: Race in pagevec code

On Wed, 21 Aug 2002, Andrew Morton wrote:

> The page->pte.chain != NULL problems predate the locking changes.
> We haven't found that one yet.

Yes, but we have 2 fixes for the page->pte.chain != NULL
problem that happened with mlock() and with discontigmem.

I don't remember seeing the other manifestation of the
bug without the locking changes...

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-08-21 22:19:27

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Race in pagevec code

On Wed, Aug 21, 2002 at 10:41:48AM -0700, Andrew Morton wrote:
> Christian Ehrhardt wrote:
> >
> > ...
> > Both processors succeeded in bringing the page_count to zero,
> > i.e. both processors will add the page to their own
> > pages_to_free_list.
>
> This is why __pagevec_release() has the refcount check inside the lock.
> If someone else grabbed a ref to the page (also inside the lock) via
> the LRU, __pagevec_release doesn't free it.

I saw this check but this doesn't help. There is no guarantee that this
other reference that someone grabbed is still beeing held at the time
where we do the check:
The problem is if this newly grabbed reference is again dropped BEFORE
the check for page_count == 0 but AFTER put_page_test_zero. In this
case there can be TWO execution paths the BOTH think that they dropped
the last reference, i.e. both call __free_pages_ok for the same page.
See?

> So the rule could be stated as: the page gets freed when there are
> no references to it, presence on the LRU counts as a reference,
> serialisation is via pagemap_lru_lock.

I assume you mean zone->lru_lock here, pagemap_lru_lock is gone
in 2.5.31+everything.gz. Besides this would possibly help if all accesses
to the page_count were protected by the lru_lock (I know we don't want
this). Relying on the lru_lock to protect certain "critical" accesses to
page_count against each other sounds dangerous to me.

> > I don't have a fix but I think the only real solution is to
> > increment the page count if a page is on a lru list. After all
> > this is a reference to the page.
>
> One would think so, but that doesn't really change anything.

We'd need yet another kernel thread that removes pages with no other
references from the lru_list. I agree that this is probably not an
option. What we really want is to make the lru_cache reference a weak
reference to the page.

regards Christian

--
THAT'S ALL FOLKS!

2002-08-21 22:50:42

by Andrew Morton

[permalink] [raw]
Subject: Re: Race in pagevec code

Christian Ehrhardt wrote:
>
> On Wed, Aug 21, 2002 at 10:41:48AM -0700, Andrew Morton wrote:
> > Christian Ehrhardt wrote:
> > >
> > > ...
> > > Both processors succeeded in bringing the page_count to zero,
> > > i.e. both processors will add the page to their own
> > > pages_to_free_list.
> >
> > This is why __pagevec_release() has the refcount check inside the lock.
> > If someone else grabbed a ref to the page (also inside the lock) via
> > the LRU, __pagevec_release doesn't free it.
>
> I saw this check but this doesn't help. There is no guarantee that this
> other reference that someone grabbed is still beeing held at the time
> where we do the check:
> The problem is if this newly grabbed reference is again dropped BEFORE
> the check for page_count == 0 but AFTER put_page_test_zero. In this
> case there can be TWO execution paths the BOTH think that they dropped
> the last reference, i.e. both call __free_pages_ok for the same page.
> See?

shrink_cache() detects that, inside the lock, and puts the page back
if it has a zero refcount.

refill_inactive doesn't need to do that, because it calls page_cache_release(),
which should look like this:

void __page_cache_release(struct page *page)
{
unsigned long flags;

spin_lock_irqsave(&_pagemap_lru_lock, flags);
if (TestClearPageLRU(page)) {
if (PageActive(page))
del_page_from_active_list(page);
else
del_page_from_inactive_list(page);
}
if (page_count(page) != 0)
page = NULL;
spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
if (page)
__free_pages_ok(page, 0);
}

If the page count and non-LRUness are both seen inside the lock,
the page is freeable.

We do a similar thing with inodes, via atomic_dec_and_lock.

Despite the transformations, it's based on the 2.4 approach. But you've
successfully worried me, and I'm not really sure it's right, and I'm
dead sure it's too hairy. Something simpler-but-not-sucky is needed.