Hi,
the following could explain Ooops reports which
BUG_ON(page->pte.chain != NULL) and possibly others. The kernel I'm
talking about is 2.5.31+akpm-everything.gz but others might be
affected as well:
Summary: Don't try to work with page_count == 0.
Don't hold references to a page that don't count towards
the page count.
Processor 1 Processor 2
refill_inactive grabs a page with
page_count == 1 from the lru.
__pagevec_release tries to release
the same page with page_count == 1
and does put_page_test_zero
bringing the page_count to 0
(No spinlocks involved!)
Interrupt or preempt or something
else that gives the other processor
some time.
refill_inactive calls page_cache_get
bringing the page_count back to 1
(this should ring some alarm bells)
refill_inactive continues all the way
down until it calls __pagevec_release
itself.
__pagevec_release calls put_page_test_zero
bringing the count back to 0
Both processors succeeded in bringing the page_count to zero,
i.e. both processors will add the page to their own
pages_to_free_list.
actually frees the page and reallocates
it to a new process which immediatly
sets page->pte.chain to something useful.
actully frees the page and sees
BUGs on page->pte.chain beeing
non zero.
Depending on what the new process does with the page this may also
explain other Ooopses on one of the BUGs in free_pages_ok as well.
Also it probably isn't necessarily a pagevec only BUG. I think
__page_cache_release has very much the same Problem:
Processor 1 Processor 2
page_cache_release frees a page
still on a lru list brining the
page count to to 0.
=> Interrupt to make the timeing
possible
Grabs page from lru list
and temporarily increments
page_count back to 1.
BUG_ON(page_count(page) != 0)
in __page_cache_release.
I don't have a fix but I think the only real solution is to
increment the page count if a page is on a lru list. After all
this is a reference to the page.
regards Christian Ehrhardt
--
THAT'S ALL FOLKS!
Christian Ehrhardt wrote:
>
> ...
> Both processors succeeded in bringing the page_count to zero,
> i.e. both processors will add the page to their own
> pages_to_free_list.
This is why __pagevec_release() has the refcount check inside the lock.
If someone else grabbed a ref to the page (also inside the lock) via
the LRU, __pagevec_release doesn't free it.
So the rule could be stated as: the page gets freed when there are
no references to it, presence on the LRU counts as a reference,
serialisation is via pagemap_lru_lock.
> ..
>
> I don't have a fix but I think the only real solution is to
> increment the page count if a page is on a lru list. After all
> this is a reference to the page.
One would think so, but that doesn't really change anything.
I agree the locking and reffing in there is really nasty. It
doesn't help that I put four, repeat four bugs in the 20-line
__page_cache_release(). __pagevec_release() is, I think, OK.
It would be much simpler to grab the lock each time
page_cache_release() is executed, but our performance targets
for 2.5 preclude that.
The page->pte.chain != NULL problems predate the locking changes.
We haven't found that one yet.
On Wed, 21 Aug 2002, Andrew Morton wrote:
> The page->pte.chain != NULL problems predate the locking changes.
> We haven't found that one yet.
Yes, but we have 2 fixes for the page->pte.chain != NULL
problem that happened with mlock() and with discontigmem.
I don't remember seeing the other manifestation of the
bug without the locking changes...
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, Aug 21, 2002 at 10:41:48AM -0700, Andrew Morton wrote:
> Christian Ehrhardt wrote:
> >
> > ...
> > Both processors succeeded in bringing the page_count to zero,
> > i.e. both processors will add the page to their own
> > pages_to_free_list.
>
> This is why __pagevec_release() has the refcount check inside the lock.
> If someone else grabbed a ref to the page (also inside the lock) via
> the LRU, __pagevec_release doesn't free it.
I saw this check but this doesn't help. There is no guarantee that this
other reference that someone grabbed is still beeing held at the time
where we do the check:
The problem is if this newly grabbed reference is again dropped BEFORE
the check for page_count == 0 but AFTER put_page_test_zero. In this
case there can be TWO execution paths the BOTH think that they dropped
the last reference, i.e. both call __free_pages_ok for the same page.
See?
> So the rule could be stated as: the page gets freed when there are
> no references to it, presence on the LRU counts as a reference,
> serialisation is via pagemap_lru_lock.
I assume you mean zone->lru_lock here, pagemap_lru_lock is gone
in 2.5.31+everything.gz. Besides this would possibly help if all accesses
to the page_count were protected by the lru_lock (I know we don't want
this). Relying on the lru_lock to protect certain "critical" accesses to
page_count against each other sounds dangerous to me.
> > I don't have a fix but I think the only real solution is to
> > increment the page count if a page is on a lru list. After all
> > this is a reference to the page.
>
> One would think so, but that doesn't really change anything.
We'd need yet another kernel thread that removes pages with no other
references from the lru_list. I agree that this is probably not an
option. What we really want is to make the lru_cache reference a weak
reference to the page.
regards Christian
--
THAT'S ALL FOLKS!
Christian Ehrhardt wrote:
>
> On Wed, Aug 21, 2002 at 10:41:48AM -0700, Andrew Morton wrote:
> > Christian Ehrhardt wrote:
> > >
> > > ...
> > > Both processors succeeded in bringing the page_count to zero,
> > > i.e. both processors will add the page to their own
> > > pages_to_free_list.
> >
> > This is why __pagevec_release() has the refcount check inside the lock.
> > If someone else grabbed a ref to the page (also inside the lock) via
> > the LRU, __pagevec_release doesn't free it.
>
> I saw this check but this doesn't help. There is no guarantee that this
> other reference that someone grabbed is still beeing held at the time
> where we do the check:
> The problem is if this newly grabbed reference is again dropped BEFORE
> the check for page_count == 0 but AFTER put_page_test_zero. In this
> case there can be TWO execution paths the BOTH think that they dropped
> the last reference, i.e. both call __free_pages_ok for the same page.
> See?
shrink_cache() detects that, inside the lock, and puts the page back
if it has a zero refcount.
refill_inactive doesn't need to do that, because it calls page_cache_release(),
which should look like this:
void __page_cache_release(struct page *page)
{
unsigned long flags;
spin_lock_irqsave(&_pagemap_lru_lock, flags);
if (TestClearPageLRU(page)) {
if (PageActive(page))
del_page_from_active_list(page);
else
del_page_from_inactive_list(page);
}
if (page_count(page) != 0)
page = NULL;
spin_unlock_irqrestore(&_pagemap_lru_lock, flags);
if (page)
__free_pages_ok(page, 0);
}
If the page count and non-LRUness are both seen inside the lock,
the page is freeable.
We do a similar thing with inodes, via atomic_dec_and_lock.
Despite the transformations, it's based on the 2.4 approach. But you've
successfully worried me, and I'm not really sure it's right, and I'm
dead sure it's too hairy. Something simpler-but-not-sucky is needed.