2005-11-02 14:41:31

by Dipankar Sarma

[permalink] [raw]
Subject: bad page state under possibly oom situation

We have discussed this in private previously and I had mentioned
that I see this introduced between 2.6.9-rc1 and 2.6.9-rc2. After
spending some time doing other things, I went back to take a
look at this again and thought I would share this with a wider
audience. The basic problem is that while running the LTP rename14
test with a tmpfs /tmp, I see this -

Bad page state at prep_new_page (in process 'rename14', page ffff810008002aa8)
flags:0x4000000000000090 mapping:0000000000000000 mapcount:0 count:0
Backtrace:

Call Trace:<ffffffff80150388>{bad_page+115} <ffffffff80150bb1>{buffered_rmqueue+438}
<ffffffff80150da8>{__alloc_pages+251} <ffffffff8022fb10>{_atomic_dec_and_lock+24}
<ffffffff801535b3>{cache_alloc_refill+581} <ffffffff8015384f>{kmem_cache_alloc+44}
<ffffffff8017cd55>{d_alloc+33} <ffffffff8017507e>{__lookup_hash+206}
<ffffffff80176f4e>{sys_rename+245} <ffffffff8010e636>{system_call+126}

Trying to fix it up, but a reboot is needed

Recently, I tested this with 2.6.14 and it worked. I then tried
setting rcupdate.maxbatch=10 as it was before 2.6.14 and the bad
page state problem happened again. Looks like it happens only under
memory pressure and likely have something to do with slab.
I am wondering if that rings a bell with anyone. Manfred ?

Thanks
Dipankar


2005-11-02 16:34:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

On Wed, 2 Nov 2005, Dipankar Sarma wrote:
> We have discussed this in private previously and I had mentioned
> that I see this introduced between 2.6.9-rc1 and 2.6.9-rc2. After
> spending some time doing other things, I went back to take a
> look at this again and thought I would share this with a wider
> audience. The basic problem is that while running the LTP rename14
> test with a tmpfs /tmp, I see this -
>
> Bad page state at prep_new_page (in process 'rename14', page ffff810008002aa8)
> flags:0x4000000000000090 mapping:0000000000000000 mapcount:0 count:0
> Backtrace:
>
> Call Trace:<ffffffff80150388>{bad_page+115} <ffffffff80150bb1>{buffered_rmqueue+438}
> <ffffffff80150da8>{__alloc_pages+251} <ffffffff8022fb10>{_atomic_dec_and_lock+24}
> <ffffffff801535b3>{cache_alloc_refill+581} <ffffffff8015384f>{kmem_cache_alloc+44}
> <ffffffff8017cd55>{d_alloc+33} <ffffffff8017507e>{__lookup_hash+206}
> <ffffffff80176f4e>{sys_rename+245} <ffffffff8010e636>{system_call+126}
>
> Trying to fix it up, but a reboot is needed

(I don't know that it makes any difference, but was this particular report
from 2.6.9-rc2 or from 2.6.14 or from something else? In both 2.6.9 and
2.6.14, flags 0x90 mean PG_slab|PG_dirty.)

> Recently, I tested this with 2.6.14 and it worked. I then tried
> setting rcupdate.maxbatch=10 as it was before 2.6.14 and the bad
> page state problem happened again. Looks like it happens only under
> memory pressure and likely have something to do with slab.
> I am wondering if that rings a bell with anyone. Manfred ?

Slab and RCU and 2.6.9-rc2 do rather point an accusing finger at my
SLAB_DESTROY_BY_RCU, which went in for the anon_vma cache in that -rc.

I did look back at that when we discussed your Bad page state before,
and didn't find anything wrong. But now you're reproducing it again,
it would be good to rule it in or out.

Could you run the rename14 test on whatever kernel suits you, but
built with mm/rmap.c omitting the SLAB_DESTROY_BY_RCU flag to
kmem_cache_create? There's then a tiny chance that rmap will try
to access a freed anon_vma struct, I'm not sure how that would
behave offhand; but that chance should be a lot tinier than what
you're finding quite easy to reproduce.

If you don't get the Bad page state with that kernel, then it'll
be worth scrutinizing the SLAB_DESTROY_BY_RCU path in mm/slab.c.

Hugh

2005-11-02 19:54:13

by Dipankar Sarma

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

On Wed, Nov 02, 2005 at 04:33:21PM +0000, Hugh Dickins wrote:
> On Wed, 2 Nov 2005, Dipankar Sarma wrote:
> >
> > Bad page state at prep_new_page (in process 'rename14', page ffff810008002aa8)
> > flags:0x4000000000000090 mapping:0000000000000000 mapcount:0 count:0
> > Backtrace:
> >
>
> (I don't know that it makes any difference, but was this particular report
> from 2.6.9-rc2 or from 2.6.14 or from something else? In both 2.6.9 and
> 2.6.14, flags 0x90 mean PG_slab|PG_dirty.)
>
> > Recently, I tested this with 2.6.14 and it worked. I then tried
> > setting rcupdate.maxbatch=10 as it was before 2.6.14 and the bad
> > page state problem happened again. Looks like it happens only under
> > memory pressure and likely have something to do with slab.
> > I am wondering if that rings a bell with anyone. Manfred ?
>
>
> I did look back at that when we discussed your Bad page state before,
> and didn't find anything wrong. But now you're reproducing it again,
> it would be good to rule it in or out.
>
> Could you run the rename14 test on whatever kernel suits you, but
> built with mm/rmap.c omitting the SLAB_DESTROY_BY_RCU flag to
> kmem_cache_create? There's then a tiny chance that rmap will try
> to access a freed anon_vma struct, I'm not sure how that would
> behave offhand; but that chance should be a lot tinier than what
> you're finding quite easy to reproduce.

I am really not comfortable with the SLAB_DESTROY_BY_RCU thing.
I am not familiar with rmap code, so I could be wrong but
it isn't clear to me why you are protecting only the slab
and not the anon_vma slab objects. How do you ensure that
the anon_vma objects don't get re-used ? If they do,
then how do you prevent freeing an in-use anon_vma ?
It seems that the critical sections are not clearly
identified here.

>
> If you don't get the Bad page state with that kernel, then it'll
> be worth scrutinizing the SLAB_DESTROY_BY_RCU path in mm/slab.c.

I tried commenting out SLAB_DESTROY_BY_RCU for anon_vma caache,
but I still hit the problem. So, that may not be it. I guess I can
look at the bad page and see if I can extract some information
from there.

Thanks
Dipankar

2005-11-02 20:34:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

On Thu, 3 Nov 2005, Dipankar Sarma wrote:
> On Wed, Nov 02, 2005 at 04:33:21PM +0000, Hugh Dickins wrote:
>
> I am really not comfortable with the SLAB_DESTROY_BY_RCU thing.
> I am not familiar with rmap code, so I could be wrong but
> it isn't clear to me why you are protecting only the slab
> and not the anon_vma slab objects. How do you ensure that
> the anon_vma objects don't get re-used ? If they do,
> then how do you prevent freeing an in-use anon_vma ?
> It seems that the critical sections are not clearly
> identified here.

The whole idea is that they may indeed get reused, but so long as
they're reused as anon_vma slab objects, with the same layout as before,
it's safe for page_lock_anon_vma to spin_lock(&anon_vma->lock): that
will still be a valid anon_vma->lock it's taking, and the worst that
can happen is that the caller will then search an irrelevant list for
the page it's looking for, and not find it (usually it'll just be an
empty list, when the anon_vma has not yet been put to use again).

An in-use anon_vma is only freed back to slab cache when its list
of vmas is empty, determined while holding anon_vma->lock.

The danger that RCU is used to guard against there, is that the slab
might be destroyed in between reading page->mapping and acquiring
anon_vma->lock, and its memory reused for something very different
e.g. anon_vma->lock no longer a spin_lock, but something which will
freeze that attempt to get the lock.

I think it's a technique which deserves to be used more widely.

> > If you don't get the Bad page state with that kernel, then it'll
> > be worth scrutinizing the SLAB_DESTROY_BY_RCU path in mm/slab.c.
>
> I tried commenting out SLAB_DESTROY_BY_RCU for anon_vma caache,
> but I still hit the problem. So, that may not be it. I guess I can
> look at the bad page and see if I can extract some information
> from there.

Phew! It seems I'm off the hook (but having said that, I'll probably
turn out to be guilty in some other way). Sorry, I don't have any
ideas (and have never reproduced this here).

Hugh

2005-11-02 23:31:13

by Nick Piggin

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

Hugh Dickins wrote:

>
> Phew! It seems I'm off the hook (but having said that, I'll probably
> turn out to be guilty in some other way). Sorry, I don't have any
> ideas (and have never reproduced this here).
>

PG_dirty should be cleared when the page is freed. In which case,
perhaps you could stick an extra field in struct page which stores
the address of the last guy who did a (Test)SetPageDirty on the
page. Print it in your bad_page handler.

That might get you started.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-03 06:40:29

by Manfred Spraul

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

Hugh Dickins wrote:

>(I don't know that it makes any difference, but was this particular report
>from 2.6.9-rc2 or from 2.6.14 or from something else? In both 2.6.9 and
>2.6.14, flags 0x90 mean PG_slab|PG_dirty.)
>
>
>
A very odd combination:
- free_pages_check() ensures that neither PG_slab nor PG_dirty are set
- prep_new_page() complains that both PG_slab and PG_dirty are set
- AFAICS slab doesn't set PG_dirty, and noone except slab set PG_slab.

I don't understand how two wrong bits can end up in page->flags.
Dipankar, could you modify bad_page() and hexdump +-128 bytes? Perhaps
someone overwrites random memory. Or change the value of PG_slab to 20
and check if page->flags remains 0x90.

--
Manfred

2005-11-03 21:18:12

by Dipankar Sarma

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

On Thu, Nov 03, 2005 at 07:38:09AM +0100, Manfred Spraul wrote:
> Hugh Dickins wrote:
>
> >(I don't know that it makes any difference, but was this particular report
> >from 2.6.9-rc2 or from 2.6.14 or from something else? In both 2.6.9 and
> >2.6.14, flags 0x90 mean PG_slab|PG_dirty.)
> >
> A very odd combination:
> - free_pages_check() ensures that neither PG_slab nor PG_dirty are set
> - prep_new_page() complains that both PG_slab and PG_dirty are set
> - AFAICS slab doesn't set PG_dirty, and noone except slab set PG_slab.
>
> I don't understand how two wrong bits can end up in page->flags.
> Dipankar, could you modify bad_page() and hexdump +-128 bytes? Perhaps
> someone overwrites random memory. Or change the value of PG_slab to 20
> and check if page->flags remains 0x90.

Here is a dump of the page struct when this happens (two different
instances) -

/* Dump of struct page */
page = ffff810008005550
4000005500009090 ffffffffffffffff 0 0
0 100100 200200 4000000000000000
ffffffffffffffff 0 0 0
ffff8100080055e8 ffffffff805ba370 4000000000000000 ffffffffffffffff

Bad page state at prep_new_page (in process 'rename14', page ffff810008005550)
flags:0x4000005500009090 mapping:0000000000000000 mapcount:0 count:0
Backtrace:

Call Trace:<ffffffff80150388>{bad_page+115} <ffffffff80150bf0>{buffered_rmqueue+501}
<ffffffff8017e2da>{alloc_inode+18} <ffffffff80150de7>{__alloc_pages+251}
<ffffffff801535f3>{cache_alloc_refill+581} <ffffffff8015388f>{kmem_cache_alloc+44}
<ffffffff80169440>{get_empty_filp+71} <ffffffff801670af>{filp_open+49}
<ffffffff80166e28>{get_unused_fd+98} <ffffffff8016713d>{do_sys_open+59}
<ffffffff8010e636>{system_call+126}
Trying to fix it up, but a reboot is needed
page = ffff81000800aaa0
4000000000000000 ffffffff00900055 0 0
0 100100 200200 4000000000000000
ffffffffffffffff 0 0 0
ffff81000800ab38 ffffffff805ba370 4000000000000000 ffffffffffffffff
Bad page state at prep_new_page (in process 'rename14', page ffff81000800aaa0)
flags:0x4000000000000000 mapping:0000000000000000 mapcount:0 count:9437270
Backtrace:

Call Trace:<ffffffff80150388>{bad_page+115} <ffffffff80150bf0>{buffered_rmqueue+501}
<ffffffff80150de7>{__alloc_pages+251} <ffffffff801535f3>{cache_alloc_refill+581}
<ffffffff8015388f>{kmem_cache_alloc+44} <ffffffff8017cd95>{d_alloc+33}
<ffffffff801750be>{__lookup_hash+206} <ffffffff8017677b>{open_namei+276}
<ffffffff801670ce>{filp_open+80} <ffffffff80166e28>{get_unused_fd+98}
<ffffffff8016713d>{do_sys_open+59} <ffffffff8010e636>{system_call+1

I had set PG_slab to 20, so it is not necessarily a corrupted slab
page.

Thanks
Dipankar

2005-11-03 22:00:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: bad page state under possibly oom situation

On Fri, 4 Nov 2005, Dipankar Sarma wrote:
> On Thu, Nov 03, 2005 at 07:38:09AM +0100, Manfred Spraul wrote:
> > Hugh Dickins wrote:
> > > flags 0x90 mean PG_slab|PG_dirty.)
> > A very odd combination:
> > Or change the value of PG_slab to 20
> > and check if page->flags remains 0x90.

Good observation and suggestion from Manfred.

> Here is a dump of the page struct when this happens (two different
> instances) -

Yuk!

page ffff810008005550 flags 4000005500009090, but the rest is okay.
page ffff81000800aaa0 count 00900055, but the rest is okay (including
mapcount, the int above count).
Even the page addresses seem related.

You seem to be infected with 55s and 90s,
nothing to do with PG_slab|PG_dirty; but I don't know what it means.

Hugh