2006-08-02 02:16:20

by Dave Jones

[permalink] [raw]
Subject: frequent slab corruption (since a long time)

Every so often, I see a slab corruption bug reported against
the Fedora kernels (going back as far as 2.6.11), and it's
still plagueing us.

It seems to have turned up in a number of different scenarios,
which makes it all the more complicated, but the footprint is
always the same. We write ffffffff00000000 to freed memory.
All the example cases seen so far have been on 32-bit x86.

Anyone have any clues where that value could be coming from?

There's a collection of corruption reports at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160878

Dave

--
http://www.codemonkey.org.uk


2006-08-02 02:34:33

by Roland Dreier

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

> There's a collection of corruption reports at
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160878

I notice that the first few reports (kernels <= 2.6.14) have len=4096,
while the later reports (kernels >= 2.6.16) have len=2048. So
assuming it's the same bug, the use after free has moved from a
4096-byte slab to a 2048-byte slab. Were there ny data structures we
shrank between 2.6.14 and 2.6.16?

2006-08-02 03:37:30

by Andi Kleen

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

Dave Jones <[email protected]> writes:

> Every so often, I see a slab corruption bug reported against
> the Fedora kernels (going back as far as 2.6.11), and it's
> still plagueing us.
>
> It seems to have turned up in a number of different scenarios,
> which makes it all the more complicated, but the footprint is
> always the same. We write ffffffff00000000 to freed memory.

DEBUG_PAGEALLOC + a small slab patch to force the 2k slab to be
only a single object per page (so that a kfree() immediately
triggers an unmap) would catch it I guess.

-Andi

2006-08-02 04:22:09

by Dave Jones

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

On Wed, Aug 02, 2006 at 05:37:28AM +0200, Andi Kleen wrote:
> Dave Jones <[email protected]> writes:
>
> > Every so often, I see a slab corruption bug reported against
> > the Fedora kernels (going back as far as 2.6.11), and it's
> > still plagueing us.
> >
> > It seems to have turned up in a number of different scenarios,
> > which makes it all the more complicated, but the footprint is
> > always the same. We write ffffffff00000000 to freed memory.
>
> DEBUG_PAGEALLOC + a small slab patch to force the 2k slab to be
> only a single object per page (so that a kfree() immediately
> triggers an unmap) would catch it I guess.

Problem with that approach is that DEBUG_PAGEALLOC makes things
so damned slow that it's pretty much unusable, and this bug
doesn't seem to want to repeat itself to order, so I doubt
many people would put up with the slowdown long enough to chase it down.

Dave

--
http://www.codemonkey.org.uk

2006-08-02 04:37:10

by Andi Kleen

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

On Wednesday 02 August 2006 06:22, Dave Jones wrote:

> Problem with that approach is that DEBUG_PAGEALLOC makes things
> so damned slow that it's pretty much unusable, and this bug
> doesn't seem to want to repeat itself to order, so I doubt
> many people would put up with the slowdown long enough to chase it down.

Really? It shouldn't be that much slower in theory. Do you
have numbers?

If it's a big problem it could probably be made faster by batching
the TLB flushes more.

-Andi

2006-08-02 04:46:49

by Dave Jones

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

On Wed, Aug 02, 2006 at 06:35:00AM +0200, Andi Kleen wrote:
> On Wednesday 02 August 2006 06:22, Dave Jones wrote:
>
> > Problem with that approach is that DEBUG_PAGEALLOC makes things
> > so damned slow that it's pretty much unusable, and this bug
> > doesn't seem to want to repeat itself to order, so I doubt
> > many people would put up with the slowdown long enough to chase it down.
>
> Really? It shouldn't be that much slower in theory. Do you
> have numbers?

You need slower boxes :-)

Every time I enable it to try and diagnose a bug in the Fedora kernel
I get a flood of "hey what gives, everything got slow" emails.
That speaks louder than any numbers to me.

It could be less of an issue on modern CPUs than it used to be, but
it has been painful enough in the past that I've really only enabled
it when I've been desperately trying to chase something down.

> If it's a big problem it could probably be made faster by batching
> the TLB flushes more.

Maybe, though that gives me the creeps a little for some reason.

Dave

--
http://www.codemonkey.org.uk

2006-08-02 05:05:33

by David Miller

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

From: Dave Jones <[email protected]>
Date: Tue, 1 Aug 2006 22:16:17 -0400

> Anyone have any clues where that value could be coming from?

Some of the dumps in there looks like ethernet headers. For example,
in comment #7:

000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 11 85 6a 0f ef 00 e0 52 cf 6c 00 08 00
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That looks like an ethernet header for an IPv4 packet to ethernet
destination MAC 00:11:85:6a:0f:ef from ethernet source MAC
00:e0:52:cf:6c:00

But this chunk is OK since we are looking at a neighbouring
INUSE object.

The reocurring corruption:

0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ff ff ff ff
0c0: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

is less identifyable. Although tg3_alloc_rx_skb() shows up
consistently in the neighbouring objects, the actually
corrupted piece is marked by release_mem() which is the
TTY layer.

The corruption is always a 32-bit 0xffffffff followed by
a 32-bit 0x00000000, 12 bytes into the object.

If it's sitting next to RX ethernet packets, it's probably
something in the vacinity of 1500+ bytes in size as that's
how large the RX skb data areas will be that the tg3 driver
allocates unless it is in Jumbo MTU mode.

By default, do_tty_write() will use a chunk size of 2048 for the write
buffer, tty->write_buf, and this is freed up as part of release_mem()
processing.

Another possibility, is the struct tty_struct itself since that is a
sizable structure too. And this theory is supported by
alloc_tty_struct() being in some of the triggering backtraces.

Perhaps a TTY refcounting problem or race condition of some sort.

What is 12-bytes into the tty_struct on x86? The struct tty_ldisc,
"ldisc" is. Oddly enough, this doesn't match up, since we'd expect
TTY_LDISC_MAGIC (0x5403) and instead we see 0xffffffff there.
Also, after the magic, we'd expect the address of the "n_tty"
string in tty_ldisc_N_TTY to show up in the next word, instead
we find NULL (0x00000000) there.

Something is clobbering the ldisc member of this free'd tty_struct is
seems. Maybe there is some problem with ldisc refcounting.

2006-08-02 05:31:04

by David Miller

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

From: David Miller <[email protected]>
Date: Tue, 01 Aug 2006 22:05:38 -0700 (PDT)

> The corruption is always a 32-bit 0xffffffff followed by
> a 32-bit 0x00000000, 12 bytes into the object.

This analysis is wrong, it's "0xb0 + 12" bytes into the object
which is 188 bytes. For x86, this lands us at the "count"
member of the tty_struct, and it shows that the tty count
has decremented to -1 (0xffffffff) which is a serious bug.

Note also that the ws_row and ws_col fields of the tty->winsize are
next and has been zero'd out (0x00000000) by the corruption.

There aren't too many places that write to tty->winsize,
the most notable is tiocswinsz(), called from tty_ioctl.

For a PTY with sub-type PTY_TYPE_MASTE, which is very likely
the case we have here, it assigns the user provided winsize
to both tty->winsize and tty->link->winsz (via the real_tty)
argument to tiocswinsz().

Actually, there is one other notable spot that writes to
tty->winsize, drivers/char/vt.c:vc_resize(), which copies
it to vc->vc_tty->winsize.

Perhaps this is a clue... but wait there is a better clue.

con_open() sets values there, and in particular it only
assigns the ws_row and ws_col members, which matches up
with the fact that we see only the first two members of
tty->winsize spammed to zero.

If the guilty code path involved vc_resize() or tiocswinsz()
we'd see 8 bytes, not 4 bytes, of the tty->winsize set to
something other than the SLAB free poison chars.

So it seems, from this perspective, that con_open()'s assignments
which are causing the corruption. This is backed up by the trace
found in bugzilla #200928 being early during bootup.

But even this doesn't add up because con_open() has to be seeing
zero's there at the time it runs, not the poison values, since it's
assignments are guarded like this:

if (!tty->winsize.ws_row && !tty->winsize.ws_col) {
tty->winsize.ws_row = vc_cons[currcons].d->vc_rows;
tty->winsize.ws_col = vc_cons[currcons].d->vc_cols;
}

Hmmm...

2006-08-02 22:23:31

by Dave Jones

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

On Tue, Aug 01, 2006 at 10:31:10PM -0700, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Tue, 01 Aug 2006 22:05:38 -0700 (PDT)
>
> > The corruption is always a 32-bit 0xffffffff followed by
> > a 32-bit 0x00000000, 12 bytes into the object.
>
> This analysis is wrong, it's "0xb0 + 12" bytes into the object
> which is 188 bytes. For x86, this lands us at the "count"
> member of the tty_struct, and it shows that the tty count
> has decremented to -1 (0xffffffff) which is a serious bug.

<stabbing in the dark here>

None of the code manipulating tty->count seems to be under
the tty_mutex. Should it be ?
Or is this protected through some other means?

Jason, ISTR you've done some digging around this area wrt races,
any insight ?

Dave

--
http://www.codemonkey.org.uk

2006-08-02 22:49:58

by David Miller

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

From: Dave Jones <[email protected]>
Date: Wed, 2 Aug 2006 18:23:21 -0400

> None of the code manipulating tty->count seems to be under
> the tty_mutex. Should it be ?
> Or is this protected through some other means?

It is in the primary code paths at least, all callers of init_dev()
(which increments tty->count) grab the mutex and also release_dev()
grabs the mutex around tty->count manipulations.

I'm surprised that when this triggers we don't get one of these
two messages:

if (pty_master) {
if (--o_tty->count < 0) {
printk(KERN_WARNING "release_dev: bad pty slave count "
"(%d) for %s\n",
o_tty->count, tty_name(o_tty, buf));
o_tty->count = 0;
}
}
if (--tty->count < 0) {
printk(KERN_WARNING "release_dev: bad tty->count (%d) for %s\n",
tty->count, tty_name(tty, buf));
tty->count = 0;
}

However, there seems to be some kind of dependency of TTY opennings
holding the BKL, as least as far as this comment on con_close() is
concerned:

/*
* tty_mutex is released, but we still hold BKL, so there is
* still exclusion against init_dev()
*/

But it is not clear to me that tty_open() and ptmx_open() always run
with the BKL held. chrdev_open() wraps the ->open call with the BKL
held, but then it plugs in the device's fops which should allow a
direct filp->fops->open() call from the VFS layer without the BKL
grabbing right?

chrdev_open() should catch /dev/foo char device opens, but what about
the sysfs instances? They might bypass this path too somehow, thus
another case where the BKL won't be held on open().

Hmmm...

2006-08-02 22:55:11

by Alan

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

Ar Mer, 2006-08-02 am 18:23 -0400, ysgrifennodd Dave Jones:
> None of the code manipulating tty->count seems to be under
> the tty_mutex. Should it be ?
> Or is this protected through some other means?

Its old lock_kernel() code so its a prime suspect for most offences.

Given the age of the reports it appears that the tty buffering changes
are too new. The ldisc locking changes are a candidate but shouldn't be
doing anything that breaks the kernel lock stuff. Will look tomorrow see
if anything strikes me about that lot.

The tty_mutex primarily protected current->signal.tty (except if you are
using SELinux which has some extremely dubious looking code for tty
handling). Someone ought to review flush_unauthorized_files() although
it doesn't fit this problem.

Alan

2006-08-03 17:21:50

by Alan

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

Ar Mer, 2006-08-02 am 15:49 -0700, ysgrifennodd David Miller:
> > None of the code manipulating tty->count seems to be under
> > the tty_mutex. Should it be ?
> > Or is this protected through some other means?
>
> It is in the primary code paths at least, all callers of init_dev()
> (which increments tty->count) grab the mutex and also release_dev()
> grabs the mutex around tty->count manipulations.

I've been auditing tty code and its joyously bad but only in harmless
places so far except for one.

init_dev (and caller) relies on tty_mutex to ensure that the
driver->ttys list is protected from things going away.

release_mem() removes stuff from the said list and frees memory. It is
not always called under tty_mutex and that appears very dubious to me at
the moment although tty->closing and the BKL *might* be sufficient.

Alan

2006-08-03 17:56:40

by Dave Jones

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

On Thu, Aug 03, 2006 at 06:40:42PM +0100, Alan Cox wrote:
> Ar Mer, 2006-08-02 am 15:49 -0700, ysgrifennodd David Miller:
> > > None of the code manipulating tty->count seems to be under
> > > the tty_mutex. Should it be ?
> > > Or is this protected through some other means?
> >
> > It is in the primary code paths at least, all callers of init_dev()
> > (which increments tty->count) grab the mutex and also release_dev()
> > grabs the mutex around tty->count manipulations.
>
> I've been auditing tty code and its joyously bad but only in harmless
> places so far except for one.
>
> init_dev (and caller) relies on tty_mutex to ensure that the
> driver->ttys list is protected from things going away.
>
> release_mem() removes stuff from the said list and frees memory. It is
> not always called under tty_mutex and that appears very dubious to me at
> the moment although tty->closing and the BKL *might* be sufficient.

Against my better judgment I was poring over that code until the wee
hours last night, and one thing crossed my mind re: the assumptions made
about the BKL in that subsystem. Now that the BKL is preemtible, do
any of those assumptions break ?

Dave


--
http://www.codemonkey.org.uk

2006-08-03 20:29:42

by Alan

[permalink] [raw]
Subject: Re: frequent slab corruption (since a long time)

Ar Iau, 2006-08-03 am 13:56 -0400, ysgrifennodd Dave Jones:
> Against my better judgment I was poring over that code until the wee
> hours last night, and one thing crossed my mind re: the assumptions made
> about the BKL in that subsystem. Now that the BKL is preemtible, do
> any of those assumptions break ?

>From the walking of the code so far I think quite a few of them were
already broken. I'm still annotating and I hope by the end of this
evening I'll have a patch to post that documents the current state of
affairs better, including some gems 8)