2009-06-25 20:38:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 01:18:59PM -0700, David Rientjes wrote:
> Isn't there also a problem in jbd2_journal_write_metadata_buffer(),
> though?
>
> tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
...
> memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
>
> jbd2_alloc() is just a wrapper to __get_free_pages() and if it fails, it
> appears as though the memcpy() would cause a NULL pointer.

Nicely spotted. Yeah, that's a bug; we need to do something about
that one, too. And what we're doing is a bit silly; it may make sense
to use __get_free_pages if filesystem blocksize == PAGE_SIZE, but
otherwise we should be using a sub-page allocator. Right now, we're
chewing up a 16k PPC page for every 4k filesystem metadata page
allocated in journal_write_metadata_buffer(), and on x86, for the
(admittedly uncommon) 1k block filesystem, we'd be chewing up a 4k
page for a 1k block buffer.


Both of these problems exist for both ext3 and ext4.

- Ted


2009-06-25 21:07:57

by Joel Becker

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 04:37:43PM -0400, Theodore Tso wrote:
> Nicely spotted. Yeah, that's a bug; we need to do something about
> that one, too. And what we're doing is a bit silly; it may make sense
> to use __get_free_pages if filesystem blocksize == PAGE_SIZE, but
> otherwise we should be using a sub-page allocator. Right now, we're
> chewing up a 16k PPC page for every 4k filesystem metadata page
> allocated in journal_write_metadata_buffer(), and on x86, for the
> (admittedly uncommon) 1k block filesystem, we'd be chewing up a 4k
> page for a 1k block buffer.
>
>
> Both of these problems exist for both ext3 and ext4.

And ocfs2.

Joel

--

Life's Little Instruction Book #337

"Reread your favorite book."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-06-25 21:26:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Jun 25, 2009 16:37 -0400, Theodore Ts'o wrote:
> On Thu, Jun 25, 2009 at 01:18:59PM -0700, David Rientjes wrote:
> > Isn't there also a problem in jbd2_journal_write_metadata_buffer(),
> > though?
> >
> > tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
> ...
> > memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
> >
> > jbd2_alloc() is just a wrapper to __get_free_pages() and if it fails, it
> > appears as though the memcpy() would cause a NULL pointer.
>
> Nicely spotted. Yeah, that's a bug; we need to do something about
> that one, too.

IIRC, in the past, jbd_alloc() had a retry mechanism that would loop
indefinitely for some allocations, because they couldn't be aborted
easily. This was removed for some reason, I'm not sure why.

> And what we're doing is a bit silly; it may make sense
> to use __get_free_pages if filesystem blocksize == PAGE_SIZE, but
> otherwise we should be using a sub-page allocator. Right now, we're
> chewing up a 16k PPC page for every 4k filesystem metadata page
> allocated in journal_write_metadata_buffer(), and on x86, for the
> (admittedly uncommon) 1k block filesystem, we'd be chewing up a 4k
> page for a 1k block buffer.

IIRC there was also a good reason for this in the past, related to
the buffers being submitted to the block device layer, and if they
were allocated from the slab cache with CONFIG_DEBUG_SLAB or something
similar enabled the buffer would be misaligned and cause grief.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-06-25 22:05:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 11:26:28PM +0200, Andreas Dilger wrote:
> IIRC there was also a good reason for this in the past, related to
> the buffers being submitted to the block device layer, and if they
> were allocated from the slab cache with CONFIG_DEBUG_SLAB or something
> similar enabled the buffer would be misaligned and cause grief.

So what does SLAB/SLUB/SLOB do if we create a slab cache which is a
power of two? Can one of the allocators still return misaligned
blocks of memory in some circumstances?

- Ted

2009-06-25 22:12:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Theodore Tso wrote:
> On Thu, Jun 25, 2009 at 11:26:28PM +0200, Andreas Dilger wrote:
>> IIRC there was also a good reason for this in the past, related to
>> the buffers being submitted to the block device layer, and if they
>> were allocated from the slab cache with CONFIG_DEBUG_SLAB or something
>> similar enabled the buffer would be misaligned and cause grief.
>
> So what does SLAB/SLUB/SLOB do if we create a slab cache which is a
> power of two? Can one of the allocators still return misaligned
> blocks of memory in some circumstances?
>
> - Ted

ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
with SLUB + slub debug, that gave back non-aligned memory, causing
eventual corruption ...

-Eric


2009-06-26 01:12:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
>
> ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> with SLUB + slub debug, that gave back non-aligned memory, causing
> eventual corruption ...

Grumble. Any chance we could add an kmem_cache option which requires
the memory to be aligned? Otherwise we could rewrite our own sub-page
allocator in ext4 that only handled aligned filesystem block sizes
(i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
code that really should be done once at core functionality.

- Ted

2009-06-26 05:16:17

by Pekka Enberg

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Hi Ted,

On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > with SLUB + slub debug, that gave back non-aligned memory, causing
> > eventual corruption ...

On Thu, 25 Jun 2009, Theodore Tso wrote:
> Grumble. Any chance we could add an kmem_cache option which requires
> the memory to be aligned? Otherwise we could rewrite our own sub-page
> allocator in ext4 that only handled aligned filesystem block sizes
> (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> code that really should be done once at core functionality.

We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
SLUB. Christoph, Nick, don't we need something like this in the allocator?
Eric, does this fix your case?

Pekka

diff --git a/mm/slub.c b/mm/slub.c
index 819f056..7cd1e69 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2400,7 +2400,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
* user specified and the dynamic determination of cache line size
* on bootup.
*/
- align = calculate_alignment(flags, align, s->objsize);
+ align = calculate_alignment(flags, align, size);

/*
* SLUB stores one object immediately after another beginning from

2009-06-26 08:56:27

by Nick Piggin

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, Jun 26, 2009 at 08:16:18AM +0300, Pekka Enberg wrote:
> Hi Ted,
>
> On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > > with SLUB + slub debug, that gave back non-aligned memory, causing
> > > eventual corruption ...
>
> On Thu, 25 Jun 2009, Theodore Tso wrote:
> > Grumble. Any chance we could add an kmem_cache option which requires
> > the memory to be aligned? Otherwise we could rewrite our own sub-page
> > allocator in ext4 that only handled aligned filesystem block sizes
> > (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> > code that really should be done once at core functionality.
>
> We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> SLUB. Christoph, Nick, don't we need something like this in the allocator?
> Eric, does this fix your case?

Well I don't understand Ted's complaint. kmem_cache_create takes an
alignment argument.


2009-06-26 08:57:59

by Pekka Enberg

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, 2009-06-26 at 10:56 +0200, Nick Piggin wrote:
> On Fri, Jun 26, 2009 at 08:16:18AM +0300, Pekka Enberg wrote:
> > Hi Ted,
> >
> > On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > > > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > > > with SLUB + slub debug, that gave back non-aligned memory, causing
> > > > eventual corruption ...
> >
> > On Thu, 25 Jun 2009, Theodore Tso wrote:
> > > Grumble. Any chance we could add an kmem_cache option which requires
> > > the memory to be aligned? Otherwise we could rewrite our own sub-page
> > > allocator in ext4 that only handled aligned filesystem block sizes
> > > (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> > > code that really should be done once at core functionality.
> >
> > We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> > SLUB. Christoph, Nick, don't we need something like this in the allocator?
> > Eric, does this fix your case?
>
> Well I don't understand Ted's complaint. kmem_cache_create takes an
> alignment argument.

Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
my patch.

Pekka


2009-06-26 09:07:39

by Nick Piggin

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, Jun 26, 2009 at 11:58:01AM +0300, Pekka Enberg wrote:
> On Fri, 2009-06-26 at 10:56 +0200, Nick Piggin wrote:
> > On Fri, Jun 26, 2009 at 08:16:18AM +0300, Pekka Enberg wrote:
> > > Hi Ted,
> > >
> > > On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > > > > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > > > > with SLUB + slub debug, that gave back non-aligned memory, causing
> > > > > eventual corruption ...
> > >
> > > On Thu, 25 Jun 2009, Theodore Tso wrote:
> > > > Grumble. Any chance we could add an kmem_cache option which requires
> > > > the memory to be aligned? Otherwise we could rewrite our own sub-page
> > > > allocator in ext4 that only handled aligned filesystem block sizes
> > > > (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> > > > code that really should be done once at core functionality.
> > >
> > > We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> > > SLUB. Christoph, Nick, don't we need something like this in the allocator?
> > > Eric, does this fix your case?
> >
> > Well I don't understand Ted's complaint. kmem_cache_create takes an
> > alignment argument.
>
> Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
> my patch.

Well your patch doesn't hurt (doesn't seem to make much sense to
use objsize rather than size when calculating alignment). But I
think alignment should still be calculated OK: calculate_alignemnt
will always return >= align, and so the next statement will
round up size to the correct alignment AFAICT.

2009-06-26 14:43:56

by Eric Sandeen

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Pekka J Enberg wrote:
> Hi Ted,
>
> On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
>>> ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
>>> with SLUB + slub debug, that gave back non-aligned memory, causing
>>> eventual corruption ...
>
> On Thu, 25 Jun 2009, Theodore Tso wrote:
>> Grumble. Any chance we could add an kmem_cache option which requires
>> the memory to be aligned? Otherwise we could rewrite our own sub-page
>> allocator in ext4 that only handled aligned filesystem block sizes
>> (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
>> code that really should be done once at core functionality.
>
> We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> SLUB. Christoph, Nick, don't we need something like this in the allocator?
> Eric, does this fix your case?

I'll test it; it'd be great if it did, I'm um, a bit ashamed at how I
fixed it ;)

Thanks!
-Eric

> Pekka
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 819f056..7cd1e69 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2400,7 +2400,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> * user specified and the dynamic determination of cache line size
> * on bootup.
> */
> - align = calculate_alignment(flags, align, s->objsize);
> + align = calculate_alignment(flags, align, size);
>
> /*
> * SLUB stores one object immediately after another beginning from


2009-06-29 21:06:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, 26 Jun 2009, Pekka Enberg wrote:

> Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
> my patch.

The size parameter to calculate_alignment has no relevance unless the size
is smaller than half of the alignment. Then for some reason we do not
properly align the stuff. I never liked that....

2009-06-29 21:15:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, 26 Jun 2009, Eric Sandeen wrote:

> I'll test it; it'd be great if it did, I'm um, a bit ashamed at how I
> fixed it ;)

Isnt the problem that the slab allocator did not align something that
you did not request to be aligned?

2009-06-29 21:21:07

by Eric Sandeen

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Christoph Lameter wrote:
> On Fri, 26 Jun 2009, Eric Sandeen wrote:
>
>> I'll test it; it'd be great if it did, I'm um, a bit ashamed at how I
>> fixed it ;)
>
> Isnt the problem that the slab allocator did not align something that
> you did not request to be aligned?

Well, maybe that's all I needed. At the risk of exposing myself to
ridicule, here's the commit:

7fcba054373d5dfc43d26e243a5c9b92069972ee

Was there a better way?

-Eric

2009-06-29 22:35:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, 29 Jun 2009, Eric Sandeen wrote:

> > Isnt the problem that the slab allocator did not align something that
> > you did not request to be aligned?
>
> Well, maybe that's all I needed. At the risk of exposing myself to
> ridicule, here's the commit:
>
> 7fcba054373d5dfc43d26e243a5c9b92069972ee
>
> Was there a better way?

No. Good patch. Indeed the right way. Page allocations come from the page
allocator not from the slab allocator (yes yes there are still issues with
the performance of the page allocator that sometimes dictate the use of
the slabs but in general I think this is what is needed).

2009-06-30 07:59:10

by Nick Piggin

[permalink] [raw]
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, Jun 29, 2009 at 05:06:37PM -0400, Christoph Lameter wrote:
> On Fri, 26 Jun 2009, Pekka Enberg wrote:
>
> > Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
> > my patch.
>
> The size parameter to calculate_alignment has no relevance unless the size
> is smaller than half of the alignment. Then for some reason we do not
> properly align the stuff. I never liked that....

It is properly aligned, if it was not, then it would be a bug. It
should always align to *at least* the requested alignment.

SLAB_HWCACHE_ALIGN is defined to align the object such that it occupies
the minimal number of cache lines. This is not subjective thing to
like or dislike, but just fact because that is how it was implemented
from the start and that is what callers expect.

If you want to minimise false sharing, then normal way would be to
request internode cacheline alignment, right? This is very widely
used.