2006-05-07 23:20:32

by Daniel Hokka Zakrisson

[permalink] [raw]
Subject: [PATCH] fs: fcntl_setlease defies lease_init assumptions

fcntl_setlease uses a struct file_lock on the stack to find the
file_lock it's actually looking for. The problem with this approach is
that lease_init will attempt to free the file_lock if the arg argument
is invalid, causing kmem_cache_free to be called with the address of the
on-stack file_lock.
After running the following test-case, it doesn't take long for an i686
machine running 2.6.16.13 to stop responding completely.
2.6.17-rc3-git12 shows similar results, although it takes a bit more
activity before crashing.

Bj?rn Steinbrink also identified a slab leak in the same piece of code,
caused by the fasync_helper call which will allocate a new slab every
time because it uses the wrong structure (the one on the stack) when the
a lease on that file already exists.

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

void die(const char *msg)
{
perror(msg);
exit(1);
}
int main(int argc, char *argv[])
{
int fd;
if (argc == 1)
{
fprintf(stderr, "Usage: %s file\n", argv[0]);
exit(1);
}
fd = open(argv[1], O_RDONLY);
if (fd == -1)
die("open()");
if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
die("setlease(RDLCK)");
if (fcntl(fd, F_SETLEASE, F_RDLCK) == -1)
die("setlease(RDLCK2)");
if (fcntl(fd, F_SETLEASE, 5) == -1)
die("setlease(invalid)");
close(fd);
return 0;
}

Signed-off-by: Daniel Hokka Zakrisson <[email protected]>

---
--- a/fs/locks.c 2006-05-07 00:29:26.000000000 +0200
+++ b/fs/locks.c 2006-05-07 23:29:12.000000000 +0200
@@ -1363,6 +1363,7 @@ static int __setlease(struct file *filp,
goto out;

if (my_before != NULL) {
+ *flp = *my_before;
error = lease->fl_lmops->fl_change(my_before, arg);
goto out;
}
@@ -1433,7 +1434,7 @@ EXPORT_SYMBOL(setlease);
*/
int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
{
- struct file_lock fl, *flp = &fl;
+ struct file_lock *fl, *flp;
struct dentry *dentry = filp->f_dentry;
struct inode *inode = dentry->d_inode;
int error;
@@ -1446,14 +1447,15 @@ int fcntl_setlease(unsigned int fd, stru
if (error)
return error;

- locks_init_lock(&fl);
- error = lease_init(filp, arg, &fl);
+ error = lease_alloc(filp, arg, &fl);
if (error)
return error;
+ flp = fl;

lock_kernel();

error = __setlease(filp, arg, &flp);
+ locks_free_lock(fl);
if (error || arg == F_UNLCK)
goto out_unlock;



2006-05-08 03:34:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions



On Sun, 7 May 2006, Linus Torvalds wrote:
>
> Trond wrote an alternate patch for the actual problem which I sent
> separately, but it would probably be good to have more safety in the slab
> layer by default regardless.

And here's Trond's suggested locks.c fix.

Linus

---
From: Trond Myklebust <[email protected]>
Subject: fs/locks.c: Fix lease_init
Date: Sun, 07 May 2006 23:02:42 -0400

It is insane to be giving lease_init() the task of freeing the lock it is
supposed to initialise, given that the lock is not guaranteed to be
allocated on the stack. This causes lockups in fcntl_setlease().
Problem diagnosed by Daniel Hokka Zakrisson <[email protected]>

Also fix a slab leak in __setlease() due to an uninitialised return value.
Problem diagnosed by Björn Steinbrink.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/locks.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index abd9448..64b96b1 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -446,15 +446,14 @@ static struct lock_manager_operations le
*/
static int lease_init(struct file *filp, int type, struct file_lock *fl)
{
+ if (assign_type(fl, type) != 0)
+ return -EINVAL;
+
fl->fl_owner = current->files;
fl->fl_pid = current->tgid;

fl->fl_file = filp;
fl->fl_flags = FL_LEASE;
- if (assign_type(fl, type) != 0) {
- locks_free_lock(fl);
- return -EINVAL;
- }
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
fl->fl_ops = NULL;
@@ -466,16 +465,19 @@ static int lease_init(struct file *filp,
static int lease_alloc(struct file *filp, int type, struct file_lock **flp)
{
struct file_lock *fl = locks_alloc_lock();
- int error;
+ int error = -ENOMEM;

if (fl == NULL)
- return -ENOMEM;
+ goto out;

error = lease_init(filp, type, fl);
- if (error)
- return error;
+ if (error) {
+ locks_free_lock(fl);
+ fl = NULL;
+ }
+out:
*flp = fl;
- return 0;
+ return error;
}

/* Check if two locks overlap each other.
@@ -1391,6 +1393,7 @@ static int __setlease(struct file *filp,
goto out;

if (my_before != NULL) {
+ *flp = *my_before;
error = lease->fl_lmops->fl_change(my_before, arg);
goto out;
}

2006-05-08 03:34:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions



On Mon, 8 May 2006, Daniel Hokka Zakrisson wrote:
>
> fcntl_setlease uses a struct file_lock on the stack to find the
> file_lock it's actually looking for. The problem with this approach is
> that lease_init will attempt to free the file_lock if the arg argument
> is invalid, causing kmem_cache_free to be called with the address of the
> on-stack file_lock.

Ok, I was actually really surprised that we'd ever allow a non-slab page
to be free'd as a slab or kmalloc allocation, without screaming very
loudly indeed. That implies a lack of some pretty fundamental sanity
checking by default in the slab layer (I suspect slab debugging turns it
on, but even without it, that's just nasty).

Can you see if this trivial patch at least causes a honking huge
"kernel BUG" message to be triggered quickly?

Trond wrote an alternate patch for the actual problem which I sent
separately, but it would probably be good to have more safety in the slab
layer by default regardless.

Linus

---
diff --git a/mm/slab.c b/mm/slab.c
index c32af7e..279ca59 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -597,6 +597,7 @@ static inline struct kmem_cache *page_ge
{
if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
+ BUG_ON(!PageSlab(page));
return (struct kmem_cache *)page->lru.next;
}

@@ -609,6 +610,7 @@ static inline struct slab *page_get_slab
{
if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
+ BUG_ON(!PageSlab(page));
return (struct slab *)page->lru.prev;
}

2006-05-08 07:57:12

by Daniel Hokka Zakrisson

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Linus Torvalds wrote:
> Ok, I was actually really surprised that we'd ever allow a non-slab page
> to be free'd as a slab or kmalloc allocation, without screaming very
> loudly indeed. That implies a lack of some pretty fundamental sanity
> checking by default in the slab layer (I suspect slab debugging turns it
> on, but even without it, that's just nasty).
>
> Can you see if this trivial patch at least causes a honking huge
> "kernel BUG" message to be triggered quickly?

It doesn't, at least not before the machine becomes unresponsive.

--
Daniel Hokka Zakrisson

2006-05-08 08:01:39

by Daniel Hokka Zakrisson

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Linus Torvalds wrote:
>
> On Sun, 7 May 2006, Linus Torvalds wrote:
>
>>Trond wrote an alternate patch for the actual problem which I sent
>>separately, but it would probably be good to have more safety in the slab
>>layer by default regardless.
>
>
> And here's Trond's suggested locks.c fix.

I can confirm that it does fix the two issues.

--
Daniel Hokka Zakrisson

2006-05-08 08:31:32

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Hi Linus,

On 5/8/06, Linus Torvalds <[email protected]> wrote:
> Ok, I was actually really surprised that we'd ever allow a non-slab page
> to be free'd as a slab or kmalloc allocation, without screaming very
> loudly indeed. That implies a lack of some pretty fundamental sanity
> checking by default in the slab layer (I suspect slab debugging turns it
> on, but even without it, that's just nasty).
>
> Can you see if this trivial patch at least causes a honking huge
> "kernel BUG" message to be triggered quickly?

page_get_cache and page_get_slab are too late. You would need to do
the check in __cache_free; otherwise the stack pointer goes to per-CPU
caches and can be given back by kmalloc(). Adding PageSlab debugging
to __cache_free is probably too much of a performance hit, though.

Pekka

2006-05-08 08:34:31

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On 5/8/06, Linus Torvalds <[email protected]> wrote:
> > Ok, I was actually really surprised that we'd ever allow a non-slab page
> > to be free'd as a slab or kmalloc allocation, without screaming very
> > loudly indeed. That implies a lack of some pretty fundamental sanity
> > checking by default in the slab layer (I suspect slab debugging turns it
> > on, but even without it, that's just nasty).
> >
> > Can you see if this trivial patch at least causes a honking huge
> > "kernel BUG" message to be triggered quickly?

On 5/8/06, Pekka Enberg <[email protected]> wrote:
> page_get_cache and page_get_slab are too late. You would need to do
> the check in __cache_free; otherwise the stack pointer goes to per-CPU
> caches and can be given back by kmalloc(). Adding PageSlab debugging
> to __cache_free is probably too much of a performance hit, though.

Btw, CONFIG_DEBUG_SLAB should catch this case, see kfree_debugcheck()
for details.

Pekka

2006-05-08 15:12:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions



On Mon, 8 May 2006, Pekka Enberg wrote:
>
> On 5/8/06, Pekka Enberg <[email protected]> wrote:
> > page_get_cache and page_get_slab are too late. You would need to do
> > the check in __cache_free; otherwise the stack pointer goes to per-CPU
> > caches and can be given back by kmalloc(). Adding PageSlab debugging
> > to __cache_free is probably too much of a performance hit, though.
>
> Btw, CONFIG_DEBUG_SLAB should catch this case, see kfree_debugcheck()
> for details.

Yeah, but CONFIG_DEBUG_SLAB is _really_ expensive.

We do have a lot of very basic debug checks (unconditionally) in the
kernel to verify various "must be true" kinds of things. It might slow
things down a bit, but in general, I think anything that helps catch
problems early tends to pay itself back very quickly. So I'm more than
happy with a simple BUG_ON() in even a hot path, if it just ends up being
compiled into a "test and branch to unlikely" and doesn't need any costly
locking etc around it.

Fedora had DEBUG_SLAB enabled in their development kernel, and that
actually helped a lot. But I suspect they may _not_ have it in their
non-development ones, and those have a much bigger test-base, so it might
well be worth it to have a good base-line that catches serious problems,
and have DEBUG_SLAB enable the expensive tests.

It's not like trying to free a non-kmalloc'ed pointer is a really strange
event. malloc/free bugs are some of the most common serious problems in
user space, and I suspect they are _less_ common in the kernel, but
still..

Linus

2006-05-08 16:06:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Mon, 2006-05-08 at 08:12 -0700, Linus Torvalds wrote:
> Yeah, but CONFIG_DEBUG_SLAB is _really_ expensive.
>
> We do have a lot of very basic debug checks (unconditionally) in the
> kernel to verify various "must be true" kinds of things. It might slow
> things down a bit, but in general, I think anything that helps catch
> problems early tends to pay itself back very quickly. So I'm more than
> happy with a simple BUG_ON() in even a hot path, if it just ends up being
> compiled into a "test and branch to unlikely" and doesn't need any costly
> locking etc around it.

I was under the impression that virt_to_page() is expensive, even more
so on NUMA. Do we really want this check included unconditionally in
slab free hot path?

Pekka

text data bss dec hex filename
9279 664 80 10023 2727 mm/slab.o (vanilla uma)
9327 664 80 10071 2757 mm/slab.o (debug uma)
13464 2596 24 16084 3ed4 mm/slab.o (vanilla numa)
13492 2596 24 16112 3ef0 mm/slab.o (debug numa)

diff --git a/mm/slab.c b/mm/slab.c
index c32af7e..8ace45b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3077,6 +3077,8 @@ static inline void __cache_free(struct k
check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));

+ BUG_ON(!PageSlab(virt_to_page(objp)));
+
/* Make sure we are not freeing a object from another
* node to the array cache on this cpu.
*/


2006-05-08 16:28:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions



On Mon, 8 May 2006, Pekka Enberg wrote:
>
> I was under the impression that virt_to_page() is expensive, even more
> so on NUMA.

virt_to_page() should be pretty cheap, but you're right, that's probably
the much higher expense than the test. And reading the "struct page" can
get a cache-miss. Although - especially for NUMA - we're actually going
to do that virt_to_page() _anyway_ just a few lines below (as part of
"virt_to_slab()".

Similarly, for "kfree()", we will actually have done that same thing
already (kfree() does "virt_to_cache(objp);").

So we're actually only left with a single case that doesn't currently do
it (and didn't trigger my trivial two additions): kmem_cache_free() just
trusts the cachep pointer implicitly. And that's obviously the case that
the whole fcntl_setlease thing used.

Everybody else would have triggered from my patch which added it at a
point where it was basically free (because we had looked up the page
anyway, and we were going to read from the "struct page" info regardless).

So from a performance standpoint, maybe my previous trivial patch is the
right thing to do, along with an even _stronger_ test for
kmem_cache_free(), where we could do

BUG_ON(virt_to_cache(objp) != cachep);

which you can then remove from the slab debug case.

So for a lot of the normal paths, you'd basically have no extra cost (two
instructions, no data cache pressure), but for kmem_cache_free() we'd have
a slight overhead (but a lot lower than SLAB_DEBUG, and at least for NUMA
it's reading a cacheline that we'd be using regardless.

I think it sounds like it's worth it, but I'm not going to really push it.

Linus

2006-05-08 16:37:14

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Mon, May 08, 2006 at 08:12:18AM -0700, Linus Torvalds wrote:

> Fedora had DEBUG_SLAB enabled in their development kernel, and that
> actually helped a lot. But I suspect they may _not_ have it in their
> non-development ones, and those have a much bigger test-base, so it might
> well be worth it to have a good base-line that catches serious problems,
> and have DEBUG_SLAB enable the expensive tests.

That's correct. Though at times I'll build a one-off test kernel if I'm
suspicious about something, and push that out as a testing update before
the real update goes live.

Those typically don't get anywhere near the level of exposure as our
development kernels though, so they tend not to show up problems as often.

We also carry the 'check the redzones of non-free slabs every few minutes'
patch, which turned up some things once or twice, but nothing else in
quite a while.

Dave

--
http://www.codemonkey.org.uk

2006-05-08 19:36:34

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Mon, 2006-05-08 at 09:28 -0700, Linus Torvalds wrote:
> So from a performance standpoint, maybe my previous trivial patch is the
> right thing to do, along with an even _stronger_ test for
> kmem_cache_free(), where we could do
>
> BUG_ON(virt_to_cache(objp) != cachep);
>
> which you can then remove from the slab debug case.
>
> So for a lot of the normal paths, you'd basically have no extra cost (two
> instructions, no data cache pressure), but for kmem_cache_free() we'd have
> a slight overhead (but a lot lower than SLAB_DEBUG, and at least for NUMA
> it's reading a cacheline that we'd be using regardless.
>
> I think it sounds like it's worth it, but I'm not going to really push it.

Sounds good to me. Andrew?

Pekka

[PATCH] slab: verify pointers before free

From: Pekka Enberg <[email protected]>

Passing an invalid pointer to kfree() and kmem_cache_free() is likely
to cause bad memory corruption or even take down the whole system
because the bad pointer is likely reused immediately due to the
per-CPU caches. Until now, we don't do any verification for this if
CONFIG_DEBUG_SLAB is disabled.

As suggested by Linus, add PageSlab check to page_to_cache() and
page_to_slab() to verify pointers passed to kfree(). Also, move the
stronger check from cache_free_debugcheck() to kmem_cache_free() to
ensure the passed pointer actually belongs to the cache we're about to
free the object.

For page_to_cache() and page_to_slab(), the assertions should have
virtually no extra cost (two instructions, no data cache pressure) and
for kmem_cache_free() the overhead should be minimal.

Signed-off-by: Pekka Enberg <[email protected]>

---

mm/slab.c | 13 ++++---------
1 files changed, 4 insertions(+), 9 deletions(-)

8e4b800f3fb45bbffcc7b365115a63b2a4c911cb
diff --git a/mm/slab.c b/mm/slab.c
index c32af7e..bc9805a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -597,6 +597,7 @@ static inline struct kmem_cache *page_ge
{
if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
+ BUG_ON(!PageSlab(page));
return (struct kmem_cache *)page->lru.next;
}

@@ -609,6 +610,7 @@ static inline struct slab *page_get_slab
{
if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
+ BUG_ON(!PageSlab(page));
return (struct slab *)page->lru.prev;
}

@@ -2597,15 +2599,6 @@ static void *cache_free_debugcheck(struc
kfree_debugcheck(objp);
page = virt_to_page(objp);

- if (page_get_cache(page) != cachep) {
- printk(KERN_ERR "mismatch in kmem_cache_free: expected "
- "cache %p, got %p\n",
- page_get_cache(page), cachep);
- printk(KERN_ERR "%p is %s.\n", cachep, cachep->name);
- printk(KERN_ERR "%p is %s.\n", page_get_cache(page),
- page_get_cache(page)->name);
- WARN_ON(1);
- }
slabp = page_get_slab(page);

if (cachep->flags & SLAB_RED_ZONE) {
@@ -3361,6 +3354,8 @@ void kmem_cache_free(struct kmem_cache *
{
unsigned long flags;

+ BUG_ON(virt_to_cache(objp) != cachep);
+
local_irq_save(flags);
__cache_free(cachep, objp);
local_irq_restore(flags);
--
1.3.0




2006-05-09 03:39:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Mon, 8 May 2006, Pekka Enberg wrote:

> > I think it sounds like it's worth it, but I'm not going to really push it.
>
> Sounds good to me. Andrew?

virt_to_page is not cheap on NUMA.

On IA64 virt_to_page is:

#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)

# define pfn_to_page(pfn) (vmem_map + (pfn))

vmem_map is not a linear map but a virtual mapping that may require
several faults to get the information.

2006-05-09 03:49:24

by Martin Bligh

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Christoph Lameter wrote:
> On Mon, 8 May 2006, Pekka Enberg wrote:
>
>
>>>I think it sounds like it's worth it, but I'm not going to really push it.
>>
>>Sounds good to me. Andrew?
>
>
> virt_to_page is not cheap on NUMA.
>
> On IA64 virt_to_page is:
>
> #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
>
> # define pfn_to_page(pfn) (vmem_map + (pfn))
>
> vmem_map is not a linear map but a virtual mapping that may require
> several faults to get the information.

Can't you use sparsemem instead? It solves the same problem without the
magic faulting, doesn't it?

M.

2006-05-09 05:32:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Mon, 8 May 2006, Martin J. Bligh wrote:

> Can't you use sparsemem instead? It solves the same problem without the
> magic faulting, doesn't it?

But sparsemem has more complex table lookups. Ultimately IA64 will move
to sparsemem (I think) but we are not there yet and we would like to be
sure that there are no performance regressions with that move.

2006-05-09 06:16:28

by Martin Bligh

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Christoph Lameter wrote:
> On Mon, 8 May 2006, Martin J. Bligh wrote:
>
>
>>Can't you use sparsemem instead? It solves the same problem without the
>>magic faulting, doesn't it?
>
>
> But sparsemem has more complex table lookups. Ultimately IA64 will move
> to sparsemem (I think) but we are not there yet and we would like to be
> sure that there are no performance regressions with that move.

Please explain your concerns in more detail re complexity. I was under
the impression the design avoided that nicely by folding the
calculations together down into a single layer.

It's been around for a long time now ... has nobody tested the
performance on ia64 yet?

M.

2006-05-09 06:24:47

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Christoph Lameter wrote:

>On Mon, 8 May 2006, Pekka Enberg wrote:
>
>
>
>>>I think it sounds like it's worth it, but I'm not going to really push it.
>>>
>>>
>>Sounds good to me. Andrew?
>>
>>
>
>virt_to_page is not cheap on NUMA.
>
>
>
I know. But it was a design assumption when I wrote the slab allocator.
Acutally, it's not cheap on non-NUMA either. And the PageCompound()
check adds an additional branch.

Probably the allocator should be rewritten, without relying on
virt_to_page().
Any proposals how kfree and kmem_cache_free could locate the cachep
pointer? That's the performance critical part.

Right's now it's
<<<
page = vir_to_page(objp)
if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
cachep = (struct kmem_cache *)page->lru.next;
>>>

What about:
- switch from power of two kmalloc caches to slighly smaller caches.
- change the kmalloc(PAGE_SIZE) users to get_free_page().
get_free_page() is now fast, too.
- use cachep= *((struct kmem_cache **)(objp & 0xfff))

The result would be a few small restrictions: all objects must start in
the first page of a slab (there are no exceptions on my 2.6.16 system),
and PAGE_SIZE'd caches are very expensive. Replacing the names_cache
with get_free_page is trivial. That leaves the pgd cache.

--
Manfred


2006-05-09 06:36:38

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Manfred Spraul (on Tue, 09 May 2006 08:22:59 +0200) wrote:
>What about:
>- switch from power of two kmalloc caches to slighly smaller caches.
>- change the kmalloc(PAGE_SIZE) users to get_free_page().
>get_free_page() is now fast, too.
>- use cachep= *((struct kmem_cache **)(objp & 0xfff))

0xfff? (PAGE_SIZE - 1) surely. Not all the world is a 386.

2006-05-09 09:03:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Manfred Spraul wrote:

> Christoph Lameter wrote:
>
>> virt_to_page is not cheap on NUMA.
>>
>>
>>
> I know. But it was a design assumption when I wrote the slab allocator.
> Acutally, it's not cheap on non-NUMA either. And the PageCompound()
> check adds an additional branch.


Just FYI, the PageCompound check can go as soon as nommu stops trying to
do its
broken page refcounting on slab pages (whenever that happens :P).
--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-09 10:26:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Tue, 9 May 2006, Manfred Spraul wrote:
> Probably the allocator should be rewritten, without relying on virt_to_page().
> Any proposals how kfree and kmem_cache_free could locate the cachep pointer?
> That's the performance critical part.
>
> Right's now it's
> <<<
> page = vir_to_page(objp)
> if (unlikely(PageCompound(page)))
> page = (struct page *)page_private(page);
> cachep = (struct kmem_cache *)page->lru.next;
> >>>
>
> What about:
> - switch from power of two kmalloc caches to slighly smaller caches.
> - change the kmalloc(PAGE_SIZE) users to get_free_page(). get_free_page() is
> now fast, too.
> - use cachep= *((struct kmem_cache **)(objp & 0xfff))

I think you mean

static inline struct kmem_cache *slab_get_cache(const void *obj)
{
struct kmem_cache **p = (void *)((unsigned long) obj & ~(PAGE_SIZE-1));
return *p;
}

On Tue, 9 May 2006, Manfred Spraul wrote:
> The result would be a few small restrictions: all objects must start in the
> first page of a slab (there are no exceptions on my 2.6.16 system), and
> PAGE_SIZE'd caches are very expensive. Replacing the names_cache with
> get_free_page is trivial. That leaves the pgd cache.

Your plan makes sense for slabs that have slab management structures
embedded within. We already have enough free space there for one pointer
due to

colour_off += cachep->slab_size;

in the alloc_slabmgmt() function, I think. Are you planning to kill
external slab management allocation completely by switching to
get_free_pages() for those cases? I'd much rather make the switch to page
allocator under the hood so kmalloc(PAGE_SIZE*n) would still work because
it's much nicer API.

Pekka

2006-05-09 14:41:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions



On Mon, 8 May 2006, Christoph Lameter wrote:
>
> On Mon, 8 May 2006, Pekka Enberg wrote:
>
> > > I think it sounds like it's worth it, but I'm not going to really push it.
> >
> > Sounds good to me. Andrew?
>
> virt_to_page is not cheap on NUMA.

Right now the __cache_free() chain does "virt_to_page()" on NUMA
regardless, through the

#ifdef CONFIG_NUMA
{
struct slab *slabp;
slabp = virt_to_slab(objp);
,,,

thing. The suggested patch obviously makes it do it _twice_: once to get
the cachep, once to get the slabp. But some simple re-organization would
make it do it just once, if we passed in the "struct page *" instead of
the "struct cachep" - since in the end, every single path into the real
core of the allocator does end up needing it.

Linus

2006-05-09 18:26:53

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

Pekka J Enberg wrote:

>I think you mean
>
>static inline struct kmem_cache *slab_get_cache(const void *obj)
>{
> struct kmem_cache **p = (void *)((unsigned long) obj & ~(PAGE_SIZE-1));
> return *p;
>}
>
>
>
Of course.

>On Tue, 9 May 2006, Manfred Spraul wrote:
>
>
>>The result would be a few small restrictions: all objects must start in the
>>first page of a slab (there are no exceptions on my 2.6.16 system), and
>>PAGE_SIZE'd caches are very expensive. Replacing the names_cache with
>>get_free_page is trivial. That leaves the pgd cache.
>>
>>
>
>Your plan makes sense for slabs that have slab management structures
>embedded within.
>
No - it would only make sense if it could be used for all slabs.
Otherwise: How should kfree figure out if it's called for a slab with
embedded pointers or not?

> We already have enough free space there for one pointer
>due to
>
> colour_off += cachep->slab_size;
>
>in the alloc_slabmgmt() function, I think. Are you planning to kill
>external slab management allocation completely by switching to
>get_free_pages() for those cases? I'd much rather make the switch to page
>allocator under the hood so kmalloc(PAGE_SIZE*n) would still work because
>it's much nicer API.
>
>
How many kmalloc(PAGE_SIZE*n) users are there?

--
Manfred

2006-05-09 18:56:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions



On Tue, 9 May 2006, Manfred Spraul wrote:
>
> How many kmalloc(PAGE_SIZE*n) users are there?

A single PAGE_SIZE allocation is quite common. Lots of kernel structures
end up (often for historical reasons) being that size. PATH_MAX, for one.
Sometimes it's also simply because it's the one "known" size that doesn't
cause fragmentation and is easily available, so..

In other words, it's often the "canonical size" for some random buffer:
if only because it's known to be the largest possible buffer that is
always available.

Linus

2006-05-09 19:05:29

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Tue, 2006-05-09 at 20:25 +0200, Manfred Spraul wrote:
> No - it would only make sense if it could be used for all slabs.
> Otherwise: How should kfree figure out if it's called for a slab with
> embedded pointers or not?

Yeah, you're absolutely right. Which is exactly why I asked what you
want to do with the OFF_SLAB case. We certainly can support both by
virtualizing kfree() and kmem_cache_free() with struct slab_operations
type of thing, but that's probably not very performant. Well, it might
be for NUMA, as virt_to_page() is so expensive.

Anyway, embedding struct kmem_cache pointer in the beginning of slab
page is doable (I have a proof-of-concept patch for that btw) but we
need to solve OFF_SLAB first.

Pekka

2006-05-09 19:15:15

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Tue, 2006-05-09 at 20:25 +0200, Manfred Spraul wrote:
> > No - it would only make sense if it could be used for all slabs.
> > Otherwise: How should kfree figure out if it's called for a slab with
> > embedded pointers or not?

On Tue, 2006-05-09 at 22:05 +0300, Pekka Enberg wrote:
> We certainly can support both by virtualizing kfree() and
> kmem_cache_free() with struct slab_operations type of thing

Uhm, not for kfree(), obviously. I was thinking of kmem_cache_free().

Pekka

2006-05-10 00:00:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] fs: fcntl_setlease defies lease_init assumptions

On Tue, 9 May 2006, Linus Torvalds wrote:

> Right now the __cache_free() chain does "virt_to_page()" on NUMA
> regardless, through the
>
> #ifdef CONFIG_NUMA
> {
> struct slab *slabp;
> slabp = virt_to_slab(objp);
> ,,,
>
> thing. The suggested patch obviously makes it do it _twice_: once to get
> the cachep, once to get the slabp. But some simple re-organization would
> make it do it just once, if we passed in the "struct page *" instead of
> the "struct cachep" - since in the end, every single path into the real
> core of the allocator does end up needing it.

Sounds like the best approach to address this rather than another slab
redesign.