LinuxLists.cc - Re: SLUB defrag pull request?

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 13 Oct 2008, Nick Piggin wrote:
> In many cases, yes it seems to. And some of the approaches even if
> they work now seem like they *might* cause problematic constraints
> in the design... Have Al and Christoph reviewed the dentry and inode
> patches?

This d_invalidate() looks suspicious to me:

+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+ int nr, void **v, void *private)
+{
+ struct dentry *dentry;
+ int i;
+
+ /*
+ * First invalidate the dentries without holding the dcache lock
+ */
+ for (i = 0; i < nr; i++) {
+ dentry = v[i];
+
+ if (dentry)
+ d_invalidate(dentry);
+ }

I think it's wrong to unhash dentries while they are possibly still
being used. You can do the shrink_dcache_parent() here, but should
leave the unhashing to be done by prune_one_dentry(), after it's been
checked that there are no other users of the dentry.

Miklos

2008-10-13 14:25:42

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:
> I think it's wrong to unhash dentries while they are possibly still
> being used. You can do the shrink_dcache_parent() here, but should
> leave the unhashing to be done by prune_one_dentry(), after it's been
> checked that there are no other users of the dentry.
>
>
d_invalidate() calls shrink_dcache_parent() as needed and will fail if
there are other users of the dentry.

2008-10-13 14:27:46

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 13 Oct 2008, Miklos Szeredi wrote:
> On Mon, 13 Oct 2008, Nick Piggin wrote:
> > In many cases, yes it seems to. And some of the approaches even if
> > they work now seem like they *might* cause problematic constraints
> > in the design... Have Al and Christoph reviewed the dentry and inode
> > patches?
>
> This d_invalidate() looks suspicious to me:

And the things kick_inodes() does without any sort of locking look
even more dangerous.

It should be the other way round: first make sure nothing is
referencing the inode, and _then_ start cleaning it up with
appropriate locks held. See prune_icache().

Miklos

2008-10-13 14:29:04

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 13 Oct 2008, Christoph Lameter wrote:
> Miklos Szeredi wrote:
> > I think it's wrong to unhash dentries while they are possibly still
> > being used. You can do the shrink_dcache_parent() here, but should
> > leave the unhashing to be done by prune_one_dentry(), after it's been
> > checked that there are no other users of the dentry.
> >
> >
> d_invalidate() calls shrink_dcache_parent() as needed and will fail if
> there are other users of the dentry.

Only if it's a directory. Now unhashing an in-use non-directory is
not fatal, but you'll get things like "filename (deleted)" in /proc,
and suchlike. Don't do it.

Miklos

2008-10-13 14:36:12

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:
> And the things kick_inodes() does without any sort of locking look
> even more dangerous.
>
> It should be the other way round: first make sure nothing is
> referencing the inode, and _then_ start cleaning it up with
> appropriate locks held. See prune_icache().
>
>
kick_inodes() only works on inodes that first have undergone
get_inodes() where we establish a refcount under inode_lock(). The final
cleanup in kick_inodes() is done under iprune_mutex. You are looking at
the loop that does writeback and invalidates attached dentries. This can
fail for various reasons.

2008-10-13 14:49:41

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 13 Oct 2008, Christoph Lameter wrote:
> Miklos Szeredi wrote:
> > And the things kick_inodes() does without any sort of locking look
> > even more dangerous.
> >
> > It should be the other way round: first make sure nothing is
> > referencing the inode, and _then_ start cleaning it up with
> > appropriate locks held. See prune_icache().
> >
> >
> kick_inodes() only works on inodes that first have undergone
> get_inodes() where we establish a refcount under inode_lock(). The final
> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
> the loop that does writeback and invalidates attached dentries. This can
> fail for various reasons.

Yes, but I'm not at all sure that calling remove_inode_buffers() or
invalidate_mapping_pages() is OK on a live inode. They should be done
after checking the refcount, just like prune_icache() does.

Also, while d_invalidate() is not actually wrong here, because you
check S_ISDIR(), but it's still the wrong function to use. You really
just want to shrink the children. Invalidation means: the filesystem
found out that the cached inode is invalid, so we want to throw it
away. In the future it might actually be able to do it for
directories as well, but currently it cannot because of possible
mounts on the dentry.

Miklos

2008-10-13 15:23:30

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

And BTW the whole thing seems to be broken WRT umount. Getting a
reference to a dentry or an inode without also getting reference to a
vfsmount is going to result in "VFS: Busy inodes after unmount of
%s. Self-destruct in 5 seconds. Have a nice day...\n". And getting a
reference to the vfsmount will result in EBUSY when trying to umount,
which is also not what we want.

So it seemst that this two pass method will not work with dentries or
inodes at all :(

Miklos

2008-10-20 15:00:18

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:
>
>>> It should be the other way round: first make sure nothing is
>>> referencing the inode, and _then_ start cleaning it up with
>>> appropriate locks held. See prune_icache().

The code was initially taken from prune_icache.

>> kick_inodes() only works on inodes that first have undergone
>> get_inodes() where we establish a refcount under inode_lock(). The final
>> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
>> the loop that does writeback and invalidates attached dentries. This can
>> fail for various reasons.
>
> Yes, but I'm not at all sure that calling remove_inode_buffers() or
> invalidate_mapping_pages() is OK on a live inode. They should be done
> after checking the refcount, just like prune_icache() does.

Dont we do the same on a truncate?

> Also, while d_invalidate() is not actually wrong here, because you
> check S_ISDIR(), but it's still the wrong function to use. You really
> just want to shrink the children. Invalidation means: the filesystem
> found out that the cached inode is invalid, so we want to throw it
> away. In the future it might actually be able to do it for
> directories as well, but currently it cannot because of possible
> mounts on the dentry.

Thats the same issue as with the dentries. The new function could deal with
both situations?

2008-10-20 18:02:36

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 20 Oct 2008, Christoph Lameter wrote:
> >> kick_inodes() only works on inodes that first have undergone
> >> get_inodes() where we establish a refcount under inode_lock(). The final
> >> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
> >> the loop that does writeback and invalidates attached dentries. This can
> >> fail for various reasons.
> >
> > Yes, but I'm not at all sure that calling remove_inode_buffers() or
> > invalidate_mapping_pages() is OK on a live inode. They should be done
> > after checking the refcount, just like prune_icache() does.
>
> Dont we do the same on a truncate?

Yes, with i_mutex and i_alloc_sem held.

>
> > Also, while d_invalidate() is not actually wrong here, because you
> > check S_ISDIR(), but it's still the wrong function to use. You really
> > just want to shrink the children. Invalidation means: the filesystem
> > found out that the cached inode is invalid, so we want to throw it
> > away. In the future it might actually be able to do it for
> > directories as well, but currently it cannot because of possible
> > mounts on the dentry.
>
> Thats the same issue as with the dentries. The new function could deal with
> both situations?

Sure.

The big issue is dealing with umount. You could do something like
grab_super() on sb before getting a ref on the inode/dentry. But I'm
not sure this is a good idea. There must be a simpler way to achieve
this...

Miklos

2008-10-20 18:24:04

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:
> On Mon, 20 Oct 2008, Christoph Lameter wrote:
>>>> kick_inodes() only works on inodes that first have undergone
>>>> get_inodes() where we establish a refcount under inode_lock(). The final
>>>> cleanup in kick_inodes() is done under iprune_mutex. You are looking at
>>>> the loop that does writeback and invalidates attached dentries. This can
>>>> fail for various reasons.
>>> Yes, but I'm not at all sure that calling remove_inode_buffers() or
>>> invalidate_mapping_pages() is OK on a live inode. They should be done
>>> after checking the refcount, just like prune_icache() does.
>> Dont we do the same on a truncate?
>
> Yes, with i_mutex and i_alloc_sem held.

There is another call to invalidate_mapping_pages() in prune_icache (that is
where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
and that seems to be for the protection of the list. So just checking
inode->i_count would do the trick?

>>> Also, while d_invalidate() is not actually wrong here, because you
>>> check S_ISDIR(), but it's still the wrong function to use. You really
>>> just want to shrink the children. Invalidation means: the filesystem
>>> found out that the cached inode is invalid, so we want to throw it
>>> away. In the future it might actually be able to do it for
>>> directories as well, but currently it cannot because of possible
>>> mounts on the dentry.
>> Thats the same issue as with the dentries. The new function could deal with
>> both situations?
>
> Sure.
>
> The big issue is dealing with umount. You could do something like
> grab_super() on sb before getting a ref on the inode/dentry. But I'm
> not sure this is a good idea. There must be a simpler way to achieve
> this...

Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.

We are only interested in the reclaim a dentry if its currently unused. If so
then why does unmount matter? Both unmount and reclaim will attempt to remove
the dentry.

Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
state. Either the entry is ignored or dget_locked() removes it from the lru.
If its off the LRU then it can no longer be reclaimed by umount.

2008-10-20 18:41:20

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 20 Oct 2008, Christoph Lameter wrote:
> Miklos Szeredi wrote:
> >>> Yes, but I'm not at all sure that calling remove_inode_buffers() or
> >>> invalidate_mapping_pages() is OK on a live inode. They should be done
> >>> after checking the refcount, just like prune_icache() does.
> >> Dont we do the same on a truncate?
> >
> > Yes, with i_mutex and i_alloc_sem held.
>
> There is another call to invalidate_mapping_pages() in prune_icache (that is
> where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
> and that seems to be for the protection of the list. So just checking
> inode->i_count would do the trick?

Yes, that's what I was saying.

> > The big issue is dealing with umount. You could do something like
> > grab_super() on sb before getting a ref on the inode/dentry. But I'm
> > not sure this is a good idea. There must be a simpler way to achieve
> > this..
>
> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.

No, I mean simpler than having to do this two stage stuff.

> We are only interested in the reclaim a dentry if its currently unused. If so
> then why does unmount matter? Both unmount and reclaim will attempt to remove
> the dentry.
>
> Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
> state. Either the entry is ignored or dget_locked() removes it from the lru.
> If its off the LRU then it can no longer be reclaimed by umount.

How is that better? You will still get busy inodes on umount.

And anyway the dentry could be put back onto the LRU by somebody else
between get_dentries() and kick_dentries(). So I don't even see how
taking the dentry off the LRU helps _anything_.

Miklos

2008-10-20 19:12:24

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:

>> There is another call to invalidate_mapping_pages() in prune_icache (that is
>> where this code originates). No i_mutex and i_alloc. Only iprune_mutex held
>> and that seems to be for the protection of the list. So just checking
>> inode->i_count would do the trick?
>
> Yes, that's what I was saying.

Ok. Thats easy to do.

>
>>> The big issue is dealing with umount. You could do something like
>>> grab_super() on sb before getting a ref on the inode/dentry. But I'm
>>> not sure this is a good idea. There must be a simpler way to achieve
>>> this..
>> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.
>
> No, I mean simpler than having to do this two stage stuff.

How could it be simpler? First you need to establish a secure reference to the
object so that it cannot vanish from under us. Then all the references can be
checked and possibly removed. If we do not need a secure reference then the
get_dentries() etc method can be NULL.

>> We are only interested in the reclaim a dentry if its currently unused. If so
>> then why does unmount matter? Both unmount and reclaim will attempt to remove
>> the dentry.
>>
>> Have a look at get_dentries(). It takes the dcache_lock and checks the dentry
>> state. Either the entry is ignored or dget_locked() removes it from the lru.
>> If its off the LRU then it can no longer be reclaimed by umount.
>
> How is that better? You will still get busy inodes on umount.

Those inodes are going to be freed by the reclaim code. Why would they be busy
(unless the case below occurs of course).

> And anyway the dentry could be put back onto the LRU by somebody else
> between get_dentries() and kick_dentries(). So I don't even see how
> taking the dentry off the LRU helps _anything_.

get_dentries() gets a reference. dput will not put the dentry back onto the LRU.

2008-10-20 19:28:49

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 20 Oct 2008, Christoph Lameter wrote:
> >>> The big issue is dealing with umount. You could do something like
> >>> grab_super() on sb before getting a ref on the inode/dentry. But I'm
> >>> not sure this is a good idea. There must be a simpler way to achieve
> >>> this..
> >> Taking a lock on vfsmount_lock? But that would make dentry reclaim a pain.
> >
> > No, I mean simpler than having to do this two stage stuff.
>
> How could it be simpler? First you need to establish a secure
> reference to the object so that it cannot vanish from under us. Then
> all the references can be checked and possibly removed. If we do not
> need a secure reference then the get_dentries() etc method can be
> NULL.

So, isn't it possible to do without get_dentries()? What's the
fundamental difference between this and regular cache shrinking?

> Those inodes are going to be freed by the reclaim code. Why would
> they be busy (unless the case below occurs of course).

Case below was brainfart, please ignore. But that doesn't really
help: the VFS assumes that you cannot umount while there are busy
dentries/inodes. Usually it works this way: VFS first gets vfsmount
ref, then gets dentry ref, and releases them in the opposite order.
And umount is not allowed if vfsmount has a non-zero refcount (it's a
bit more complicated, but the essense is the same).

The current SLUB defrag violates this: it gets dentry or inode ref
without getting a ref on the vfsmount or the super block as well.
This means that the umount will succeed (that's OK), but also the
super block will be going away and that's bad. See
generic_shutdown_super().

> > And anyway the dentry could be put back onto the LRU by somebody else
> > between get_dentries() and kick_dentries(). So I don't even see how
> > taking the dentry off the LRU helps _anything_.
>
> get_dentries() gets a reference. dput will not put the dentry back
> onto the LRU.

Right.

Miklos

2008-10-20 19:54:26

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:
> So, isn't it possible to do without get_dentries()? What's the
> fundamental difference between this and regular cache shrinking?

The fundamental difference is that slab defrag operates on sparsely populated
dentries. It comes into effect when the density of dentries per page is low
and lots of memory is wasted. It defragments by kicking out dentries in low
density pages. These can then be reclaimed.

> Case below was brainfart, please ignore. But that doesn't really
> help: the VFS assumes that you cannot umount while there are busy
> dentries/inodes. Usually it works this way: VFS first gets vfsmount
> ref, then gets dentry ref, and releases them in the opposite order.
> And umount is not allowed if vfsmount has a non-zero refcount (it's a
> bit more complicated, but the essense is the same).

The dentries that we get a ref on are candidates for removal. Their lifetime
is limited. Unmounting while we are trying to remove dentries/inodes results
in two mechanisms removing dentries/inodes.

If we have obtained a reference then invalidate_list() will return the number
of busy inodes which would trigger the printk in generic_shutdown_super(). But
these are inodes currently being reclaimed by slab defrag. Just waiting a bit
would remedy the situation.

We would need some way to make generic_shutdown_super() wait until slab defrag
is finished.

2008-10-20 20:50:59

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, 20 Oct 2008, Christoph Lameter wrote:
> Miklos Szeredi wrote:
> > So, isn't it possible to do without get_dentries()? What's the
> > fundamental difference between this and regular cache shrinking?
>
> The fundamental difference is that slab defrag operates on sparsely
> populated dentries. It comes into effect when the density of
> dentries per page is low and lots of memory is wasted. It
> defragments by kicking out dentries in low density pages. These can
> then be reclaimed.

OK, but why can't this be done in just one stage?

AFAICS the problem is exactly the same as generic shrinking, except it
wants to evict dentries selectively: only ones which are in very
sparse slabs.

So is the problem selecting these dentries? Would it be too expensive
to do it the same as normal cache shrinking and walk the lru, but only
evict the ones which are tagged as being in a sparse page?

> The dentries that we get a ref on are candidates for removal. Their
> lifetime is limited. Unmounting while we are trying to remove
> dentries/inodes results in two mechanisms removing dentries/inodes.
>
> If we have obtained a reference then invalidate_list() will return
> the number of busy inodes which would trigger the printk in
> generic_shutdown_super(). But these are inodes currently being
> reclaimed by slab defrag. Just waiting a bit would remedy the
> situation.

I guess so, but that's just a hack to work around the problem, and
creates more interdependencies between VFS and the allocator with
unforseeable consequences.

Miklos

2008-10-20 23:04:42

by Dave Chinner

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Mon, Oct 20, 2008 at 02:53:40PM -0500, Christoph Lameter wrote:
> Miklos Szeredi wrote:
> > Case below was brainfart, please ignore. But that doesn't really
> > help: the VFS assumes that you cannot umount while there are busy
> > dentries/inodes. Usually it works this way: VFS first gets vfsmount
> > ref, then gets dentry ref, and releases them in the opposite order.
> > And umount is not allowed if vfsmount has a non-zero refcount (it's a
> > bit more complicated, but the essense is the same).
>
> The dentries that we get a ref on are candidates for removal. Their lifetime
> is limited. Unmounting while we are trying to remove dentries/inodes results
> in two mechanisms removing dentries/inodes.
>
> If we have obtained a reference then invalidate_list() will return the number
> of busy inodes which would trigger the printk in generic_shutdown_super(). But
> these are inodes currently being reclaimed by slab defrag. Just waiting a bit
> would remedy the situation.
>
> We would need some way to make generic_shutdown_super() wait until slab defrag
> is finished.

Seems to me that prune_dcache() handles this case by holding the sb->s_umount
semaphore while pruning. The same logic applies here, right?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-10-21 23:18:19

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Miklos Szeredi wrote:
> On Mon, 20 Oct 2008, Christoph Lameter wrote:
>> Miklos Szeredi wrote:
>>> So, isn't it possible to do without get_dentries()? What's the
>>> fundamental difference between this and regular cache shrinking?
>> The fundamental difference is that slab defrag operates on sparsely
>> populated dentries. It comes into effect when the density of
>> dentries per page is low and lots of memory is wasted. It
>> defragments by kicking out dentries in low density pages. These can
>> then be reclaimed.
>
> OK, but why can't this be done in just one stage?

The only way that a secure reference can be established is if the slab page is
locked. That requires a spinlock. The slab allocator calls the get() functions
while the slab lock guarantees object existence. Then locks are dropped and
reclaim actions can start with the guarantee that the slab object will not
suddenly vanish.

2008-10-22 07:11:10

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Tue, 21 Oct 2008, Christoph Lameter wrote:
> The only way that a secure reference can be established is if the
> slab page is locked. That requires a spinlock. The slab allocator
> calls the get() functions while the slab lock guarantees object
> existence. Then locks are dropped and reclaim actions can start with
> the guarantee that the slab object will not suddenly vanish.

Yes, you've made up your mind, that you want to do it this way. But
it's the _wrong_ way, this "want to get a secure reference for use
later" leads to madness when applied to dentries or inodes. Try for a
minute to think outside this template.

For example dcache_lock will protect against dentries moving to/from
d_lru. So you can do this:

take dcache_lock
check if d_lru is non-empty
take sb->s_umount
free dentry
release sb->s_umount
release dcache_lock

Yeah, locking will be more complicated in reality. Still, much less
complicated than trying to do the same across two separate phases.

Why can't something like that work?

Miklos

2008-10-22 15:53:12

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Miklos Szeredi wrote:

> On Tue, 21 Oct 2008, Christoph Lameter wrote:
>> The only way that a secure reference can be established is if the
>> slab page is locked. That requires a spinlock. The slab allocator
>> calls the get() functions while the slab lock guarantees object
>> existence. Then locks are dropped and reclaim actions can start with
>> the guarantee that the slab object will not suddenly vanish.
>
> Yes, you've made up your mind, that you want to do it this way. But
> it's the _wrong_ way, this "want to get a secure reference for use
> later" leads to madness when applied to dentries or inodes. Try for a
> minute to think outside this template.
>
> For example dcache_lock will protect against dentries moving to/from
> d_lru. So you can do this:
>
> take dcache_lock
> check if d_lru is non-empty

The dentry could have been freed even before we take the dcache_lock. We
cannot access d_lru without a stable reference to the dentry.

> take sb->s_umount
> free dentry
> release sb->s_umount
> release dcache_lock
>
> Yeah, locking will be more complicated in reality. Still, much less
> complicated than trying to do the same across two separate phases.
>
> Why can't something like that work?

Because the slab starts out with a series of objects left in a slab. It
needs to do build a list of objects etc in a way that is independent as
possible from the user of the slab page. It does that by locking the slab
page so that free operations stall until the reference has been
established. If it would not be shutting off frees then the objects could
vanish under us.

We could also avoid frees by calling some cache specific method that locks
out frees before and after. But then frees would stall everywhere and
every slab cache would have to check a global lock before freeing objects
(there would be numerous complications with RCU free etc etc).

Slab defrag only stops frees on a particular slab page.

The slab defrag approach also allows the slab cache (dentry or inodes
here) to do something else than free the object. It would be possible f.e.
to move the object by allocating a new entry and moving the information to
the new dentry. That would actually be better since it would preserve the
objects and just move them into the same slab page.

2008-10-22 19:46:38

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Christoph Lameter wrote:
> On Wed, 22 Oct 2008, Miklos Szeredi wrote:
>
> > On Tue, 21 Oct 2008, Christoph Lameter wrote:
> >> The only way that a secure reference can be established is if the
> >> slab page is locked. That requires a spinlock. The slab allocator
> >> calls the get() functions while the slab lock guarantees object
> >> existence. Then locks are dropped and reclaim actions can start with
> >> the guarantee that the slab object will not suddenly vanish.
> >
> > Yes, you've made up your mind, that you want to do it this way. But
> > it's the _wrong_ way, this "want to get a secure reference for use
> > later" leads to madness when applied to dentries or inodes. Try for a
> > minute to think outside this template.
> >
> > For example dcache_lock will protect against dentries moving to/from
> > d_lru. So you can do this:
> >
> > take dcache_lock
> > check if d_lru is non-empty
>
> The dentry could have been freed even before we take the dcache_lock. We
> cannot access d_lru without a stable reference to the dentry.

Why? The kmem_cache_free() doesn't touch the contents of the object,
does it?

> > take sb->s_umount
> > free dentry
> > release sb->s_umount
> > release dcache_lock
> >
> > Yeah, locking will be more complicated in reality. Still, much less
> > complicated than trying to do the same across two separate phases.
> >
> > Why can't something like that work?
>
> Because the slab starts out with a series of objects left in a slab. It
> needs to do build a list of objects etc in a way that is independent as
> possible from the user of the slab page. It does that by locking the slab
> page so that free operations stall until the reference has been
> established. If it would not be shutting off frees then the objects could
> vanish under us.

It doesn't matter. All we care about is that the dentry is on the
lru: it's cached but unused. Every other state (being created,
active, being freed, freed) is uninteresting.

> We could also avoid frees by calling some cache specific method that locks
> out frees before and after. But then frees would stall everywhere and
> every slab cache would have to check a global lock before freeing objects
> (there would be numerous complications with RCU free etc etc).
>
> Slab defrag only stops frees on a particular slab page.
>
> The slab defrag approach also allows the slab cache (dentry or inodes
> here) to do something else than free the object. It would be possible f.e.
> to move the object by allocating a new entry and moving the information to
> the new dentry. That would actually be better since it would preserve the
> objects and just move them into the same slab page.

Sure, and all that is possible without doing this messy 2 phase thing.
Unless I'm still missing something obvious...

Miklos

2008-10-22 19:55:34

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Miklos Szeredi wrote:

> Why? The kmem_cache_free() doesn't touch the contents of the object,
> does it?

Because filesystem code may be running on other processors which may be
freeing the dentry.

>> Because the slab starts out with a series of objects left in a slab. It
>> needs to do build a list of objects etc in a way that is independent as
>> possible from the user of the slab page. It does that by locking the slab
>> page so that free operations stall until the reference has been
>> established. If it would not be shutting off frees then the objects could
>> vanish under us.
>
> It doesn't matter. All we care about is that the dentry is on the
> lru: it's cached but unused. Every other state (being created,
> active, being freed, freed) is uninteresting.

We cannot figure out that it is on the lru if we do not have a stable
reference to the object.

> Sure, and all that is possible without doing this messy 2 phase thing.
> Unless I'm still missing something obvious...

Obviously one cannot free or handle an object that may be concurrently
freed on another processor.

2008-10-22 20:13:25

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Christoph Lameter wrote:
> On Wed, 22 Oct 2008, Miklos Szeredi wrote:
>
> > Why? The kmem_cache_free() doesn't touch the contents of the object,
> > does it?
>
> Because filesystem code may be running on other processors which may be
> freeing the dentry.

You are not actually listening to what I'm saying. Please read the
question carefully again.

Thanks,
Miklos

2008-10-22 20:21:22

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Miklos Szeredi wrote:

> You are not actually listening to what I'm saying. Please read the
> question carefully again.

That is the impression that I got from you too. I have listed the options
to get a reliable reference to an object and you seem to just skip over
it.

2008-10-22 20:26:26

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Christoph Lameter wrote:
> On Wed, 22 Oct 2008, Miklos Szeredi wrote:
>
> > You are not actually listening to what I'm saying. Please read the
> > question carefully again.
>
> That is the impression that I got from you too. I have listed the options
> to get a reliable reference to an object and you seem to just skip over
> it.

Because you don't _need_ a reliable reference to access the contents
of the dentry. The dentry is still there after being freed, as long
as the underlying slab is there and isn't being reused for some other
purpose. But you can easily ensure that from the slab code.

Hmm?

Miklos

2008-10-22 20:49:15

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, Oct 22, 2008 at 11:26 PM, Miklos Szeredi <[email protected]> wrote:
> Because you don't _need_ a reliable reference to access the contents
> of the dentry. The dentry is still there after being freed, as long
> as the underlying slab is there and isn't being reused for some other
> purpose. But you can easily ensure that from the slab code.
>
> Hmm?

Actually, when debugging is enabled, it's customary to poison the
object, for example (see free_debug_processing() in mm/slub.c). So we
really can't "easily ensure" that in the allocator unless we by-pass
all the current debugging code.

2008-10-22 21:01:18

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Miklos Szeredi wrote:

>> That is the impression that I got from you too. I have listed the options
>> to get a reliable reference to an object and you seem to just skip over
>> it.
>
> Because you don't _need_ a reliable reference to access the contents
> of the dentry. The dentry is still there after being freed, as long
> as the underlying slab is there and isn't being reused for some other
> purpose. But you can easily ensure that from the slab code.

With the two callbacks that I described that would take the global
lock? That was already discussed before. Please read! It does not scale
and the lock would have to be acquired before objects in a slab page are
scanned and handled in any way.

Without that locking any other processor can go into reclaim and start
evicting the dentries that we are operating upon.

Freeing in the slab sense means that a kfree ran to get rid of the
object.

2008-10-22 21:03:48

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Pekka Enberg wrote:

> Actually, when debugging is enabled, it's customary to poison the
> object, for example (see free_debug_processing() in mm/slub.c). So we
> really can't "easily ensure" that in the allocator unless we by-pass
> all the current debugging code.

We may be talking of different frees here. Maybe what he means by freeing
is that the object was put on the lru? And we understand a kfree().

2008-10-22 21:05:51

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Pekka Enberg wrote:
> On Wed, Oct 22, 2008 at 11:26 PM, Miklos Szeredi <[email protected]> wrote:
> > Because you don't _need_ a reliable reference to access the contents
> > of the dentry. The dentry is still there after being freed, as long
> > as the underlying slab is there and isn't being reused for some other
> > purpose. But you can easily ensure that from the slab code.
> >
> > Hmm?
>
> Actually, when debugging is enabled, it's customary to poison the
> object, for example (see free_debug_processing() in mm/slub.c). So we
> really can't "easily ensure" that in the allocator unless we by-pass
> all the current debugging code.

Thank you, that does actually answer my question. I would still think
it's a good sacrifice to no let the dentries be poisoned for the sake
of a simpler dentry defragmenter.

Miklos

2008-10-22 21:12:21

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Hi Miklos,

On Thu, Oct 23, 2008 at 12:04 AM, Miklos Szeredi <[email protected]> wrote:
>> Actually, when debugging is enabled, it's customary to poison the
>> object, for example (see free_debug_processing() in mm/slub.c). So we
>> really can't "easily ensure" that in the allocator unless we by-pass
>> all the current debugging code.
>
> Thank you, that does actually answer my question. I would still think
> it's a good sacrifice to no let the dentries be poisoned for the sake
> of a simpler dentry defragmenter.

To be honest, I haven't paid enough attention to the discussion to see
how much simpler it would be. But I don't like the idea of forcibly
disabling debugging for slab caches because of a new core feature in
the allocator. Keep in mind that it's not just dentries we're talking
about here, we're defragmenting inodes as well.

Pekka

2008-10-22 21:30:40

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Miklos Szeredi wrote:

>> Actually, when debugging is enabled, it's customary to poison the
>> object, for example (see free_debug_processing() in mm/slub.c). So we
>> really can't "easily ensure" that in the allocator unless we by-pass
>> all the current debugging code.

Plus the allocator may be reusing parts of the freed object for a freelist
etc even if the object is not poisoned.

> Thank you, that does actually answer my question. I would still think
> it's a good sacrifice to no let the dentries be poisoned for the sake
> of a simpler dentry defragmenter.

You can simplify defrag by not doing anything in the get() method. That
means some of the objects passed to the kick() method may be already have
been freed in the interim.

The kick method then must be able to determine if the object has already
been freed (or is undergoing freeing) by inspecting the object contents
(allocations are held off until kick() is complete). It then needs to free
only the objects that are still allocated.

That way you could get to a one stage system.... If the dentry code can
give us that then the approach would become much simpler.

2008-10-22 22:11:06

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Wed, 22 Oct 2008, Christoph Lameter wrote:
> On Wed, 22 Oct 2008, Miklos Szeredi wrote:
>
> >> Actually, when debugging is enabled, it's customary to poison the
> >> object, for example (see free_debug_processing() in mm/slub.c). So we
> >> really can't "easily ensure" that in the allocator unless we by-pass
> >> all the current debugging code.
>
> Plus the allocator may be reusing parts of the freed object for a freelist
> etc even if the object is not poisoned.

Actually, no: looking at the slub code it already makes sure that
objects are neither poisoned, nor touched in any way _if_ there is a
constructor for the object. And for good reason too, otherwise a
reused object would contain rubbish after a second allocation.

Come on guys, you should be the experts in this thing!

So again, just checking d_lru should do work fine. There's absolutely
no need to mess with extra references in a separate phase, which leads
to lots of complications.

Miklos

2008-10-22 23:21:25

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Miklos Szeredi wrote:

> So again, just checking d_lru should do work fine. There's absolutely
> no need to mess with extra references in a separate phase, which leads
> to lots of complications.

Then try it the way I outlined it by skipping the get() stage. You just
need to add checks for the poison in case debugging is on and then you
should be fine.

2008-10-23 07:10:25

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Hi Miklos,

On Thu, 2008-10-23 at 00:10 +0200, Miklos Szeredi wrote:
> Actually, no: looking at the slub code it already makes sure that
> objects are neither poisoned, nor touched in any way _if_ there is a
> constructor for the object. And for good reason too, otherwise a
> reused object would contain rubbish after a second allocation.

There's no inherent reason why we cannot poison slab caches with a
constructor. As a matter of fact SLAB does it which is probably why I
got confused here. The only thing that needs to disable slab poisoning
by design is SLAB_DESTROY_BY_RCU.

But for SLUB, you're obviously right.

On Thu, 2008-10-23 at 00:10 +0200, Miklos Szeredi wrote:
> Come on guys, you should be the experts in this thing!

Yeah, I know. Yet you're stuck with us. That's sad.

Pekka

2008-10-23 08:40:00

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Pekka Enberg wrote:
> On Thu, 2008-10-23 at 00:10 +0200, Miklos Szeredi wrote:
> > Actually, no: looking at the slub code it already makes sure that
> > objects are neither poisoned, nor touched in any way _if_ there is a
> > constructor for the object. And for good reason too, otherwise a
> > reused object would contain rubbish after a second allocation.
>
> There's no inherent reason why we cannot poison slab caches with a
> constructor.

Right, it just needs to call the constructor for every allocation.

> > Come on guys, you should be the experts in this thing!
>
> Yeah, I know. Yet you're stuck with us. That's sad.

No, I was a bit rude, sorry.

I think the _real_ problem is that instead of fancy features like this
defragmenter, SLUB should first concentrate on getting the code solid
enough to replace the other allocators.

Miklos

2008-10-23 13:42:58

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Miklos Szeredi wrote:

> I think the _real_ problem is that instead of fancy features like this
> defragmenter, SLUB should first concentrate on getting the code solid
> enough to replace the other allocators.

Solid? What is not solid? The SLUB design was made in part because of the
defrag problems that were not easy to solve with SLAB. The ability to lock
down a slab allows stabilizing objects. We discussed solutions to the
fragmentation problem for years and did not get anywhere with SLAB.

2008-10-23 13:59:23

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, Oct 23, 2008 at 4:40 PM, Christoph Lameter
<[email protected]> wrote:
> Solid? What is not solid? The SLUB design was made in part because of the
> defrag problems that were not easy to solve with SLAB. The ability to lock
> down a slab allows stabilizing objects. We discussed solutions to the
> fragmentation problem for years and did not get anywhere with SLAB.

I'd assume he's talking about the Intel-reported regression that's yet
to be resolved.

2008-10-23 14:10:22

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Pekka Enberg wrote:

> On Thu, Oct 23, 2008 at 4:40 PM, Christoph Lameter
> <[email protected]> wrote:
>> Solid? What is not solid? The SLUB design was made in part because of the
>> defrag problems that were not easy to solve with SLAB. The ability to lock
>> down a slab allows stabilizing objects. We discussed solutions to the
>> fragmentation problem for years and did not get anywhere with SLAB.
>
> I'd assume he's talking about the Intel-reported regression that's yet
> to be resolved.

On that subject:

Got a draft of a patch here that does freelist handling differently.
Instead of building linked lists it uses free objects to build arrays of
pointers to free objects. That improves cache cold free behavior since the
object contents itself does not have to be touched on free.

The problem looks like its freeing objects on a different processor that
where it was used last. With the pointer array it is only necessary to
touch the objects that contain the arrays.

2008-10-23 14:15:09

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, Oct 23, 2008 at 5:09 PM, Christoph Lameter
<[email protected]> wrote:
> Got a draft of a patch here that does freelist handling differently. Instead
> of building linked lists it uses free objects to build arrays of pointers to
> free objects. That improves cache cold free behavior since the object
> contents itself does not have to be touched on free.
>
> The problem looks like its freeing objects on a different processor that
> where it was used last. With the pointer array it is only necessary to touch
> the objects that contain the arrays.

Interesting. SLAB gets away with this because of per-cpu caches or
because it uses the bufctls instead of a freelist?

2008-10-23 14:26:28

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Pekka Enberg wrote:

>> The problem looks like its freeing objects on a different processor that
>> where it was used last. With the pointer array it is only necessary to touch
>> the objects that contain the arrays.
>
> Interesting. SLAB gets away with this because of per-cpu caches or
> because it uses the bufctls instead of a freelist?

Exactly. Slab adds a special management structure to each slab page that
contains the freelist and other stuff. Freeing first occurs to a per cpu
queue that contains an array of pointers. Then later the objects are moved
from the pointer array into the management structure for the slab.

What we could do for SLUB is to generate a linked list of pointer arrays
in the free objects of a slab page. If all objects are allocated then no
pointer array is needed. The first object freed would become the first
pointer array. If that is found to be exhausted then the object currently
being freed is becoming the next pointer array and we put a link to the
old one into the object as well.

2008-10-23 15:17:36

by Eric Dumazet

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Christoph Lameter a ?crit :
> On Thu, 23 Oct 2008, Pekka Enberg wrote:
>
>>> The problem looks like its freeing objects on a different processor that
>>> where it was used last. With the pointer array it is only necessary
>>> to touch
>>> the objects that contain the arrays.
>>
>> Interesting. SLAB gets away with this because of per-cpu caches or
>> because it uses the bufctls instead of a freelist?
>
> Exactly. Slab adds a special management structure to each slab page that
> contains the freelist and other stuff. Freeing first occurs to a per cpu
> queue that contains an array of pointers. Then later the objects are
> moved from the pointer array into the management structure for the slab.
>
> What we could do for SLUB is to generate a linked list of pointer arrays
> in the free objects of a slab page. If all objects are allocated then no
> pointer array is needed. The first object freed would become the first
> pointer array. If that is found to be exhausted then the object
> currently being freed is becoming the next pointer array and we put a
> link to the old one into the object as well.
>

This idea is very nice, especially considering that many objects are freed
by RCU, and their rcu_head (which is hot at kfree() time), might be far
away the linked list anchor actually used in SLUB.

At alloc time, I remember I added a prefetchw() call in SLAB in __cache_alloc(),
this could explain some differences between SLUB and SLAB too, since SLAB
gives a hint to processor to warm its cache.

2008-10-23 15:41:00

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Eric Dumazet wrote:

> At alloc time, I remember I added a prefetchw() call in SLAB in
> __cache_alloc(),
> this could explain some differences between SLUB and SLAB too, since SLAB
> gives a hint to processor to warm its cache.

SLUB touches objects by default when allocating. And it does it
immediately in slab_alloc() in order to retrieve the pointer to the next
object. So there is no point of hinting there right now.

If we go to the pointer arrays then the situation is similar to SLAB where
the object is not touched by the allocator. Then the hint would be useful
again.

2008-10-23 16:39:31

by Eric Dumazet

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

Christoph Lameter a ?crit :
> On Thu, 23 Oct 2008, Eric Dumazet wrote:
>
>> At alloc time, I remember I added a prefetchw() call in SLAB in
>> __cache_alloc(),
>> this could explain some differences between SLUB and SLAB too, since SLAB
>> gives a hint to processor to warm its cache.
>
> SLUB touches objects by default when allocating. And it does it
> immediately in slab_alloc() in order to retrieve the pointer to the next
> object. So there is no point of hinting there right now.
>

Please note SLUB touches by reading object.

prefetchw() gives a hint to cpu saying this cache line is going to be *modified*, even
if first access is a read. Some architectures can save some bus transactions, acquiring
the cache line in an exclusive way instead of shared one.

> If we go to the pointer arrays then the situation is similar to SLAB
> where the object is not touched by the allocator. Then the hint would be
> useful again.

It is usefull right now for ((SLAB_DESTROY_BY_RCU | SLAB_POISON) or ctor caches.

Probably not that important because many objects are very large anyway, and a prefetchw()
of the begining of object is partial.

2008-10-23 16:48:42

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 23 Oct 2008, Eric Dumazet wrote:

>> SLUB touches objects by default when allocating. And it does it immediately
>> in slab_alloc() in order to retrieve the pointer to the next object. So
>> there is no point of hinting there right now.
>>
>
> Please note SLUB touches by reading object.
>
> prefetchw() gives a hint to cpu saying this cache line is going to be
> *modified*, even
> if first access is a read. Some architectures can save some bus transactions,
> acquiring
> the cache line in an exclusive way instead of shared one.

Most architectures actually can do that. Its probably worth to run some
tests with that. Conversion of a cacheline from shared to exclusive can
cost something.

>> If we go to the pointer arrays then the situation is similar to SLAB where
>> the object is not touched by the allocator. Then the hint would be useful
>> again.
>
> It is usefull right now for ((SLAB_DESTROY_BY_RCU | SLAB_POISON) or ctor
> caches.

Correct.

> Probably not that important because many objects are very large anyway, and a
> prefetchw()
> of the begining of object is partial.

Right.

2008-10-23 17:18:28

by Eric Dumazet

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

diff --git a/mm/slub.c b/mm/slub.c
index 0c83e6a..c2017a3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1592,13 +1592,14 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,

local_irq_save(flags);
c = get_cpu_slab(s, smp_processor_id());
+ object = c->freelist;
+ prefetchw(object);
objsize = c->objsize;
- if (unlikely(!c->freelist || !node_match(c, node)))
+ if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
- object = c->freelist;
c->freelist = object[c->offset];
stat(c, ALLOC_FASTPATH);
}

Attachments:

slub.patch (585.00 B)

2008-10-28 11:06:36

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Thu, 2008-10-23 at 19:14 +0200, Eric Dumazet wrote:
> [PATCH] slub: slab_alloc() can use prefetchw()
>
> Most kmalloced() areas are initialized/written right after allocation.
>
> prefetchw() gives a hint to cpu saying this cache line is going to be
> *modified*, even if first access is a read.
>
> Some architectures can save some bus transactions, acquiring
> the cache line in an exclusive way instead of shared one.
>
> Same optimization was done in 2005 on SLAB in commit
> 34342e863c3143640c031760140d640a06c6a5f8
> ([PATCH] mm/slab.c: prefetchw the start of new allocated objects)
>
> Signed-off-by: Eric Dumazet <[email protected]>

Christoph, I was sort of expecting a NAK/ACK from you before merging
this. I would be nice to have numbers on this but then again I don't see
how this can hurt either.

Pekka

2008-10-28 11:20:26

by Nick Piggin

[permalink] [raw]

Subject: Re: SLUB defrag pull request?

On Tuesday 28 October 2008 22:06, Pekka Enberg wrote:
> On Thu, 2008-10-23 at 19:14 +0200, Eric Dumazet wrote:
> > [PATCH] slub: slab_alloc() can use prefetchw()
> >
> > Most kmalloced() areas are initialized/written right after allocation.
> >
> > prefetchw() gives a hint to cpu saying this cache line is going to be
> > *modified*, even if first access is a read.
> >
> > Some architectures can save some bus transactions, acquiring
> > the cache line in an exclusive way instead of shared one.
> >
> > Same optimization was done in 2005 on SLAB in commit
> > 34342e863c3143640c031760140d640a06c6a5f8
> > ([PATCH] mm/slab.c: prefetchw the start of new allocated objects)
> >
> > Signed-off-by: Eric Dumazet <[email protected]>
>
> Christoph, I was sort of expecting a NAK/ACK from you before merging
> this. I would be nice to have numbers on this but then again I don't see
> how this can hurt either.

I've seen explicit prefetches hurt quite surprising amount if they're
not placed in appropriate places (which includes putting them in
places where the object is already in cache, or the processor is in a
good position to have speculatively initiated the operation anyway).

I'm not saying it's going to be the case here, but it can be really
hard to actually tell if it is worthwhile, IMO. For example, some
nice CPU local workloads that are often fitting within cache, might
have the object already in cache 99.x% of the time here. prefetch may
easily slow things down.

2008-10-30 15:47:23