2014-01-02 23:55:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add shrink_pagecache_parent

On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <[email protected]> wrote:

> Analogous to shrink_dcache_parent except that it collects inodes.
> It is not very appropriate to be put in dcache.c, but d_walk can only
> be invoked from here.

Please cc Dave Chinner on future revisions. He be da man.

The overall intent of the patchset seems reasonable and I agree that it
can't be efficiently done from userspace with the current kernel API.
We *could* do it from userspace by providing facilities for userspace to
query the VFS caches: "is this pathname in the dentry cache" and "is
this inode in the inode cache".

> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
> }
> EXPORT_SYMBOL(shrink_dcache_parent);
>
> +static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
> +{
> + struct list_head *list = data;
> + struct inode *inode = dentry->d_inode;
> +
> + if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
> + (!capable(CAP_SYS_ADMIN))))
> + goto out;
> + spin_lock(&inode->i_lock);
> + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||

It's unclear what rationale lies behind this particular group of tests.

> + (inode->i_mapping->nrpages == 0) ||
> + (!list_empty(&inode->i_lru))) {

arg, the "Inode locking rules" at the top of fs/inode.c needs a
refresh, I suspect. It is too vague.

Formally, inode->i_lru is protected by
i_sb->s_inode_lru->node[nid].lock, not by ->i_lock. I guess you can
just do a list_lru_add() and that will atomically add the inode to your
local list_lru if ->i_lru wasn't being used for anything else.

I *think* that your use of i_lock works OK, because code which fiddles
with i_lru and s_inode_lru also takes i_lock. However we need to
decide which is the preferred and official lock. ie: what is the
design here??

However... most inodes will be on an LRU list, won't they? Doesn't
this reuse of i_lru mean that many inodes will fail to be processed?
If so, we might need to add a new list_head to the inode, which will be
problematic.


Aside: inode_lru_isolate() fiddles directly with inode->i_lru without
taking i_sb->s_inode_lru->node[nid].lock. Why doesn't this make a
concurrent s_inode_lru walker go oops?? Should we be using
list_lru_del() in there? (which should have been called
list_lru_del_init(), sigh).


2014-01-08 02:06:54

by Li Wang

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add shrink_pagecache_parent

Hi,

On 01/03/2014 07:55 AM, Andrew Morton wrote:
> On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <[email protected]> wrote:
>
>> Analogous to shrink_dcache_parent except that it collects inodes.
>> It is not very appropriate to be put in dcache.c, but d_walk can only
>> be invoked from here.
>
> Please cc Dave Chinner on future revisions. He be da man.
>
> The overall intent of the patchset seems reasonable and I agree that it
> can't be efficiently done from userspace with the current kernel API.
> We *could* do it from userspace by providing facilities for userspace to
> query the VFS caches: "is this pathname in the dentry cache" and "is
> this inode in the inode cache".
>
Even we have these available, i am afraid it will still introduce
non-negligible overhead due to frequent system calls for a directory
walking operation, especially under massive small file situations.

>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
>> }
>> EXPORT_SYMBOL(shrink_dcache_parent);
>>
>> +static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
>> +{
>> + struct list_head *list = data;
>> + struct inode *inode = dentry->d_inode;
>> +
>> + if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
>> + (!capable(CAP_SYS_ADMIN))))
>> + goto out;
>> + spin_lock(&inode->i_lock);
>> + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>
> It's unclear what rationale lies behind this particular group of tests.
>
>> + (inode->i_mapping->nrpages == 0) ||
>> + (!list_empty(&inode->i_lru))) {
>
> arg, the "Inode locking rules" at the top of fs/inode.c needs a
> refresh, I suspect. It is too vague.
>
> Formally, inode->i_lru is protected by
> i_sb->s_inode_lru->node[nid].lock, not by ->i_lock. I guess you can
> just do a list_lru_add() and that will atomically add the inode to your
> local list_lru if ->i_lru wasn't being used for anything else.
>
> I *think* that your use of i_lock works OK, because code which fiddles
> with i_lru and s_inode_lru also takes i_lock. However we need to
> decide which is the preferred and official lock. ie: what is the
> design here??
>
> However... most inodes will be on an LRU list, won't they? Doesn't
> this reuse of i_lru mean that many inodes will fail to be processed?
> If so, we might need to add a new list_head to the inode, which will be
> problematic.
>
As far as I know, fix me if i am wrong, only when inode has zero
reference count, it will be put into superblock lru list. For most
situations, there is at least a dentry refers to it, so it will not
be on any lru list.

>
> Aside: inode_lru_isolate() fiddles directly with inode->i_lru without
> taking i_sb->s_inode_lru->node[nid].lock. Why doesn't this make a
> concurrent s_inode_lru walker go oops?? Should we be using
> list_lru_del() in there? (which should have been called
> list_lru_del_init(), sigh).
>
It seems inode_lru_isolate() only called by prune_icache_sb() as
a callback function. Before calling it, the caller has hold
the lock.

2014-01-08 08:01:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add shrink_pagecache_parent

On Thu, Jan 02, 2014 at 03:55:34PM -0800, Andrew Morton wrote:
> On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <[email protected]> wrote:
>
> > Analogous to shrink_dcache_parent except that it collects inodes.
> > It is not very appropriate to be put in dcache.c, but d_walk can only
> > be invoked from here.
>
> Please cc Dave Chinner on future revisions. He be da man.
>
> The overall intent of the patchset seems reasonable and I agree that it
> can't be efficiently done from userspace with the current kernel API.
> We *could* do it from userspace by providing facilities for userspace to
> query the VFS caches: "is this pathname in the dentry cache" and "is
> this inode in the inode cache".
>
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
> > }
> > EXPORT_SYMBOL(shrink_dcache_parent);
> >
> > +static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
> > +{
> > + struct list_head *list = data;
> > + struct inode *inode = dentry->d_inode;
> > +
> > + if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
> > + (!capable(CAP_SYS_ADMIN))))
> > + goto out;
> > + spin_lock(&inode->i_lock);
> > + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>
> It's unclear what rationale lies behind this particular group of tests.
>
> > + (inode->i_mapping->nrpages == 0) ||
> > + (!list_empty(&inode->i_lru))) {
>
> arg, the "Inode locking rules" at the top of fs/inode.c needs a
> refresh, I suspect. It is too vague.

Yes, it probably does need work.

> Formally, inode->i_lru is protected by
> i_sb->s_inode_lru->node[nid].lock, not by ->i_lock. I guess you can
> just do a list_lru_add() and that will atomically add the inode to your
> local list_lru if ->i_lru wasn't being used for anything else.

There is no such thing as a "local" list_lru. If you need to put an
object on a local list, then just use a local struct list_head.
That's how we do dispose lists for the objects being removed from
the LRU...

However, the only way you can check if the i_lru is not in use is to
hold the relevant LRU lock, and that's something that should not be
directly accessed - the internal locking of the LRU is private,
subject to change and as such is only accessible in th places that
it is explicitly exposed. i.e. the ->isolate callback.

> I *think* that your use of i_lock works OK, because code which fiddles
> with i_lru and s_inode_lru also takes i_lock. However we need to
> decide which is the preferred and official lock. ie: what is the
> design here??

THe LRU lock nests inside the i_lock. The i_lock does not provide
exclusive access to i_lru if the inode is on the LRU; LRU list
manipulations can modify i_lru (e.g. removing an adjacent inode in
the LRU list) without holding i_lock....

> However... most inodes will be on an LRU list, won't they? Doesn't
> this reuse of i_lru mean that many inodes will fail to be processed?
> If so, we might need to add a new list_head to the inode, which will be
> problematic.

Yes, yes, and yes, adding a new list head to the struct inode for
such an uncommon corner case is a non-starter.

> Aside: inode_lru_isolate() fiddles directly with inode->i_lru without
> taking i_sb->s_inode_lru->node[nid].lock. Why doesn't this make a
> concurrent s_inode_lru walker go oops?? Should we be using
> list_lru_del() in there?

No, inode_lru_isoalte() is called with the lru lock held. The
specific list lock is passed as the lru_lock parameter, so it can be
droppped if a blocking operation needs to be done to prepare the
object for isolation. So, calling list_lru_del() will deadlock on
the LRU lock.

> (which should have been called list_lru_del_init(), sigh).

That implies that removing the object from the LRU without
initialising the object being removed is a valid thing to do. It's
not - the lru_list code requires that an object not on an LRU is in
an intialised state so that list_empty() checks work correctly. i.e
list_lru_del(object); list_lru_add(object); needs to work, and that
is non-negotiable. So, no need for suffixes to define different
behaviours - there can be only one...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-01-15 00:22:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] Add shrink_pagecache_parent

On Wed, Jan 08, 2014 at 10:06:31AM +0800, Li Wang wrote:
> Hi,
>
> On 01/03/2014 07:55 AM, Andrew Morton wrote:
> >On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <[email protected]> wrote:
> >
> >>Analogous to shrink_dcache_parent except that it collects inodes.
> >>It is not very appropriate to be put in dcache.c, but d_walk can only
> >>be invoked from here.
> >However... most inodes will be on an LRU list, won't they? Doesn't
> >this reuse of i_lru mean that many inodes will fail to be processed?
> >If so, we might need to add a new list_head to the inode, which will be
> >problematic.
> >
> As far as I know, fix me if i am wrong, only when inode has zero
> reference count, it will be put into superblock lru list. For most
> situations, there is at least a dentry refers to it, so it will not
> be on any lru list.

Yes, that's when they get added to the LRU, but they don't get
removed if they are referenced again by a dentry. Hence dentries can
be reclaimed, which puts the inode on it's LRU, but then the
directory is read again and a new dentry allocated to point to it.
We do not remove the inode from the LRU at this point in time.
Hence you can have referenced inodes that are on the LRU, and in
many workloads most of the referenced inodes in the system are on
the LRU....

Cheers,

Dave.
--
Dave Chinner
[email protected]