by Fox Chen

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

On Sun, Dec 20, 2020 at 7:52 AM Ian Kent <[email protected]> wrote:
>
> On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> > Hello,
> >
> > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > > And looking further I see there's a race that kernfs can't do
> > > anything
> > > about between kernfs_refresh_inode() and fs/inode.c:update_times().
> >
> > Do kernfs files end up calling into that path tho? Doesn't look like
> > it to
> > me but if so yeah we'd need to override the update_time for kernfs.
>
> Sorry, the below was very hastily done and not what I would actually
> propose.
>
> The main point of it was the question
>
> + /* Which kernfs node attributes should be updated from
> + * time?
> + */
>
> but looking at it again this morning I think the node iattr fields
> that might need to be updated would be atime, ctime and mtime only,
> maybe not ctime ... not sure.
>
> What do you think?
>
> Also, if kn->attr == NULL it should fall back to what the VFS
> currently does.
>
> The update_times() function is one of the few places where the
> VFS updates the inode times.
>
> The idea is that the reason kernfs needs to overwrite the inode
> attributes is to reset what the VFS might have done but if kernfs
> has this inode operation they won't need to be overwritten since
> they won't have changed.
>
> There may be other places where the attributes (or an attribute)
> are set by the VFS, I haven't finished checking that yet so my
> suggestion might not be entirely valid.
>
> What I need to do is work out what kernfs node attributes, if any,
> should be updated by .update_times(). If I go by what
> kernfs_refresh_inode() does now then that would be none but shouldn't
> atime at least be updated in the node iattr.
>
> > > +static int kernfs_iop_update_time(struct inode *inode, struct
> > > timespec64 *time, int flags)
> > > {
> > > - struct inode *inode = d_inode(path->dentry);
> > > struct kernfs_node *kn = inode->i_private;
> > > + struct kernfs_iattrs *attrs;
> > >
> > > mutex_lock(&kernfs_mutex);
> > > + attrs = kernfs_iattrs(kn);
> > > + if (!attrs) {
> > > + mutex_unlock(&kernfs_mutex);
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + /* Which kernfs node attributes should be updated from
> > > + * time?
> > > + */
> > > +
> > > kernfs_refresh_inode(kn, inode);
> > > mutex_unlock(&kernfs_mutex);
> >
> > I don't see how this would reflect the changes from kernfs_setattr()
> > into
> > the attached inode. This would actually make the attr updates
> > obviously racy
> > - the userland visible attrs would be stale until the inode gets
> > reclaimed
> > and then when it gets reinstantiated it'd show the latest
> > information.
>
> Right, I will have to think about that, but as I say above this
> isn't really what I would propose.
>
> If .update_times() sticks strictly to what kernfs_refresh_inode()
> does now then it would set the inode attributes from the node iattr
> only.
>
> >
> > That said, if you wanna take the direction where attr updates are
> > reflected
> > to the associated inode when the change occurs, which makes sense,
> > the right
> > thing to do would be making kernfs_setattr() update the associated
> > inode if
> > existent.
>
> Mmm, that's a good point but it looks like the inode isn't available
> there.
>
Is it possible to embed super block somewhere, then we can call
kernfs_get_inode to get inode in kernfs_setattr???

thanks,
fox

2020-12-22 02:20:00

by Ian Kent

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote:
> On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <[email protected]> wrote:
> > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <[email protected]>
> > > wrote:
> > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <[email protected]>
> > > > > wrote:
> > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > > What could be done is to make the kernfs node
> > > > > > > > > attr_mutex
> > > > > > > > > a pointer and dynamically allocate it but even that
> > > > > > > > > is
> > > > > > > > > too
> > > > > > > > > costly a size addition to the kernfs node structure
> > > > > > > > > as
> > > > > > > > > Tejun has said.
> > > > > > > >
> > > > > > > > I guess the question to ask is, is there really a need
> > > > > > > > to
> > > > > > > > call kernfs_refresh_inode() from functions that are
> > > > > > > > usually
> > > > > > > > reading/checking functions.
> > > > > > > >
> > > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > > write/set
> > > > > > > > operations in (if there's any) places where things like
> > > > > > > > setattr_copy() is not already called?
> > > > > > > >
> > > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > > >
> > > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > > sysfs
> > > > > > > namespace is
> > > > > > > implemented, so I don't think there's an easy around
> > > > > > > that.
> > > > > > > The
> > > > > > > only
> > > > > > > thing I
> > > > > > > can think of is embedding the lock into attrs and doing
> > > > > > > xchg
> > > > > > > dance
> > > > > > > when
> > > > > > > attaching it.
> > > > > >
> > > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > > attrs structure, am I correct?
> > > > > >
> > > > > > Assuming it is then, to keep things simple, use two locks.
> > > > > >
> > > > > > One global lock for the allocation and an attrs lock for
> > > > > > all
> > > > > > the
> > > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > > update.
> > > > > >
> > > > > > The critical section for the global lock could be reduced
> > > > > > and
> > > > > > it
> > > > > > changed to a spin lock.
> > > > > >
> > > > > > In __kernfs_iattrs() we would have something like:
> > > > > >
> > > > > > take the allocation lock
> > > > > > do the allocated checks
> > > > > > assign if existing attrs
> > > > > > release the allocation lock
> > > > > > return existing if found
> > > > > > othewise
> > > > > > release the allocation lock
> > > > > >
> > > > > > allocate and initialize attrs
> > > > > >
> > > > > > take the allocation lock
> > > > > > check if someone beat us to it
> > > > > > free and grab exiting attrs
> > > > > > otherwise
> > > > > > assign the new attrs
> > > > > > release the allocation lock
> > > > > > return attrs
> > > > > >
> > > > > > Add a spinlock to the attrs struct and use it everywhere
> > > > > > for
> > > > > > field updates.
> > > > > >
> > > > > > Am I on the right track or can you see problems with this?
> > > > > >
> > > > > > Ian
> > > > > >
> > > > >
> > > > > umm, we update the inode in kernfs_refresh_inode, right?? So
> > > > > I
> > > > > guess
> > > > > the problem is how can we protect the inode when
> > > > > kernfs_refresh_inode
> > > > > is called, not the attrs??
> > > >
> > > > But the attrs (which is what's copied from) were protected by
> > > > the
> > > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > > dealing with the kernfs node attrs too.
> > > >
> > > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > > copies
> > > > the node attrs to the inode under the same mutex lock. So, if a
> > > > read
> > > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > > protected,
> > > > it needs to be protected in a different way.
> > > >
> > >
> > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem
> > > for
> > > .setattr but
> > > no lock for .getattr (misdocumented?? sometimes they have as
> > > you've
> > > found out)?
> > > What does it protect against?? Because .permission does a similar
> > > thing
> > > here -- updating inode attributes, the goal is to provide the
> > > same
> > > protection level
> > > for .permission as for .setattr, am I right???
> >
> > As far as the documentation goes that's probably my
> > misunderstanding
> > of it.
> >
> > It does happen that the VFS makes assumptions about how call backs
> > are meant to be used.
> >
> > Read like call backs, like .getattr() and .permission() are meant
> > to
> > be used, well, like read like functions so the VFS should be ok to
> > take locks or not based on the operation context at hand.
> >
> > So it's not about the locking for these call backs per se, it's
> > about
> > the context in which they are called.
> >
> > For example, in link_path_walk(), at the beginning of the component
> > lookup loop (essentially for the containing directory at that
> > point),
> > may_lookup() is called which leads to a call to .permission()
> > without
> > any inode lock held at that point.
> >
> > But file opens (possibly following a path walk to resolve a path)
> > are different.
> >
> > For example, do_filp_open() calls path_openat() which leads to a
> > call to open_last_lookups(), which leads to a call to .permission()
> > along the way. And in this case there are two contexts, an open()
> > create or one without create, the former needing the exclusive
> > inode
> > lock and the later able to use the shared lock.
> >
> > So it's about the locking needed for the encompassing operation
> > that
> > is being done not about those functions specifically.
> >
> > TBH the VFS is very complex and Al has a much, much better
> > understanding of it than I do so he would need to be the one to
> > answer
> > whether it's the file systems responsibility to use these calls in
> > the
> > way the VFS expects.
> >
> > My belief is that if a file system needs to use a call back in a
> > way
> > that's in conflict with what the VFS expects it's the file systems'
> > responsibility to deal with the side effects.
> >
>
> Thanks for clarifying. Ian.
>
> Yeah, it's complex and confusing and it's very hard to spot lock
> context by reading VFS code.
>
> I put code like this:
> if (lockdep_is_held_type(&inode->i_rwsem, -1)) {
> if (lockdep_is_held_type(&inode->i_rwsem, 0)) {
> pr_warn("kernfs iop_permission inode WRITE
> lock is held");
> } else if (lockdep_is_held_type(&inode->i_rwsem, 1))
> {
> pr_warn("kernfs iop_permission inode READ
> lock
> is held");
> }
> } else {
> pr_warn("kernfs iop_permission inode lock is NOT
> held");
> }
>
> in both .permission & .getattr. Then I do some open/read/write/close
> to /sys, interestingly, all log outputs suggest they are in WRITE
> lock
> context.

The thing is in open_last_lookups() called from path_openat() we
have:
if (open_flag & O_CREAT)
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);

and from there it's - lookup_open() and possibly may_o_create() which
calls inode_permission() and onto .permission().

So, strictly speaking, it should be taken exclusive because you would
expect may_o_create() to be called only on O_CREATE.

But all the possible cases would need to be checked and if it is taken
shared anywhere we are in trouble.

Another example is in link_path_walk() we have a call to may_lookup()
which leads to a call to .permission() without the lock being held.

So there are a bunch of cases to check and knowing you are the owner
of the lock if it is held is at least risky when concurrency is high
and there's a possibility the lock can be taken shared.

Ian