I was running some regressions and noticed a (race-y) kernel warning that
happens when nlink becomes less than zero. Looking through the code
it looks like we aren't good about protecting the inode lock when
manipulating nlink and some code that was added several years ago to
protect against bugs in underlying file systems nlink handling didn't
look quite right either. I took a look at what NFS was doing and tried to
follow similar approaches in the 9p code.
Signed-off-by: Eric Van Hensbergen <[email protected]>
---
fs/9p/vfs_inode.c | 32 ++++++++++++++++++++++++--------
fs/9p/vfs_inode_dotl.c | 2 ++
2 files changed, 26 insertions(+), 8 deletions(-)
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index b845ee18a80b..9723c3cbae38 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -508,9 +508,12 @@ static int v9fs_at_to_dotl_flags(int flags)
/**
* v9fs_dec_count - helper functon to drop i_nlink.
*
- * If a directory had nlink <= 2 (including . and ..), then we should not drop
- * the link count, which indicates the underlying exported fs doesn't maintain
- * nlink accurately. e.g.
+ * Put a guards around this so we are only dropping nlink
+ * if it would be valid. This prevents bugs from an underlying
+ * filesystem implementations from triggering kernel WARNs. We'll
+ * still print 9p debug messages if the underlying filesystem is wrong.
+ *
+ * known underlying filesystems which might exhibit this issue:
* - overlayfs sets nlink to 1 for merged dir
* - ext4 (with dir_nlink feature enabled) sets nlink to 1 if a dir has more
* than EXT4_LINK_MAX (65000) links.
@@ -519,8 +522,13 @@ static int v9fs_at_to_dotl_flags(int flags)
*/
static void v9fs_dec_count(struct inode *inode)
{
- if (!S_ISDIR(inode->i_mode) || inode->i_nlink > 2)
+ spin_lock(&inode->i_lock);
+ if (inode->i_nlink > 0)
drop_nlink(inode);
+ else
+ p9_debug(P9_DEBUG_ERROR, "WARNING: nlink is already 0 inode %p\n",
+ inode);
+ spin_unlock(&inode->i_lock);
}
/**
@@ -566,8 +574,9 @@ static int v9fs_remove(struct inode *dir, struct dentry *dentry, int flags)
* link count
*/
if (flags & AT_REMOVEDIR) {
+ spin_lock(&inode->i_lock);
clear_nlink(inode);
- v9fs_dec_count(dir);
+ spin_unlock(&inode->i_lock);
} else
v9fs_dec_count(inode);
@@ -713,7 +722,9 @@ static int v9fs_vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
err = PTR_ERR(fid);
fid = NULL;
} else {
+ spin_lock(&dir->i_lock);
inc_nlink(dir);
+ spin_unlock(&dir->i_lock);
v9fs_invalidate_inode_attr(dir);
}
@@ -962,14 +973,19 @@ v9fs_vfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
error_locked:
if (!retval) {
if (new_inode) {
- if (S_ISDIR(new_inode->i_mode))
+ if (S_ISDIR(new_inode->i_mode)) {
+ spin_lock(&new_inode->i_lock);
clear_nlink(new_inode);
- else
+ spin_unlock(&new_inode->i_lock);
+ } else
v9fs_dec_count(new_inode);
}
if (S_ISDIR(old_inode->i_mode)) {
- if (!new_inode)
+ if (!new_inode) {
+ spin_lock(&new_dir->i_lock);
inc_nlink(new_dir);
+ spin_unlock(&new_dir->i_lock);
+ }
v9fs_dec_count(old_dir);
}
v9fs_invalidate_inode_attr(old_inode);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index c7319af2f471..6cc037f726e7 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -427,7 +427,9 @@ static int v9fs_vfs_mkdir_dotl(struct mnt_idmap *idmap,
v9fs_set_create_acl(inode, fid, dacl, pacl);
d_instantiate(dentry, inode);
}
+ spin_lock(&dir->i_lock);
inc_nlink(dir);
+ spin_unlock(&dir->i_lock);
v9fs_invalidate_inode_attr(dir);
error:
p9_fid_put(fid);
---
base-commit: 5254c0cbc92d2a08e75443bdb914f1c4839cdf5a
change-id: 20240107-fix-nlink-handling-3c0646f5d927
Best regards,
--
Eric Van Hensbergen <[email protected]>
Eric Van Hensbergen wrote on Sun, Jan 07, 2024 at 07:07:52PM +0000:
> I was running some regressions and noticed a (race-y) kernel warning that
> happens when nlink becomes less than zero. Looking through the code
> it looks like we aren't good about protecting the inode lock when
> manipulating nlink and some code that was added several years ago to
> protect against bugs in underlying file systems nlink handling didn't
> look quite right either. I took a look at what NFS was doing and tried to
> follow similar approaches in the 9p code.
I was about to say the set/inc/etc_nlink helpers could probably just be
using atomic (there's an atomic_dec_if_postive that we could have used
for the v9fs_dec_count warning), but this isn't our code so not much to
do about that -- I agree it needs a lock.
I didn't take the time to check if you missed any, but it won't be worse
than what we have right now:
Acked-by: Dominique Martinet <[email protected]>
--
Dominique Martinet | Asmadeus
On Monday, January 8, 2024 12:19:34 PM CET [email protected] wrote:
> Eric Van Hensbergen wrote on Sun, Jan 07, 2024 at 07:07:52PM +0000:
> > I was running some regressions and noticed a (race-y) kernel warning that
> > happens when nlink becomes less than zero. Looking through the code
> > it looks like we aren't good about protecting the inode lock when
> > manipulating nlink and some code that was added several years ago to
> > protect against bugs in underlying file systems nlink handling didn't
> > look quite right either. I took a look at what NFS was doing and tried to
> > follow similar approaches in the 9p code.
>
> I was about to say the set/inc/etc_nlink helpers could probably just be
> using atomic (there's an atomic_dec_if_postive that we could have used
> for the v9fs_dec_count warning), but this isn't our code so not much to
> do about that -- I agree it needs a lock.
>
> I didn't take the time to check if you missed any, but it won't be worse
> than what we have right now:
> Acked-by: Dominique Martinet <[email protected]>
That's actually a good point. For these tasks atomic inc/sub/etc are usually
used instead of locks.
I would at least add local wrapper functions that would do these spinlocks for
us.
However would it be too bold to change those inode functions to use atomic
operations directly on their end?
/Christian
On Mon, Jan 8, 2024 at 6:08 AM Christian Schoenebeck
<[email protected]> wrote:
>
> On Monday, January 8, 2024 12:19:34 PM CET [email protected] wrote:
> > Eric Van Hensbergen wrote on Sun, Jan 07, 2024 at 07:07:52PM +0000:
> > > I was running some regressions and noticed a (race-y) kernel warning that
> > > happens when nlink becomes less than zero. Looking through the code
> > > it looks like we aren't good about protecting the inode lock when
> > > manipulating nlink and some code that was added several years ago to
> > > protect against bugs in underlying file systems nlink handling didn't
> > > look quite right either. I took a look at what NFS was doing and tried to
> > > follow similar approaches in the 9p code.
> >
> > I was about to say the set/inc/etc_nlink helpers could probably just be
> > using atomic (there's an atomic_dec_if_postive that we could have used
> > for the v9fs_dec_count warning), but this isn't our code so not much to
> > do about that -- I agree it needs a lock.
> >
> > I didn't take the time to check if you missed any, but it won't be worse
> > than what we have right now:
> > Acked-by: Dominique Martinet <[email protected]>
>
> That's actually a good point. For these tasks atomic inc/sub/etc are usually
> used instead of locks.
>
> I would at least add local wrapper functions that would do these spinlocks for
> us.
>
I'm good with adding local wrapper functions, I imagine these aren't
used in the kernel because for regular file-systems maybe you want the
warning that your inode link accounting is wrong.
I suppose we could be naughty and not use the kernel functions (which
themselves are basically wrappers).
-eric
On Monday, January 8, 2024 3:12:24 PM CET Eric Van Hensbergen wrote:
> On Mon, Jan 8, 2024 at 6:08 AM Christian Schoenebeck
> <[email protected]> wrote:
> >
> > On Monday, January 8, 2024 12:19:34 PM CET [email protected] wrote:
> > > Eric Van Hensbergen wrote on Sun, Jan 07, 2024 at 07:07:52PM +0000:
> > > > I was running some regressions and noticed a (race-y) kernel warning that
> > > > happens when nlink becomes less than zero. Looking through the code
> > > > it looks like we aren't good about protecting the inode lock when
> > > > manipulating nlink and some code that was added several years ago to
> > > > protect against bugs in underlying file systems nlink handling didn't
> > > > look quite right either. I took a look at what NFS was doing and tried to
> > > > follow similar approaches in the 9p code.
> > >
> > > I was about to say the set/inc/etc_nlink helpers could probably just be
> > > using atomic (there's an atomic_dec_if_postive that we could have used
> > > for the v9fs_dec_count warning), but this isn't our code so not much to
> > > do about that -- I agree it needs a lock.
> > >
> > > I didn't take the time to check if you missed any, but it won't be worse
> > > than what we have right now:
> > > Acked-by: Dominique Martinet <[email protected]>
> >
> > That's actually a good point. For these tasks atomic inc/sub/etc are usually
> > used instead of locks.
> >
> > I would at least add local wrapper functions that would do these spinlocks for
> > us.
> >
>
> I'm good with adding local wrapper functions, I imagine these aren't
> used in the kernel because for regular file-systems maybe you want the
> warning that your inode link accounting is wrong.
> I suppose we could be naughty and not use the kernel functions (which
> themselves are basically wrappers).
Well, one half of that code is actually using atomic operations to increment/
decrement the private counter. Which means to me those kernel functions were
intended to be called from a concurrent context. So I don't get why the other
variable is not atomic there. They should be I think.
I would probably try and send a patch for changing those kernel functions and
see if people are fine with that. But up to you.
Christian Schoenebeck wrote on Mon, Jan 08, 2024 at 03:55:53PM +0100:
> > I'm good with adding local wrapper functions,
(Agreed having a local wrapper that locks + use these is better than this
current patch -- v2 looks much better, thanks!)
> > I imagine these aren't
> > used in the kernel because for regular file-systems maybe you want the
> > warning that your inode link accounting is wrong.
> > I suppose we could be naughty and not use the kernel functions (which
> > themselves are basically wrappers).
>
> Well, one half of that code is actually using atomic operations to increment/
> decrement the private counter. Which means to me those kernel functions were
> intended to be called from a concurrent context. So I don't get why the other
> variable is not atomic there. They should be I think.
I think the key difference is inode level vs superblock level -- the
inode is local and holding a lock can be faster if manipulations are
grouped together (x atomic operations are usually slower than a spinlock
and x normal operations), while the sb potentially has contentions and
would be more likely to use atomic...
> I would probably try and send a patch for changing those kernel functions and
> see if people are fine with that. But up to you.
With that said I just checked ext4 and it looks just as racy as we do in
particular the rmdir/unlink case doesn't seem to take any lock, so it's
definitely worth raising the subject on fsdevel!
I'll see how work is busy and ask later today if time allows
--
Dominique Martinet | Asmadeus