Somebody noticed an "ls -l" over nfs failing on entries with inode
numbers greater than 2^32 on a 32-bit NFS server. The cause is some
code that tries to compare i_ino to the full 64-bit inode number.
I think the following will fix it, but I'm curious: why is i_ino
"unsigned long", anyway? Is there something know that depends on that,
or is it just that the sheer number of users makes it too scary to
change?
--b.
commit 0cc784eb430285535ae7a79dd5133ab66e9ce839
Author: J. Bruce Fields <[email protected]>
Date: Tue Sep 10 11:41:12 2013 -0400
exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 293bc2e..6a79bb8 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -215,7 +215,7 @@ struct getdents_callback {
struct dir_context ctx;
char *name; /* name that was found. It already points to a
buffer NAME_MAX+1 is size */
- unsigned long ino; /* the inum we are looking for */
+ u64 ino; /* the inum we are looking for */
int found; /* inode matched? */
int sequence; /* sequence counter */
};
@@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
struct inode *dir = path->dentry->d_inode;
int error;
struct file *file;
+ struct kstat stat;
struct getdents_callback buffer = {
.ctx.actor = filldir_one,
.name = name,
- .ino = child->d_inode->i_ino
};
error = -ENOTDIR;
@@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
if (!dir->i_fop)
goto out;
/*
+ * inode->i_ino is unsigned long, kstat->ino is u64, so the
+ * former would be insufficient on 32-bit hosts when the
+ * filesystem supports 64-bit inode numbers. So we need to
+ * actually call ->getattr, not just read i_ino:
+ */
+ error = vfs_getattr(path, &stat);
+ if (error)
+ return error;
+ buffer.ino = stat.ino;
+ /*
* Open the directory ...
*/
file = dentry_open(path, O_RDONLY, cred);
On Thu, Sep 12, 2013 at 12:03:24PM -0400, J. Bruce Fields wrote:
> Somebody noticed an "ls -l" over nfs failing on entries with inode
> numbers greater than 2^32 on a 32-bit NFS server. The cause is some
> code that tries to compare i_ino to the full 64-bit inode number.
>
> I think the following will fix it, but I'm curious: why is i_ino
> "unsigned long", anyway? Is there something know that depends on that,
> or is it just that the sheer number of users makes it too scary to
> change?
i_ino use is entirely up to filesystem; it may be used by library helpers,
provided that the choice of using or not using those is, again, up to
filesystem in question. NFSD has no damn business looking at it; library
helpers in fs/exportfs might, but that makes them not suitable for use
by filesystems without inode numbers or with 64bit ones.
The reason why it's there at all is that it serves as convenient icache
search key for many filesystems. IOW, it's used by iget_locked() and
avoiding the overhead of 64bit comparisons on 32bit hosts is the main
reason to avoid making it u64.
Again, no fs-independent code has any business looking at it, 64bit or
not. From the VFS point of view there is no such thing as inode number.
And get_name() is just a library helper. For many fs types it works
as suitable ->s_export_op->get_name() instance, but decision to use it
or not belongs to filesystem in question and frankly, it's probably better
to provide an instance of your own anyway.
On Thu, Sep 12, 2013 at 08:33:28PM +0100, Al Viro wrote:
> i_ino use is entirely up to filesystem; it may be used by library helpers,
> provided that the choice of using or not using those is, again, up to
> filesystem in question.
With more and more filesystems using large inode numbers I'm starting to
wonder if this still makes sense. But given that it's been this way for
a long time we should have more documentation of it for sure.
> NFSD has no damn business looking at it; library
> helpers in fs/exportfs might, but that makes them not suitable for use
> by filesystems without inode numbers or with 64bit ones.
>
> The reason why it's there at all is that it serves as convenient icache
> search key for many filesystems. IOW, it's used by iget_locked() and
> avoiding the overhead of 64bit comparisons on 32bit hosts is the main
> reason to avoid making it u64.
>
> Again, no fs-independent code has any business looking at it, 64bit or
> not. From the VFS point of view there is no such thing as inode number.
> And get_name() is just a library helper. For many fs types it works
> as suitable ->s_export_op->get_name() instance, but decision to use it
> or not belongs to filesystem in question and frankly, it's probably better
> to provide an instance of your own anyway.
Given that these days most exportable filesystems use 64-bit inode
numbers I think we should put the patch from Bruce in. Nevermind that
it's in a slow path, so the overhead of vfs_getattr really doesn't hurt.
On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote:
> If so then it's no huge code duplication to it by hand:
>
> if (inode->i_op->getattr)
> inode->i_op->getattr(path->mnt, path->dentry, &stat);
> else
> generic_fillattr(inode, &stat);
Maybe make that a vfs_getattr_nosec and let vfs_getattr call it?
Including a proper kerneldoc comment explaining when to use it, please.
On Wed, 2 Oct 2013, J. Bruce Fields wrote:
> On Wed, Oct 02, 2013 at 11:47:22AM -0700, Sage Weil wrote:
> > On Wed, 2 Oct 2013, J. Bruce Fields wrote:
> > > On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote:
> > > > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote:
> > > > > If so then it's no huge code duplication to it by hand:
> > > > >
> > > > > if (inode->i_op->getattr)
> > > > > inode->i_op->getattr(path->mnt, path->dentry, &stat);
> > > > > else
> > > > > generic_fillattr(inode, &stat);
> > > >
> > > > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it?
> > > >
> > > > Including a proper kerneldoc comment explaining when to use it, please.
> > >
> > > Something like this?
> >
> > I'm late to this thread, but: getattr() is a comparatively expensive
> > operation for just getting an (immutable) ino value on a distributed or
> > clustered fs. It would be nice if there were a separate call that didn't
> > try to populate all the other kstat fields with valid data. Something
> > like this was part of the xstat series from forever ago (a bit mask passed
> > to getattr indicating which fields were needed), so the approach below
> > might be okay if we think we'll get there sometime soon, but my preference
> > would be for another fix...
>
> Understood, and perhaps this code should eventually take advantage of
> xstat, but:
>
> - this code isn't handling the common case, so we're not too
> worried about the performance, and
> - you also have the option of defining your own
> export_operations->get_name. Especially consider that if you
> have some better way to answer the question "what is the name
> of inode X in directory Y" that's better than reading Y
> looking for a matching inode number.
Ah, I see. Sounds good then!
Thanks-
sage
>
> --b.
>
> > (Ceph also uses 64-bit inos.)
> >
> > Thanks!
> > sage
> >
> >
> > >
> > > --b.
> > >
> > > commit 8418a41b7192cf2f372ae091207adb29a088f9a0
> > > Author: J. Bruce Fields <[email protected]>
> > > Date: Tue Sep 10 11:41:12 2013 -0400
> > >
> > > exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
> > >
> > > Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
> > > 32-bit NFS server exporting a very large XFS filesystem, when the
> > > server's cache is cold (so the inodes in question are not in cache).
> > >
> > > Reported-by: Trevor Cordes <[email protected]>
> > > Signed-off-by: J. Bruce Fields <[email protected]>
> > >
> > > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> > > index 293bc2e..811831a 100644
> > > --- a/fs/exportfs/expfs.c
> > > +++ b/fs/exportfs/expfs.c
> > > @@ -215,7 +215,7 @@ struct getdents_callback {
> > > struct dir_context ctx;
> > > char *name; /* name that was found. It already points to a
> > > buffer NAME_MAX+1 is size */
> > > - unsigned long ino; /* the inum we are looking for */
> > > + u64 ino; /* the inum we are looking for */
> > > int found; /* inode matched? */
> > > int sequence; /* sequence counter */
> > > };
> > > @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > struct inode *dir = path->dentry->d_inode;
> > > int error;
> > > struct file *file;
> > > + struct kstat stat;
> > > struct getdents_callback buffer = {
> > > .ctx.actor = filldir_one,
> > > .name = name,
> > > - .ino = child->d_inode->i_ino
> > > };
> > >
> > > error = -ENOTDIR;
> > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > if (!dir->i_fop)
> > > goto out;
> > > /*
> > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > + * former would be insufficient on 32-bit hosts when the
> > > + * filesystem supports 64-bit inode numbers. So we need to
> > > + * actually call ->getattr, not just read i_ino:
> > > + */
> > > + error = vfs_getattr_nosec(path, &stat);
> > > + if (error)
> > > + return error;
> > > + buffer.ino = stat.ino;
> > > + /*
> > > * Open the directory ...
> > > */
> > > file = dentry_open(path, O_RDONLY, cred);
> > > diff --git a/fs/stat.c b/fs/stat.c
> > > index 04ce1ac..71a39e8 100644
> > > --- a/fs/stat.c
> > > +++ b/fs/stat.c
> > > @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
> > >
> > > EXPORT_SYMBOL(generic_fillattr);
> > >
> > > -int vfs_getattr(struct path *path, struct kstat *stat)
> > > +/**
> > > + * vfs_getattr_nosec - getattr without security checks
> > > + * @path: file to get attributes from
> > > + * @stat: structure to return attributes in
> > > + *
> > > + * Get attributes without calling security_inode_getattr.
> > > + *
> > > + * Currently the only caller other than vfs_getattr is internal to the
> > > + * filehandle lookup code, which uses only the inode number and returns
> > > + * no attributes to any user. Any other code probably wants
> > > + * vfs_getattr.
> > > + */
> > > +int vfs_getattr_nosec(struct path *path, struct kstat *stat)
> > > {
> > > struct inode *inode = path->dentry->d_inode;
> > > - int retval;
> > > -
> > > - retval = security_inode_getattr(path->mnt, path->dentry);
> > > - if (retval)
> > > - return retval;
> > >
> > > if (inode->i_op->getattr)
> > > return inode->i_op->getattr(path->mnt, path->dentry, stat);
> > > @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat)
> > > return 0;
> > > }
> > >
> > > +EXPORT_SYMBOL_GPL(vfs_getattr_nosec);
> > > +
> > > +int vfs_getattr(struct path *path, struct kstat *stat)
> > > +{
> > > + int retval;
> > > +
> > > + retval = security_inode_getattr(path->mnt, path->dentry);
> > > + if (retval)
> > > + return retval;
> > > + return vfs_getattr_nosec(path, stat);
> > > +}
> > > +
> > > EXPORT_SYMBOL(vfs_getattr);
> > >
> > > int vfs_fstat(unsigned int fd, struct kstat *stat)
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 9818747..5a51faa 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
> > > extern const struct inode_operations page_symlink_inode_operations;
> > > extern int generic_readlink(struct dentry *, char __user *, int);
> > > extern void generic_fillattr(struct inode *, struct kstat *);
> > > +int vfs_getattr_nosec(struct path *path, struct kstat *stat);
> > > extern int vfs_getattr(struct path *, struct kstat *);
> > > void __inode_add_bytes(struct inode *inode, loff_t bytes);
> > > void inode_add_bytes(struct inode *inode, loff_t bytes);
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
>
>
On Sun, Sep 29, 2013 at 04:54:54AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 12, 2013 at 08:33:28PM +0100, Al Viro wrote:
> > i_ino use is entirely up to filesystem; it may be used by library helpers,
> > provided that the choice of using or not using those is, again, up to
> > filesystem in question.
>
> With more and more filesystems using large inode numbers I'm starting to
> wonder if this still makes sense. But given that it's been this way for
> a long time we should have more documentation of it for sure.
>
> > NFSD has no damn business looking at it; library
> > helpers in fs/exportfs might, but that makes them not suitable for use
> > by filesystems without inode numbers or with 64bit ones.
> >
> > The reason why it's there at all is that it serves as convenient icache
> > search key for many filesystems. IOW, it's used by iget_locked() and
> > avoiding the overhead of 64bit comparisons on 32bit hosts is the main
> > reason to avoid making it u64.
> >
> > Again, no fs-independent code has any business looking at it, 64bit or
> > not. From the VFS point of view there is no such thing as inode number.
> > And get_name() is just a library helper. For many fs types it works
> > as suitable ->s_export_op->get_name() instance, but decision to use it
> > or not belongs to filesystem in question and frankly, it's probably better
> > to provide an instance of your own anyway.
>
> Given that these days most exportable filesystems use 64-bit inode
> numbers I think we should put the patch from Bruce in. Nevermind that
> it's in a slow path, so the overhead of vfs_getattr really doesn't hurt.
Calling vfs_getattr adds a security_inode_getattr() call that wasn't
there before. Any chance of that being a problem?
If so then it's no huge code duplication to it by hand:
if (inode->i_op->getattr)
inode->i_op->getattr(path->mnt, path->dentry, &stat);
else
generic_fillattr(inode, &stat);
--b.
From: "J. Bruce Fields" <[email protected]>
The filehandle lookup code wants this version of getattr.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/stat.c | 31 +++++++++++++++++++++++++------
include/linux/fs.h | 1 +
2 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/fs/stat.c b/fs/stat.c
index 04ce1ac..71a39e8 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
EXPORT_SYMBOL(generic_fillattr);
-int vfs_getattr(struct path *path, struct kstat *stat)
+/**
+ * vfs_getattr_nosec - getattr without security checks
+ * @path: file to get attributes from
+ * @stat: structure to return attributes in
+ *
+ * Get attributes without calling security_inode_getattr.
+ *
+ * Currently the only caller other than vfs_getattr is internal to the
+ * filehandle lookup code, which uses only the inode number and returns
+ * no attributes to any user. Any other code probably wants
+ * vfs_getattr.
+ */
+int vfs_getattr_nosec(struct path *path, struct kstat *stat)
{
struct inode *inode = path->dentry->d_inode;
- int retval;
-
- retval = security_inode_getattr(path->mnt, path->dentry);
- if (retval)
- return retval;
if (inode->i_op->getattr)
return inode->i_op->getattr(path->mnt, path->dentry, stat);
@@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat)
return 0;
}
+EXPORT_SYMBOL_GPL(vfs_getattr_nosec);
+
+int vfs_getattr(struct path *path, struct kstat *stat)
+{
+ int retval;
+
+ retval = security_inode_getattr(path->mnt, path->dentry);
+ if (retval)
+ return retval;
+ return vfs_getattr_nosec(path, stat);
+}
+
EXPORT_SYMBOL(vfs_getattr);
int vfs_fstat(unsigned int fd, struct kstat *stat)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..5a51faa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
extern const struct inode_operations page_symlink_inode_operations;
extern int generic_readlink(struct dentry *, char __user *, int);
extern void generic_fillattr(struct inode *, struct kstat *);
+int vfs_getattr_nosec(struct path *path, struct kstat *stat);
extern int vfs_getattr(struct path *, struct kstat *);
void __inode_add_bytes(struct inode *inode, loff_t bytes);
void inode_add_bytes(struct inode *inode, loff_t bytes);
--
1.7.9.5
On Fri, Oct 11, 2013 at 05:53:51PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 11, 2013 at 09:28:07AM +1100, Dave Chinner wrote:
> > On Wed, Oct 09, 2013 at 10:53:20AM -0400, J. Bruce Fields wrote:
> > > On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote:
> > > > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote:
> > > > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote:
> > > > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> > > > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > > > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > > > > > > if (!dir->i_fop)
> > > > > > > > goto out;
> > > > > > > > /*
> > > > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > > > > > > + * former would be insufficient on 32-bit hosts when the
> > > > > > > > + * filesystem supports 64-bit inode numbers. So we need to
> > > > > > > > + * actually call ->getattr, not just read i_ino:
> > > > > > > > + */
> > > > > > > > + error = vfs_getattr_nosec(path, &stat);
> > > > > > >
> > > > > > > Doh, "path" here is for the parent.... The following works better!
> > > > > >
> > > > > > By the way, I'm testing this with:
> > > > > >
> > > > > > - create a bunch of nested subdirectories, use
> > > > > > name_to_fhandle_at to get a handle for the bottom directory.
> > > > > > - echo 2 >/proc/sys/vm/drop_caches
> > > > > > - open_by_fhandle_at on the filehandle
> > > > > >
> > > > > > But this only actually exercises the reconnect path on the first run
> > > > > > after boot. Is there something obvious I'm missing here?
> > > > >
> > > > > Looking at the code.... OK, most of the work of drop_caches is done by
> > > > > shrink_slab_node, which doesn't actually try to free every single thing
> > > > > that it could free--in particular, it won't try to free anything if it
> > > > > thinks there are less than shrinker->batch_size (1024 in the
> > > > > super_block->s_shrink case) objects to free.
> > >
> > > (Oops, sorry, that should have been "less than half of
> > > shrinker->batch_size", see below.)
> > >
> > > > That's not quite right. Yes, the shrinker won't be called if the
> > > > calculated scan count is less than the batch size, but the left over
> > > > is added back the shrinker scan count to carry over to the next call
> > > > to the shrinker. Hence if you repeated call the shrinker on a small
> > > > cache with a large batch size, it will eventually aggregate the scan
> > > > counts to over the batch size and trim the cache....
> > >
> > > No, in shrink_slab_count, we do this:
> > >
> > > if (total_scan > max_pass * 2)
> > > total_scan = max_pass * 2;
> > >
> > > while (total_scan >= batch_size) {
> > > ...
> > > }
> > >
> > > where max_pass is the value returned from count_objects. So as long as
> > > count_objects returns less than half batch_size, nothing ever happens.
> >
> > Ah, right - I hadn't considered what that does to small caches - the
> > intended purpose of that is to limit the scan size when caches are
> > extremely large and lots of deferral has occurred. Perhaps we need
> > to consider the batch size in this? e.g.:
> >
> > total_scan = min(total_scan, max(max_pass * 2, batch_size));
> >
> > Hence for small caches (max_pass <<< batch_size), it evaluates as:
> >
> > total_scan = min(total_scan, batch_size);
> >
> > and hence once aggregation of repeated calls pushes us over the
> > batch size we run the shrinker.
> >
> > For large caches (max_pass >>> batch_size), it evaluates as:
> >
> > total_scan = min(total_scan, max_pass * 2);
> >
> > which gives us the same behaviour as the current code.
> >
> > I'll write up a patch to do this...
>
> It all feels a bit ad-hoc, but OK.
>
> drop_caches could still end up leaving some small caches alone, right?
Yes, but it's iterative nature means that as long as it is making
progress it will continue to call the shrinkers and hence in most
cases caches will get more than just one call to be shrunk.
> I hadn't expected that, but then again maybe I don't really understand
> what drop_caches is for.
drop_caches is a "best attempt" to free memory, not a guaranteed
method of freeing pages or slab objects. It's a big hammer that can
free a lot of memory and it will continue to free memory as long as
it makes progress. But if it can't make progress, then it simply
stops, and that can happen at any time during slab cache
shrinking...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote:
> > If so then it's no huge code duplication to it by hand:
> >
> > if (inode->i_op->getattr)
> > inode->i_op->getattr(path->mnt, path->dentry, &stat);
> > else
> > generic_fillattr(inode, &stat);
>
> Maybe make that a vfs_getattr_nosec and let vfs_getattr call it?
>
> Including a proper kerneldoc comment explaining when to use it, please.
Something like this?
--b.
commit 8418a41b7192cf2f372ae091207adb29a088f9a0
Author: J. Bruce Fields <[email protected]>
Date: Tue Sep 10 11:41:12 2013 -0400
exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Reported-by: Trevor Cordes <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 293bc2e..811831a 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -215,7 +215,7 @@ struct getdents_callback {
struct dir_context ctx;
char *name; /* name that was found. It already points to a
buffer NAME_MAX+1 is size */
- unsigned long ino; /* the inum we are looking for */
+ u64 ino; /* the inum we are looking for */
int found; /* inode matched? */
int sequence; /* sequence counter */
};
@@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
struct inode *dir = path->dentry->d_inode;
int error;
struct file *file;
+ struct kstat stat;
struct getdents_callback buffer = {
.ctx.actor = filldir_one,
.name = name,
- .ino = child->d_inode->i_ino
};
error = -ENOTDIR;
@@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
if (!dir->i_fop)
goto out;
/*
+ * inode->i_ino is unsigned long, kstat->ino is u64, so the
+ * former would be insufficient on 32-bit hosts when the
+ * filesystem supports 64-bit inode numbers. So we need to
+ * actually call ->getattr, not just read i_ino:
+ */
+ error = vfs_getattr_nosec(path, &stat);
+ if (error)
+ return error;
+ buffer.ino = stat.ino;
+ /*
* Open the directory ...
*/
file = dentry_open(path, O_RDONLY, cred);
diff --git a/fs/stat.c b/fs/stat.c
index 04ce1ac..71a39e8 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
EXPORT_SYMBOL(generic_fillattr);
-int vfs_getattr(struct path *path, struct kstat *stat)
+/**
+ * vfs_getattr_nosec - getattr without security checks
+ * @path: file to get attributes from
+ * @stat: structure to return attributes in
+ *
+ * Get attributes without calling security_inode_getattr.
+ *
+ * Currently the only caller other than vfs_getattr is internal to the
+ * filehandle lookup code, which uses only the inode number and returns
+ * no attributes to any user. Any other code probably wants
+ * vfs_getattr.
+ */
+int vfs_getattr_nosec(struct path *path, struct kstat *stat)
{
struct inode *inode = path->dentry->d_inode;
- int retval;
-
- retval = security_inode_getattr(path->mnt, path->dentry);
- if (retval)
- return retval;
if (inode->i_op->getattr)
return inode->i_op->getattr(path->mnt, path->dentry, stat);
@@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat)
return 0;
}
+EXPORT_SYMBOL_GPL(vfs_getattr_nosec);
+
+int vfs_getattr(struct path *path, struct kstat *stat)
+{
+ int retval;
+
+ retval = security_inode_getattr(path->mnt, path->dentry);
+ if (retval)
+ return retval;
+ return vfs_getattr_nosec(path, stat);
+}
+
EXPORT_SYMBOL(vfs_getattr);
int vfs_fstat(unsigned int fd, struct kstat *stat)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..5a51faa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
extern const struct inode_operations page_symlink_inode_operations;
extern int generic_readlink(struct dentry *, char __user *, int);
extern void generic_fillattr(struct inode *, struct kstat *);
+int vfs_getattr_nosec(struct path *path, struct kstat *stat);
extern int vfs_getattr(struct path *, struct kstat *);
void __inode_add_bytes(struct inode *inode, loff_t bytes);
void inode_add_bytes(struct inode *inode, loff_t bytes);
On Wed, Oct 02, 2013 at 10:57:27AM -0700, Christoph Hellwig wrote:
> > Something like this?
>
> Looks good, although I'd split it into two separate patches for the VFS
> and exportfs bits.
Hm, this might be to the point of leaving the first patch a little
unmotivated, but OK, maybe so.
Thanks for the review!
--b.
On Wed, Oct 02, 2013 at 11:43:20AM -0400, J. Bruce Fields wrote:
>
> Just for fun, I took a stab at an xfs-specific method.
>
> Pretty much untested and probably wrong as I know nothing about xfs, but
> it does seem like it allows some minor simplifications compared to the
> common helper.
>
> But as Christoph says other filesystems have the same problem so we
> probably want to fix the common helper anyway.
You'll need this in at least ext4, btrfs, gfs2, ocfs2 and probably more.
So fixing this for real sounds like the best deal.
On Wed, Oct 09, 2013 at 10:53:20AM -0400, J. Bruce Fields wrote:
> On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote:
> > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote:
> > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote:
> > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > > > > if (!dir->i_fop)
> > > > > > goto out;
> > > > > > /*
> > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > > > > + * former would be insufficient on 32-bit hosts when the
> > > > > > + * filesystem supports 64-bit inode numbers. So we need to
> > > > > > + * actually call ->getattr, not just read i_ino:
> > > > > > + */
> > > > > > + error = vfs_getattr_nosec(path, &stat);
> > > > >
> > > > > Doh, "path" here is for the parent.... The following works better!
> > > >
> > > > By the way, I'm testing this with:
> > > >
> > > > - create a bunch of nested subdirectories, use
> > > > name_to_fhandle_at to get a handle for the bottom directory.
> > > > - echo 2 >/proc/sys/vm/drop_caches
> > > > - open_by_fhandle_at on the filehandle
> > > >
> > > > But this only actually exercises the reconnect path on the first run
> > > > after boot. Is there something obvious I'm missing here?
> > >
> > > Looking at the code.... OK, most of the work of drop_caches is done by
> > > shrink_slab_node, which doesn't actually try to free every single thing
> > > that it could free--in particular, it won't try to free anything if it
> > > thinks there are less than shrinker->batch_size (1024 in the
> > > super_block->s_shrink case) objects to free.
>
> (Oops, sorry, that should have been "less than half of
> shrinker->batch_size", see below.)
>
> > That's not quite right. Yes, the shrinker won't be called if the
> > calculated scan count is less than the batch size, but the left over
> > is added back the shrinker scan count to carry over to the next call
> > to the shrinker. Hence if you repeated call the shrinker on a small
> > cache with a large batch size, it will eventually aggregate the scan
> > counts to over the batch size and trim the cache....
>
> No, in shrink_slab_count, we do this:
>
> if (total_scan > max_pass * 2)
> total_scan = max_pass * 2;
>
> while (total_scan >= batch_size) {
> ...
> }
>
> where max_pass is the value returned from count_objects. So as long as
> count_objects returns less than half batch_size, nothing ever happens.
Ah, right - I hadn't considered what that does to small caches - the
intended purpose of that is to limit the scan size when caches are
extremely large and lots of deferral has occurred. Perhaps we need
to consider the batch size in this? e.g.:
total_scan = min(total_scan, max(max_pass * 2, batch_size));
Hence for small caches (max_pass <<< batch_size), it evaluates as:
total_scan = min(total_scan, batch_size);
and hence once aggregation of repeated calls pushes us over the
batch size we run the shrinker.
For large caches (max_pass >>> batch_size), it evaluates as:
total_scan = min(total_scan, max_pass * 2);
which gives us the same behaviour as the current code.
I'll write up a patch to do this...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote:
> > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > > if (!dir->i_fop)
> > > > goto out;
> > > > /*
> > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > > + * former would be insufficient on 32-bit hosts when the
> > > > + * filesystem supports 64-bit inode numbers. So we need to
> > > > + * actually call ->getattr, not just read i_ino:
> > > > + */
> > > > + error = vfs_getattr_nosec(path, &stat);
> > >
> > > Doh, "path" here is for the parent.... The following works better!
> >
> > By the way, I'm testing this with:
> >
> > - create a bunch of nested subdirectories, use
> > name_to_fhandle_at to get a handle for the bottom directory.
> > - echo 2 >/proc/sys/vm/drop_caches
> > - open_by_fhandle_at on the filehandle
> >
> > But this only actually exercises the reconnect path on the first run
> > after boot. Is there something obvious I'm missing here?
>
> Looking at the code.... OK, most of the work of drop_caches is done by
> shrink_slab_node, which doesn't actually try to free every single thing
> that it could free--in particular, it won't try to free anything if it
> thinks there are less than shrinker->batch_size (1024 in the
> super_block->s_shrink case) objects to free.
That's not quite right. Yes, the shrinker won't be called if the
calculated scan count is less than the batch size, but the left over
is added back the shrinker scan count to carry over to the next call
to the shrinker. Hence if you repeated call the shrinker on a small
cache with a large batch size, it will eventually aggregate the scan
counts to over the batch size and trim the cache....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Oct 02, 2013 at 09:04:48AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 02, 2013 at 11:43:20AM -0400, J. Bruce Fields wrote:
> >
> > Just for fun, I took a stab at an xfs-specific method.
> >
> > Pretty much untested and probably wrong as I know nothing about xfs, but
> > it does seem like it allows some minor simplifications compared to the
> > common helper.
> >
> > But as Christoph says other filesystems have the same problem so we
> > probably want to fix the common helper anyway.
>
> You'll need this in at least ext4, btrfs, gfs2, ocfs2 and probably more.
> So fixing this for real sounds like the best deal.
Based on "git grep '\bget_name\b' fs/".... Actually gfs2 and btrfs
already have their own get_name methods. They look like they probably
handle 64-bit inodes fine.
That leaves xfs, ext4, and ocfs2.
Anyway I'm still in favor of fixing the helper.
--b.
On Wed, 2 Oct 2013, J. Bruce Fields wrote:
> On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote:
> > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote:
> > > If so then it's no huge code duplication to it by hand:
> > >
> > > if (inode->i_op->getattr)
> > > inode->i_op->getattr(path->mnt, path->dentry, &stat);
> > > else
> > > generic_fillattr(inode, &stat);
> >
> > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it?
> >
> > Including a proper kerneldoc comment explaining when to use it, please.
>
> Something like this?
I'm late to this thread, but: getattr() is a comparatively expensive
operation for just getting an (immutable) ino value on a distributed or
clustered fs. It would be nice if there were a separate call that didn't
try to populate all the other kstat fields with valid data. Something
like this was part of the xstat series from forever ago (a bit mask passed
to getattr indicating which fields were needed), so the approach below
might be okay if we think we'll get there sometime soon, but my preference
would be for another fix...
(Ceph also uses 64-bit inos.)
Thanks!
sage
>
> --b.
>
> commit 8418a41b7192cf2f372ae091207adb29a088f9a0
> Author: J. Bruce Fields <[email protected]>
> Date: Tue Sep 10 11:41:12 2013 -0400
>
> exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
>
> Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
> 32-bit NFS server exporting a very large XFS filesystem, when the
> server's cache is cold (so the inodes in question are not in cache).
>
> Reported-by: Trevor Cordes <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index 293bc2e..811831a 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -215,7 +215,7 @@ struct getdents_callback {
> struct dir_context ctx;
> char *name; /* name that was found. It already points to a
> buffer NAME_MAX+1 is size */
> - unsigned long ino; /* the inum we are looking for */
> + u64 ino; /* the inum we are looking for */
> int found; /* inode matched? */
> int sequence; /* sequence counter */
> };
> @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> struct inode *dir = path->dentry->d_inode;
> int error;
> struct file *file;
> + struct kstat stat;
> struct getdents_callback buffer = {
> .ctx.actor = filldir_one,
> .name = name,
> - .ino = child->d_inode->i_ino
> };
>
> error = -ENOTDIR;
> @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> if (!dir->i_fop)
> goto out;
> /*
> + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> + * former would be insufficient on 32-bit hosts when the
> + * filesystem supports 64-bit inode numbers. So we need to
> + * actually call ->getattr, not just read i_ino:
> + */
> + error = vfs_getattr_nosec(path, &stat);
> + if (error)
> + return error;
> + buffer.ino = stat.ino;
> + /*
> * Open the directory ...
> */
> file = dentry_open(path, O_RDONLY, cred);
> diff --git a/fs/stat.c b/fs/stat.c
> index 04ce1ac..71a39e8 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
>
> EXPORT_SYMBOL(generic_fillattr);
>
> -int vfs_getattr(struct path *path, struct kstat *stat)
> +/**
> + * vfs_getattr_nosec - getattr without security checks
> + * @path: file to get attributes from
> + * @stat: structure to return attributes in
> + *
> + * Get attributes without calling security_inode_getattr.
> + *
> + * Currently the only caller other than vfs_getattr is internal to the
> + * filehandle lookup code, which uses only the inode number and returns
> + * no attributes to any user. Any other code probably wants
> + * vfs_getattr.
> + */
> +int vfs_getattr_nosec(struct path *path, struct kstat *stat)
> {
> struct inode *inode = path->dentry->d_inode;
> - int retval;
> -
> - retval = security_inode_getattr(path->mnt, path->dentry);
> - if (retval)
> - return retval;
>
> if (inode->i_op->getattr)
> return inode->i_op->getattr(path->mnt, path->dentry, stat);
> @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat)
> return 0;
> }
>
> +EXPORT_SYMBOL_GPL(vfs_getattr_nosec);
> +
> +int vfs_getattr(struct path *path, struct kstat *stat)
> +{
> + int retval;
> +
> + retval = security_inode_getattr(path->mnt, path->dentry);
> + if (retval)
> + return retval;
> + return vfs_getattr_nosec(path, stat);
> +}
> +
> EXPORT_SYMBOL(vfs_getattr);
>
> int vfs_fstat(unsigned int fd, struct kstat *stat)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9818747..5a51faa 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
> extern const struct inode_operations page_symlink_inode_operations;
> extern int generic_readlink(struct dentry *, char __user *, int);
> extern void generic_fillattr(struct inode *, struct kstat *);
> +int vfs_getattr_nosec(struct path *path, struct kstat *stat);
> extern int vfs_getattr(struct path *, struct kstat *);
> void __inode_add_bytes(struct inode *inode, loff_t bytes);
> void inode_add_bytes(struct inode *inode, loff_t bytes);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Wed, Oct 02, 2013 at 11:47:22AM -0700, Sage Weil wrote:
> On Wed, 2 Oct 2013, J. Bruce Fields wrote:
> > On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote:
> > > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote:
> > > > If so then it's no huge code duplication to it by hand:
> > > >
> > > > if (inode->i_op->getattr)
> > > > inode->i_op->getattr(path->mnt, path->dentry, &stat);
> > > > else
> > > > generic_fillattr(inode, &stat);
> > >
> > > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it?
> > >
> > > Including a proper kerneldoc comment explaining when to use it, please.
> >
> > Something like this?
>
> I'm late to this thread, but: getattr() is a comparatively expensive
> operation for just getting an (immutable) ino value on a distributed or
> clustered fs. It would be nice if there were a separate call that didn't
> try to populate all the other kstat fields with valid data. Something
> like this was part of the xstat series from forever ago (a bit mask passed
> to getattr indicating which fields were needed), so the approach below
> might be okay if we think we'll get there sometime soon, but my preference
> would be for another fix...
Understood, and perhaps this code should eventually take advantage of
xstat, but:
- this code isn't handling the common case, so we're not too
worried about the performance, and
- you also have the option of defining your own
export_operations->get_name. Especially consider that if you
have some better way to answer the question "what is the name
of inode X in directory Y" that's better than reading Y
looking for a matching inode number.
--b.
> (Ceph also uses 64-bit inos.)
>
> Thanks!
> sage
>
>
> >
> > --b.
> >
> > commit 8418a41b7192cf2f372ae091207adb29a088f9a0
> > Author: J. Bruce Fields <[email protected]>
> > Date: Tue Sep 10 11:41:12 2013 -0400
> >
> > exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
> >
> > Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
> > 32-bit NFS server exporting a very large XFS filesystem, when the
> > server's cache is cold (so the inodes in question are not in cache).
> >
> > Reported-by: Trevor Cordes <[email protected]>
> > Signed-off-by: J. Bruce Fields <[email protected]>
> >
> > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> > index 293bc2e..811831a 100644
> > --- a/fs/exportfs/expfs.c
> > +++ b/fs/exportfs/expfs.c
> > @@ -215,7 +215,7 @@ struct getdents_callback {
> > struct dir_context ctx;
> > char *name; /* name that was found. It already points to a
> > buffer NAME_MAX+1 is size */
> > - unsigned long ino; /* the inum we are looking for */
> > + u64 ino; /* the inum we are looking for */
> > int found; /* inode matched? */
> > int sequence; /* sequence counter */
> > };
> > @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > struct inode *dir = path->dentry->d_inode;
> > int error;
> > struct file *file;
> > + struct kstat stat;
> > struct getdents_callback buffer = {
> > .ctx.actor = filldir_one,
> > .name = name,
> > - .ino = child->d_inode->i_ino
> > };
> >
> > error = -ENOTDIR;
> > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > if (!dir->i_fop)
> > goto out;
> > /*
> > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > + * former would be insufficient on 32-bit hosts when the
> > + * filesystem supports 64-bit inode numbers. So we need to
> > + * actually call ->getattr, not just read i_ino:
> > + */
> > + error = vfs_getattr_nosec(path, &stat);
> > + if (error)
> > + return error;
> > + buffer.ino = stat.ino;
> > + /*
> > * Open the directory ...
> > */
> > file = dentry_open(path, O_RDONLY, cred);
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 04ce1ac..71a39e8 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
> >
> > EXPORT_SYMBOL(generic_fillattr);
> >
> > -int vfs_getattr(struct path *path, struct kstat *stat)
> > +/**
> > + * vfs_getattr_nosec - getattr without security checks
> > + * @path: file to get attributes from
> > + * @stat: structure to return attributes in
> > + *
> > + * Get attributes without calling security_inode_getattr.
> > + *
> > + * Currently the only caller other than vfs_getattr is internal to the
> > + * filehandle lookup code, which uses only the inode number and returns
> > + * no attributes to any user. Any other code probably wants
> > + * vfs_getattr.
> > + */
> > +int vfs_getattr_nosec(struct path *path, struct kstat *stat)
> > {
> > struct inode *inode = path->dentry->d_inode;
> > - int retval;
> > -
> > - retval = security_inode_getattr(path->mnt, path->dentry);
> > - if (retval)
> > - return retval;
> >
> > if (inode->i_op->getattr)
> > return inode->i_op->getattr(path->mnt, path->dentry, stat);
> > @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat)
> > return 0;
> > }
> >
> > +EXPORT_SYMBOL_GPL(vfs_getattr_nosec);
> > +
> > +int vfs_getattr(struct path *path, struct kstat *stat)
> > +{
> > + int retval;
> > +
> > + retval = security_inode_getattr(path->mnt, path->dentry);
> > + if (retval)
> > + return retval;
> > + return vfs_getattr_nosec(path, stat);
> > +}
> > +
> > EXPORT_SYMBOL(vfs_getattr);
> >
> > int vfs_fstat(unsigned int fd, struct kstat *stat)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 9818747..5a51faa 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len);
> > extern const struct inode_operations page_symlink_inode_operations;
> > extern int generic_readlink(struct dentry *, char __user *, int);
> > extern void generic_fillattr(struct inode *, struct kstat *);
> > +int vfs_getattr_nosec(struct path *path, struct kstat *stat);
> > extern int vfs_getattr(struct path *, struct kstat *);
> > void __inode_add_bytes(struct inode *inode, loff_t bytes);
> > void inode_add_bytes(struct inode *inode, loff_t bytes);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote:
> On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > if (!dir->i_fop)
> > > goto out;
> > > /*
> > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > + * former would be insufficient on 32-bit hosts when the
> > > + * filesystem supports 64-bit inode numbers. So we need to
> > > + * actually call ->getattr, not just read i_ino:
> > > + */
> > > + error = vfs_getattr_nosec(path, &stat);
> >
> > Doh, "path" here is for the parent.... The following works better!
>
> By the way, I'm testing this with:
>
> - create a bunch of nested subdirectories, use
> name_to_fhandle_at to get a handle for the bottom directory.
> - echo 2 >/proc/sys/vm/drop_caches
> - open_by_fhandle_at on the filehandle
>
> But this only actually exercises the reconnect path on the first run
> after boot. Is there something obvious I'm missing here?
Looking at the code.... OK, most of the work of drop_caches is done by
shrink_slab_node, which doesn't actually try to free every single thing
that it could free--in particular, it won't try to free anything if it
thinks there are less than shrinker->batch_size (1024 in the
super_block->s_shrink case) objects to free.
So for now I'm just nesting deeper (2048 subdirectories) to get a
reliable way to exercise reconnect_path.
--b.
> Something like this?
Looks good, although I'd split it into two separate patches for the VFS
and exportfs bits.
Reviewed-by: Christoph Hellwig <[email protected]>
On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > if (!dir->i_fop)
> > goto out;
> > /*
> > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > + * former would be insufficient on 32-bit hosts when the
> > + * filesystem supports 64-bit inode numbers. So we need to
> > + * actually call ->getattr, not just read i_ino:
> > + */
> > + error = vfs_getattr_nosec(path, &stat);
>
> Doh, "path" here is for the parent.... The following works better!
By the way, I'm testing this with:
- create a bunch of nested subdirectories, use
name_to_fhandle_at to get a handle for the bottom directory.
- echo 2 >/proc/sys/vm/drop_caches
- open_by_fhandle_at on the filehandle
But this only actually exercises the reconnect path on the first run
after boot. Is there something obvious I'm missing here?
--b.
From: "J. Bruce Fields" <[email protected]>
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Reviewed-by: Christoph Hellwig <[email protected]>
Reported-by: Trevor Cordes <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/exportfs/expfs.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 293bc2e..811831a 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -215,7 +215,7 @@ struct getdents_callback {
struct dir_context ctx;
char *name; /* name that was found. It already points to a
buffer NAME_MAX+1 is size */
- unsigned long ino; /* the inum we are looking for */
+ u64 ino; /* the inum we are looking for */
int found; /* inode matched? */
int sequence; /* sequence counter */
};
@@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
struct inode *dir = path->dentry->d_inode;
int error;
struct file *file;
+ struct kstat stat;
struct getdents_callback buffer = {
.ctx.actor = filldir_one,
.name = name,
- .ino = child->d_inode->i_ino
};
error = -ENOTDIR;
@@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
if (!dir->i_fop)
goto out;
/*
+ * inode->i_ino is unsigned long, kstat->ino is u64, so the
+ * former would be insufficient on 32-bit hosts when the
+ * filesystem supports 64-bit inode numbers. So we need to
+ * actually call ->getattr, not just read i_ino:
+ */
+ error = vfs_getattr_nosec(path, &stat);
+ if (error)
+ return error;
+ buffer.ino = stat.ino;
+ /*
* Open the directory ...
*/
file = dentry_open(path, O_RDONLY, cred);
--
1.7.9.5
On Fri, Oct 11, 2013 at 09:28:07AM +1100, Dave Chinner wrote:
> On Wed, Oct 09, 2013 at 10:53:20AM -0400, J. Bruce Fields wrote:
> > On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote:
> > > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote:
> > > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> > > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > > > > > if (!dir->i_fop)
> > > > > > > goto out;
> > > > > > > /*
> > > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > > > > > + * former would be insufficient on 32-bit hosts when the
> > > > > > > + * filesystem supports 64-bit inode numbers. So we need to
> > > > > > > + * actually call ->getattr, not just read i_ino:
> > > > > > > + */
> > > > > > > + error = vfs_getattr_nosec(path, &stat);
> > > > > >
> > > > > > Doh, "path" here is for the parent.... The following works better!
> > > > >
> > > > > By the way, I'm testing this with:
> > > > >
> > > > > - create a bunch of nested subdirectories, use
> > > > > name_to_fhandle_at to get a handle for the bottom directory.
> > > > > - echo 2 >/proc/sys/vm/drop_caches
> > > > > - open_by_fhandle_at on the filehandle
> > > > >
> > > > > But this only actually exercises the reconnect path on the first run
> > > > > after boot. Is there something obvious I'm missing here?
> > > >
> > > > Looking at the code.... OK, most of the work of drop_caches is done by
> > > > shrink_slab_node, which doesn't actually try to free every single thing
> > > > that it could free--in particular, it won't try to free anything if it
> > > > thinks there are less than shrinker->batch_size (1024 in the
> > > > super_block->s_shrink case) objects to free.
> >
> > (Oops, sorry, that should have been "less than half of
> > shrinker->batch_size", see below.)
> >
> > > That's not quite right. Yes, the shrinker won't be called if the
> > > calculated scan count is less than the batch size, but the left over
> > > is added back the shrinker scan count to carry over to the next call
> > > to the shrinker. Hence if you repeated call the shrinker on a small
> > > cache with a large batch size, it will eventually aggregate the scan
> > > counts to over the batch size and trim the cache....
> >
> > No, in shrink_slab_count, we do this:
> >
> > if (total_scan > max_pass * 2)
> > total_scan = max_pass * 2;
> >
> > while (total_scan >= batch_size) {
> > ...
> > }
> >
> > where max_pass is the value returned from count_objects. So as long as
> > count_objects returns less than half batch_size, nothing ever happens.
>
> Ah, right - I hadn't considered what that does to small caches - the
> intended purpose of that is to limit the scan size when caches are
> extremely large and lots of deferral has occurred. Perhaps we need
> to consider the batch size in this? e.g.:
>
> total_scan = min(total_scan, max(max_pass * 2, batch_size));
>
> Hence for small caches (max_pass <<< batch_size), it evaluates as:
>
> total_scan = min(total_scan, batch_size);
>
> and hence once aggregation of repeated calls pushes us over the
> batch size we run the shrinker.
>
> For large caches (max_pass >>> batch_size), it evaluates as:
>
> total_scan = min(total_scan, max_pass * 2);
>
> which gives us the same behaviour as the current code.
>
> I'll write up a patch to do this...
It all feels a bit ad-hoc, but OK.
drop_caches could still end up leaving some small caches alone, right?
I hadn't expected that, but then again maybe I don't really understand
what drop_caches is for.
--b.
On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote:
> On Sun, Sep 29, 2013 at 04:54:54AM -0700, Christoph Hellwig wrote:
> > On Thu, Sep 12, 2013 at 08:33:28PM +0100, Al Viro wrote:
> > > i_ino use is entirely up to filesystem; it may be used by library helpers,
> > > provided that the choice of using or not using those is, again, up to
> > > filesystem in question.
> >
> > With more and more filesystems using large inode numbers I'm starting to
> > wonder if this still makes sense. But given that it's been this way for
> > a long time we should have more documentation of it for sure.
> >
> > > NFSD has no damn business looking at it; library
> > > helpers in fs/exportfs might, but that makes them not suitable for use
> > > by filesystems without inode numbers or with 64bit ones.
> > >
> > > The reason why it's there at all is that it serves as convenient icache
> > > search key for many filesystems. IOW, it's used by iget_locked() and
> > > avoiding the overhead of 64bit comparisons on 32bit hosts is the main
> > > reason to avoid making it u64.
> > >
> > > Again, no fs-independent code has any business looking at it, 64bit or
> > > not. From the VFS point of view there is no such thing as inode number.
> > > And get_name() is just a library helper. For many fs types it works
> > > as suitable ->s_export_op->get_name() instance, but decision to use it
> > > or not belongs to filesystem in question and frankly, it's probably better
> > > to provide an instance of your own anyway.
> >
> > Given that these days most exportable filesystems use 64-bit inode
> > numbers I think we should put the patch from Bruce in. Nevermind that
> > it's in a slow path, so the overhead of vfs_getattr really doesn't hurt.
>
> Calling vfs_getattr adds a security_inode_getattr() call that wasn't
> there before. Any chance of that being a problem?
>
> If so then it's no huge code duplication to it by hand:
>
> if (inode->i_op->getattr)
> inode->i_op->getattr(path->mnt, path->dentry, &stat);
> else
> generic_fillattr(inode, &stat);
Just for fun, I took a stab at an xfs-specific method.
Pretty much untested and probably wrong as I know nothing about xfs, but
it does seem like it allows some minor simplifications compared to the
common helper.
But as Christoph says other filesystems have the same problem so we
probably want to fix the common helper anyway.
--b.
commit 1d2c2fa899d94aef5283aa66f4bd453305a62030
Author: J. Bruce Fields <[email protected]>
Date: Fri Sep 20 16:14:53 2013 -0400
exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Reported-by: Trevor Cordes <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
index 066df42..e76a288 100644
--- a/fs/xfs/xfs_export.c
+++ b/fs/xfs/xfs_export.c
@@ -31,6 +31,7 @@
#include "xfs_inode_item.h"
#include "xfs_trace.h"
#include "xfs_icache.h"
+#include "xfs_dir2_priv.h"
/*
* Note that we only accept fileids which are long enough rather than allow
@@ -240,10 +241,90 @@ xfs_fs_nfs_commit_metadata(
return _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
}
+struct getdents_callback {
+ struct dir_context ctx;
+ char *name; /* name that was found. It already points to a
+ buffer NAME_MAX+1 is size */
+ u64 ino; /* the inum we are looking for */
+ int found; /* inode matched? */
+ int sequence; /* sequence counter */
+};
+
+/*
+ * A rather strange filldir function to capture
+ * the name matching the specified inode number.
+ */
+static int filldir_one(void * __buf, const char * name, int len,
+ loff_t pos, u64 ino, unsigned int d_type)
+{
+ struct getdents_callback *buf = __buf;
+ int result = 0;
+
+ buf->sequence++;
+ if (buf->ino == ino && len <= NAME_MAX) {
+ memcpy(buf->name, name, len);
+ buf->name[len] = '\0';
+ buf->found = 1;
+ result = -1;
+ }
+ return result;
+}
+
+STATIC int
+xfs_fs_get_name(
+ struct dentry *parent,
+ char *name,
+ struct dentry *child)
+{
+ struct inode *dir = parent->d_inode;
+ int error;
+ struct xfs_inode *ip = XFS_I(child->d_inode);
+ struct getdents_callback buffer = {
+ .ctx.actor = filldir_one,
+ .name = name,
+ .ino = ip->i_ino
+ };
+ size_t bufsize;
+
+ error = -ENOTDIR;
+ if (!dir || !S_ISDIR(dir->i_mode))
+ goto out;
+
+ /* See comment in fs/xfs/xfs_file.c:xfs_file_readdir */
+ bufsize = (size_t)min_t(loff_t, 32768, ip->i_d.di_size);
+
+ buffer.sequence = 0;
+ while (1) {
+ int old_seq = buffer.sequence;
+
+ error = mutex_lock_killable(&dir->i_mutex);
+ if (error)
+ goto out;
+ if (!IS_DEADDIR(dir))
+ error = -xfs_readdir(ip, &buffer.ctx, bufsize);
+ mutex_unlock(&dir->i_mutex);
+ if (buffer.found) {
+ error = 0;
+ break;
+ }
+
+ if (error)
+ break;
+
+ error = -ENOENT;
+ if (old_seq == buffer.sequence)
+ break;
+ }
+
+out:
+ return error;
+}
+
const struct export_operations xfs_export_operations = {
.encode_fh = xfs_fs_encode_fh,
.fh_to_dentry = xfs_fs_fh_to_dentry,
.fh_to_parent = xfs_fs_fh_to_parent,
.get_parent = xfs_fs_get_parent,
.commit_metadata = xfs_fs_nfs_commit_metadata,
+ .get_name = xfs_fs_get_name
};
On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> if (!dir->i_fop)
> goto out;
> /*
> + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> + * former would be insufficient on 32-bit hosts when the
> + * filesystem supports 64-bit inode numbers. So we need to
> + * actually call ->getattr, not just read i_ino:
> + */
> + error = vfs_getattr_nosec(path, &stat);
Doh, "path" here is for the parent.... The following works better!
--b.
commit 231d38b6f775c4677a61b4d7dc15e0a0d6bbb709
Author: J. Bruce Fields <[email protected]>
Date: Tue Sep 10 11:41:12 2013 -0400
exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Reviewed-by: Christoph Hellwig <[email protected]>
Reported-by: Trevor Cordes <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index a235f00..c43fe9b 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -215,7 +215,7 @@ struct getdents_callback {
struct dir_context ctx;
char *name; /* name that was found. It already points to a
buffer NAME_MAX+1 is size */
- unsigned long ino; /* the inum we are looking for */
+ u64 ino; /* the inum we are looking for */
int found; /* inode matched? */
int sequence; /* sequence counter */
};
@@ -255,10 +255,14 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
struct inode *dir = path->dentry->d_inode;
int error;
struct file *file;
+ struct kstat stat;
+ struct path child_path = {
+ .mnt = path->mnt,
+ .dentry = child,
+ };
struct getdents_callback buffer = {
.ctx.actor = filldir_one,
.name = name,
- .ino = child->d_inode->i_ino
};
error = -ENOTDIR;
@@ -268,6 +272,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
if (!dir->i_fop)
goto out;
/*
+ * inode->i_ino is unsigned long, kstat->ino is u64, so the
+ * former would be insufficient on 32-bit hosts when the
+ * filesystem supports 64-bit inode numbers. So we need to
+ * actually call ->getattr, not just read i_ino:
+ */
+ error = vfs_getattr_nosec(&child_path, &stat);
+ if (error)
+ return error;
+ buffer.ino = stat.ino;
+ /*
* Open the directory ...
*/
file = dentry_open(path, O_RDONLY, cred);
On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote:
> On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote:
> > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote:
> > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote:
> > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote:
> > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
> > > > > if (!dir->i_fop)
> > > > > goto out;
> > > > > /*
> > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the
> > > > > + * former would be insufficient on 32-bit hosts when the
> > > > > + * filesystem supports 64-bit inode numbers. So we need to
> > > > > + * actually call ->getattr, not just read i_ino:
> > > > > + */
> > > > > + error = vfs_getattr_nosec(path, &stat);
> > > >
> > > > Doh, "path" here is for the parent.... The following works better!
> > >
> > > By the way, I'm testing this with:
> > >
> > > - create a bunch of nested subdirectories, use
> > > name_to_fhandle_at to get a handle for the bottom directory.
> > > - echo 2 >/proc/sys/vm/drop_caches
> > > - open_by_fhandle_at on the filehandle
> > >
> > > But this only actually exercises the reconnect path on the first run
> > > after boot. Is there something obvious I'm missing here?
> >
> > Looking at the code.... OK, most of the work of drop_caches is done by
> > shrink_slab_node, which doesn't actually try to free every single thing
> > that it could free--in particular, it won't try to free anything if it
> > thinks there are less than shrinker->batch_size (1024 in the
> > super_block->s_shrink case) objects to free.
(Oops, sorry, that should have been "less than half of
shrinker->batch_size", see below.)
> That's not quite right. Yes, the shrinker won't be called if the
> calculated scan count is less than the batch size, but the left over
> is added back the shrinker scan count to carry over to the next call
> to the shrinker. Hence if you repeated call the shrinker on a small
> cache with a large batch size, it will eventually aggregate the scan
> counts to over the batch size and trim the cache....
No, in shrink_slab_count, we do this:
if (total_scan > max_pass * 2)
total_scan = max_pass * 2;
while (total_scan >= batch_size) {
...
}
where max_pass is the value returned from count_objects. So as long as
count_objects returns less than half batch_size, nothing ever happens.
(I wonder if that check's correct? The "forever" in the comment above
it seems wrong at least.)
--b.