2018-07-31 22:53:12

by Mark Fasheh

[permalink] [raw]
Subject: [RFC PATCH 0/4] vfs: map unique ino/dev pairs for user space

Hi,

These patches are follow-up to this conversation on fsdevel which provides a
good break-down of the core problem:

https://www.spinics.net/lists/linux-fsdevel/msg128003.html

That e-mail was in turn a follow-up to the following patch series:

https://lwn.net/Articles/753917/

which tried to solve the dev portion of this problem with a bit of structure
re-routing but was never accepted.


We have an inconsistency in how the kernel is exporting inode number /
device pairs for user space. I'm not talking about stat(2) or statx(2) here
but everywhere else where an ino/dev is exported into a buffer for user
space.

We typically do this by simply dumping the potentially 32 bit inode->i_ino
and single device from super->s_dev. Sometimes these values differ from
what is returned via stat(2) or statx(2). Some filesystems might even show
duplicate (but internally different!) pairs when the raw i_ino/s_dev is
used.

Aside from breaking software which expects these pairs to be unique, this
can put the user in a situation where they might not be able to find an
inode referenced from some kernel log (be it /proc/, printk buffer, etc).
What's even worse - depending on how ino is exported, they might even find
an existing but /wrong/ file.

The following patches fix these inconsistencies by introducing a VFS helper
function which calls the underlying filesystem ->getattr to get our real
inode number / device pair. The returned values can then be used at those
places in the kernel where we are incorrectly reporting our ino/dev pair.
We then update fs/proc/ and fs/locks.c to use this helper when writing to
/proc/PID/maps and /proc/locks respectively.


For anyone who is watching the evolution of these patches, this would be one
implementation of the 'callback' method which I discussed in my last e-mail.
This approach is very straight forward and easy to understand but has some
drawbacks:

- Not all of the call sites can tolerate errors. Instead, we fallback to
inode->i_ino and super->s_dev in case of getattr error. That behavior is
no worse than what we have today but what we have today is pretty bad so I
don't particularly feel good about that. It's better than nothing though.

- There are places in the kernel where we could never call into ->getattr.
There are tons of tracepoints for example that are printing the the wrong
ino/dev pair.

- The helper function has to fake a path struct because getting a vfsmount
from inside these call sites is impossible. This isn't actually a big
deal as nobody except NFS cares and the NFS patch is very trivial. It's
ugly though, so if we had consensus on this approach I would happily
rework our ->getattr function signature, perhaps pulling the vfsmount and
dentry arguments back out.

- The dentry argument to ->getattr is potentially problematic as we have
some places which _only_ have an inode. Even if we had a dentry I'm not
sure what would keep it from going negative while in the middle of our
->getattr call.


Comments and review would be greatly appreciated. Thanks,
--Mark


2018-07-31 22:53:14

by Mark Fasheh

[permalink] [raw]
Subject: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs

There are places in the VFS where we export an ino/dev pair to userspace.
/proc/PID/maps is a good example - we directly expose inode->i_ino and
inode->i_sb->s_dev to userspace there. Many filesystems don't put a unique
value in inode->i_ino and instead rely on ->getattr to provide the real
inode number to userspace. super->s_dev is similar - some filesystems
expose a difference device from what's put in super->s_dev when queried via
stat/statx.

Ultimately this makes it impossible for a user (or software) to match one of
those reported pairs to the right inode.

We can fix this by adding a helper function, vfs_map_unique_ino_dev(), which
will query the owning filesystem (via ->getattr) to get the correct ino/dev
pair. Later patches will update those places which simply dump inode->i_ino
and super->s_dev to use the helper.

Signed-off-by: Mark Fasheh <[email protected]>
---
fs/stat.c | 23 +++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 25 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index f8e6fb2c3657..80ea42505219 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -84,6 +84,29 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
}
EXPORT_SYMBOL(vfs_getattr_nosec);

+void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)
+{
+ int ret;
+ struct path path;
+ struct kstat stat;
+
+ path.mnt = NULL;
+ path.dentry = dentry;
+ /* ->dev is always returned, so we only need to specify ino here */
+ ret = vfs_getattr_nosec(&path, &stat, STATX_INO, 0);
+ if (ret) {
+ struct inode *inode = d_inode(dentry);
+ /* Fallback to old behavior in case of getattr error */
+ *ino = inode->i_ino;
+ *dev = inode->i_sb->s_dev;
+ return ret;
+ }
+ *ino = stat.ino;
+ *dev = stat.dev;
+ return 0;
+}
+EXPORT_SYMBOL(vfs_map_unique_ino_dev);
+
/*
* vfs_getattr - Get the enhanced basic attributes of a file
* @path: The file of interest
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d78d146a98da..b80789472438 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3077,6 +3077,8 @@ extern void kfree_link(void *);
extern void generic_fillattr(struct inode *, struct kstat *);
extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
+extern void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev);
+
void __inode_add_bytes(struct inode *inode, loff_t bytes);
void inode_add_bytes(struct inode *inode, loff_t bytes);
void __inode_sub_bytes(struct inode *inode, loff_t bytes);
--
2.15.1


2018-07-31 22:53:16

by Mark Fasheh

[permalink] [raw]
Subject: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr

->getattr from inside the kernel won't always have a vfsmount, check for
this.

Signed-off-by: Mark Fasheh <[email protected]>
---
fs/nfs/inode.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b65aee481d13..7a3d90a7cc92 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -795,7 +795,9 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
}

/*
- * We may force a getattr if the user cares about atime.
+ * We may force a getattr if the user cares about atime. A
+ * NULL vfsmount means we are called from inside the kernel,
+ * where no atime is required.
*
* Note that we only have to check the vfsmount flags here:
* - NFS always sets S_NOATIME by so checking it would give a
@@ -803,7 +805,7 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
* - NFS never sets SB_NOATIME or SB_NODIRATIME so there is
* no point in checking those.
*/
- if ((path->mnt->mnt_flags & MNT_NOATIME) ||
+ if (!path->mnt || (path->mnt->mnt_flags & MNT_NOATIME) ||
((path->mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
request_mask &= ~STATX_ATIME;

--
2.15.1


2018-07-31 22:53:21

by Mark Fasheh

[permalink] [raw]
Subject: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

/proc/locks does not always print the correct inode number/device pair.
Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
unique values for userspace.

Signed-off-by: Mark Fasheh <[email protected]>
---
fs/locks.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index db7b6917d9c5..3a012df87fd8 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
loff_t id, char *pfx)
{
struct inode *inode = NULL;
+ struct dentry *dentry;
unsigned int fl_pid;
struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;

@@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
if (fl_pid == 0)
return;

- if (fl->fl_file != NULL)
+ if (fl->fl_file != NULL) {
inode = locks_inode(fl->fl_file);
+ dentry = file_dentry(fl->fl_file);
+ }

seq_printf(f, "%lld:%s ", id, pfx);
if (IS_POSIX(fl)) {
@@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
: (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
}
if (inode) {
+ __u64 ino;
+ dev_t dev;
+
+ vfs_map_unique_ino_dev(dentry, &ino, &dev);
/* userspace relies on this representation of dev_t */
seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
- MAJOR(inode->i_sb->s_dev),
- MINOR(inode->i_sb->s_dev), inode->i_ino);
+ MAJOR(dev), MINOR(dev), inode->i_ino);
} else {
seq_printf(f, "%d <none>:0 ", fl_pid);
}
--
2.15.1


2018-07-31 22:53:20

by Mark Fasheh

[permalink] [raw]
Subject: [PATCH 3/4] proc: use vfs helper to get ino/dev pairs for maps file

Proc writes ino/dev into /proc/PID/maps which it gets from i_ino and s_dev.
The problem with this is that i_ino and s_dev are not guaranteed to be
unique - we can have duplicates in the maps file for otherwise distinct
inodes.

This breaks any software expecting them to be unique, including lsof(8). We
can fix this by using vfs_map_unique_ino_dev() to map the unique ino/dev
pair for us.

Signed-off-by: Mark Fasheh <[email protected]>
---
fs/proc/nommu.c | 11 ++++-------
fs/proc/task_mmu.c | 8 +++-----
fs/proc/task_nommu.c | 8 +++-----
3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c
index 3b63be64e436..56c12d02c5ad 100644
--- a/fs/proc/nommu.c
+++ b/fs/proc/nommu.c
@@ -36,7 +36,7 @@
*/
static int nommu_region_show(struct seq_file *m, struct vm_region *region)
{
- unsigned long ino = 0;
+ u64 ino = 0;
struct file *file;
dev_t dev = 0;
int flags;
@@ -44,15 +44,12 @@ static int nommu_region_show(struct seq_file *m, struct vm_region *region)
flags = region->vm_flags;
file = region->vm_file;

- if (file) {
- struct inode *inode = file_inode(region->vm_file);
- dev = inode->i_sb->s_dev;
- ino = inode->i_ino;
- }
+ if (file)
+ vfs_map_unique_ino_dev(file_dentry(file), &ino, &dev); {

seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
seq_printf(m,
- "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+ "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %llu ",
region->vm_start,
region->vm_end,
flags & VM_READ ? 'r' : '-',
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfd73a4616ce..e9cd33425191 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -276,7 +276,7 @@ static int is_stack(struct vm_area_struct *vma)
static void show_vma_header_prefix(struct seq_file *m,
unsigned long start, unsigned long end,
vm_flags_t flags, unsigned long long pgoff,
- dev_t dev, unsigned long ino)
+ dev_t dev, u64 ino)
{
seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
seq_put_hex_ll(m, NULL, start, 8);
@@ -299,16 +299,14 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid)
struct mm_struct *mm = vma->vm_mm;
struct file *file = vma->vm_file;
vm_flags_t flags = vma->vm_flags;
- unsigned long ino = 0;
+ u64 ino = 0;
unsigned long long pgoff = 0;
unsigned long start, end;
dev_t dev = 0;
const char *name = NULL;

if (file) {
- struct inode *inode = file_inode(vma->vm_file);
- dev = inode->i_sb->s_dev;
- ino = inode->i_ino;
+ vfs_map_unique_ino_dev(file_dentry(file), &ino, &dev);
pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
}

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 5b62f57bd9bc..1198e979ed3f 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -146,7 +146,7 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma,
int is_pid)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long ino = 0;
+ u64 ino = 0;
struct file *file;
dev_t dev = 0;
int flags;
@@ -156,15 +156,13 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma,
file = vma->vm_file;

if (file) {
- struct inode *inode = file_inode(vma->vm_file);
- dev = inode->i_sb->s_dev;
- ino = inode->i_ino;
+ vfs_map_unique_inode_dev(file_dentry(file), &ino, &dev));
pgoff = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
}

seq_setwidth(m, 25 + sizeof(void *) * 6 - 1);
seq_printf(m,
- "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+ "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %llu ",
vma->vm_start,
vma->vm_end,
flags & VM_READ ? 'r' : '-',
--
2.15.1


2018-07-31 23:58:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr

On Tue, Jul 31, 2018 at 02:10:43PM -0700, Mark Fasheh wrote:
> ->getattr from inside the kernel won't always have a vfsmount, check for
> this.

Really? Where would that happen?

2018-08-01 00:34:32

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr

Hi Al,

On Tue, Jul 31, 2018 at 11:16:05PM +0100, Al Viro wrote:
> On Tue, Jul 31, 2018 at 02:10:43PM -0700, Mark Fasheh wrote:
> > ->getattr from inside the kernel won't always have a vfsmount, check for
> > this.
>
> Really? Where would that happen?

It happens in my first patch. FWIW, I'm not tied to that particular bit of
code, or even this (latest) approach. I would actually very much appreciate
your input into how we might fix the problem we have where we are sometimes
not exporting a correct ino/dev pair to user space.

I have a good break-down of the problem and possible solutions here:

https://www.spinics.net/lists/linux-fsdevel/msg128003.html

Thanks,
--Mark

2018-08-01 01:04:08

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs


On Tue, Jul 31, 2018 at 02:10:42PM -0700, Mark Fasheh wrote:
> diff --git a/fs/stat.c b/fs/stat.c
> index f8e6fb2c3657..80ea42505219 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -84,6 +84,29 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> }
> EXPORT_SYMBOL(vfs_getattr_nosec);
>
> +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)
> +{
> + int ret;
> + struct path path;
> + struct kstat stat;
> +
> + path.mnt = NULL;
> + path.dentry = dentry;
> + /* ->dev is always returned, so we only need to specify ino here */
> + ret = vfs_getattr_nosec(&path, &stat, STATX_INO, 0);
> + if (ret) {
> + struct inode *inode = d_inode(dentry);
> + /* Fallback to old behavior in case of getattr error */
> + *ino = inode->i_ino;
> + *dev = inode->i_sb->s_dev;
> + return ret;

Ooof, attached is a version of this patch which doesn't try to return a value
from a void function. Apologies for the extra e-mail.

2018-08-01 07:21:22

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

On Wed, Aug 1, 2018 at 12:10 AM, Mark Fasheh <[email protected]> wrote:
> /proc/locks does not always print the correct inode number/device pair.
> Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> unique values for userspace.
>
> Signed-off-by: Mark Fasheh <[email protected]>
> ---
> fs/locks.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index db7b6917d9c5..3a012df87fd8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> loff_t id, char *pfx)
> {
> struct inode *inode = NULL;
> + struct dentry *dentry;
> unsigned int fl_pid;
> struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>
> @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> if (fl_pid == 0)
> return;
>
> - if (fl->fl_file != NULL)
> + if (fl->fl_file != NULL) {
> inode = locks_inode(fl->fl_file);
> + dentry = file_dentry(fl->fl_file);
> + }
>
> seq_printf(f, "%lld:%s ", id, pfx);
> if (IS_POSIX(fl)) {
> @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> : (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
> }
> if (inode) {
> + __u64 ino;
> + dev_t dev;
> +
> + vfs_map_unique_ino_dev(dentry, &ino, &dev);
> /* userspace relies on this representation of dev_t */
> seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> - MAJOR(inode->i_sb->s_dev),
> - MINOR(inode->i_sb->s_dev), inode->i_ino);
> + MAJOR(dev), MINOR(dev), inode->i_ino);

Don't you mean ,ino); ?

Thanks,
Amir.

2018-08-01 07:25:05

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs

On Wed, Aug 1, 2018 at 2:21 AM, Mark Fasheh <[email protected]> wrote:
>
> On Tue, Jul 31, 2018 at 02:10:42PM -0700, Mark Fasheh wrote:
>> diff --git a/fs/stat.c b/fs/stat.c
>> index f8e6fb2c3657..80ea42505219 100644
>> --- a/fs/stat.c
>> +++ b/fs/stat.c
>> @@ -84,6 +84,29 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>> }
>> EXPORT_SYMBOL(vfs_getattr_nosec);
>>
>> +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)
>> +{
>> + int ret;
>> + struct path path;
>> + struct kstat stat;
>> +
>> + path.mnt = NULL;
>> + path.dentry = dentry;
>> + /* ->dev is always returned, so we only need to specify ino here */
>> + ret = vfs_getattr_nosec(&path, &stat, STATX_INO, 0);
>> + if (ret) {
>> + struct inode *inode = d_inode(dentry);
>> + /* Fallback to old behavior in case of getattr error */
>> + *ino = inode->i_ino;
>> + *dev = inode->i_sb->s_dev;
>> + return ret;
>
> Ooof, attached is a version of this patch which doesn't try to return a value
> from a void function. Apologies for the extra e-mail.
>
> From Mark Fasheh <[email protected]>
>
> [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs
>
> There are places in the VFS where we export an ino/dev pair to userspace.
> /proc/PID/maps is a good example - we directly expose inode->i_ino and
> inode->i_sb->s_dev to userspace there. Many filesystems don't put a unique
> value in inode->i_ino and instead rely on ->getattr to provide the real
> inode number to userspace. super->s_dev is similar - some filesystems
> expose a difference device from what's put in super->s_dev when queried via
> stat/statx.
>
> Ultimately this makes it impossible for a user (or software) to match one of
> those reported pairs to the right inode.
>
> We can fix this by adding a helper function, vfs_map_unique_ino_dev(), which
> will query the owning filesystem (via ->getattr) to get the correct ino/dev
> pair. Later patches will update those places which simply dump inode->i_ino
> and super->s_dev to use the helper.
>
> Signed-off-by: Mark Fasheh <[email protected]>
> ---
> fs/stat.c | 22 ++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 2 files changed, 24 insertions(+)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index f8e6fb2c3657..20c72d618ed5 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -84,6 +84,28 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> }
> EXPORT_SYMBOL(vfs_getattr_nosec);
>
> +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)

I find this function name a bit more than function can guaranty.
It's just a fancy wrapper around ->getattr()
How about vfs_get_ino_dev() ?

Thanks,
Amir.

2018-08-01 12:49:11

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote:
> /proc/locks does not always print the correct inode number/device pair.
> Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> unique values for userspace.
>
> Signed-off-by: Mark Fasheh <[email protected]>
> ---
> fs/locks.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index db7b6917d9c5..3a012df87fd8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> loff_t id, char *pfx)
> {
> struct inode *inode = NULL;
> + struct dentry *dentry;
> unsigned int fl_pid;
> struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>
> @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> if (fl_pid == 0)
> return;
>
> - if (fl->fl_file != NULL)
> + if (fl->fl_file != NULL) {
> inode = locks_inode(fl->fl_file);
> + dentry = file_dentry(fl->fl_file);
> + }
>
> seq_printf(f, "%lld:%s ", id, pfx);
> if (IS_POSIX(fl)) {
> @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> : (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
> }
> if (inode) {
> + __u64 ino;
> + dev_t dev;
> +
> + vfs_map_unique_ino_dev(dentry, &ino, &dev);

This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I
don't think it'll be ok to call ->getattr while holding a spinlock.

> /* userspace relies on this representation of dev_t */
> seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> - MAJOR(inode->i_sb->s_dev),
> - MINOR(inode->i_sb->s_dev), inode->i_ino);
> + MAJOR(dev), MINOR(dev), inode->i_ino);
> } else {
> seq_printf(f, "%d <none>:0 ", fl_pid);
> }

--
Jeff Layton <[email protected]>


2018-08-01 17:45:56

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: introduce function to map unique ino/dev pairs

Hi Amir,

On Wed, Aug 01, 2018 at 08:41:14AM +0300, Amir Goldstein wrote:
> > +void vfs_map_unique_ino_dev(struct dentry *dentry, u64 *ino, dev_t *dev)
>
> I find this function name a bit more than function can guaranty.
> It's just a fancy wrapper around ->getattr()
> How about vfs_get_ino_dev() ?

Yeah I agree with that. An early version actually had the name
vfs_get_ino_dev(), I don't mind going back to it.

Thanks for the review!
--Mark

2018-08-01 19:32:47

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

On Wed, Aug 01, 2018 at 08:37:31AM +0300, Amir Goldstein wrote:
> > if (inode) {
> > + __u64 ino;
> > + dev_t dev;
> > +
> > + vfs_map_unique_ino_dev(dentry, &ino, &dev);
> > /* userspace relies on this representation of dev_t */
> > seq_printf(f, "%d %02x:%02x:%ld ", fl_pid,
> > - MAJOR(inode->i_sb->s_dev),
> > - MINOR(inode->i_sb->s_dev), inode->i_ino);
> > + MAJOR(dev), MINOR(dev), inode->i_ino);
>
> Don't you mean ,ino); ?

Indeed I do, thanks for catching that one Amir.
--Mark

2018-08-01 22:26:11

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 4/4] locks: map correct ino/dev pairs when exporting to userspace

Hi Jeff,

On Wed, Aug 01, 2018 at 07:03:54AM -0400, Jeff Layton wrote:
> On Tue, 2018-07-31 at 14:10 -0700, Mark Fasheh wrote:
> > /proc/locks does not always print the correct inode number/device pair.
> > Update lock_get_status() to use vfs_map_unique_ino_dev() to get the real,
> > unique values for userspace.
> >
> > Signed-off-by: Mark Fasheh <[email protected]>
> > ---
> > fs/locks.c | 12 +++++++++---
> > 1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/locks.c b/fs/locks.c
> > index db7b6917d9c5..3a012df87fd8 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -2621,6 +2621,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> > loff_t id, char *pfx)
> > {
> > struct inode *inode = NULL;
> > + struct dentry *dentry;
> > unsigned int fl_pid;
> > struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> >
> > @@ -2633,8 +2634,10 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> > if (fl_pid == 0)
> > return;
> >
> > - if (fl->fl_file != NULL)
> > + if (fl->fl_file != NULL) {
> > inode = locks_inode(fl->fl_file);
> > + dentry = file_dentry(fl->fl_file);
> > + }
> >
> > seq_printf(f, "%lld:%s ", id, pfx);
> > if (IS_POSIX(fl)) {
> > @@ -2681,10 +2684,13 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> > : (fl->fl_type == F_WRLCK) ? "WRITE" : "READ ");
> > }
> > if (inode) {
> > + __u64 ino;
> > + dev_t dev;
> > +
> > + vfs_map_unique_ino_dev(dentry, &ino, &dev);
>
> This code is under a spinlock (blocked_locks_lock or ctx->flc_lock). I
> don't think it'll be ok to call ->getattr while holding a spinlock.

Hmm, indeed it does call under spinlock. I'll hve to rethink how to approach
this in fs/locks.c

Thanks,
--Mark

2018-08-02 02:22:01

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] vfs: map unique ino/dev pairs for user space

Mark Fasheh:
> The following patches fix these inconsistencies by introducing a VFS helper
> function which calls the underlying filesystem ->getattr to get our real
> inode number / device pair. The returned values can then be used at those
> places in the kernel where we are incorrectly reporting our ino/dev pair.
> We then update fs/proc/ and fs/locks.c to use this helper when writing to
> /proc/PID/maps and /proc/locks respectively.

I definitly agree that ino/dev pair should be a unique identity on the
system. But I don't know why you are tryng to solve the problem in
generic VFS layer instead of the problematic FS. Isn't it an
unnecessary overhead for many FS?
How about creating a new f_op member ->get_ino_dev(), ->show_identity()
or something, and implement the new f_op in the problematic FS only?
I hope it will be a lighter way to get the pair than generic getattr
way.


J. R. Okajima

2018-08-02 02:32:20

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr

On Tue, Jul 31, 2018 at 10:51:57PM +0000, Mark Fasheh wrote:
> Hi Al,
>
> On Tue, Jul 31, 2018 at 11:16:05PM +0100, Al Viro wrote:
> > On Tue, Jul 31, 2018 at 02:10:43PM -0700, Mark Fasheh wrote:
> > > ->getattr from inside the kernel won't always have a vfsmount, check for
> > > this.
> >
> > Really? Where would that happen?
>
> It happens in my first patch. FWIW, I'm not tied to that particular bit of
> code, or even this (latest) approach. I would actually very much appreciate
> your input into how we might fix the problem we have where we are sometimes
> not exporting a correct ino/dev pair to user space.

Which callers would those be? I don't see anything in your series that
wouldn't have vfsmount at hand...

> I have a good break-down of the problem and possible solutions here:
>
> https://www.spinics.net/lists/linux-fsdevel/msg128003.html

btrfs pulling odd crap with device numbers is a problem, but this is far from
the most obvious unpleasantness caused by that - e.g. userland code expecting
that unequal st_dev for directory and subdirectory means that we have
a mountpoint there. Or assuming it can compare that to st_dev of mountpoints
(or, indeed, the values in /proc/self/mountinfo) to find which fs it's on...

/proc/*/maps is unfortunate, but it does contain the pathnames, doesn't it?
What _real_ problems are there - the ones where we don't have a saner solution?
Note that /proc/locks is (a) FPOS interface and (b) unsuitable for your
approach anyway.

2018-08-03 03:58:44

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 2/4] nfs: check for NULL vfsmount in nfs_getattr

Hi Al,

On Thu, Aug 02, 2018 at 01:43:48AM +0100, Al Viro wrote:
> On Tue, Jul 31, 2018 at 10:51:57PM +0000, Mark Fasheh wrote:
> > Hi Al,
> >
> > On Tue, Jul 31, 2018 at 11:16:05PM +0100, Al Viro wrote:
> > > On Tue, Jul 31, 2018 at 02:10:43PM -0700, Mark Fasheh wrote:
> > > > ->getattr from inside the kernel won't always have a vfsmount, check for
> > > > this.
> > >
> > > Really? Where would that happen?
> >
> > It happens in my first patch. FWIW, I'm not tied to that particular bit of
> > code, or even this (latest) approach. I would actually very much appreciate
> > your input into how we might fix the problem we have where we are sometimes
> > not exporting a correct ino/dev pair to user space.
>
> Which callers would those be? I don't see anything in your series that
> wouldn't have vfsmount at hand...

You are correct - there's no callers in my patch series that don't have a
vfsmount available. I can remove patch #2 and pass the vfsmount through.

At some point I was trying to plug this callback into audit_copy_inode()
which AFAICT doesn't have a vfsmount to pass in when we're coming from
vfs_rename (vfs_rename() -> fsnotify_move()). I will eventually have to go
back and handle this for audit though so it seems worth mentioning here.


> > I have a good break-down of the problem and possible solutions here:
> >
> > https://www.spinics.net/lists/linux-fsdevel/msg128003.html
>
> btrfs pulling odd crap with device numbers is a problem, but this is far from
> the most obvious unpleasantness caused by that - e.g. userland code expecting
> that unequal st_dev for directory and subdirectory means that we have
> a mountpoint there. Or assuming it can compare that to st_dev of mountpoints
> (or, indeed, the values in /proc/self/mountinfo) to find which fs it's on...

Indeed, I agree that all of these are pretty gross and things I'd like to
see fixed - I have to deal with subvolume boundaries in some of my own
software. I know Jeff is working on some code to give subvolumes their own
vfsmount but that wouldn't help with our device reporting unless we did
something like move s_dev from the super_block to the vfsmount (also an idea
that's been thrown around).


> /proc/*/maps is unfortunate, but it does contain the pathnames, doesn't it?
> What _real_ problems are there - the ones where we don't have a saner solution?
> Note that /proc/locks is (a) FPOS interface and (b) unsuitable for your
> approach anyway.

Yeah I see now that /proc/locks isn't going to work with something calling
into ->getattr().

Primarily my concerns are /proc/*/maps and audit (which is recording ino/dev
pairs). Even though /proc/*/maps prints a path it is still a problem as
there is no way for userspace to verify that the inode it stats using that
path is the correct one.

I don't mean to overstate the case but the /proc/*/maps issue is real to us
(SUSE) in that it's produced bug reports. For example, lsof produces
confusing output when we run with a btrfs subvolume as root.

As far as problems that can't be solved by this approach - there are
tracepoints in events/lock.h and events/writeback.h which ought to be
reporting the correct ino/dev pairs. There's of course /proc/locks too. I
acknowledge that whether those are severe enough to warrant a different
approach might be up for debate.

Thanks,
--Mark