2021-03-15 12:10:37

by David Howells

[permalink] [raw]
Subject: [RFC][PATCH 0/3] vfs: Use an xarray instead of inserted bookmarks to scan mount list


Hi Al, Miklós,

Can we consider replacing the "insert cursor" approach we're currently
using for proc files to scan the current namespace's mount list[1] with
something that uses an xarray of mounts indexed by mnt_id?

This has some advantages:

(1) It's simpler. We don't need to insert dummy mount objects as
bookmarks into the mount list and code that's walking the list doesn't
have to carefully step over them.

(2) We can use the file position to represent the mnt_id and can jump to
it directly - ie. using seek() to jump to a mount object by its ID.

(3) It might make it easier to use RCU in future to dump mount entries
rather than having to take namespace_sem. xarray provides for the
possibility of tagging entries to say that they're viewable to avoid
dumping incomplete mount objects.

But there are a number of disadvantages:

(1) We have to allocate memory to maintain the xarray, which becomes more
of a problem as mnt_id values get scattered.

(2) We need to preallocate the xarray memory because we're manipulating

(3) The effect could be magnified because someone mounts an entire
subtree and propagation has to copy all of it.

David

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f [1]

---
David Howells (3):
vfs: Use an xarray in the mount namespace to handle /proc/mounts list
vfs: Use the mounts_to_id array to do /proc/mounts and co.
vfs: Remove mount list trawling cursor stuff


fs/mount.h | 2 +-
fs/namespace.c | 66 ++++++++++---------------------------------
fs/proc_namespace.c | 3 --
include/linux/mount.h | 4 +--
4 files changed, 17 insertions(+), 58 deletions(-)



2021-03-15 12:11:03

by David Howells

[permalink] [raw]
Subject: [PATCH 3/3] vfs: Remove mount list trawling cursor stuff

Remove the stuff for trawling a mount namespace's mount list using inserted
cursors as bookmarks as this has been replaced with an xarray-based
approach.

Signed-off-by: David Howells <[email protected]>
cc: Alexander Viro <[email protected]>
cc: Miklos Szeredi <[email protected]>
cc: Matthew Wilcox <[email protected]>
---

fs/namespace.c | 30 ------------------------------
include/linux/mount.h | 4 +---
2 files changed, 1 insertion(+), 33 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d19fde0654f7..105a6d882cb4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -673,11 +673,6 @@ static inline void unlock_ns_list(struct mnt_namespace *ns)
spin_unlock(&ns->ns_lock);
}

-static inline bool mnt_is_cursor(struct mount *mnt)
-{
- return mnt->mnt.mnt_flags & MNT_CURSOR;
-}
-
/*
* __is_local_mountpoint - Test to see if dentry is a mountpoint in the
* current mount namespace.
@@ -702,8 +697,6 @@ bool __is_local_mountpoint(struct dentry *dentry)
down_read(&namespace_sem);
lock_ns_list(ns);
list_for_each_entry(mnt, &ns->list, mnt_list) {
- if (mnt_is_cursor(mnt))
- continue;
is_covered = (mnt->mnt_mountpoint == dentry);
if (is_covered)
break;
@@ -1334,26 +1327,6 @@ struct vfsmount *mnt_clone_internal(const struct path *path)
}

#ifdef CONFIG_PROC_FS
-#if 0
-static struct mount *mnt_list_next(struct mnt_namespace *ns,
- struct list_head *p)
-{
- struct mount *mnt, *ret = NULL;
-
- lock_ns_list(ns);
- list_for_each_continue(p, &ns->list) {
- mnt = list_entry(p, typeof(*mnt), mnt_list);
- if (!mnt_is_cursor(mnt)) {
- ret = mnt;
- break;
- }
- }
- unlock_ns_list(ns);
-
- return ret;
-}
-#endif
-
/* iterator; we want it to have access to namespace_sem, thus here... */
static void *m_start(struct seq_file *m, loff_t *pos)
{
@@ -4390,9 +4363,6 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
struct mount *child;
int mnt_flags;

- if (mnt_is_cursor(mnt))
- continue;
-
if (mnt->mnt.mnt_sb->s_type != sb->s_type)
continue;

diff --git a/include/linux/mount.h b/include/linux/mount.h
index 5d92a7e1a742..88027d38833c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -51,8 +51,7 @@ struct fs_context;
#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )

#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
- MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED | \
- MNT_CURSOR)
+ MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)

#define MNT_INTERNAL 0x4000

@@ -66,7 +65,6 @@ struct fs_context;
#define MNT_SYNC_UMOUNT 0x2000000
#define MNT_MARKED 0x4000000
#define MNT_UMOUNT 0x8000000
-#define MNT_CURSOR 0x10000000

struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */


2021-03-15 12:11:39

by David Howells

[permalink] [raw]
Subject: [PATCH 1/3] vfs: Use an xarray in the mount namespace to handle /proc/mounts list

Add an xarray to the mount namespace and use this to perform a mnt_id to
mount object mapping for the namespace. Make use of xa_reserve() to
perform preallocation before taking the mount_lock.

This will allow the set of mount objects in a namespace to be iterated
using xarray iteration and without the need to insert and remove fake
mounts as bookmarks - which cause issues for other trawlers of the list.

As a bonus, if we want to allow it, lseek() can be used to start at a
particular mount - though there's no easy way to limit the return to just a
single entry or enforce a failure if that mount doesn't exist, but a later
one does.

Signed-off-by: David Howells <[email protected]>
cc: Alexander Viro <[email protected]>
cc: Miklos Szeredi <[email protected]>
cc: Matthew Wilcox <[email protected]>
---

fs/mount.h | 2 +
fs/namespace.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 74 insertions(+), 9 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 0b6e08cf8afb..455f4d293a65 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -4,6 +4,7 @@
#include <linux/poll.h>
#include <linux/ns_common.h>
#include <linux/fs_pin.h>
+#include <linux/xarray.h>

struct mnt_namespace {
struct ns_common ns;
@@ -14,6 +15,7 @@ struct mnt_namespace {
* - taking namespace_sem for read AND taking .ns_lock.
*/
struct list_head list;
+ struct xarray mounts_by_id; /* List of mounts by mnt_id */
spinlock_t ns_lock;
struct user_namespace *user_ns;
struct ucounts *ucounts;
diff --git a/fs/namespace.c b/fs/namespace.c
index 56bb5a5fdc0d..5c9bcaeac4de 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -901,6 +901,57 @@ void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp, struct m
mnt_add_count(old_parent, -1);
}

+/*
+ * Reserve slots in the mnt_id-to-mount mapping in a namespace. This gets the
+ * memory allocation done upfront.
+ */
+static int reserve_mnt_id_one(struct mount *mnt, struct mnt_namespace *ns)
+{
+ struct mount *m;
+ int ret;
+
+ ret = xa_reserve(&ns->mounts_by_id, mnt->mnt_id, GFP_KERNEL);
+ if (ret < 0)
+ return ret;
+
+ list_for_each_entry(m, &mnt->mnt_list, mnt_list) {
+ ret = xa_reserve(&ns->mounts_by_id, m->mnt_id, GFP_KERNEL);
+ if (ret < 0)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int reserve_mnt_id_list(struct hlist_head *tree_list)
+{
+ struct mount *child;
+ int ret;
+
+ hlist_for_each_entry(child, tree_list, mnt_hash) {
+ ret = reserve_mnt_id_one(child, child->mnt_parent->mnt_ns);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
+static void add_mnt_to_ns(struct mount *m, struct mnt_namespace *ns)
+{
+ void *x;
+
+ m->mnt_ns = ns;
+ x = xa_store(&ns->mounts_by_id, m->mnt_id, m, GFP_ATOMIC);
+ WARN(xa_err(x), "Couldn't store mnt_id %x\n", m->mnt_id);
+}
+
+static void remove_mnt_from_ns(struct mount *mnt)
+{
+ if (mnt->mnt_ns && mnt->mnt_ns != MNT_NS_INTERNAL)
+ xa_erase(&mnt->mnt_ns->mounts_by_id, mnt->mnt_id);
+ mnt->mnt_ns = NULL;
+}
+
/*
* vfsmount lock must be held for write
*/
@@ -914,8 +965,9 @@ static void commit_tree(struct mount *mnt)
BUG_ON(parent == mnt);

list_add_tail(&head, &mnt->mnt_list);
- list_for_each_entry(m, &head, mnt_list)
- m->mnt_ns = n;
+ list_for_each_entry(m, &head, mnt_list) {
+ add_mnt_to_ns(m, n);
+ }

list_splice(&head, n->list.prev);

@@ -1529,7 +1581,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
ns->mounts--;
__touch_mnt_namespace(ns);
}
- p->mnt_ns = NULL;
+ remove_mnt_from_ns(p);
if (how & UMOUNT_SYNC)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;

@@ -2144,6 +2196,13 @@ static int attach_recursive_mnt(struct mount *source_mnt,
err = count_mounts(ns, source_mnt);
if (err)
goto out;
+
+ /* Reserve id-to-mount mapping slots in the namespace we're
+ * going to use.
+ */
+ err = reserve_mnt_id_one(source_mnt, dest_mnt->mnt_ns);
+ if (err)
+ goto out;
}

if (IS_MNT_SHARED(dest_mnt)) {
@@ -2151,6 +2210,8 @@ static int attach_recursive_mnt(struct mount *source_mnt,
if (err)
goto out;
err = propagate_mnt(dest_mnt, dest_mp, source_mnt, &tree_list);
+ if (!err && !moving)
+ err = reserve_mnt_id_list(&tree_list);
lock_mount_hash();
if (err)
goto out_cleanup_ids;
@@ -3260,6 +3321,7 @@ static void dec_mnt_namespaces(struct ucounts *ucounts)

static void free_mnt_ns(struct mnt_namespace *ns)
{
+ WARN_ON(!xa_empty(&ns->mounts_by_id));
if (!is_anon_ns(ns))
ns_free_inum(&ns->ns);
dec_mnt_namespaces(ns->ucounts);
@@ -3306,6 +3368,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
INIT_LIST_HEAD(&new_ns->list);
init_waitqueue_head(&new_ns->poll);
spin_lock_init(&new_ns->ns_lock);
+ xa_init(&new_ns->mounts_by_id);
new_ns->user_ns = get_user_ns(user_ns);
new_ns->ucounts = ucounts;
return new_ns;
@@ -3362,7 +3425,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
p = old;
q = new;
while (p) {
- q->mnt_ns = new_ns;
+ add_mnt_to_ns(q, new_ns);
new_ns->mounts++;
if (new_fs) {
if (&p->mnt == new_fs->root.mnt) {
@@ -3404,7 +3467,7 @@ struct dentry *mount_subtree(struct vfsmount *m, const char *name)
mntput(m);
return ERR_CAST(ns);
}
- mnt->mnt_ns = ns;
+ add_mnt_to_ns(mnt, ns);
ns->root = mnt;
ns->mounts++;
list_add(&mnt->mnt_list, &ns->list);
@@ -3583,7 +3646,7 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
goto err_path;
}
mnt = real_mount(newmount.mnt);
- mnt->mnt_ns = ns;
+ add_mnt_to_ns(mnt, ns);
ns->root = mnt;
ns->mounts = 1;
list_add(&mnt->mnt_list, &ns->list);
@@ -4193,7 +4256,7 @@ static void __init init_mount_tree(void)
if (IS_ERR(ns))
panic("Can't allocate initial namespace");
m = real_mount(mnt);
- m->mnt_ns = ns;
+ add_mnt_to_ns(m, ns);
ns->root = m;
ns->mounts = 1;
list_add(&m->mnt_list, &ns->list);
@@ -4270,7 +4333,7 @@ void kern_unmount(struct vfsmount *mnt)
{
/* release long term mount so mount point can be released */
if (!IS_ERR_OR_NULL(mnt)) {
- real_mount(mnt)->mnt_ns = NULL;
+ remove_mnt_from_ns(real_mount(mnt));
synchronize_rcu(); /* yecchhh... */
mntput(mnt);
}
@@ -4283,7 +4346,7 @@ void kern_unmount_array(struct vfsmount *mnt[], unsigned int num)

for (i = 0; i < num; i++)
if (mnt[i])
- real_mount(mnt[i])->mnt_ns = NULL;
+ remove_mnt_from_ns(real_mount(mnt[i]));
synchronize_rcu_expedited();
for (i = 0; i < num; i++)
mntput(mnt[i]);


2021-03-15 12:50:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] vfs: Use an xarray instead of inserted bookmarks to scan mount list

On Mon, Mar 15, 2021 at 12:07:39PM +0000, David Howells wrote:
>
> Hi Al, Mikl?s,
>
> Can we consider replacing the "insert cursor" approach we're currently
> using for proc files to scan the current namespace's mount list[1] with
> something that uses an xarray of mounts indexed by mnt_id?
>
> This has some advantages:
>
> (1) It's simpler. We don't need to insert dummy mount objects as
> bookmarks into the mount list and code that's walking the list doesn't
> have to carefully step over them.
>
> (2) We can use the file position to represent the mnt_id and can jump to
> it directly - ie. using seek() to jump to a mount object by its ID.
>
> (3) It might make it easier to use RCU in future to dump mount entries
> rather than having to take namespace_sem. xarray provides for the
> possibility of tagging entries to say that they're viewable to avoid
> dumping incomplete mount objects.

Usually one fully constructs the object, then inserts it into the XArray.

> But there are a number of disadvantages:
>
> (1) We have to allocate memory to maintain the xarray, which becomes more
> of a problem as mnt_id values get scattered.

mnt_id values don't seem to get particularly scattered. They're allocated
using an IDA, so they stay small (unlike someone using idr_alloc_cyclic
;-).

2021-03-15 13:18:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] vfs: Use an xarray instead of inserted bookmarks to scan mount list

On Mon, Mar 15, 2021 at 1:07 PM David Howells <[email protected]> wrote:
>
>
> Hi Al, Miklós,
>
> Can we consider replacing the "insert cursor" approach we're currently
> using for proc files to scan the current namespace's mount list[1] with
> something that uses an xarray of mounts indexed by mnt_id?
>
> This has some advantages:
>
> (1) It's simpler. We don't need to insert dummy mount objects as
> bookmarks into the mount list and code that's walking the list doesn't
> have to carefully step over them.
>
> (2) We can use the file position to represent the mnt_id and can jump to
> it directly - ie. using seek() to jump to a mount object by its ID.

What happens if the mount at the current position is removed?

Thanks,
Miklos

2021-03-15 13:19:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] vfs: Use an xarray instead of inserted bookmarks to scan mount list

On Mon, Mar 15, 2021 at 02:14:35PM +0100, Miklos Szeredi wrote:
> On Mon, Mar 15, 2021 at 1:07 PM David Howells <[email protected]> wrote:
> >
> >
> > Hi Al, Mikl?s,
> >
> > Can we consider replacing the "insert cursor" approach we're currently
> > using for proc files to scan the current namespace's mount list[1] with
> > something that uses an xarray of mounts indexed by mnt_id?
> >
> > This has some advantages:
> >
> > (1) It's simpler. We don't need to insert dummy mount objects as
> > bookmarks into the mount list and code that's walking the list doesn't
> > have to carefully step over them.
> >
> > (2) We can use the file position to represent the mnt_id and can jump to
> > it directly - ie. using seek() to jump to a mount object by its ID.
>
> What happens if the mount at the current position is removed?

xa_find() will move to the next one.

2021-03-15 13:45:34

by David Howells

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] vfs: Use an xarray instead of inserted bookmarks to scan mount list

Miklos Szeredi <[email protected]> wrote:

> > (2) We can use the file position to represent the mnt_id and can jump to
> > it directly - ie. using seek() to jump to a mount object by its ID.
>
> What happens if the mount at the current position is removed?

umount_tree() requires the namespace_sem to be writelocked, so that should be
fine as the patches currently read-lock that whilst doing /proc/*/mount*

I'm assuming that kern_unmount() won't be a problem as it is there to deal
with mounts made by kern_mount() which don't get added to the mount list
(mnt_ns is MNT_NS_INTERNAL). kern_unmount_array() seems to be the same
because overlayfs gives it mounts generated by clone_private_mount(). It
might be worth putting a WARN_ON() in kern_unmount() to require this.

When reading through proc, m_start() calls xas_find() which returns the entry
at the starting index or, if not present, the next higher entry.

David

2021-03-15 14:52:31

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] vfs: Use an xarray instead of inserted bookmarks to scan mount list

On Mon, Mar 15, 2021 at 2:41 PM David Howells <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> wrote:
>
> > > (2) We can use the file position to represent the mnt_id and can jump to
> > > it directly - ie. using seek() to jump to a mount object by its ID.
> >
> > What happens if the mount at the current position is removed?
>
> umount_tree() requires the namespace_sem to be writelocked, so that should be
> fine as the patches currently read-lock that whilst doing /proc/*/mount*
>
> I'm assuming that kern_unmount() won't be a problem as it is there to deal
> with mounts made by kern_mount() which don't get added to the mount list
> (mnt_ns is MNT_NS_INTERNAL). kern_unmount_array() seems to be the same
> because overlayfs gives it mounts generated by clone_private_mount(). It
> might be worth putting a WARN_ON() in kern_unmount() to require this.
>
> When reading through proc, m_start() calls xas_find() which returns the entry
> at the starting index or, if not present, the next higher entry.

This will break the property of new mounts always being added to the
end of the list. That's likely a regression for nerural based parsers
(i.e. people), probably less so for machine parsers.

Thanks,
Miklos

>
> David
>