2015-05-24 15:02:09

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 0/4 v2] NFSD: Pin to vfsmount for some nfsd exports cache

If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root, anyone *can't*
unmount those mount points until export cache expired.

# cat /etc/exports
/nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash)
# ll /nfs/
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs
# mount /dev/sde /nfs/test
# df
Filesystem 1K-blocks Used Available Use% Mounted on
......
/dev/sdd 1038336 32944 1005392 4% /nfs/pnfs
/dev/sdc 10475520 32928 10442592 1% /nfs/xfs
/dev/sde 999320 1284 929224 1% /nfs/test
# mount -t nfs 127.0.0.1:/nfs/ /mnt
# ll /mnt/*/
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
# umount /nfs/test/
umount: /nfs/test/: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)

I don't think that's user expect, they want umount /nfs/test/.

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

v1 --> v2,
1. Adds an option named "allow_umount" for exports allowing user
un-mounting the filesystem where nfsd exports base on.
2. New helpers path_get_pin/path_put_unpin for path pin.
3. Update exports according to the "allow_umount" option.

Kinglong Mee (5):
fs_pin: Fix uninitialized value in fs_pin
fs_pin: Export functions for specific filesystem
path: New helpers path_get_pin/path_put_unpin for path pin
sunrpc: New helper cache_force_expire for cache cleanup
nfsd: Allows user un-mounting filesystem where nfsd exports base on

fs/fs_pin.c | 3 +++
fs/namei.c | 26 ++++++++++++++++++++
fs/nfsd/export.c | 52 ++++++++++++++++++++++++++++++++--------
fs/nfsd/export.h | 11 ++++++++-
include/linux/fs_pin.h | 6 +++++
include/linux/path.h | 4 ++++
include/linux/sunrpc/cache.h | 11 +++++++++
include/uapi/linux/nfsd/export.h | 3 ++-
8 files changed, 104 insertions(+), 12 deletions(-)

--
2.4.1



2015-05-24 15:10:34

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 1/5 v2] fs_pin: Fix uninitialized value in fs_pin

Without initialized, done in fs_pin at stack space may
contains strange value.

v2, same as v1.

Signed-off-by: Kinglong Mee <[email protected]>
---
include/linux/fs_pin.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h
index 3886b3b..0dde7b7 100644
--- a/include/linux/fs_pin.h
+++ b/include/linux/fs_pin.h
@@ -1,3 +1,6 @@
+#ifndef _LINUX_FS_PIN_H
+#define _LINUX_FS_PIN_H
+
#include <linux/wait.h>

struct fs_pin {
@@ -16,9 +19,12 @@ static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *))
INIT_HLIST_NODE(&p->s_list);
INIT_HLIST_NODE(&p->m_list);
p->kill = kill;
+ p->done = 0;
}

void pin_remove(struct fs_pin *);
void pin_insert_group(struct fs_pin *, struct vfsmount *, struct hlist_head *);
void pin_insert(struct fs_pin *, struct vfsmount *);
void pin_kill(struct fs_pin *);
+
+#endif
--
2.4.1


2015-05-24 15:10:44

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 3/5 v2] path: New helpers path_get_pin/path_put_unpin for path pin

Two helpers for filesystem pining to vfsmnt, not mntget.

v2, new patch.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/namei.c | 26 ++++++++++++++++++++++++++
include/linux/path.h | 4 ++++
2 files changed, 30 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index fe30d3b..8f94e7c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -492,6 +492,32 @@ void path_put(const struct path *path)
}
EXPORT_SYMBOL(path_put);

+/**
+ * path_get_pin - get a reference to a path's dentry
+ * and pin to path's vfsmnt
+ * @path: path to get the reference to
+ * @p: the fs_pin pin to vfsmnt
+ */
+void path_get_pin(struct path *path, struct fs_pin *p)
+{
+ dget(path->dentry);
+ pin_insert_group(p, path->mnt, NULL);
+}
+EXPORT_SYMBOL(path_get_pin);
+
+/**
+ * path_put_unpin - put a reference to a path's dentry
+ * and remove pin to path's vfsmnt
+ * @path: path to put the reference to
+ * @p: the fs_pin removed from vfsmnt
+ */
+void path_put_unpin(struct path *path, struct fs_pin *p)
+{
+ dput(path->dentry);
+ pin_remove(p);
+}
+EXPORT_SYMBOL(path_put_unpin);
+
struct nameidata {
struct path path;
struct qstr last;
diff --git a/include/linux/path.h b/include/linux/path.h
index d137218..34599fb 100644
--- a/include/linux/path.h
+++ b/include/linux/path.h
@@ -3,6 +3,7 @@

struct dentry;
struct vfsmount;
+struct fs_pin;

struct path {
struct vfsmount *mnt;
@@ -12,6 +13,9 @@ struct path {
extern void path_get(const struct path *);
extern void path_put(const struct path *);

+extern void path_get_pin(struct path *, struct fs_pin *);
+extern void path_put_unpin(struct path *, struct fs_pin *);
+
static inline int path_equal(const struct path *path1, const struct path *path2)
{
return path1->mnt == path2->mnt && path1->dentry == path2->dentry;
--
2.4.1


2015-05-24 15:10:38

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 2/5 v2] fs_pin: Export functions for specific filesystem

Exports functions for others who want pin to vfsmount,
eg, nfsd's export cache.

v2, same as v1.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/fs_pin.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index 611b540..553e8b1 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
wake_up_locked(&pin->wait);
spin_unlock_irq(&pin->wait.lock);
}
+EXPORT_SYMBOL(pin_remove);

void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
{
@@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
spin_unlock(&pin_lock);
}
+EXPORT_SYMBOL(pin_insert_group);

void pin_insert(struct fs_pin *pin, struct vfsmount *m)
{
pin_insert_group(pin, m, &m->mnt_sb->s_pins);
}
+EXPORT_SYMBOL(pin_insert);

void pin_kill(struct fs_pin *p)
{
--
2.4.1


2015-05-24 15:10:50

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 4/5 v2] sunrpc: New helper cache_force_expire for cache cleanup

Using expiry_time force cleanup a cache.

v2, same as v1.

Signed-off-by: Kinglong Mee <[email protected]>
---
include/linux/sunrpc/cache.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 437ddb6..ce75e9c 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -210,6 +210,17 @@ extern int cache_check(struct cache_detail *detail,
struct cache_head *h, struct cache_req *rqstp);
extern void cache_flush(void);
extern void cache_purge(struct cache_detail *detail);
+
+static inline void cache_force_expire(struct cache_detail *detail, struct cache_head *h)
+{
+ write_lock(&detail->hash_lock);
+ h->expiry_time = seconds_since_boot() - 1;
+ detail->nextcheck = seconds_since_boot();
+ write_unlock(&detail->hash_lock);
+
+ cache_flush();
+}
+
#define NEVER (0x7FFFFFFF)
extern void __init cache_initialize(void);
extern int cache_register_net(struct cache_detail *cd, struct net *net);
--
2.4.1


2015-05-24 15:10:57

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 5/5] nfsd: allows user un-mounting filesystem where nfsd exports base on

If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root, anyone *can't*
unmount those mount points until export cache expired.

/nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash)
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs
Filesystem 1K-blocks Used Available Use% Mounted on
......
/dev/sdd 1038336 32944 1005392 4% /nfs/pnfs
/dev/sdc 10475520 32928 10442592 1% /nfs/xfs
/dev/sde 999320 1284 929224 1% /nfs/test
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
umount: /nfs/test/: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)

I don't think that's user expect, they want umount /nfs/test/.

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

v2,
1. Update exports according to the "allow_umount" option.
Pin to vfsmnt default, change when updating.
2. Using kzalloc for all memory allocating without kmalloc.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/nfsd/export.c | 52 ++++++++++++++++++++++++++++++++--------
fs/nfsd/export.h | 11 ++++++++-
include/uapi/linux/nfsd/export.h | 3 ++-
3 files changed, 54 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index f79521a..cc34b0b 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)

if (test_bit(CACHE_VALID, &key->h.flags) &&
!test_bit(CACHE_NEGATIVE, &key->h.flags))
- path_put(&key->ek_path);
+ path_put_unpin(&key->ek_path, &key->ek_pin);
auth_domain_put(key->ek_client);
- kfree(key);
+ kfree_rcu(key, rcu_head);
}

static void expkey_request(struct cache_detail *cd,
@@ -83,7 +83,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
return -EINVAL;
mesg[mlen-1] = 0;

- buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
err = -ENOMEM;
if (!buf)
goto out;
@@ -120,6 +120,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
goto out;

key.ek_client = dom;
+ key.cd = cd;
key.ek_fsidtype = fsidtype;
memcpy(key.ek_fsid, buf, len);

@@ -210,6 +211,13 @@ static inline void expkey_init(struct cache_head *cnew,
new->ek_fsidtype = item->ek_fsidtype;

memcpy(new->ek_fsid, item->ek_fsid, sizeof(new->ek_fsid));
+ new->cd = item->cd;
+}
+
+static void expkey_pin_kill(struct fs_pin *pin)
+{
+ struct svc_expkey *key = container_of(pin, struct svc_expkey, ek_pin);
+ cache_force_expire(key->cd, &key->h);
}

static inline void expkey_update(struct cache_head *cnew,
@@ -218,13 +226,14 @@ static inline void expkey_update(struct cache_head *cnew,
struct svc_expkey *new = container_of(cnew, struct svc_expkey, h);
struct svc_expkey *item = container_of(citem, struct svc_expkey, h);

+ init_fs_pin(&new->ek_pin, expkey_pin_kill);
new->ek_path = item->ek_path;
- path_get(&item->ek_path);
+ path_get_pin(&new->ek_path, &new->ek_pin);
}

static struct cache_head *expkey_alloc(void)
{
- struct svc_expkey *i = kmalloc(sizeof(*i), GFP_KERNEL);
+ struct svc_expkey *i = kzalloc(sizeof(*i), GFP_KERNEL);
if (i)
return &i->h;
else
@@ -309,11 +318,16 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
static void svc_export_put(struct kref *ref)
{
struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
- path_put(&exp->ex_path);
+
+ if (EX_ALLOW_UMOUNT(exp))
+ path_put_unpin(&exp->ex_path, &exp->ex_pin);
+ else
+ path_put(&exp->ex_path);
+
auth_domain_put(exp->ex_client);
nfsd4_fslocs_free(&exp->ex_fslocs);
kfree(exp->ex_uuid);
- kfree(exp);
+ kfree_rcu(exp, rcu_head);
}

static void svc_export_request(struct cache_detail *cd,
@@ -520,7 +534,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
return -EINVAL;
mesg[mlen-1] = 0;

- buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
return -ENOMEM;

@@ -694,15 +708,23 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
path_equal(&orig->ex_path, &new->ex_path);
}

+static void export_pin_kill(struct fs_pin *pin)
+{
+ struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
+ cache_force_expire(exp->cd, &exp->h);
+}
+
static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
{
struct svc_export *new = container_of(cnew, struct svc_export, h);
struct svc_export *item = container_of(citem, struct svc_export, h);

+ init_fs_pin(&new->ex_pin, export_pin_kill);
kref_get(&item->ex_client->ref);
+ new->ex_flags = NFSEXP_ALLOW_UMOUNT;
new->ex_client = item->ex_client;
new->ex_path = item->ex_path;
- path_get(&item->ex_path);
+ path_get_pin(&new->ex_path, &new->ex_pin);
new->ex_fslocs.locations = NULL;
new->ex_fslocs.locations_count = 0;
new->ex_fslocs.migrated = 0;
@@ -717,6 +739,14 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
struct svc_export *item = container_of(citem, struct svc_export, h);
int i;

+ if (!EX_ALLOW_UMOUNT(item)) {
+ path_get(&new->ex_path);
+ if (EX_ALLOW_UMOUNT(new))
+ path_put_unpin(&new->ex_path, &new->ex_pin);
+ else
+ path_put(&new->ex_path);
+ }
+
new->ex_flags = item->ex_flags;
new->ex_anon_uid = item->ex_anon_uid;
new->ex_anon_gid = item->ex_anon_gid;
@@ -740,7 +770,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)

static struct cache_head *svc_export_alloc(void)
{
- struct svc_export *i = kmalloc(sizeof(*i), GFP_KERNEL);
+ struct svc_export *i = kzalloc(sizeof(*i), GFP_KERNEL);
if (i)
return &i->h;
else
@@ -811,6 +841,7 @@ exp_find_key(struct cache_detail *cd, struct auth_domain *clp, int fsid_type,

key.ek_client = clp;
key.ek_fsidtype = fsid_type;
+ key.cd = cd;
memcpy(key.ek_fsid, fsidv, key_len(fsid_type));

ek = svc_expkey_lookup(cd, &key);
@@ -1159,6 +1190,7 @@ static struct flags {
{ NFSEXP_NOAUTHNLM, {"insecure_locks", ""}},
{ NFSEXP_V4ROOT, {"v4root", ""}},
{ NFSEXP_PNFS, {"pnfs", ""}},
+ { NFSEXP_ALLOW_UMOUNT, {"allow_umount", ""}},
{ 0, {"", ""}}
};

diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index 1f52bfc..1134875 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -4,6 +4,7 @@
#ifndef NFSD_EXPORT_H
#define NFSD_EXPORT_H

+#include <linux/fs_pin.h>
#include <linux/sunrpc/cache.h>
#include <uapi/linux/nfsd/export.h>

@@ -46,6 +47,8 @@ struct exp_flavor_info {

struct svc_export {
struct cache_head h;
+ struct cache_detail *cd;
+
struct auth_domain * ex_client;
int ex_flags;
struct path ex_path;
@@ -58,7 +61,9 @@ struct svc_export {
struct exp_flavor_info ex_flavors[MAX_SECINFO_LIST];
enum pnfs_layouttype ex_layout_type;
struct nfsd4_deviceid_map *ex_devid_map;
- struct cache_detail *cd;
+
+ struct fs_pin ex_pin;
+ struct rcu_head rcu_head;
};

/* an "export key" (expkey) maps a filehandlefragement to an
@@ -67,17 +72,21 @@ struct svc_export {
*/
struct svc_expkey {
struct cache_head h;
+ struct cache_detail *cd;

struct auth_domain * ek_client;
int ek_fsidtype;
u32 ek_fsid[6];

struct path ek_path;
+ struct fs_pin ek_pin;
+ struct rcu_head rcu_head;
};

#define EX_ISSYNC(exp) (!((exp)->ex_flags & NFSEXP_ASYNC))
#define EX_NOHIDE(exp) ((exp)->ex_flags & NFSEXP_NOHIDE)
#define EX_WGATHER(exp) ((exp)->ex_flags & NFSEXP_GATHERED_WRITES)
+#define EX_ALLOW_UMOUNT(exp) ((exp)->ex_flags & NFSEXP_ALLOW_UMOUNT)

int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export *exp);
__be32 check_nfsd_access(struct svc_export *exp, struct svc_rqst *rqstp);
diff --git a/include/uapi/linux/nfsd/export.h b/include/uapi/linux/nfsd/export.h
index 0df7bd5..61aa8bb 100644
--- a/include/uapi/linux/nfsd/export.h
+++ b/include/uapi/linux/nfsd/export.h
@@ -51,9 +51,10 @@
*/
#define NFSEXP_V4ROOT 0x10000
#define NFSEXP_PNFS 0x20000
+#define NFSEXP_ALLOW_UMOUNT 0x40000

/* All flags that we claim to support. (Note we don't support NOACL.) */
-#define NFSEXP_ALLFLAGS 0x3FE7F
+#define NFSEXP_ALLFLAGS 0x7FE7F

/* The flags that may vary depending on security flavor: */
#define NFSEXP_SECINFO_FLAGS (NFSEXP_READONLY | NFSEXP_ROOTSQUASH \
--
2.4.1


2015-06-01 18:21:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/4 v2] NFSD: Pin to vfsmount for some nfsd exports cache

On Sun, May 24, 2015 at 11:01:56PM +0800, Kinglong Mee wrote:
> If there are some mount points(not exported for nfs) under pseudo root,
> after client's operation of those entry under the root, anyone *can't*
> unmount those mount points until export cache expired.

Thanks for the update, apologies for the delayed response.

>
> # cat /etc/exports
> /nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash)
> /nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash)
> # ll /nfs/
> total 0
> drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
> drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
> drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs
> # mount /dev/sde /nfs/test
> # df
> Filesystem 1K-blocks Used Available Use% Mounted on
> ......
> /dev/sdd 1038336 32944 1005392 4% /nfs/pnfs
> /dev/sdc 10475520 32928 10442592 1% /nfs/xfs
> /dev/sde 999320 1284 929224 1% /nfs/test
> # mount -t nfs 127.0.0.1:/nfs/ /mnt
> # ll /mnt/*/
> /mnt/pnfs/:
> total 0
> -rw-r--r--. 1 root root 0 Apr 21 22:23 attr
> drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp
>
> /mnt/xfs/:
> total 0
> # umount /nfs/test/
> umount: /nfs/test/: target is busy
> (In some cases useful info about processes that
> use the device is found by lsof(8) or fuser(1).)
>
> I don't think that's user expect, they want umount /nfs/test/.
>
> It's caused by exports cache of nfsd holds the reference of
> the path (here is /nfs/test/), so, it can't be umounted.
>
> v1 --> v2,
> 1. Adds an option named "allow_umount" for exports allowing user
> un-mounting the filesystem where nfsd exports base on.

I don't think allow_umount is a useful option. I'd rather just make the
code behave like allow_umount was on all the time.

The fact is nobody could ever *depend* on umount to fail on an exported
filesystem anyway.

I do think we might want a stronger "allow_umount" option that actually
revokes locks and such as necessary. I just don't see the need for this
in between case.

--b.

> 2. New helpers path_get_pin/path_put_unpin for path pin.
> 3. Update exports according to the "allow_umount" option.
>
> Kinglong Mee (5):
> fs_pin: Fix uninitialized value in fs_pin
> fs_pin: Export functions for specific filesystem
> path: New helpers path_get_pin/path_put_unpin for path pin
> sunrpc: New helper cache_force_expire for cache cleanup
> nfsd: Allows user un-mounting filesystem where nfsd exports base on
>
> fs/fs_pin.c | 3 +++
> fs/namei.c | 26 ++++++++++++++++++++
> fs/nfsd/export.c | 52 ++++++++++++++++++++++++++++++++--------
> fs/nfsd/export.h | 11 ++++++++-
> include/linux/fs_pin.h | 6 +++++
> include/linux/path.h | 4 ++++
> include/linux/sunrpc/cache.h | 11 +++++++++
> include/uapi/linux/nfsd/export.h | 3 ++-
> 8 files changed, 104 insertions(+), 12 deletions(-)
>
> --
> 2.4.1

2015-06-02 01:41:10

by Kinglong Mee

[permalink] [raw]
Subject: Re: [PATCH 0/4 v2] NFSD: Pin to vfsmount for some nfsd exports cache

On 6/2/2015 2:21 AM, J. Bruce Fields wrote:
> On Sun, May 24, 2015 at 11:01:56PM +0800, Kinglong Mee wrote:
>> If there are some mount points(not exported for nfs) under pseudo root,
>> after client's operation of those entry under the root, anyone *can't*
>> unmount those mount points until export cache expired.
>
> Thanks for the update, apologies for the delayed response.
>
>>
>> # cat /etc/exports
>> /nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash)
>> /nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash)
>> # ll /nfs/
>> total 0
>> drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
>> drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
>> drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs
>> # mount /dev/sde /nfs/test
>> # df
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> ......
>> /dev/sdd 1038336 32944 1005392 4% /nfs/pnfs
>> /dev/sdc 10475520 32928 10442592 1% /nfs/xfs
>> /dev/sde 999320 1284 929224 1% /nfs/test
>> # mount -t nfs 127.0.0.1:/nfs/ /mnt
>> # ll /mnt/*/
>> /mnt/pnfs/:
>> total 0
>> -rw-r--r--. 1 root root 0 Apr 21 22:23 attr
>> drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp
>>
>> /mnt/xfs/:
>> total 0
>> # umount /nfs/test/
>> umount: /nfs/test/: target is busy
>> (In some cases useful info about processes that
>> use the device is found by lsof(8) or fuser(1).)
>>
>> I don't think that's user expect, they want umount /nfs/test/.
>>
>> It's caused by exports cache of nfsd holds the reference of
>> the path (here is /nfs/test/), so, it can't be umounted.
>>
>> v1 --> v2,
>> 1. Adds an option named "allow_umount" for exports allowing user
>> un-mounting the filesystem where nfsd exports base on.
>
> I don't think allow_umount is a useful option. I'd rather just make the
> code behave like allow_umount was on all the time.

Got it.

A new patch site is posted as updated from v1.

>
> The fact is nobody could ever *depend* on umount to fail on an exported
> filesystem anyway.
>
> I do think we might want a stronger "allow_umount" option that actually
> revokes locks and such as necessary. I just don't see the need for this
> in between case.

Yes, it isn't.

But user always care it when umount fail, and it is late.
I think a proc file is useful than mount options like the stronger
"allow_umount". Maybe "exportfs -f" can do the job of revokes locks.

We can discuss it deeply when really need it.

thanks,
Kinglong Mee

>
> --b.
>
>> 2. New helpers path_get_pin/path_put_unpin for path pin.
>> 3. Update exports according to the "allow_umount" option.
>>
>> Kinglong Mee (5):
>> fs_pin: Fix uninitialized value in fs_pin
>> fs_pin: Export functions for specific filesystem
>> path: New helpers path_get_pin/path_put_unpin for path pin
>> sunrpc: New helper cache_force_expire for cache cleanup
>> nfsd: Allows user un-mounting filesystem where nfsd exports base on
>>
>> fs/fs_pin.c | 3 +++
>> fs/namei.c | 26 ++++++++++++++++++++
>> fs/nfsd/export.c | 52 ++++++++++++++++++++++++++++++++--------
>> fs/nfsd/export.h | 11 ++++++++-
>> include/linux/fs_pin.h | 6 +++++
>> include/linux/path.h | 4 ++++
>> include/linux/sunrpc/cache.h | 11 +++++++++
>> include/uapi/linux/nfsd/export.h | 3 ++-
>> 8 files changed, 104 insertions(+), 12 deletions(-)
>>
>> --
>> 2.4.1
>

2015-06-05 15:02:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/5] nfsd: allows user un-mounting filesystem where nfsd exports base on

On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote:
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)
>
> if (test_bit(CACHE_VALID, &key->h.flags) &&
> !test_bit(CACHE_NEGATIVE, &key->h.flags))
> - path_put(&key->ek_path);
> + path_put_unpin(&key->ek_path, &key->ek_pin);
> auth_domain_put(key->ek_client);
> - kfree(key);
> + kfree_rcu(key, rcu_head);
> }

That looks wrong. OK, so you want umount() to proceed; fine, no problem
with that. However, what happens if the final mntput() hits while you
are just approaching that path_put_unpin()? ->kill() will be triggered,
and it would bloody better
a) make sure that expkey_put() is called for that key if it hadn't
already been done and
b) do not return until such expkey_put() completes. Including the
ones that might have been already entered by the time we'd got to ->kill().

Am I missing something subtle here?

2015-06-06 02:22:04

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 5/5] nfsd: allows user un-mounting filesystem where nfsd exports base on

On Fri, Jun 05, 2015 at 04:02:13PM +0100, Al Viro wrote:
> On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote:
> > --- a/fs/nfsd/export.c
> > +++ b/fs/nfsd/export.c
> > @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)
> >
> > if (test_bit(CACHE_VALID, &key->h.flags) &&
> > !test_bit(CACHE_NEGATIVE, &key->h.flags))
> > - path_put(&key->ek_path);
> > + path_put_unpin(&key->ek_path, &key->ek_pin);
> > auth_domain_put(key->ek_client);
> > - kfree(key);
> > + kfree_rcu(key, rcu_head);
> > }
>
> That looks wrong. OK, so you want umount() to proceed; fine, no problem
> with that. However, what happens if the final mntput() hits while you
> are just approaching that path_put_unpin()? ->kill() will be triggered,
> and it would bloody better
> a) make sure that expkey_put() is called for that key if it hadn't
> already been done and
> b) do not return until such expkey_put() completes. Including the
> ones that might have been already entered by the time we'd got to ->kill().
>
> Am I missing something subtle here?

Having looked through that code... It *is* wrong. Note that the normal
approach is to have pin_remove() called via pin_kill(), directly or triggered
from group_pin_kill() and/or cleanup_mnt() on the mount it's attached to.
pin_remove() should never be called outside of ->kill() callbacks. It should
be called at the point where you are OK with fs being shut down.

The fundamental reason why it's broken is different, though - you *can't*
grab a reference if all you've got is a pin. By the time the callback is
called, the mount in question is already irretrievably committed to being
killed. There's one hell of a wide window between the point of no return
and the point where you are notified of anything, and that's by design -
you might very well have had several mounts doomed by a syscall and they
all get through cleanup_mnt() just before return to userland. One by one.
So between the point where this puppy is doomed and the call of your callback
there might have been several filesystems going through shutdown, with tons
of IO, waiting for remote servers, etc.

We could add a primitive that would _try_ to grab a reference - that can
be done (lock_mount_hash(), check if it has MNT_DOOMED or MNT_SYNC_UMOUNT,
fail if it does, otherwise mnt_add_count(mnt, 1) and succeed, doing
unlock_mount_hash() on both exit paths). HOWEVER, you'll need to think
very carefully where to use that primitive - unlike mntget() it _can_
fail and lock_mount_hash() can inflict quite a bit of cacheline pingpong
if used heavily.

Could you give details on lifecycle of those objects, including the stages
at which we might try to grab references? Combination of such primitive with
a pin (doing just "NULL the references to vfsmount/dentry, do dput() on
what that dentry used to be and call pin_remove()") might work, if the
lifecycle is good enough.

2015-06-06 13:50:24

by Kinglong Mee

[permalink] [raw]
Subject: Re: [PATCH 5/5] nfsd: allows user un-mounting filesystem where nfsd exports base on

On 6/6/2015 10:21 AM, Al Viro wrote:
> On Fri, Jun 05, 2015 at 04:02:13PM +0100, Al Viro wrote:
>> On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote:
>>> --- a/fs/nfsd/export.c
>>> +++ b/fs/nfsd/export.c
>>> @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref)
>>>
>>> if (test_bit(CACHE_VALID, &key->h.flags) &&
>>> !test_bit(CACHE_NEGATIVE, &key->h.flags))
>>> - path_put(&key->ek_path);
>>> + path_put_unpin(&key->ek_path, &key->ek_pin);
>>> auth_domain_put(key->ek_client);
>>> - kfree(key);
>>> + kfree_rcu(key, rcu_head);
>>> }
>>
>> That looks wrong. OK, so you want umount() to proceed; fine, no problem
>> with that. However, what happens if the final mntput() hits while you
>> are just approaching that path_put_unpin()? ->kill() will be triggered,
>> and it would bloody better
>> a) make sure that expkey_put() is called for that key if it hadn't
>> already been done and
>> b) do not return until such expkey_put() completes. Including the
>> ones that might have been already entered by the time we'd got to ->kill().

You are right.
Sorry for my fault, the above patch misses caring the race.

>>
>> Am I missing something subtle here?
>
> Having looked through that code... It *is* wrong. Note that the normal
> approach is to have pin_remove() called via pin_kill(), directly or triggered
> from group_pin_kill() and/or cleanup_mnt() on the mount it's attached to.
> pin_remove() should never be called outside of ->kill() callbacks. It should
> be called at the point where you are OK with fs being shut down.

Thank you very much for your comments.
I will try to using fs_pin as the restrict.

>
> The fundamental reason why it's broken is different, though - you *can't*
> grab a reference if all you've got is a pin. By the time the callback is
> called, the mount in question is already irretrievably committed to being
> killed. There's one hell of a wide window between the point of no return
> and the point where you are notified of anything, and that's by design -
> you might very well have had several mounts doomed by a syscall and they
> all get through cleanup_mnt() just before return to userland. One by one.
> So between the point where this puppy is doomed and the call of your callback
> there might have been several filesystems going through shutdown, with tons
> of IO, waiting for remote servers, etc.
>
> We could add a primitive that would _try_ to grab a reference - that can
> be done (lock_mount_hash(), check if it has MNT_DOOMED or MNT_SYNC_UMOUNT,
> fail if it does, otherwise mnt_add_count(mnt, 1) and succeed, doing
> unlock_mount_hash() on both exit paths). HOWEVER, you'll need to think
> very carefully where to use that primitive - unlike mntget() it _can_
> fail and lock_mount_hash() can inflict quite a bit of cacheline pingpong
> if used heavily.

Do you mean adding a new feature?

>
> Could you give details on lifecycle of those objects, including the stages
> at which we might try to grab references? Combination of such primitive with
> a pin (doing just "NULL the references to vfsmount/dentry, do dput() on
> what that dentry used to be and call pin_remove()") might work, if the
> lifecycle is good enough.

NFSD has two caches named expkey and export which are managed by sunrpc cache
fundamental. I will only explain export following for expkey is similar as export.

struct cache_head {
struct kref ref;
... ...
};
struct svc_export {
struct cache_head h;
struct path ex_path;
... ...
};

1. svc_export has a reference, will be freed when the reference is decreased to zero.
2. ex_path must be put when freed (Want change mntget to fs_pin for ex_path's vfsmnt).
3. With fs_pin, there are two logic (one is the normal logic, the other is pin_kill)
which can cause free svc_export.
4. The reference of the normal logic is zero, but the pin_kill logic is not zero.
the second logic will decrease the reference indirectly, if decrease to zero,
umount will go though the normal logic's code, at last frees the svc_export;
if not zero, umount must don't free the svc_export.

I try to solve the window as,
struct svc_export {
struct cache_head h;
struct path ex_path;
... ...
struct fs_pin ex_pin;
struct rcu_head rcu_head;

/* For cache_put and fs umounting window */
struct completion ex_done;
struct work_struct ex_work;
};

1. ex_done is for umount waiting the reference is decreased to zero.
2. ex_work is for umount decrease the reference indirectly.
3. The normal logic don't free the svc_export, calls complete() and
go though pin_kill() logic as,
(svc_export_put will be called when reference is decreased to zero)

static void svc_export_put(struct kref *ref)
{
struct svc_export *exp = container_of(ref, struct svc_export, h.ref);

rcu_read_lock();
complete(&exp->ex_done);
pin_kill(&exp->ex_pin);
}

4. pin_kill() logic will schedules to decrease the reference though ex_work,
and at last path_put_unpin and destroy the svc_export.

static void export_pin_kill(struct fs_pin *pin)
{
struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);

if (!completion_done(&exp->ex_done)) {
schedule_work(&exp->ex_work);
wait_for_completion(&exp->ex_done);
}
path_put_unpin(&exp->ex_path, &exp->ex_pin);
svc_export_destroy(exp);
}

The full patches will be sent later. Thanks again.

thanks,
Kinglong Mee