2015-06-25 14:17:48

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 00/10 v6] NFSD: Pin to vfsmount for nfsd exports cache

If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root, anyone *can't*
unmount those mount points until export cache expired.

# cat /etc/exports
/nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash)
# ll /nfs/
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs
# mount /dev/sde /nfs/test
# df
Filesystem 1K-blocks Used Available Use% Mounted on
......
/dev/sdd 1038336 32944 1005392 4% /nfs/pnfs
/dev/sdc 10475520 32928 10442592 1% /nfs/xfs
/dev/sde 999320 1284 929224 1% /nfs/test
# mount -t nfs 127.0.0.1:/nfs/ /mnt
# ll /mnt/*/
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
# umount /nfs/test/
umount: /nfs/test/: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

I don't think that's user expect, they want umount /nfs/test/.
Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.

This patch site lets nfsd exports pinning to vfsmount,
not using mntget, so user can umount any exports mountpoint now.

v3,
1. New helpers path_get_pin/path_put_unpin for path pin.
2. Use kzalloc for allocating memory.

v4, Thanks for Al Viro's commets for the logic of fs_pin.
1. add a completion for pin_kill waiting the reference is decreased to zero.
2. add a work_struct for pin_kill decreases the reference indirectly.
3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
4. svc_export_put/svc_expkey_put go though pin_kill logic.

v5,
let killing fs_pin under a reference of vfsmnt.

v6,
1. revert the change of v5
2. new helper legitimize_mntget() for nfsd exports/expkey cache
get vfsmount from fs_pin
3. cleanup some codes of sunrpc's cache
4. switch using list_head instead of single list for cache_head
in cache_detail
5. new functions validate/invalidate for processing of reference
increase/decrease change (nfsd exports/expkey using grab the
reference of mnt)
6. delete cache_head directly from cache_detail in pin_kill

Right now,

When reference of cahce_head increase(>1), grab a reference of mnt once.
and reference decrease to 1 (==1), drop the reference of mnt.

So after that,
When ref > 1, user cannot umount the filesystem with -EBUSY.
when ref ==1, means cache only reference by nfsd cache,
no other reference. So user can try umount,
1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
and the reference will be dropped, re-back to 1.
So, pin_kill can delete the cache and umount success.
3. when umountting, no reference to nfsd cache,
pin_kill can delete the cache and umount success.

Kinglong Mee (10):
fs_pin: Initialize value for fs_pin explicitly
fs_pin: Export functions for specific filesystem
path: New helpers path_get_pin/path_put_unpin for path pin
fs: New helper legitimize_mntget() for getting an legitimize mnt
sunrpc: Store cache_detail in seq_file's private directly
sunrpc/nfsd: Remove duplicate code by exports seq_operations functions
sunrpc: Switch to using list_head instead single list
sunrpc: New helper cache_delete_entry for deleting cache_head directly
sunrpc: Support validate/invalidate for reference change in cache_detail
nfsd: Allows user un-mounting filesystem where nfsd exports base on

fs/fs_pin.c | 4 +
fs/namei.c | 26 ++++
fs/namespace.c | 19 +++
fs/nfsd/export.c | 242 ++++++++++++++++++++++++--------------
fs/nfsd/export.h | 26 +++-
include/linux/fs_pin.h | 6 +
include/linux/mount.h | 1 +
include/linux/path.h | 4 +
include/linux/sunrpc/cache.h | 21 +++-
net/sunrpc/auth_gss/svcauth_gss.c | 2 +-
net/sunrpc/cache.c | 159 +++++++++++++++----------
net/sunrpc/svcauth_unix.c | 2 +-
12 files changed, 357 insertions(+), 155 deletions(-)

--
2.4.3



2015-06-25 14:19:11

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 01/10 v6] fs_pin: Initialize value for fs_pin explicitly

Without initialized, done in fs_pin at stack space may
contains strange value.

v3, v4, v5, v6
Adds macro for header file

Signed-off-by: Kinglong Mee <[email protected]>
---
include/linux/fs_pin.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/fs_pin.h b/include/linux/fs_pin.h
index 3886b3b..0dde7b7 100644
--- a/include/linux/fs_pin.h
+++ b/include/linux/fs_pin.h
@@ -1,3 +1,6 @@
+#ifndef _LINUX_FS_PIN_H
+#define _LINUX_FS_PIN_H
+
#include <linux/wait.h>

struct fs_pin {
@@ -16,9 +19,12 @@ static inline void init_fs_pin(struct fs_pin *p, void (*kill)(struct fs_pin *))
INIT_HLIST_NODE(&p->s_list);
INIT_HLIST_NODE(&p->m_list);
p->kill = kill;
+ p->done = 0;
}

void pin_remove(struct fs_pin *);
void pin_insert_group(struct fs_pin *, struct vfsmount *, struct hlist_head *);
void pin_insert(struct fs_pin *, struct vfsmount *);
void pin_kill(struct fs_pin *);
+
+#endif
--
2.4.3


2015-06-25 14:19:32

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 02/10 v6] fs_pin: Export functions for specific filesystem

Exports functions for others who want pin to vfsmount,
eg, nfsd's export cache.

v4, v5, v6
add exporting of pin_kill.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/fs_pin.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/fs_pin.c b/fs/fs_pin.c
index 611b540..a1a4eb2 100644
--- a/fs/fs_pin.c
+++ b/fs/fs_pin.c
@@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
wake_up_locked(&pin->wait);
spin_unlock_irq(&pin->wait.lock);
}
+EXPORT_SYMBOL(pin_remove);

void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
{
@@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
spin_unlock(&pin_lock);
}
+EXPORT_SYMBOL(pin_insert_group);

void pin_insert(struct fs_pin *pin, struct vfsmount *m)
{
pin_insert_group(pin, m, &m->mnt_sb->s_pins);
}
+EXPORT_SYMBOL(pin_insert);

void pin_kill(struct fs_pin *p)
{
@@ -72,6 +75,7 @@ void pin_kill(struct fs_pin *p)
}
rcu_read_unlock();
}
+EXPORT_SYMBOL(pin_kill);

void mnt_pin_kill(struct mount *m)
{
--
2.4.3


2015-06-25 14:20:09

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 03/10 v6] path: New helpers path_get_pin/path_put_unpin for path pin

Two helpers for filesystem pining to vfsmnt, not mntget.

v4, v5, v6 same as v2.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/namei.c | 26 ++++++++++++++++++++++++++
include/linux/path.h | 4 ++++
2 files changed, 30 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 4a8d998b..ac71c65 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -492,6 +492,32 @@ void path_put(const struct path *path)
}
EXPORT_SYMBOL(path_put);

+/**
+ * path_get_pin - get a reference to a path's dentry
+ * and pin to path's vfsmnt
+ * @path: path to get the reference to
+ * @p: the fs_pin pin to vfsmnt
+ */
+void path_get_pin(struct path *path, struct fs_pin *p)
+{
+ dget(path->dentry);
+ pin_insert_group(p, path->mnt, NULL);
+}
+EXPORT_SYMBOL(path_get_pin);
+
+/**
+ * path_put_unpin - put a reference to a path's dentry
+ * and remove pin to path's vfsmnt
+ * @path: path to put the reference to
+ * @p: the fs_pin removed from vfsmnt
+ */
+void path_put_unpin(struct path *path, struct fs_pin *p)
+{
+ dput(path->dentry);
+ pin_remove(p);
+}
+EXPORT_SYMBOL(path_put_unpin);
+
struct nameidata {
struct path path;
struct qstr last;
diff --git a/include/linux/path.h b/include/linux/path.h
index d137218..34599fb 100644
--- a/include/linux/path.h
+++ b/include/linux/path.h
@@ -3,6 +3,7 @@

struct dentry;
struct vfsmount;
+struct fs_pin;

struct path {
struct vfsmount *mnt;
@@ -12,6 +13,9 @@ struct path {
extern void path_get(const struct path *);
extern void path_put(const struct path *);

+extern void path_get_pin(struct path *, struct fs_pin *);
+extern void path_put_unpin(struct path *, struct fs_pin *);
+
static inline int path_equal(const struct path *path1, const struct path *path2)
{
return path1->mnt == path2->mnt && path1->dentry == path2->dentry;
--
2.4.3


2015-06-25 14:21:48

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 04/10 v6] fs: New helper legitimize_mntget() for getting a legitimize mnt

New helper legitimize_mntget for getting a mnt without setting
MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED, otherwise return NULL.

v6, New one

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/namespace.c | 19 +++++++++++++++++++
include/linux/mount.h | 1 +
2 files changed, 20 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1f4f9da..f31d165 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1142,6 +1142,25 @@ struct vfsmount *mntget(struct vfsmount *mnt)
}
EXPORT_SYMBOL(mntget);

+struct vfsmount *legitimize_mntget(struct vfsmount *vfsmnt)
+{
+ struct mount *mnt;
+
+ if (vfsmnt == NULL)
+ return NULL;
+
+ read_seqlock_excl(&mount_lock);
+ mnt = real_mount(vfsmnt);
+ if (vfsmnt->mnt_flags & (MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED))
+ vfsmnt = NULL;
+ else
+ mnt_add_count(mnt, 1);
+ read_sequnlock_excl(&mount_lock);
+
+ return vfsmnt;
+}
+EXPORT_SYMBOL(legitimize_mntget);
+
struct vfsmount *mnt_clone_internal(struct path *path)
{
struct mount *p;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c..8ae9dc0 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -79,6 +79,7 @@ extern void mnt_drop_write(struct vfsmount *mnt);
extern void mnt_drop_write_file(struct file *file);
extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
+extern struct vfsmount *legitimize_mntget(struct vfsmount *vfsmnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);

--
2.4.3


2015-06-25 14:25:35

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 05/10 v6] sunrpc: Store cache_detail in seq_file's private directly

Cleanup.

Just store cache_detail in seq_file's private,
an allocated handle is redundant.

Signed-off-by: Kinglong Mee <[email protected]>
---
net/sunrpc/cache.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 2928aff..edec603 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1270,18 +1270,13 @@ EXPORT_SYMBOL_GPL(qword_get);
* get a header, then pass each real item in the cache
*/

-struct handle {
- struct cache_detail *cd;
-};
-
static void *c_start(struct seq_file *m, loff_t *pos)
__acquires(cd->hash_lock)
{
loff_t n = *pos;
unsigned int hash, entry;
struct cache_head *ch;
- struct cache_detail *cd = ((struct handle*)m->private)->cd;
-
+ struct cache_detail *cd = m->private;

read_lock(&cd->hash_lock);
if (!n--)
@@ -1308,7 +1303,7 @@ static void *c_next(struct seq_file *m, void *p, loff_t *pos)
{
struct cache_head *ch = p;
int hash = (*pos >> 32);
- struct cache_detail *cd = ((struct handle*)m->private)->cd;
+ struct cache_detail *cd = m->private;

if (p == SEQ_START_TOKEN)
hash = 0;
@@ -1334,14 +1329,14 @@ static void *c_next(struct seq_file *m, void *p, loff_t *pos)
static void c_stop(struct seq_file *m, void *p)
__releases(cd->hash_lock)
{
- struct cache_detail *cd = ((struct handle*)m->private)->cd;
+ struct cache_detail *cd = m->private;
read_unlock(&cd->hash_lock);
}

static int c_show(struct seq_file *m, void *p)
{
struct cache_head *cp = p;
- struct cache_detail *cd = ((struct handle*)m->private)->cd;
+ struct cache_detail *cd = m->private;

if (p == SEQ_START_TOKEN)
return cd->cache_show(m, cd, NULL);
@@ -1373,24 +1368,27 @@ static const struct seq_operations cache_content_op = {
static int content_open(struct inode *inode, struct file *file,
struct cache_detail *cd)
{
- struct handle *han;
+ struct seq_file *seq;
+ int err;

if (!cd || !try_module_get(cd->owner))
return -EACCES;
- han = __seq_open_private(file, &cache_content_op, sizeof(*han));
- if (han == NULL) {
+
+ err = seq_open(file, &cache_content_op);
+ if (err) {
module_put(cd->owner);
- return -ENOMEM;
+ return err;
}

- han->cd = cd;
+ seq = file->private_data;
+ seq->private = cd;
return 0;
}

static int content_release(struct inode *inode, struct file *file,
struct cache_detail *cd)
{
- int ret = seq_release_private(inode, file);
+ int ret = seq_release(inode, file);
module_put(cd->owner);
return ret;
}
--
2.4.3


2015-06-25 14:28:04

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 06/10 v6] sunrpc/nfsd: Remove redundant code by exports seq_operations functions

Nfsd has implement a site of seq_operations functions as sunrpc's cache.
Just exports sunrpc's codes, and remove nfsd's redundant codes.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/nfsd/export.c | 73 ++------------------------------------------
include/linux/sunrpc/cache.h | 5 +++
net/sunrpc/cache.c | 15 +++++----
3 files changed, 17 insertions(+), 76 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 002d3a9..34a384c 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -1075,73 +1075,6 @@ exp_pseudoroot(struct svc_rqst *rqstp, struct svc_fh *fhp)
return rv;
}

-/* Iterator */
-
-static void *e_start(struct seq_file *m, loff_t *pos)
- __acquires(((struct cache_detail *)m->private)->hash_lock)
-{
- loff_t n = *pos;
- unsigned hash, export;
- struct cache_head *ch;
- struct cache_detail *cd = m->private;
- struct cache_head **export_table = cd->hash_table;
-
- read_lock(&cd->hash_lock);
- if (!n--)
- return SEQ_START_TOKEN;
- hash = n >> 32;
- export = n & ((1LL<<32) - 1);
-
-
- for (ch=export_table[hash]; ch; ch=ch->next)
- if (!export--)
- return ch;
- n &= ~((1LL<<32) - 1);
- do {
- hash++;
- n += 1LL<<32;
- } while(hash < EXPORT_HASHMAX && export_table[hash]==NULL);
- if (hash >= EXPORT_HASHMAX)
- return NULL;
- *pos = n+1;
- return export_table[hash];
-}
-
-static void *e_next(struct seq_file *m, void *p, loff_t *pos)
-{
- struct cache_head *ch = p;
- int hash = (*pos >> 32);
- struct cache_detail *cd = m->private;
- struct cache_head **export_table = cd->hash_table;
-
- if (p == SEQ_START_TOKEN)
- hash = 0;
- else if (ch->next == NULL) {
- hash++;
- *pos += 1LL<<32;
- } else {
- ++*pos;
- return ch->next;
- }
- *pos &= ~((1LL<<32) - 1);
- while (hash < EXPORT_HASHMAX && export_table[hash] == NULL) {
- hash++;
- *pos += 1LL<<32;
- }
- if (hash >= EXPORT_HASHMAX)
- return NULL;
- ++*pos;
- return export_table[hash];
-}
-
-static void e_stop(struct seq_file *m, void *p)
- __releases(((struct cache_detail *)m->private)->hash_lock)
-{
- struct cache_detail *cd = m->private;
-
- read_unlock(&cd->hash_lock);
-}
-
static struct flags {
int flag;
char *name[2];
@@ -1270,9 +1203,9 @@ static int e_show(struct seq_file *m, void *p)
}

const struct seq_operations nfs_exports_op = {
- .start = e_start,
- .next = e_next,
- .stop = e_stop,
+ .start = cache_seq_start,
+ .next = cache_seq_next,
+ .stop = cache_seq_stop,
.show = e_show,
};

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 437ddb6..04ee5a2 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -224,6 +224,11 @@ extern int sunrpc_cache_register_pipefs(struct dentry *parent, const char *,
umode_t, struct cache_detail *);
extern void sunrpc_cache_unregister_pipefs(struct cache_detail *);

+/* Must store cache_detail in seq_file->private if using next three functions */
+extern void *cache_seq_start(struct seq_file *file, loff_t *pos);
+extern void *cache_seq_next(struct seq_file *file, void *p, loff_t *pos);
+extern void cache_seq_stop(struct seq_file *file, void *p);
+
extern void qword_add(char **bpp, int *lp, char *str);
extern void qword_addhex(char **bpp, int *lp, char *buf, int blen);
extern int qword_get(char **bpp, char *dest, int bufsize);
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index edec603..673c2fa 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1270,7 +1270,7 @@ EXPORT_SYMBOL_GPL(qword_get);
* get a header, then pass each real item in the cache
*/

-static void *c_start(struct seq_file *m, loff_t *pos)
+void *cache_seq_start(struct seq_file *m, loff_t *pos)
__acquires(cd->hash_lock)
{
loff_t n = *pos;
@@ -1298,8 +1298,9 @@ static void *c_start(struct seq_file *m, loff_t *pos)
*pos = n+1;
return cd->hash_table[hash];
}
+EXPORT_SYMBOL_GPL(cache_seq_start);

-static void *c_next(struct seq_file *m, void *p, loff_t *pos)
+void *cache_seq_next(struct seq_file *m, void *p, loff_t *pos)
{
struct cache_head *ch = p;
int hash = (*pos >> 32);
@@ -1325,13 +1326,15 @@ static void *c_next(struct seq_file *m, void *p, loff_t *pos)
++*pos;
return cd->hash_table[hash];
}
+EXPORT_SYMBOL_GPL(cache_seq_next);

-static void c_stop(struct seq_file *m, void *p)
+void cache_seq_stop(struct seq_file *m, void *p)
__releases(cd->hash_lock)
{
struct cache_detail *cd = m->private;
read_unlock(&cd->hash_lock);
}
+EXPORT_SYMBOL_GPL(cache_seq_stop);

static int c_show(struct seq_file *m, void *p)
{
@@ -1359,9 +1362,9 @@ static int c_show(struct seq_file *m, void *p)
}

static const struct seq_operations cache_content_op = {
- .start = c_start,
- .next = c_next,
- .stop = c_stop,
+ .start = cache_seq_start,
+ .next = cache_seq_next,
+ .stop = cache_seq_stop,
.show = c_show,
};

--
2.4.3


2015-06-25 14:30:01

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 07/10 v6] sunrpc: Switch to using list_head instead single list

Switch using list_head for cache_head in cache_detail,
it is useful of remove an cache_head entry directly from cache_detail.

Signed-off-by: Kinglong Mee <[email protected]>
---
include/linux/sunrpc/cache.h | 4 +--
net/sunrpc/cache.c | 74 ++++++++++++++++++++++++--------------------
2 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 04ee5a2..ecc0ff6 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -46,7 +46,7 @@
*
*/
struct cache_head {
- struct cache_head * next;
+ struct list_head cache_list;
time_t expiry_time; /* After time time, don't use the data */
time_t last_refresh; /* If CACHE_PENDING, this is when upcall
* was sent, else this is when update was received
@@ -73,7 +73,7 @@ struct cache_detail_pipefs {
struct cache_detail {
struct module * owner;
int hash_size;
- struct cache_head ** hash_table;
+ struct list_head * hash_table;
rwlock_t hash_lock;

atomic_t inuse; /* active user-space update or lookup */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 673c2fa..ad2155c 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -44,7 +44,7 @@ static void cache_revisit_request(struct cache_head *item);
static void cache_init(struct cache_head *h)
{
time_t now = seconds_since_boot();
- h->next = NULL;
+ INIT_LIST_HEAD(&h->cache_list);
h->flags = 0;
kref_init(&h->ref);
h->expiry_time = now + CACHE_NEW_EXPIRY;
@@ -54,15 +54,16 @@ static void cache_init(struct cache_head *h)
struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
struct cache_head *key, int hash)
{
- struct cache_head **head, **hp;
struct cache_head *new = NULL, *freeme = NULL;
+ struct cache_head *tmp;
+ struct list_head *head, *pos, *tpos;

head = &detail->hash_table[hash];

read_lock(&detail->hash_lock);

- for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
- struct cache_head *tmp = *hp;
+ list_for_each_safe(pos, tpos, head) {
+ tmp = list_entry(pos, struct cache_head, cache_list);
if (detail->match(tmp, key)) {
if (cache_is_expired(detail, tmp))
/* This entry is expired, we will discard it. */
@@ -88,12 +89,11 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
write_lock(&detail->hash_lock);

/* check if entry appeared while we slept */
- for (hp=head; *hp != NULL ; hp = &(*hp)->next) {
- struct cache_head *tmp = *hp;
+ list_for_each_safe(pos, tpos, head) {
+ tmp = list_entry(pos, struct cache_head, cache_list);
if (detail->match(tmp, key)) {
if (cache_is_expired(detail, tmp)) {
- *hp = tmp->next;
- tmp->next = NULL;
+ list_del_init(&tmp->cache_list);
detail->entries --;
freeme = tmp;
break;
@@ -104,8 +104,8 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
return tmp;
}
}
- new->next = *head;
- *head = new;
+
+ list_add(&new->cache_list, head);
detail->entries++;
cache_get(new);
write_unlock(&detail->hash_lock);
@@ -143,7 +143,6 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
* If 'old' is not VALID, we update it directly,
* otherwise we need to replace it
*/
- struct cache_head **head;
struct cache_head *tmp;

if (!test_bit(CACHE_VALID, &old->flags)) {
@@ -168,15 +167,13 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
}
cache_init(tmp);
detail->init(tmp, old);
- head = &detail->hash_table[hash];

write_lock(&detail->hash_lock);
if (test_bit(CACHE_NEGATIVE, &new->flags))
set_bit(CACHE_NEGATIVE, &tmp->flags);
else
detail->update(tmp, new);
- tmp->next = *head;
- *head = tmp;
+ list_add(&tmp->cache_list, &detail->hash_table[hash]);
detail->entries++;
cache_get(tmp);
cache_fresh_locked(tmp, new->expiry_time);
@@ -416,42 +413,44 @@ static int cache_clean(void)
/* find a non-empty bucket in the table */
while (current_detail &&
current_index < current_detail->hash_size &&
- current_detail->hash_table[current_index] == NULL)
+ list_empty(&current_detail->hash_table[current_index]))
current_index++;

/* find a cleanable entry in the bucket and clean it, or set to next bucket */

if (current_detail && current_index < current_detail->hash_size) {
- struct cache_head *ch, **cp;
+ struct cache_head *ch = NULL, *putme = NULL;
+ struct list_head *head, *pos, *tpos;
struct cache_detail *d;

write_lock(&current_detail->hash_lock);

/* Ok, now to clean this strand */

- cp = & current_detail->hash_table[current_index];
- for (ch = *cp ; ch ; cp = & ch->next, ch = *cp) {
+ head = &current_detail->hash_table[current_index];
+ list_for_each_safe(pos, tpos, head) {
+ ch = list_entry(pos, struct cache_head, cache_list);
if (current_detail->nextcheck > ch->expiry_time)
current_detail->nextcheck = ch->expiry_time+1;
if (!cache_is_expired(current_detail, ch))
continue;

- *cp = ch->next;
- ch->next = NULL;
+ list_del_init(pos);
current_detail->entries--;
+ putme = ch;
rv = 1;
break;
}

write_unlock(&current_detail->hash_lock);
d = current_detail;
- if (!ch)
+ if (!putme)
current_index ++;
spin_unlock(&cache_list_lock);
- if (ch) {
- set_bit(CACHE_CLEANED, &ch->flags);
- cache_fresh_unlocked(ch, d);
- cache_put(ch, d);
+ if (putme) {
+ set_bit(CACHE_CLEANED, &putme->flags);
+ cache_fresh_unlocked(putme, d);
+ cache_put(putme, d);
}
} else
spin_unlock(&cache_list_lock);
@@ -1277,6 +1276,7 @@ void *cache_seq_start(struct seq_file *m, loff_t *pos)
unsigned int hash, entry;
struct cache_head *ch;
struct cache_detail *cd = m->private;
+ struct list_head *ptr, *tptr;

read_lock(&cd->hash_lock);
if (!n--)
@@ -1284,19 +1284,22 @@ void *cache_seq_start(struct seq_file *m, loff_t *pos)
hash = n >> 32;
entry = n & ((1LL<<32) - 1);

- for (ch=cd->hash_table[hash]; ch; ch=ch->next)
+ list_for_each_safe(ptr, tptr, &cd->hash_table[hash]) {
+ ch = list_entry(ptr, struct cache_head, cache_list);
if (!entry--)
return ch;
+ }
n &= ~((1LL<<32) - 1);
do {
hash++;
n += 1LL<<32;
} while(hash < cd->hash_size &&
- cd->hash_table[hash]==NULL);
+ list_empty(&cd->hash_table[hash]));
if (hash >= cd->hash_size)
return NULL;
*pos = n+1;
- return cd->hash_table[hash];
+ return list_first_entry_or_null(&cd->hash_table[hash],
+ struct cache_head, cache_list);
}
EXPORT_SYMBOL_GPL(cache_seq_start);

@@ -1308,23 +1311,24 @@ void *cache_seq_next(struct seq_file *m, void *p, loff_t *pos)

if (p == SEQ_START_TOKEN)
hash = 0;
- else if (ch->next == NULL) {
+ else if (list_is_last(&ch->cache_list, &cd->hash_table[hash])) {
hash++;
*pos += 1LL<<32;
} else {
++*pos;
- return ch->next;
+ return list_next_entry(ch, cache_list);
}
*pos &= ~((1LL<<32) - 1);
while (hash < cd->hash_size &&
- cd->hash_table[hash] == NULL) {
+ list_empty(&cd->hash_table[hash])) {
hash++;
*pos += 1LL<<32;
}
if (hash >= cd->hash_size)
return NULL;
++*pos;
- return cd->hash_table[hash];
+ return list_first_entry_or_null(&cd->hash_table[hash],
+ struct cache_head, cache_list);
}
EXPORT_SYMBOL_GPL(cache_seq_next);

@@ -1666,17 +1670,21 @@ EXPORT_SYMBOL_GPL(cache_unregister_net);
struct cache_detail *cache_create_net(struct cache_detail *tmpl, struct net *net)
{
struct cache_detail *cd;
+ int i;

cd = kmemdup(tmpl, sizeof(struct cache_detail), GFP_KERNEL);
if (cd == NULL)
return ERR_PTR(-ENOMEM);

- cd->hash_table = kzalloc(cd->hash_size * sizeof(struct cache_head *),
+ cd->hash_table = kzalloc(cd->hash_size * sizeof(struct list_head),
GFP_KERNEL);
if (cd->hash_table == NULL) {
kfree(cd);
return ERR_PTR(-ENOMEM);
}
+
+ for (i = 0; i < cd->hash_size; i++)
+ INIT_LIST_HEAD(&cd->hash_table[i]);
cd->net = net;
return cd;
}
--
2.4.3


2015-06-25 14:34:29

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 08/10] sunrpc: New helper cache_delete_entry for deleting cache_head directly

A new helper cache_delete_entry() for delete cache_head from
cache_detail directly.

It will be used by pin_kill, so make sure the cache_detail is valid
before deleting is needed.

Because pin_kill is not many times,
so the influence of performance is accepted.

Signed-off-by: Kinglong Mee <[email protected]>
---
include/linux/sunrpc/cache.h | 1 +
net/sunrpc/cache.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+)

diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index ecc0ff6..5a4b921 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -210,6 +210,7 @@ extern int cache_check(struct cache_detail *detail,
struct cache_head *h, struct cache_req *rqstp);
extern void cache_flush(void);
extern void cache_purge(struct cache_detail *detail);
+extern void cache_delete_entry(struct cache_detail *cd, struct cache_head *h);
#define NEVER (0x7FFFFFFF)
extern void __init cache_initialize(void);
extern int cache_register_net(struct cache_detail *cd, struct net *net);
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index ad2155c..8a27483 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -458,6 +458,36 @@ static int cache_clean(void)
return rv;
}

+void cache_delete_entry(struct cache_detail *detail, struct cache_head *h)
+{
+ struct cache_detail *tmp;
+
+ if (!detail || !h)
+ return;
+
+ spin_lock(&cache_list_lock);
+ list_for_each_entry(tmp, &cache_list, others) {
+ if (tmp == detail)
+ goto found;
+ }
+ spin_unlock(&cache_list_lock);
+ printk(KERN_WARNING "%s: Deleted cache detail %p\n", __func__, detail);
+ return ;
+
+found:
+ write_lock(&detail->hash_lock);
+
+ list_del_init(&h->cache_list);
+ detail->entries--;
+ set_bit(CACHE_CLEANED, &h->flags);
+
+ write_unlock(&detail->hash_lock);
+ spin_unlock(&cache_list_lock);
+
+ cache_put(h, detail);
+}
+EXPORT_SYMBOL_GPL(cache_delete_entry);
+
/*
* We want to regularly clean the cache, so we need to schedule some work ...
*/
--
2.4.3


2015-06-25 14:36:11

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 09/10 v6] sunrpc: Support validate/invalidate for reference change in cache_detail

Add validate/invalidate functions in cache_detail for processing
reference change (increase/decrease, both are before change!)

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/nfsd/export.h | 2 +-
include/linux/sunrpc/cache.h | 11 ++++++++++-
net/sunrpc/auth_gss/svcauth_gss.c | 2 +-
net/sunrpc/cache.c | 12 ++++++------
net/sunrpc/svcauth_unix.c | 2 +-
5 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index 1f52bfc..b559acf 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -105,7 +105,7 @@ static inline void exp_put(struct svc_export *exp)

static inline struct svc_export *exp_get(struct svc_export *exp)
{
- cache_get(&exp->h);
+ cache_get(&exp->h, exp->cd);
return exp;
}
struct svc_export * rqst_exp_find(struct svc_rqst *, int, u32 *);
diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
index 5a4b921..f77b2cd 100644
--- a/include/linux/sunrpc/cache.h
+++ b/include/linux/sunrpc/cache.h
@@ -101,6 +101,8 @@ struct cache_detail {
int (*match)(struct cache_head *orig, struct cache_head *new);
void (*init)(struct cache_head *orig, struct cache_head *new);
void (*update)(struct cache_head *orig, struct cache_head *new);
+ void (*validate)(struct cache_head *h);
+ void (*invalidate)(struct cache_head *h);

/* fields below this comment are for internal use
* and should not be touched by cache owners
@@ -185,8 +187,11 @@ sunrpc_cache_pipe_upcall(struct cache_detail *detail, struct cache_head *h);

extern void cache_clean_deferred(void *owner);

-static inline struct cache_head *cache_get(struct cache_head *h)
+static inline struct cache_head *cache_get(struct cache_head *h, struct cache_detail *cd)
{
+ if (cd && cd->validate)
+ cd->validate(h);
+
kref_get(&h->ref);
return h;
}
@@ -197,6 +202,10 @@ static inline void cache_put(struct cache_head *h, struct cache_detail *cd)
if (atomic_read(&h->ref.refcount) <= 2 &&
h->expiry_time < cd->nextcheck)
cd->nextcheck = h->expiry_time;
+
+ if (cd->invalidate)
+ cd->invalidate(h);
+
kref_put(&h->ref, cd->cache_put);
}

diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
index 1095be9..ee1faa2 100644
--- a/net/sunrpc/auth_gss/svcauth_gss.c
+++ b/net/sunrpc/auth_gss/svcauth_gss.c
@@ -1520,7 +1520,7 @@ svcauth_gss_accept(struct svc_rqst *rqstp, __be32 *authp)
goto auth_err;
}
svcdata->rsci = rsci;
- cache_get(&rsci->h);
+ cache_get(&rsci->h, NULL);
rqstp->rq_cred.cr_flavor = gss_svc_to_pseudoflavor(
rsci->mechctx->mech_type,
GSS_C_QOP_DEFAULT,
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 8a27483..cb7f3c0 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -68,7 +68,7 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
if (cache_is_expired(detail, tmp))
/* This entry is expired, we will discard it. */
break;
- cache_get(tmp);
+ cache_get(tmp, detail);
read_unlock(&detail->hash_lock);
return tmp;
}
@@ -98,7 +98,7 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,
freeme = tmp;
break;
}
- cache_get(tmp);
+ cache_get(tmp, detail);
write_unlock(&detail->hash_lock);
cache_put(new, detail);
return tmp;
@@ -107,7 +107,7 @@ struct cache_head *sunrpc_cache_lookup(struct cache_detail *detail,

list_add(&new->cache_list, head);
detail->entries++;
- cache_get(new);
+ cache_get(new, detail);
write_unlock(&detail->hash_lock);

if (freeme)
@@ -175,7 +175,7 @@ struct cache_head *sunrpc_cache_update(struct cache_detail *detail,
detail->update(tmp, new);
list_add(&tmp->cache_list, &detail->hash_table[hash]);
detail->entries++;
- cache_get(tmp);
+ cache_get(tmp, detail);
cache_fresh_locked(tmp, new->expiry_time);
cache_fresh_locked(old, 0);
write_unlock(&detail->hash_lock);
@@ -1204,7 +1204,7 @@ int sunrpc_cache_pipe_upcall(struct cache_detail *detail, struct cache_head *h)
}

crq->q.reader = 0;
- crq->item = cache_get(h);
+ crq->item = cache_get(h, detail);
crq->buf = buf;
crq->len = 0;
crq->readers = 0;
@@ -1382,7 +1382,7 @@ static int c_show(struct seq_file *m, void *p)
seq_printf(m, "# expiry=%ld refcnt=%d flags=%lx\n",
convert_to_wallclock(cp->expiry_time),
atomic_read(&cp->ref.refcount), cp->flags);
- cache_get(cp);
+ cache_get(cp, cd);
if (cache_check(cd, cp, NULL))
/* cache_check does a cache_put on failure */
seq_printf(m, "# ");
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 621ca7b..ebba6b7 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -359,7 +359,7 @@ ip_map_cached_get(struct svc_xprt *xprt)
cache_put(&ipm->h, sn->ip_map_cache);
return NULL;
}
- cache_get(&ipm->h);
+ cache_get(&ipm->h, NULL);
}
spin_unlock(&xprt->xpt_lock);
}
--
2.4.3


2015-06-25 14:37:27

by Kinglong Mee

[permalink] [raw]
Subject: [PATCH 10/10 v6] nfsd: Allows user un-mounting filesystem where nfsd exports base on

If there are some mount points(not exported for nfs) under pseudo root,
after client's operation of those entry under the root, anyone *can't*
unmount those mount points until export cache expired.

/nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash)
/nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash)
total 0
drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs
drwxr-xr-x. 3 root root 84 Apr 21 22:27 test
drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs
Filesystem 1K-blocks Used Available Use% Mounted on
......
/dev/sdd 1038336 32944 1005392 4% /nfs/pnfs
/dev/sdc 10475520 32928 10442592 1% /nfs/xfs
/dev/sde 999320 1284 929224 1% /nfs/test
/mnt/pnfs/:
total 0
-rw-r--r--. 1 root root 0 Apr 21 22:23 attr
drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp

/mnt/xfs/:
total 0
umount: /nfs/test/: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)

It's caused by exports cache of nfsd holds the reference of
the path (here is /nfs/test/), so, it can't be umounted.

I don't think that's user expect, they want umount /nfs/test/.
Bruce think user can also umount /nfs/pnfs/ and /nfs/xfs.

Also, using kzalloc for all memory allocating without kmalloc.
Thanks for Al Viro's commets for the logic of fs_pin.

v3,
1. using path_get_pin/path_put_unpin for path pin
2. using kzalloc for memory allocating

v4,
1. add a completion for pin_kill waiting the reference is decreased to zero.
2. add a work_struct for pin_kill decreases the reference indirectly.
3. free svc_export/svc_expkey in pin_kill, not svc_export_put/svc_expkey_put.
4. svc_export_put/svc_expkey_put go though pin_kill logic.

v5, same as v4.

v6,
1. Pin vfsmnt to mount point at first, when reference increace (==2),
grab a reference to vfsmnt by mntget. When decreace (==1),
drop the reference to vfsmnt, left pin.
2. Delete cache_head directly from cache_detail.

Right now,
When reference of cahce_head increase(>1), grab a reference of mnt once.
and reference decrease to 1 (==1), drop the reference of mnt.

So after that,
When ref > 1, user cannot umount the filesystem with -EBUSY.
when ref ==1, means cache only reference by nfsd cache,
no other reference. So user can try umount,
1. before set MNT_UMOUNT (protected by mount_lock), nfsd cache is
referenced (ref > 1, legitimize_mntget), umount will fail with -EBUSY.
2. after set MNT_UMOUNT, nfsd cache is referenced (ref == 2),
legitimize_mntget will fail, and set cache to CACHE_NEGATIVE,
and the reference will be dropped, re-back to 1.
So, pin_kill can delete the cache and umount success.
3. when umountting, no reference to nfsd cache,
pin_kill can delete the cache and umount success.

Signed-off-by: Kinglong Mee <[email protected]>
---
fs/nfsd/export.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++------
fs/nfsd/export.h | 24 +++++++-
2 files changed, 174 insertions(+), 19 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index 34a384c..f7b1aa8 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -37,15 +37,23 @@
#define EXPKEY_HASHMAX (1 << EXPKEY_HASHBITS)
#define EXPKEY_HASHMASK (EXPKEY_HASHMAX -1)

+static void expkey_destroy(struct svc_expkey *key)
+{
+ auth_domain_put(key->ek_client);
+ kfree_rcu(key, rcu_head);
+}
+
static void expkey_put(struct kref *ref)
{
struct svc_expkey *key = container_of(ref, struct svc_expkey, h.ref);

if (test_bit(CACHE_VALID, &key->h.flags) &&
- !test_bit(CACHE_NEGATIVE, &key->h.flags))
- path_put(&key->ek_path);
- auth_domain_put(key->ek_client);
- kfree(key);
+ !test_bit(CACHE_NEGATIVE, &key->h.flags)) {
+ rcu_read_lock();
+ complete(&key->ek_done);
+ pin_kill(&key->ek_pin);
+ } else
+ expkey_destroy(key);
}

static void expkey_request(struct cache_detail *cd,
@@ -83,7 +91,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
return -EINVAL;
mesg[mlen-1] = 0;

- buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
err = -ENOMEM;
if (!buf)
goto out;
@@ -119,6 +127,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen)
if (key.h.expiry_time == 0)
goto out;

+ key.cd = cd;
key.ek_client = dom;
key.ek_fsidtype = fsidtype;
memcpy(key.ek_fsid, buf, len);
@@ -210,6 +219,59 @@ static inline void expkey_init(struct cache_head *cnew,
new->ek_fsidtype = item->ek_fsidtype;

memcpy(new->ek_fsid, item->ek_fsid, sizeof(new->ek_fsid));
+ new->cd = item->cd;
+}
+
+static void expkey_validate(struct cache_head *h)
+{
+ struct svc_expkey *key = container_of(h, struct svc_expkey, h);
+
+ if (!test_bit(CACHE_VALID, &key->h.flags) ||
+ test_bit(CACHE_NEGATIVE, &key->h.flags))
+ return;
+
+ if (atomic_read(&h->ref.refcount) == 1) {
+ mutex_lock(&key->ek_mutex);
+ if (legitimize_mntget(key->ek_path.mnt) == NULL) {
+ printk(KERN_WARNING "%s: Get mnt for %pd2 failed!\n",
+ __func__, key->ek_path.dentry);
+ set_bit(CACHE_NEGATIVE, &h->flags);
+ } else
+ key->ek_mnt_ref = true;
+ mutex_unlock(&key->ek_mutex);
+ }
+}
+
+static void expkey_invalidate(struct cache_head *h)
+{
+ struct svc_expkey *key = container_of(h, struct svc_expkey, h);
+
+ if (atomic_read(&h->ref.refcount) == 2) {
+ mutex_lock(&key->ek_mutex);
+ if (key->ek_mnt_ref) {
+ mntput(key->ek_path.mnt);
+ key->ek_mnt_ref = false;
+ }
+ mutex_unlock(&key->ek_mutex);
+ }
+}
+
+static void expkey_pin_kill(struct fs_pin *pin)
+{
+ struct svc_expkey *key = container_of(pin, struct svc_expkey, ek_pin);
+
+ if (!completion_done(&key->ek_done)) {
+ schedule_work(&key->ek_work);
+ wait_for_completion(&key->ek_done);
+ }
+ path_put_unpin(&key->ek_path, &key->ek_pin);
+ expkey_destroy(key);
+}
+
+static void expkey_close_work(struct work_struct *work)
+{
+ struct svc_expkey *key = container_of(work, struct svc_expkey, ek_work);
+ cache_delete_entry(key->cd, &key->h);
}

static inline void expkey_update(struct cache_head *cnew,
@@ -218,16 +280,20 @@ static inline void expkey_update(struct cache_head *cnew,
struct svc_expkey *new = container_of(cnew, struct svc_expkey, h);
struct svc_expkey *item = container_of(citem, struct svc_expkey, h);

+ init_fs_pin(&new->ek_pin, expkey_pin_kill);
new->ek_path = item->ek_path;
- path_get(&item->ek_path);
+ path_get_pin(&new->ek_path, &new->ek_pin);
}

static struct cache_head *expkey_alloc(void)
{
- struct svc_expkey *i = kmalloc(sizeof(*i), GFP_KERNEL);
- if (i)
+ struct svc_expkey *i = kzalloc(sizeof(*i), GFP_KERNEL);
+ if (i) {
+ INIT_WORK(&i->ek_work, expkey_close_work);
+ init_completion(&i->ek_done);
+ mutex_init(&i->ek_mutex);
return &i->h;
- else
+ } else
return NULL;
}

@@ -243,6 +309,8 @@ static struct cache_detail svc_expkey_cache_template = {
.init = expkey_init,
.update = expkey_update,
.alloc = expkey_alloc,
+ .validate = expkey_validate,
+ .invalidate = expkey_invalidate,
};

static int
@@ -306,14 +374,21 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
fsloc->locations = NULL;
}

-static void svc_export_put(struct kref *ref)
+static void svc_export_destroy(struct svc_export *exp)
{
- struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
- path_put(&exp->ex_path);
auth_domain_put(exp->ex_client);
nfsd4_fslocs_free(&exp->ex_fslocs);
kfree(exp->ex_uuid);
- kfree(exp);
+ kfree_rcu(exp, rcu_head);
+}
+
+static void svc_export_put(struct kref *ref)
+{
+ struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
+
+ rcu_read_lock();
+ complete(&exp->ex_done);
+ pin_kill(&exp->ex_pin);
}

static void svc_export_request(struct cache_detail *cd,
@@ -520,7 +595,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
return -EINVAL;
mesg[mlen-1] = 0;

- buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
return -ENOMEM;

@@ -694,15 +769,67 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
path_equal(&orig->ex_path, &new->ex_path);
}

+static void export_validate(struct cache_head *h)
+{
+ struct svc_export *exp = container_of(h, struct svc_export, h);
+
+ if (test_bit(CACHE_NEGATIVE, &h->flags))
+ return;
+
+ if (atomic_read(&h->ref.refcount) == 1) {
+ mutex_lock(&exp->ex_mutex);
+ if (legitimize_mntget(exp->ex_path.mnt) == NULL) {
+ printk(KERN_WARNING "%s: Get mnt for %pd2 failed!\n",
+ __func__, exp->ex_path.dentry);
+ set_bit(CACHE_NEGATIVE, &h->flags);
+ } else
+ exp->ex_mnt_ref = true;
+ mutex_unlock(&exp->ex_mutex);
+ }
+}
+
+static void export_invalidate(struct cache_head *h)
+{
+ struct svc_export *exp = container_of(h, struct svc_export, h);
+
+ if (atomic_read(&h->ref.refcount) == 2) {
+ mutex_lock(&exp->ex_mutex);
+ if (exp->ex_mnt_ref) {
+ mntput(exp->ex_path.mnt);
+ exp->ex_mnt_ref = false;
+ }
+ mutex_unlock(&exp->ex_mutex);
+ }
+}
+
+static void export_pin_kill(struct fs_pin *pin)
+{
+ struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
+
+ if (!completion_done(&exp->ex_done)) {
+ schedule_work(&exp->ex_work);
+ wait_for_completion(&exp->ex_done);
+ }
+ path_put_unpin(&exp->ex_path, &exp->ex_pin);
+ svc_export_destroy(exp);
+}
+
+static void export_close_work(struct work_struct *work)
+{
+ struct svc_export *exp = container_of(work, struct svc_export, ex_work);
+ cache_delete_entry(exp->cd, &exp->h);
+}
+
static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
{
struct svc_export *new = container_of(cnew, struct svc_export, h);
struct svc_export *item = container_of(citem, struct svc_export, h);

+ init_fs_pin(&new->ex_pin, export_pin_kill);
kref_get(&item->ex_client->ref);
new->ex_client = item->ex_client;
new->ex_path = item->ex_path;
- path_get(&item->ex_path);
+ path_get_pin(&new->ex_path, &new->ex_pin);
new->ex_fslocs.locations = NULL;
new->ex_fslocs.locations_count = 0;
new->ex_fslocs.migrated = 0;
@@ -740,10 +867,13 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)

static struct cache_head *svc_export_alloc(void)
{
- struct svc_export *i = kmalloc(sizeof(*i), GFP_KERNEL);
- if (i)
+ struct svc_export *i = kzalloc(sizeof(*i), GFP_KERNEL);
+ if (i) {
+ INIT_WORK(&i->ex_work, export_close_work);
+ init_completion(&i->ex_done);
+ mutex_init(&i->ex_mutex);
return &i->h;
- else
+ } else
return NULL;
}

@@ -759,6 +889,8 @@ static struct cache_detail svc_export_cache_template = {
.init = svc_export_init,
.update = export_update,
.alloc = svc_export_alloc,
+ .validate = export_validate,
+ .invalidate = export_invalidate,
};

static int
@@ -809,6 +941,7 @@ exp_find_key(struct cache_detail *cd, struct auth_domain *clp, int fsid_type,
if (!clp)
return ERR_PTR(-ENOENT);

+ key.cd = cd;
key.ek_client = clp;
key.ek_fsidtype = fsid_type;
memcpy(key.ek_fsid, fsidv, key_len(fsid_type));
diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
index b559acf..1b5c5f8 100644
--- a/fs/nfsd/export.h
+++ b/fs/nfsd/export.h
@@ -4,6 +4,7 @@
#ifndef NFSD_EXPORT_H
#define NFSD_EXPORT_H

+#include <linux/fs_pin.h>
#include <linux/sunrpc/cache.h>
#include <uapi/linux/nfsd/export.h>

@@ -46,6 +47,8 @@ struct exp_flavor_info {

struct svc_export {
struct cache_head h;
+ struct cache_detail *cd;
+
struct auth_domain * ex_client;
int ex_flags;
struct path ex_path;
@@ -58,7 +61,16 @@ struct svc_export {
struct exp_flavor_info ex_flavors[MAX_SECINFO_LIST];
enum pnfs_layouttype ex_layout_type;
struct nfsd4_deviceid_map *ex_devid_map;
- struct cache_detail *cd;
+
+ struct fs_pin ex_pin;
+ struct rcu_head rcu_head;
+
+ bool ex_mnt_ref;
+ struct mutex ex_mutex;
+
+ /* For cache_put and fs umounting window */
+ struct completion ex_done;
+ struct work_struct ex_work;
};

/* an "export key" (expkey) maps a filehandlefragement to an
@@ -67,12 +79,22 @@ struct svc_export {
*/
struct svc_expkey {
struct cache_head h;
+ struct cache_detail *cd;

struct auth_domain * ek_client;
int ek_fsidtype;
u32 ek_fsid[6];

struct path ek_path;
+ struct fs_pin ek_pin;
+ struct rcu_head rcu_head;
+
+ bool ek_mnt_ref;
+ struct mutex ek_mutex;
+
+ /* For cache_put and fs umounting window */
+ struct completion ek_done;
+ struct work_struct ek_work;
};

#define EX_ISSYNC(exp) (!((exp)->ex_flags & NFSEXP_ASYNC))
--
2.4.3


2015-07-01 05:47:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 10/10 v6] nfsd: Allows user un-mounting filesystem where nfsd exports base on

On Thu, Jun 25, 2015 at 10:37:14PM +0800, Kinglong Mee wrote:
> +static void expkey_validate(struct cache_head *h)
> +{
> + struct svc_expkey *key = container_of(h, struct svc_expkey, h);
> +
> + if (!test_bit(CACHE_VALID, &key->h.flags) ||
> + test_bit(CACHE_NEGATIVE, &key->h.flags))
> + return;
> +
> + if (atomic_read(&h->ref.refcount) == 1) {
> + mutex_lock(&key->ek_mutex);

... followed by kref_get(&h->ref) in caller

> + if (atomic_read(&h->ref.refcount) == 2) {
> + mutex_lock(&key->ek_mutex);

... followed by kref_put() in caller.

Suppose two threads call cache_get() at the same time. Refcount is 1.
Depending on the timing you get either one or both grabbing vfsmount
references. Whichever variant matches the one you want, there is no way
to tell one from another afterwards and they *do* differ in the resulting
vfsmount refcount changes.

Similar to that, suppose the refcount is 3 and two threads call cache_put()
at the same time. If one of them gets through the entire thing (including
kref_put()) before the other gets to atomic_read(), you get the second
see refcount 2 and do that mntput(). If not, _nobody_ will ever see refcount
2 and mntput() is not done.

How can that code possibly be correct? This kind of splitting atomic_read
from increment/decrement (and slapping a sleeping operation in between,
no less) is basically never right. Not unless you have everything serialized
on the outside and do not need the atomic in the first place, which doesn't
seem to be the case here.

2015-07-02 15:18:09

by Kinglong Mee

[permalink] [raw]
Subject: Re: [PATCH 10/10 v6] nfsd: Allows user un-mounting filesystem where nfsd exports base on

On 7/1/2015 13:47, Al Viro wrote:
> On Thu, Jun 25, 2015 at 10:37:14PM +0800, Kinglong Mee wrote:
>> +static void expkey_validate(struct cache_head *h)
>> +{
>> + struct svc_expkey *key = container_of(h, struct svc_expkey, h);
>> +
>> + if (!test_bit(CACHE_VALID, &key->h.flags) ||
>> + test_bit(CACHE_NEGATIVE, &key->h.flags))
>> + return;
>> +
>> + if (atomic_read(&h->ref.refcount) == 1) {
>> + mutex_lock(&key->ek_mutex);
>
> ... followed by kref_get(&h->ref) in caller

Got it.

>
>> + if (atomic_read(&h->ref.refcount) == 2) {
>> + mutex_lock(&key->ek_mutex);
>
> ... followed by kref_put() in caller.

No, must before kref_put.
If kref_put() to zero will free the structure.

>
> Suppose two threads call cache_get() at the same time. Refcount is 1.
> Depending on the timing you get either one or both grabbing vfsmount
> references. Whichever variant matches the one you want, there is no way
> to tell one from another afterwards and they *do* differ in the resulting
> vfsmount refcount changes.
>
> Similar to that, suppose the refcount is 3 and two threads call cache_put()
> at the same time. If one of them gets through the entire thing (including
> kref_put()) before the other gets to atomic_read(), you get the second
> see refcount 2 and do that mntput(). If not, _nobody_ will ever see refcount
> 2 and mntput() is not done.
>
> How can that code possibly be correct? This kind of splitting atomic_read
> from increment/decrement (and slapping a sleeping operation in between,
> no less) is basically never right. Not unless you have everything serialized
> on the outside and do not need the atomic in the first place, which doesn't
> seem to be the case here.

For protect the reference, maybe I will implements a couple of get_ref/put_ref
as kref_get/kref_put.

+static void expkey_get_ref(struct cache_head *h)
+{
+ struct svc_expkey *key = container_of(h, struct svc_expkey, h);
+
+ mutex_lock(&key->ref_mutex);
+ kref_get(&h->ref);
+
+ if (!test_bit(CACHE_VALID, &key->h.flags) ||
+ test_bit(CACHE_NEGATIVE, &key->h.flags))
+ goto out;
+
+ if (atomic_read(&h->ref.refcount) == 2) {
+ if (legitimize_mntget(key->ek_path.mnt) == NULL) {
+ printk(KERN_WARNING "%s: Get mnt for %pd2 failed!\n",
+ __func__, key->ek_path.dentry);
+ set_bit(CACHE_NEGATIVE, &h->flags);
+ } else
+ key->ek_mnt_ref = true;
+ }
+out:
+ mutex_unlock(&key->ref_mutex);
+}
+
+static void expkey_put_ref(struct cache_head *h)
+{
+ struct svc_expkey *key = container_of(h, struct svc_expkey, h);
+
+ mutex_lock(&key->ref_mutex);
+ if (key->ek_mnt_ref && (atomic_read(&h->ref.refcount) == 2)) {
+ mntput(key->ek_path.mnt);
+ key->ek_mnt_ref = false;
+ }
+
+ if (unlikely(!atomic_dec_and_test(&h->ref.refcount))) {
+ mutex_unlock(&key->ref_mutex);
+ return ;
+ }
+
+ expkey_put(&h->ref);
+}
+

Code for nfsd exports cache is similar as expkey.

thanks,
Kinglong Mee