2018-09-05 23:56:33

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 0/7] Misc NFS + pNFS performance enhancements

Fallout from a bunch of flame graphs...

Trond Myklebust (7):
pNFS: Don't zero out the array in nfs4_alloc_pages()
pNFS: Don't allocate more pages than we need to fit a layoutget
response
NFS: Convert lookups of the lock context to RCU
NFS: Simplify internal check for whether file is open for write
NFS: Convert lookups of the open context to RCU
NFSv4: Convert open state lookup to use RCU
NFSv4: Convert struct nfs4_state to use refcount_t

fs/nfs/delegation.c | 11 ++--
fs/nfs/filelayout/filelayout.c | 1 +
fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
fs/nfs/inode.c | 70 +++++++++++---------------
fs/nfs/nfs4_fs.h | 3 +-
fs/nfs/nfs4proc.c | 38 ++++++++++----
fs/nfs/nfs4state.c | 32 ++++++------
fs/nfs/pnfs.c | 16 ++++--
fs/nfs/pnfs.h | 1 +
include/linux/nfs_fs.h | 2 +
10 files changed, 98 insertions(+), 77 deletions(-)

--
2.17.1


2018-09-05 23:56:46

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 1/7] pNFS: Don't zero out the array in nfs4_alloc_pages()

We don't need a zeroed out array, since it is immediately being filled.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/pnfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 7d9a51e6b847..6b5b2d36f502 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -965,7 +965,7 @@ static struct page **nfs4_alloc_pages(size_t size, gfp_t gfp_flags)
struct page **pages;
int i;

- pages = kcalloc(size, sizeof(struct page *), gfp_flags);
+ pages = kmalloc_array(size, sizeof(struct page *), gfp_flags);
if (!pages) {
dprintk("%s: can't alloc array of %zu pages\n", __func__, size);
return NULL;
@@ -975,7 +975,7 @@ static struct page **nfs4_alloc_pages(size_t size, gfp_t gfp_flags)
pages[i] = alloc_page(gfp_flags);
if (!pages[i]) {
dprintk("%s: failed to allocate page\n", __func__);
- nfs4_free_pages(pages, size);
+ nfs4_free_pages(pages, i);
return NULL;
}
}
--
2.17.1

2018-09-05 23:56:47

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 2/7] pNFS: Don't allocate more pages than we need to fit a layoutget response

For the 'files' and 'flexfiles' layout types, we do not expect the reply
to be any larger than 4k. The block and scsi layout types are a little more
greedy, so we keep allocating the maximum response size for now.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/filelayout/filelayout.c | 1 +
fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
fs/nfs/pnfs.c | 7 +++++++
fs/nfs/pnfs.h | 1 +
4 files changed, 10 insertions(+)

diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index d175724ff566..61f46facb39c 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -1164,6 +1164,7 @@ static struct pnfs_layoutdriver_type filelayout_type = {
.id = LAYOUT_NFSV4_1_FILES,
.name = "LAYOUT_NFSV4_1_FILES",
.owner = THIS_MODULE,
+ .max_layoutget_response = 4096, /* 1 page or so... */
.alloc_layout_hdr = filelayout_alloc_layout_hdr,
.free_layout_hdr = filelayout_free_layout_hdr,
.alloc_lseg = filelayout_alloc_lseg,
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index cae43333ef16..86bcba40ca61 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -2356,6 +2356,7 @@ static struct pnfs_layoutdriver_type flexfilelayout_type = {
.name = "LAYOUT_FLEX_FILES",
.owner = THIS_MODULE,
.flags = PNFS_LAYOUTGET_ON_OPEN,
+ .max_layoutget_response = 4096, /* 1 page or so... */
.set_layoutdriver = ff_layout_set_layoutdriver,
.alloc_layout_hdr = ff_layout_alloc_layout_hdr,
.free_layout_hdr = ff_layout_free_layout_hdr,
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 6b5b2d36f502..c5672c02afd6 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -991,6 +991,7 @@ pnfs_alloc_init_layoutget_args(struct inode *ino,
gfp_t gfp_flags)
{
struct nfs_server *server = pnfs_find_server(ino, ctx);
+ size_t max_reply_sz = server->pnfs_curr_ld->max_layoutget_response;
size_t max_pages = max_response_pages(server);
struct nfs4_layoutget *lgp;

@@ -1000,6 +1001,12 @@ pnfs_alloc_init_layoutget_args(struct inode *ino,
if (lgp == NULL)
return NULL;

+ if (max_reply_sz) {
+ size_t npages = (max_reply_sz + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (npages < max_pages)
+ max_pages = npages;
+ }
+
lgp->args.layout.pages = nfs4_alloc_pages(max_pages, gfp_flags);
if (!lgp->args.layout.pages) {
kfree(lgp);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index ece367ebde69..e2e9fcd5341d 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -125,6 +125,7 @@ struct pnfs_layoutdriver_type {
struct module *owner;
unsigned flags;
unsigned max_deviceinfo_size;
+ unsigned max_layoutget_response;

int (*set_layoutdriver) (struct nfs_server *, const struct nfs_fh *);
int (*clear_layoutdriver) (struct nfs_server *);
--
2.17.1

2018-09-05 23:56:47

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 3/7] NFS: Convert lookups of the lock context to RCU

Speed up lookups of an existing lock context by avoiding the inode->i_lock,
and using RCU instead.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/inode.c | 25 ++++++++++++-------------
include/linux/nfs_fs.h | 1 +
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b65aee481d13..09b3b7146ff4 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -857,15 +857,14 @@ static void nfs_init_lock_context(struct nfs_lock_context *l_ctx)

static struct nfs_lock_context *__nfs_find_lock_context(struct nfs_open_context *ctx)
{
- struct nfs_lock_context *head = &ctx->lock_context;
- struct nfs_lock_context *pos = head;
+ struct nfs_lock_context *pos;

- do {
+ list_for_each_entry_rcu(pos, &ctx->lock_context.list, list) {
if (pos->lockowner != current->files)
continue;
- refcount_inc(&pos->count);
- return pos;
- } while ((pos = list_entry(pos->list.next, typeof(*pos), list)) != head);
+ if (refcount_inc_not_zero(&pos->count))
+ return pos;
+ }
return NULL;
}

@@ -874,10 +873,10 @@ struct nfs_lock_context *nfs_get_lock_context(struct nfs_open_context *ctx)
struct nfs_lock_context *res, *new = NULL;
struct inode *inode = d_inode(ctx->dentry);

- spin_lock(&inode->i_lock);
+ rcu_read_lock();
res = __nfs_find_lock_context(ctx);
+ rcu_read_unlock();
if (res == NULL) {
- spin_unlock(&inode->i_lock);
new = kmalloc(sizeof(*new), GFP_KERNEL);
if (new == NULL)
return ERR_PTR(-ENOMEM);
@@ -885,14 +884,14 @@ struct nfs_lock_context *nfs_get_lock_context(struct nfs_open_context *ctx)
spin_lock(&inode->i_lock);
res = __nfs_find_lock_context(ctx);
if (res == NULL) {
- list_add_tail(&new->list, &ctx->lock_context.list);
+ list_add_tail_rcu(&new->list, &ctx->lock_context.list);
new->open_context = ctx;
res = new;
new = NULL;
}
+ spin_unlock(&inode->i_lock);
+ kfree(new);
}
- spin_unlock(&inode->i_lock);
- kfree(new);
return res;
}
EXPORT_SYMBOL_GPL(nfs_get_lock_context);
@@ -904,9 +903,9 @@ void nfs_put_lock_context(struct nfs_lock_context *l_ctx)

if (!refcount_dec_and_lock(&l_ctx->count, &inode->i_lock))
return;
- list_del(&l_ctx->list);
+ list_del_rcu(&l_ctx->list);
spin_unlock(&inode->i_lock);
- kfree(l_ctx);
+ kfree_rcu(l_ctx, rcu_head);
}
EXPORT_SYMBOL_GPL(nfs_put_lock_context);

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a0831e9d19c9..d2f4f88a0e66 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -62,6 +62,7 @@ struct nfs_lock_context {
struct nfs_open_context *open_context;
fl_owner_t lockowner;
atomic_t io_count;
+ struct rcu_head rcu_head;
};

struct nfs4_state;
--
2.17.1

2018-09-05 23:56:48

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 4/7] NFS: Simplify internal check for whether file is open for write

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/inode.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 09b3b7146ff4..052db41a7f80 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1328,19 +1328,11 @@ static bool nfs_file_has_writers(struct nfs_inode *nfsi)
{
struct inode *inode = &nfsi->vfs_inode;

- assert_spin_locked(&inode->i_lock);
-
if (!S_ISREG(inode->i_mode))
return false;
if (list_empty(&nfsi->open_files))
return false;
- /* Note: This relies on nfsi->open_files being ordered with writers
- * being placed at the head of the list.
- * See nfs_inode_attach_open_context()
- */
- return (list_first_entry(&nfsi->open_files,
- struct nfs_open_context,
- list)->mode & FMODE_WRITE) == FMODE_WRITE;
+ return inode_is_open_for_write(inode);
}

static bool nfs_file_has_buffered_writers(struct nfs_inode *nfsi)
--
2.17.1

2018-09-05 23:56:50

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 6/7] NFSv4: Convert open state lookup to use RCU

Further reduce contention on the inode->i_lock.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4state.c | 12 ++++++------
2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 3a6904173214..1b4737f4cac4 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -191,6 +191,7 @@ struct nfs4_state {
atomic_t count;

wait_queue_head_t waitq;
+ struct rcu_head rcu_head;
};


diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index be92ce4259e9..7feac365038c 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -684,7 +684,7 @@ __nfs4_find_state_byowner(struct inode *inode, struct nfs4_state_owner *owner)
struct nfs_inode *nfsi = NFS_I(inode);
struct nfs4_state *state;

- list_for_each_entry(state, &nfsi->open_states, inode_states) {
+ list_for_each_entry_rcu(state, &nfsi->open_states, inode_states) {
if (state->owner != owner)
continue;
if (!nfs4_valid_open_stateid(state))
@@ -698,7 +698,7 @@ __nfs4_find_state_byowner(struct inode *inode, struct nfs4_state_owner *owner)
static void
nfs4_free_open_state(struct nfs4_state *state)
{
- kfree(state);
+ kfree_rcu(state, rcu_head);
}

struct nfs4_state *
@@ -707,9 +707,9 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
struct nfs4_state *state, *new;
struct nfs_inode *nfsi = NFS_I(inode);

- spin_lock(&inode->i_lock);
+ rcu_read_lock();
state = __nfs4_find_state_byowner(inode, owner);
- spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
if (state)
goto out;
new = nfs4_alloc_open_state();
@@ -720,7 +720,7 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
state = new;
state->owner = owner;
atomic_inc(&owner->so_count);
- list_add(&state->inode_states, &nfsi->open_states);
+ list_add_rcu(&state->inode_states, &nfsi->open_states);
ihold(inode);
state->inode = inode;
spin_unlock(&inode->i_lock);
@@ -746,7 +746,7 @@ void nfs4_put_open_state(struct nfs4_state *state)
if (!atomic_dec_and_lock(&state->count, &owner->so_lock))
return;
spin_lock(&inode->i_lock);
- list_del(&state->inode_states);
+ list_del_rcu(&state->inode_states);
list_del(&state->open_states);
spin_unlock(&inode->i_lock);
spin_unlock(&owner->so_lock);
--
2.17.1

2018-09-05 23:56:51

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 7/7] NFSv4: Convert struct nfs4_state to use refcount_t

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 8 ++++----
fs/nfs/nfs4state.c | 8 ++++----
3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 1b4737f4cac4..8d59c9655ec4 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -188,7 +188,7 @@ struct nfs4_state {
unsigned int n_wronly; /* Number of write-only references */
unsigned int n_rdwr; /* Number of read/write references */
fmode_t state; /* State on the server (R,W, or RW) */
- atomic_t count;
+ refcount_t count;

wait_queue_head_t waitq;
struct rcu_head rcu_head;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 10c20a5b075d..4da59bd53f98 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1777,7 +1777,7 @@ static struct nfs4_state *nfs4_try_open_cached(struct nfs4_opendata *opendata)
out:
return ERR_PTR(ret);
out_return_state:
- atomic_inc(&state->count);
+ refcount_inc(&state->count);
return state;
}

@@ -1849,7 +1849,7 @@ _nfs4_opendata_reclaim_to_nfs4_state(struct nfs4_opendata *data)
update:
update_open_stateid(state, &data->o_res.stateid, NULL,
data->o_arg.fmode);
- atomic_inc(&state->count);
+ refcount_inc(&state->count);

return state;
}
@@ -1887,7 +1887,7 @@ nfs4_opendata_find_nfs4_state(struct nfs4_opendata *data)
return ERR_CAST(inode);
if (data->state != NULL && data->state->inode == inode) {
state = data->state;
- atomic_inc(&state->count);
+ refcount_inc(&state->count);
} else
state = nfs4_get_open_state(inode, data->owner);
iput(inode);
@@ -1978,7 +1978,7 @@ static struct nfs4_opendata *nfs4_open_recoverdata_alloc(struct nfs_open_context
if (opendata == NULL)
return ERR_PTR(-ENOMEM);
opendata->state = state;
- atomic_inc(&state->count);
+ refcount_inc(&state->count);
return opendata;
}

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 7feac365038c..6ca3bd076073 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -655,7 +655,7 @@ nfs4_alloc_open_state(void)
state = kzalloc(sizeof(*state), GFP_NOFS);
if (!state)
return NULL;
- atomic_set(&state->count, 1);
+ refcount_set(&state->count, 1);
INIT_LIST_HEAD(&state->lock_states);
spin_lock_init(&state->state_lock);
seqlock_init(&state->seqlock);
@@ -689,7 +689,7 @@ __nfs4_find_state_byowner(struct inode *inode, struct nfs4_state_owner *owner)
continue;
if (!nfs4_valid_open_stateid(state))
continue;
- if (atomic_inc_not_zero(&state->count))
+ if (refcount_inc_not_zero(&state->count))
return state;
}
return NULL;
@@ -743,7 +743,7 @@ void nfs4_put_open_state(struct nfs4_state *state)
struct inode *inode = state->inode;
struct nfs4_state_owner *owner = state->owner;

- if (!atomic_dec_and_lock(&state->count, &owner->so_lock))
+ if (!refcount_dec_and_lock(&state->count, &owner->so_lock))
return;
spin_lock(&inode->i_lock);
list_del_rcu(&state->inode_states);
@@ -1573,7 +1573,7 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
continue;
if (state->state == 0)
continue;
- atomic_inc(&state->count);
+ refcount_inc(&state->count);
spin_unlock(&sp->so_lock);
status = ops->recover_open(sp, state);
if (status >= 0) {
--
2.17.1

2018-09-05 23:56:50

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

Reduce contention on the inode->i_lock by ensuring that we use RCU
when looking up the NFS open context.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/delegation.c | 11 ++++++-----
fs/nfs/inode.c | 35 +++++++++++++++--------------------
fs/nfs/nfs4proc.c | 30 ++++++++++++++++++++++++------
fs/nfs/nfs4state.c | 12 ++++++------
fs/nfs/pnfs.c | 5 ++++-
include/linux/nfs_fs.h | 1 +
6 files changed, 56 insertions(+), 38 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index f033f3a69a3b..76d205d1c7bc 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -136,8 +136,8 @@ static int nfs_delegation_claim_opens(struct inode *inode,
int err;

again:
- spin_lock(&inode->i_lock);
- list_for_each_entry(ctx, &nfsi->open_files, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
state = ctx->state;
if (state == NULL)
continue;
@@ -147,8 +147,9 @@ static int nfs_delegation_claim_opens(struct inode *inode,
continue;
if (!nfs4_stateid_match(&state->stateid, stateid))
continue;
- get_nfs_open_context(ctx);
- spin_unlock(&inode->i_lock);
+ if (!get_nfs_open_context(ctx))
+ continue;
+ rcu_read_unlock();
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
@@ -164,7 +165,7 @@ static int nfs_delegation_claim_opens(struct inode *inode,
return err;
goto again;
}
- spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
return 0;
}

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 052db41a7f80..5b1eee4952b7 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -977,9 +977,9 @@ EXPORT_SYMBOL_GPL(alloc_nfs_open_context);

struct nfs_open_context *get_nfs_open_context(struct nfs_open_context *ctx)
{
- if (ctx != NULL)
- refcount_inc(&ctx->lock_context.count);
- return ctx;
+ if (ctx != NULL && refcount_inc_not_zero(&ctx->lock_context.count))
+ return ctx;
+ return NULL;
}
EXPORT_SYMBOL_GPL(get_nfs_open_context);

@@ -988,13 +988,13 @@ static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync)
struct inode *inode = d_inode(ctx->dentry);
struct super_block *sb = ctx->dentry->d_sb;

+ if (!refcount_dec_and_test(&ctx->lock_context.count))
+ return;
if (!list_empty(&ctx->list)) {
- if (!refcount_dec_and_lock(&ctx->lock_context.count, &inode->i_lock))
- return;
- list_del(&ctx->list);
+ spin_lock(&inode->i_lock);
+ list_del_rcu(&ctx->list);
spin_unlock(&inode->i_lock);
- } else if (!refcount_dec_and_test(&ctx->lock_context.count))
- return;
+ }
if (inode != NULL)
NFS_PROTO(inode)->close_context(ctx, is_sync);
if (ctx->cred != NULL)
@@ -1002,7 +1002,7 @@ static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync)
dput(ctx->dentry);
nfs_sb_deactive(sb);
kfree(ctx->mdsthreshold);
- kfree(ctx);
+ kfree_rcu(ctx, rcu_head);
}

void put_nfs_open_context(struct nfs_open_context *ctx)
@@ -1026,10 +1026,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)
struct nfs_inode *nfsi = NFS_I(inode);

spin_lock(&inode->i_lock);
- if (ctx->mode & FMODE_WRITE)
- list_add(&ctx->list, &nfsi->open_files);
- else
- list_add_tail(&ctx->list, &nfsi->open_files);
+ list_add_tail_rcu(&ctx->list, &nfsi->open_files);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
@@ -1050,16 +1047,17 @@ struct nfs_open_context *nfs_find_open_context(struct inode *inode, struct rpc_c
struct nfs_inode *nfsi = NFS_I(inode);
struct nfs_open_context *pos, *ctx = NULL;

- spin_lock(&inode->i_lock);
- list_for_each_entry(pos, &nfsi->open_files, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(pos, &nfsi->open_files, list) {
if (cred != NULL && pos->cred != cred)
continue;
if ((pos->mode & (FMODE_READ|FMODE_WRITE)) != mode)
continue;
ctx = get_nfs_open_context(pos);
- break;
+ if (ctx)
+ break;
}
- spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
return ctx;
}

@@ -1077,9 +1075,6 @@ void nfs_file_clear_open_context(struct file *filp)
if (ctx->error < 0)
invalidate_inode_pages2(inode->i_mapping);
filp->private_data = NULL;
- spin_lock(&inode->i_lock);
- list_move_tail(&ctx->list, &NFS_I(inode)->open_files);
- spin_unlock(&inode->i_lock);
put_nfs_open_context_sync(ctx);
}
}
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 8220a168282e..10c20a5b075d 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1933,23 +1933,41 @@ nfs4_opendata_to_nfs4_state(struct nfs4_opendata *data)
return ret;
}

-static struct nfs_open_context *nfs4_state_find_open_context(struct nfs4_state *state)
+static struct nfs_open_context *
+nfs4_state_find_open_context_mode(struct nfs4_state *state, fmode_t mode)
{
struct nfs_inode *nfsi = NFS_I(state->inode);
struct nfs_open_context *ctx;

- spin_lock(&state->inode->i_lock);
- list_for_each_entry(ctx, &nfsi->open_files, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
if (ctx->state != state)
continue;
- get_nfs_open_context(ctx);
- spin_unlock(&state->inode->i_lock);
+ if ((ctx->mode & mode) != mode)
+ continue;
+ if (!get_nfs_open_context(ctx))
+ continue;
+ rcu_read_unlock();
return ctx;
}
- spin_unlock(&state->inode->i_lock);
+ rcu_read_unlock();
return ERR_PTR(-ENOENT);
}

+static struct nfs_open_context *
+nfs4_state_find_open_context(struct nfs4_state *state)
+{
+ struct nfs_open_context *ctx;
+
+ ctx = nfs4_state_find_open_context_mode(state, FMODE_READ|FMODE_WRITE);
+ if (!IS_ERR(ctx))
+ return ctx;
+ ctx = nfs4_state_find_open_context_mode(state, FMODE_WRITE);
+ if (!IS_ERR(ctx))
+ return ctx;
+ return nfs4_state_find_open_context_mode(state, FMODE_READ);
+}
+
static struct nfs4_opendata *nfs4_open_recoverdata_alloc(struct nfs_open_context *ctx,
struct nfs4_state *state, enum open_claim_type4 claim)
{
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 40a08cd483f0..be92ce4259e9 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1437,8 +1437,8 @@ void nfs_inode_find_state_and_recover(struct inode *inode,
struct nfs4_state *state;
bool found = false;

- spin_lock(&inode->i_lock);
- list_for_each_entry(ctx, &nfsi->open_files, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
state = ctx->state;
if (state == NULL)
continue;
@@ -1456,7 +1456,7 @@ void nfs_inode_find_state_and_recover(struct inode *inode,
nfs4_state_mark_reclaim_nograce(clp, state))
found = true;
}
- spin_unlock(&inode->i_lock);
+ rcu_read_unlock();

nfs_inode_find_delegation_state_and_recover(inode, stateid);
if (found)
@@ -1469,13 +1469,13 @@ static void nfs4_state_mark_open_context_bad(struct nfs4_state *state)
struct nfs_inode *nfsi = NFS_I(inode);
struct nfs_open_context *ctx;

- spin_lock(&inode->i_lock);
- list_for_each_entry(ctx, &nfsi->open_files, list) {
+ rcu_read_lock();
+ list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
if (ctx->state != state)
continue;
set_bit(NFS_CONTEXT_BAD, &ctx->flags);
}
- spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
}

static void nfs4_state_mark_recovery_failed(struct nfs4_state *state, int error)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index c5672c02afd6..06cb90e9bc6e 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1339,6 +1339,7 @@ bool pnfs_roc(struct inode *ino,
if (!nfs_have_layout(ino))
return false;
retry:
+ rcu_read_lock();
spin_lock(&ino->i_lock);
lo = nfsi->layout;
if (!lo || !pnfs_layout_is_valid(lo) ||
@@ -1349,6 +1350,7 @@ bool pnfs_roc(struct inode *ino,
pnfs_get_layout_hdr(lo);
if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
spin_unlock(&ino->i_lock);
+ rcu_read_unlock();
wait_on_bit(&lo->plh_flags, NFS_LAYOUT_RETURN,
TASK_UNINTERRUPTIBLE);
pnfs_put_layout_hdr(lo);
@@ -1362,7 +1364,7 @@ bool pnfs_roc(struct inode *ino,
skip_read = true;
}

- list_for_each_entry(ctx, &nfsi->open_files, list) {
+ list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
state = ctx->state;
if (state == NULL)
continue;
@@ -1410,6 +1412,7 @@ bool pnfs_roc(struct inode *ino,

out_noroc:
spin_unlock(&ino->i_lock);
+ rcu_read_unlock();
pnfs_layoutcommit_inode(ino, true);
if (roc) {
struct pnfs_layoutdriver_type *ld = NFS_SERVER(ino)->pnfs_curr_ld;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d2f4f88a0e66..6e0417c02279 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -83,6 +83,7 @@ struct nfs_open_context {

struct list_head list;
struct nfs4_threshold *mdsthreshold;
+ struct rcu_head rcu_head;
};

struct nfs_open_dir_context {
--
2.17.1

2018-09-06 00:04:58

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements



> On Sep 5, 2018, at 3:23 PM, Trond Myklebust <[email protected]> wrote:
>
> Fallout from a bunch of flame graphs...

Hey, are these in a public git repo I can pull from?


> Trond Myklebust (7):
> pNFS: Don't zero out the array in nfs4_alloc_pages()
> pNFS: Don't allocate more pages than we need to fit a layoutget
> response
> NFS: Convert lookups of the lock context to RCU
> NFS: Simplify internal check for whether file is open for write
> NFS: Convert lookups of the open context to RCU
> NFSv4: Convert open state lookup to use RCU
> NFSv4: Convert struct nfs4_state to use refcount_t
>
> fs/nfs/delegation.c | 11 ++--
> fs/nfs/filelayout/filelayout.c | 1 +
> fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
> fs/nfs/inode.c | 70 +++++++++++---------------
> fs/nfs/nfs4_fs.h | 3 +-
> fs/nfs/nfs4proc.c | 38 ++++++++++----
> fs/nfs/nfs4state.c | 32 ++++++------
> fs/nfs/pnfs.c | 16 ++++--
> fs/nfs/pnfs.h | 1 +
> include/linux/nfs_fs.h | 2 +
> 10 files changed, 98 insertions(+), 77 deletions(-)
>
> --
> 2.17.1
>

--
Chuck Lever

2018-09-06 01:08:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements

T24gV2VkLCAyMDE4LTA5LTA1IGF0IDE1OjMzIC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
PiBPbiBTZXAgNSwgMjAxOCwgYXQgMzoyMyBQTSwgVHJvbmQgTXlrbGVidXN0IDx0cm9uZG15QGdt
YWlsLmNvbT4NCj4gPiB3cm90ZToNCj4gPiANCj4gPiBGYWxsb3V0IGZyb20gYSBidW5jaCBvZiBm
bGFtZSBncmFwaHMuLi4NCj4gDQo+IEhleSwgYXJlIHRoZXNlIGluIGEgcHVibGljIGdpdCByZXBv
IEkgY2FuIHB1bGwgZnJvbT8NCj4gDQoNCkkndmUgcHVzaGVkIG91dCBteSAndGVzdGluZycgYnJh
bmNoIHRvIGdpdC5saW51eC1uZnMub3JnLiBUaGUgdXN1YWwNCmNhdmVhdHMgYXBwbHk6IHBsZWFz
ZSBkbyBub3QgdHJlYXQgdGhhdCBicmFuY2ggYXMgYmVpbmcgc3RhYmxlLCBhc3N1bWUNCml0IHdv
bid0IGJlIHJlYmFzZWQgb3IgYXNzdW1lIHRoYXQgSSB3b24ndCBjaGFuZ2UgdGhlIGNvbnRlbnRz
LiBJdCBpcw0KanVzdCB0aGVyZSBmb3IgdGVzdGluZyBwdXJwb3Nlcy4NCg0KQ2hlZXJzDQogVHJv
bmQNCg0KPiANCj4gPiBUcm9uZCBNeWtsZWJ1c3QgKDcpOg0KPiA+ICBwTkZTOiBEb24ndCB6ZXJv
IG91dCB0aGUgYXJyYXkgaW4gbmZzNF9hbGxvY19wYWdlcygpDQo+ID4gIHBORlM6IERvbid0IGFs
bG9jYXRlIG1vcmUgcGFnZXMgdGhhbiB3ZSBuZWVkIHRvIGZpdCBhIGxheW91dGdldA0KPiA+ICAg
IHJlc3BvbnNlDQo+ID4gIE5GUzogQ29udmVydCBsb29rdXBzIG9mIHRoZSBsb2NrIGNvbnRleHQg
dG8gUkNVDQo+ID4gIE5GUzogU2ltcGxpZnkgaW50ZXJuYWwgY2hlY2sgZm9yIHdoZXRoZXIgZmls
ZSBpcyBvcGVuIGZvciB3cml0ZQ0KPiA+ICBORlM6IENvbnZlcnQgbG9va3VwcyBvZiB0aGUgb3Bl
biBjb250ZXh0IHRvIFJDVQ0KPiA+ICBORlN2NDogQ29udmVydCBvcGVuIHN0YXRlIGxvb2t1cCB0
byB1c2UgUkNVDQo+ID4gIE5GU3Y0OiBDb252ZXJ0IHN0cnVjdCBuZnM0X3N0YXRlIHRvIHVzZSBy
ZWZjb3VudF90DQo+ID4gDQo+ID4gZnMvbmZzL2RlbGVnYXRpb24uYyAgICAgICAgICAgICAgICAg
ICAgfCAxMSArKy0tDQo+ID4gZnMvbmZzL2ZpbGVsYXlvdXQvZmlsZWxheW91dC5jICAgICAgICAg
fCAgMSArDQo+ID4gZnMvbmZzL2ZsZXhmaWxlbGF5b3V0L2ZsZXhmaWxlbGF5b3V0LmMgfCAgMSAr
DQo+ID4gZnMvbmZzL2lub2RlLmMgICAgICAgICAgICAgICAgICAgICAgICAgfCA3MCArKysrKysr
KysrKy0tLS0tLS0tLS0NCj4gPiAtLS0tLQ0KPiA+IGZzL25mcy9uZnM0X2ZzLmggICAgICAgICAg
ICAgICAgICAgICAgIHwgIDMgKy0NCj4gPiBmcy9uZnMvbmZzNHByb2MuYyAgICAgICAgICAgICAg
ICAgICAgICB8IDM4ICsrKysrKysrKystLS0tDQo+ID4gZnMvbmZzL25mczRzdGF0ZS5jICAgICAg
ICAgICAgICAgICAgICAgfCAzMiArKysrKystLS0tLS0NCj4gPiBmcy9uZnMvcG5mcy5jICAgICAg
ICAgICAgICAgICAgICAgICAgICB8IDE2ICsrKystLQ0KPiA+IGZzL25mcy9wbmZzLmggICAgICAg
ICAgICAgICAgICAgICAgICAgIHwgIDEgKw0KPiA+IGluY2x1ZGUvbGludXgvbmZzX2ZzLmggICAg
ICAgICAgICAgICAgIHwgIDIgKw0KPiA+IDEwIGZpbGVzIGNoYW5nZWQsIDk4IGluc2VydGlvbnMo
KyksIDc3IGRlbGV0aW9ucygtKQ0KPiA+IA0KPiA+IC0tIA0KPiA+IDIuMTcuMQ0KPiA+IA0KPiAN
Cj4gLS0NCj4gQ2h1Y2sgTGV2ZXINCj4gDQo+IA0KPiANCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpM
aW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIEhhbW1lcnNwYWNlDQp0cm9uZC5teWtsZWJ1c3RA
aGFtbWVyc3BhY2UuY29tDQoNCg0K

2018-09-07 20:26:14

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements


> On Sep 5, 2018, at 4:36 PM, Trond Myklebust <[email protected]> =
wrote:
>=20
> On Wed, 2018-09-05 at 15:33 -0400, Chuck Lever wrote:
>>> On Sep 5, 2018, at 3:23 PM, Trond Myklebust <[email protected]>
>>> wrote:
>>>=20
>>> Fallout from a bunch of flame graphs...
>>=20
>> Hey, are these in a public git repo I can pull from?
>>=20
>=20
> I've pushed out my 'testing' branch to git.linux-nfs.org. The usual
> caveats apply: please do not treat that branch as being stable, assume
> it won't be rebased or assume that I won't change the contents. It is
> just there for testing purposes.
>=20
> Cheers
> Trond
>=20
>>=20
>>> Trond Myklebust (7):
>>> pNFS: Don't zero out the array in nfs4_alloc_pages()
>>> pNFS: Don't allocate more pages than we need to fit a layoutget
>>> response
>>> NFS: Convert lookups of the lock context to RCU
>>> NFS: Simplify internal check for whether file is open for write
>>> NFS: Convert lookups of the open context to RCU
>>> NFSv4: Convert open state lookup to use RCU
>>> NFSv4: Convert struct nfs4_state to use refcount_t
>>>=20
>>> fs/nfs/delegation.c | 11 ++--
>>> fs/nfs/filelayout/filelayout.c | 1 +
>>> fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
>>> fs/nfs/inode.c | 70 +++++++++++----------
>>> -----
>>> fs/nfs/nfs4_fs.h | 3 +-
>>> fs/nfs/nfs4proc.c | 38 ++++++++++----
>>> fs/nfs/nfs4state.c | 32 ++++++------
>>> fs/nfs/pnfs.c | 16 ++++--
>>> fs/nfs/pnfs.h | 1 +
>>> include/linux/nfs_fs.h | 2 +
>>> 10 files changed, 98 insertions(+), 77 deletions(-)
>>>=20
>>> --=20
>>> 2.17.1

Some performance testing results for the full "testing" series.
The fio tests are designed to push the IOPS rate, and the third
is a QD=3D1 test to measure the latency of 512KB NFS WRITE
operations. All three tests use direct I/O.

The "without fair queuing" kernels have this commit reverted:

commit ae03d238e8a11ddc76668c64ad405cd8412446a6
Author: Trond Myklebust <[email protected]>
AuthorDate: Tue Sep 4 11:47:51 2018 -0400
Commit: Trond Myklebust <[email protected]>
CommitDate: Wed Sep 5 14:37:07 2018 -0400

SUNRPC: Queue fairness for all.


Client: 12-core, two-socket, 56Gb InfiniBand
Server: 4-core, one-socket, 56Gb InfiniBand, tmpfs export

Test: /usr/bin/fio --size=3D1G --direct=3D1 --rw=3Drandrw =
--refill_buffers --norandommap --randrepeat=3D0 --ioengine=3Dlibaio =
--bs=3D8k --rwmixread=3D70 --iodepth=3D16 --numjobs=3D16 --runtime=3D240 =
--group_reporting

NFSv3 on RDMA:
Stock v4.19-rc2:
=E2=80=A2 read: IOPS=3D109k, BW=3D849MiB/s =
(890MB/s)(11.2GiB/13506msec)
=E2=80=A2 write: IOPS=3D46.6k, BW=3D364MiB/s =
(382MB/s)(4915MiB/13506msec)
Trond's kernel (with fair queuing):
=E2=80=A2 read: IOPS=3D83.0k, BW=3D649MiB/s =
(680MB/s)(11.2GiB/17676msec)
=E2=80=A2 write: IOPS=3D35.6k, BW=3D278MiB/s =
(292MB/s)(4921MiB/17676msec)
Trond's kernel (without fair queuing):
=E2=80=A2 read: IOPS=3D90.5k, BW=3D707MiB/s =
(742MB/s)(11.2GiB/16216msec)
=E2=80=A2 write: IOPS=3D38.8k, BW=3D303MiB/s =
(318MB/s)(4917MiB/16216msec)

NFSv3 on TCP (IPoIB):
Stock v4.19-rc2:
=E2=80=A2 read: IOPS=3D23.8k, BW=3D186MiB/s =
(195MB/s)(11.2GiB/61635msec)
=E2=80=A2 write: IOPS=3D10.2k, BW=3D79.9MiB/s =
(83.8MB/s)(4923MiB/61635msec)
Trond's kernel (with fair queuing):
=E2=80=A2 read: IOPS=3D25.9k, BW=3D202MiB/s =
(212MB/s)(11.2GiB/56710msec)
=E2=80=A2 write: IOPS=3D11.1k, BW=3D86.7MiB/s =
(90.9MB/s)(4916MiB/56710msec)
Trond's kernel (without fair queuing):
=E2=80=A2 read: IOPS=3D25.0k, BW=3D203MiB/s =
(213MB/s)(11.2GiB/56492msec)
=E2=80=A2 write: IOPS=3D11.1k, BW=3D86.0MiB/s =
(91.2MB/s)(4915MiB/56492msec)


Test: /usr/bin/fio --size=3D1G --direct=3D1 --rw=3Drandread =
--refill_buffers --norandommap --randrepeat=3D0 --ioengine=3Dlibaio =
--bs=3D4k --rwmixread=3D100 --iodepth=3D1024 --numjobs=3D16 =
--runtime=3D240 --group_reporting

NFSv3 on RDMA:
Stock v4.19-rc2:
=E2=80=A2 read: IOPS=3D149k, BW=3D580MiB/s =
(608MB/s)(16.0GiB/28241msec)
Trond's kernel (with fair queuing):
=E2=80=A2 read: IOPS=3D81.5k, BW=3D318MiB/s =
(334MB/s)(16.0GiB/51450msec)
Trond's kernel (without fair queuing):
=E2=80=A2 read: IOPS=3D82.4k, BW=3D322MiB/s =
(337MB/s)(16.0GiB/50918msec)

NFSv3 on TCP (IPoIB):
Stock v4.19-rc2:
=E2=80=A2 read: IOPS=3D37.2k, BW=3D145MiB/s =
(153MB/s)(16.0GiB/112630msec)
Trond's kernel (with fair queuing):
=E2=80=A2 read: IOPS=3D2715, BW=3D10.6MiB/s =
(11.1MB/s)(2573MiB/242594msec)
Trond's kernel (without fair queuing):
=E2=80=A2 read: IOPS=3D2869, BW=3D11.2MiB/s =
(11.8MB/s)(2724MiB/242979msec)


Test: /home/cel/bin/iozone -M -i0 -s8g -r512k -az -I -N

My kernel: 4.19.0-rc2-00026-g50d68a4
system call latencies in microseconds, N=3D5:
=E2=80=A2 write: mean=3D602, std=3D13.0
=E2=80=A2 rewrite: mean=3D541, std=3D17.3
server round trip latency in microseconds, N=3D5:
=E2=80=A2 RTT: mean=3D354, std=3D3.0

Trond's kernel (with fair queuing):
system call latencies in microseconds, N=3D5:
=E2=80=A2 write: mean=3D572, std=3D10.6
=E2=80=A2 rewrite: mean=3D533, std=3D7.9
server round trip latency in microseconds, N=3D5:
=E2=80=A2 RTT: mean=3D352, std=3D2.7


--
Chuck Lever

2018-09-10 07:59:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements

T24gRnJpLCAyMDE4LTA5LTA3IGF0IDExOjQ0IC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
DQo+IENsaWVudDogMTItY29yZSwgdHdvLXNvY2tldCwgNTZHYiBJbmZpbmlCYW5kDQo+IFNlcnZl
cjogNC1jb3JlLCBvbmUtc29ja2V0LCA1NkdiIEluZmluaUJhbmQsIHRtcGZzIGV4cG9ydA0KPiAN
Cj4gVGVzdDogL3Vzci9iaW4vZmlvIC0tc2l6ZT0xRyAtLWRpcmVjdD0xIC0tcnc9cmFuZHJ3IC0t
cmVmaWxsX2J1ZmZlcnMNCj4gLS1ub3JhbmRvbW1hcCAtLXJhbmRyZXBlYXQ9MCAtLWlvZW5naW5l
PWxpYmFpbyAtLWJzPThrIC0tcndtaXhyZWFkPTcwIA0KPiAtLWlvZGVwdGg9MTYgLS1udW1qb2Jz
PTE2IC0tcnVudGltZT0yNDAgLS1ncm91cF9yZXBvcnRpbmcNCj4gDQo+IE5GU3YzIG9uIFJETUE6
DQo+IFN0b2NrIHY0LjE5LXJjMjoNCj4gCeKAoiByZWFkOiBJT1BTPTEwOWssIEJXPTg0OU1pQi9z
ICg4OTBNQi9zKSgxMS4yR2lCLzEzNTA2bXNlYykNCj4gCeKAoiB3cml0ZTogSU9QUz00Ni42aywg
Qlc9MzY0TWlCL3MgKDM4Mk1CL3MpKDQ5MTVNaUIvMTM1MDZtc2VjKQ0KPiBUcm9uZCdzIGtlcm5l
bCAod2l0aCBmYWlyIHF1ZXVpbmcpOg0KPiAJ4oCiIHJlYWQ6IElPUFM9ODMuMGssIEJXPTY0OU1p
Qi9zICg2ODBNQi9zKSgxMS4yR2lCLzE3Njc2bXNlYykNCj4gCeKAoiB3cml0ZTogSU9QUz0zNS42
aywgQlc9Mjc4TWlCL3MgKDI5Mk1CL3MpKDQ5MjFNaUIvMTc2NzZtc2VjKQ0KPiBUcm9uZCdzIGtl
cm5lbCAod2l0aG91dCBmYWlyIHF1ZXVpbmcpOg0KPiAJ4oCiIHJlYWQ6IElPUFM9OTAuNWssIEJX
PTcwN01pQi9zICg3NDJNQi9zKSgxMS4yR2lCLzE2MjE2bXNlYykNCj4gCeKAoiB3cml0ZTogSU9Q
Uz0zOC44aywgQlc9MzAzTWlCL3MgKDMxOE1CL3MpKDQ5MTdNaUIvMTYyMTZtc2VjKQ0KPiANCj4g
TkZTdjMgb24gVENQIChJUG9JQik6DQo+IFN0b2NrIHY0LjE5LXJjMjoNCj4gCeKAoiByZWFkOiBJ
T1BTPTIzLjhrLCBCVz0xODZNaUIvcyAoMTk1TUIvcykoMTEuMkdpQi82MTYzNW1zZWMpDQo+IAni
gKIgd3JpdGU6IElPUFM9MTAuMmssIEJXPTc5LjlNaUIvcyAoODMuOE1CL3MpKDQ5MjNNaUIvNjE2
MzVtc2VjKQ0KPiBUcm9uZCdzIGtlcm5lbCAod2l0aCBmYWlyIHF1ZXVpbmcpOg0KPiAJ4oCiIHJl
YWQ6IElPUFM9MjUuOWssIEJXPTIwMk1pQi9zICgyMTJNQi9zKSgxMS4yR2lCLzU2NzEwbXNlYykN
Cj4gCeKAoiB3cml0ZTogSU9QUz0xMS4xaywgQlc9ODYuN01pQi9zICg5MC45TUIvcykoNDkxNk1p
Qi81NjcxMG1zZWMpDQo+IFRyb25kJ3Mga2VybmVsICh3aXRob3V0IGZhaXIgcXVldWluZyk6DQo+
IAnigKIgcmVhZDogSU9QUz0yNS4waywgQlc9MjAzTWlCL3MgKDIxM01CL3MpKDExLjJHaUIvNTY0
OTJtc2VjKQ0KPiAJ4oCiIHdyaXRlOiBJT1BTPTExLjFrLCBCVz04Ni4wTWlCL3MgKDkxLjJNQi9z
KSg0OTE1TWlCLzU2NDkybXNlYykNCj4gDQo+IA0KPiBUZXN0OiAvdXNyL2Jpbi9maW8gLS1zaXpl
PTFHIC0tZGlyZWN0PTEgLS1ydz1yYW5kcmVhZCAtLQ0KPiByZWZpbGxfYnVmZmVycyAtLW5vcmFu
ZG9tbWFwIC0tcmFuZHJlcGVhdD0wIC0taW9lbmdpbmU9bGliYWlvIC0tYnM9NGsgDQo+IC0tcndt
aXhyZWFkPTEwMCAtLWlvZGVwdGg9MTAyNCAtLW51bWpvYnM9MTYgLS1ydW50aW1lPTI0MCAtLQ0K
PiBncm91cF9yZXBvcnRpbmcNCj4gDQo+IE5GU3YzIG9uIFJETUE6DQo+IFN0b2NrIHY0LjE5LXJj
MjoNCj4gCeKAoiByZWFkOiBJT1BTPTE0OWssIEJXPTU4ME1pQi9zICg2MDhNQi9zKSgxNi4wR2lC
LzI4MjQxbXNlYykNCj4gVHJvbmQncyBrZXJuZWwgKHdpdGggZmFpciBxdWV1aW5nKToNCj4gCeKA
oiByZWFkOiBJT1BTPTgxLjVrLCBCVz0zMThNaUIvcyAoMzM0TUIvcykoMTYuMEdpQi81MTQ1MG1z
ZWMpDQo+IFRyb25kJ3Mga2VybmVsICh3aXRob3V0IGZhaXIgcXVldWluZyk6DQo+IAnigKIgcmVh
ZDogSU9QUz04Mi40aywgQlc9MzIyTWlCL3MgKDMzN01CL3MpKDE2LjBHaUIvNTA5MThtc2VjKQ0K
PiANCj4gTkZTdjMgb24gVENQIChJUG9JQik6DQo+IFN0b2NrIHY0LjE5LXJjMjoNCj4gCeKAoiBy
ZWFkOiBJT1BTPTM3LjJrLCBCVz0xNDVNaUIvcyAoMTUzTUIvcykoMTYuMEdpQi8xMTI2MzBtc2Vj
KQ0KPiBUcm9uZCdzIGtlcm5lbCAod2l0aCBmYWlyIHF1ZXVpbmcpOg0KPiAJ4oCiIHJlYWQ6IElP
UFM9MjcxNSwgQlc9MTAuNk1pQi9zICgxMS4xTUIvcykoMjU3M01pQi8yNDI1OTRtc2VjKQ0KPiBU
cm9uZCdzIGtlcm5lbCAod2l0aG91dCBmYWlyIHF1ZXVpbmcpOg0KPiAJ4oCiIHJlYWQ6IElPUFM9
Mjg2OSwgQlc9MTEuMk1pQi9zICgxMS44TUIvcykoMjcyNE1pQi8yNDI5Nzltc2VjKQ0KPiANCj4g
DQo+IFRlc3Q6IC9ob21lL2NlbC9iaW4vaW96b25lIC1NIC1pMCAtczhnIC1yNTEyayAtYXogLUkg
LU4NCj4gDQo+IE15IGtlcm5lbDogNC4xOS4wLXJjMi0wMDAyNi1nNTBkNjhhNA0KPiBzeXN0ZW0g
Y2FsbCBsYXRlbmNpZXMgaW4gbWljcm9zZWNvbmRzLCBOPTU6DQo+IAnigKIgd3JpdGU6ICAgIG1l
YW49NjAyLCBzdGQ9MTMuMA0KPiAJ4oCiIHJld3JpdGU6ICBtZWFuPTU0MSwgc3RkPTE3LjMNCj4g
c2VydmVyIHJvdW5kIHRyaXAgbGF0ZW5jeSBpbiBtaWNyb3NlY29uZHMsIE49NToNCj4gCeKAoiBS
VFQ6ICAgICAgbWVhbj0zNTQsIHN0ZD0zLjANCj4gDQo+IFRyb25kJ3Mga2VybmVsICh3aXRoIGZh
aXIgcXVldWluZyk6DQo+IHN5c3RlbSBjYWxsIGxhdGVuY2llcyBpbiBtaWNyb3NlY29uZHMsIE49
NToNCj4gCeKAoiB3cml0ZTogICAgbWVhbj01NzIsIHN0ZD0xMC42DQo+IAnigKIgcmV3cml0ZTog
IG1lYW49NTMzLCBzdGQ9Ny45DQo+IHNlcnZlciByb3VuZCB0cmlwIGxhdGVuY3kgaW4gbWljcm9z
ZWNvbmRzLCBOPTU6DQo+IAnigKIgUlRUOiAgICAgIG1lYW49MzUyLCBzdGQ9Mi43DQoNClRoYW5r
cyBmb3IgdGVzdGluZyEgSSd2ZSBiZWVuIHNwZW5kaW5nIHRoZSBsYXN0IDMgZGF5cyB0cnlpbmcg
dG8gZmlndXJlDQpvdXQgd2h5IHdlJ3JlIHNlZWluZyByZWdyZXNzaW9ucyB3aXRoIFJETUEuIEkg
dGhpbmsgSSBoYXZlIGEgZmV3DQpjYW5kaWRhdGVzOg0KDQotIFRoZSBjb25nZXN0aW9uIGNvbnRy
b2wgd2FzIGZhaWxpbmcgdG8gd2FrZSB1cCB0aGUgd3JpdGUgbG9jayB3aGVuIHdlDQpxdWV1ZSBh
IHJlcXVlc3QgdGhhdCBoYXMgYWxyZWFkeSBiZWVuIGFsbG9jYXRlZCBhIGNvbmdlc3Rpb24gY29u
dHJvbA0KY3JlZGl0Lg0KLSBUaGUgbGl2ZWxvY2sgYXZvaWRhbmNlIGNvZGUgaW4geHBydF90cmFu
c21pdCgpIHdhcyBjYXVzaW5nIHRoZQ0KcXVldWVpbmcgdG8gYnJlYWsuDQotIEluY29ycmVjdCBy
ZXR1cm4gdmFsdWUgcmV0dXJuZWQgYnkgeHBydF90cmFuc21pdCgpIHdoZW4gdGhlIHF1ZXVlIGlz
DQplbXB0eSBjYXVzZXMgdGhlIHJlcXVlc3QgdG8gcmV0cnkgd2FpdGluZyBmb3IgdGhlIGxvY2su
DQotIEEgcmFjZSBpbiB4cHJ0X3ByZXBhcmVfdHJhbnNtaXQoKSBjb3VsZCBjYXVzZSB0aGUgcmVx
dWVzdCB0byB3YWl0IGZvcg0KdGhlIHdyaXRlIGxvY2sgZGVzcGl0ZSBoYXZpbmcgYmVlbiB0cmFu
c21pdHRlZCBieSBhbm90aGVyIHJlcXVlc3QuDQotIFRoZSBjaGFuZ2UgdG8gY29udmVydCB0aGUg
d3JpdGUgbG9jayBpbnRvIGEgbm9uLXByaW9yaXR5IHF1ZXVlIGFsc28NCmNoYW5nZWQgdGhlIHdh
a2UgdXAgY29kZSwgY2F1c2luZyB0aGUgcmVxdWVzdCB0aGF0IGlzIGdyYW50ZWQgdGhlIGxvY2sN
CnRvIGJlIHF1ZXVlZCBvbiBycGNpb2QsIGluc3RlYWQgb2Ygb24gdGhlIGxvdy1sYXRlbmN5IHhw
cnRpb2QNCndvcmtxdWV1ZS4NCg0KSSd2ZSBmaXhlZCBhbGwgdGhlIGFib3ZlLiBJbiBhZGRpdGlv
biwgSSd2ZSB0aWdodGVuZWQgdXAgYSBmZXcgY2FzZXMNCndoZXJlIHdlIHdlcmUgZ3JhYmJpbmcg
c3BpbmxvY2tzIHVubmVjZXNzYXJpbHksIGFuZCBJJ3ZlIGNvbnZlcnRlZCB0aGUNCnJlcGx5IGxv
b2t1cCB0byB1c2UgYW4gcmJ0cmVlIGluIG9yZGVyIHRvIHJlZHVjZSB0aGUgYW1vdW50IG9mIHRp
bWUgd2UNCm5lZWQgdG8gaG9sZCB0aGUgeHBydC0+cXVldWVfbG9jay4NCg0KVGhlIG5ldyBjb2Rl
IGhhcyBiZWVuIHJlYmFzZWQgb250byA0LjE5LjAtcmMzLCBhbmQgaXMgbm93IGF2YWlsYWJsZSBv
bg0KdGhlICd0ZXN0aW5nJyBicmFuY2guIFdvdWxkIHlvdSBiZSBhYmxlIHRvIGdpdmUgaXQgYW5v
dGhlciBxdWljayBzcGluPw0KDQpUaGFua3MhDQogIFRyb25kDQoNCi0tIA0KVHJvbmQgTXlrbGVi
dXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIEhhbW1lcnNwYWNlDQp0cm9uZC5teWts
ZWJ1c3RAaGFtbWVyc3BhY2UuY29tDQoNCg0K

2018-09-10 21:09:28

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements



> On Sep 9, 2018, at 9:35 PM, Trond Myklebust <[email protected]> =
wrote:
>=20
> On Fri, 2018-09-07 at 11:44 -0400, Chuck Lever wrote:
>>=20
>> Client: 12-core, two-socket, 56Gb InfiniBand
>> Server: 4-core, one-socket, 56Gb InfiniBand, tmpfs export
>>=20
>> Test: /usr/bin/fio --size=3D1G --direct=3D1 --rw=3Drandrw =
--refill_buffers
>> --norandommap --randrepeat=3D0 --ioengine=3Dlibaio --bs=3D8k =
--rwmixread=3D70=20
>> --iodepth=3D16 --numjobs=3D16 --runtime=3D240 --group_reporting
>>=20
>> NFSv3 on RDMA:
>> Stock v4.19-rc2:
>> =E2=80=A2 read: IOPS=3D109k, BW=3D849MiB/s =
(890MB/s)(11.2GiB/13506msec)
>> =E2=80=A2 write: IOPS=3D46.6k, BW=3D364MiB/s =
(382MB/s)(4915MiB/13506msec)
>> Trond's kernel (with fair queuing):
>> =E2=80=A2 read: IOPS=3D83.0k, BW=3D649MiB/s =
(680MB/s)(11.2GiB/17676msec)
>> =E2=80=A2 write: IOPS=3D35.6k, BW=3D278MiB/s =
(292MB/s)(4921MiB/17676msec)
>> Trond's kernel (without fair queuing):
>> =E2=80=A2 read: IOPS=3D90.5k, BW=3D707MiB/s =
(742MB/s)(11.2GiB/16216msec)
>> =E2=80=A2 write: IOPS=3D38.8k, BW=3D303MiB/s =
(318MB/s)(4917MiB/16216msec)
>>=20
>> NFSv3 on TCP (IPoIB):
>> Stock v4.19-rc2:
>> =E2=80=A2 read: IOPS=3D23.8k, BW=3D186MiB/s =
(195MB/s)(11.2GiB/61635msec)
>> =E2=80=A2 write: IOPS=3D10.2k, BW=3D79.9MiB/s =
(83.8MB/s)(4923MiB/61635msec)
>> Trond's kernel (with fair queuing):
>> =E2=80=A2 read: IOPS=3D25.9k, BW=3D202MiB/s =
(212MB/s)(11.2GiB/56710msec)
>> =E2=80=A2 write: IOPS=3D11.1k, BW=3D86.7MiB/s =
(90.9MB/s)(4916MiB/56710msec)
>> Trond's kernel (without fair queuing):
>> =E2=80=A2 read: IOPS=3D25.0k, BW=3D203MiB/s =
(213MB/s)(11.2GiB/56492msec)
>> =E2=80=A2 write: IOPS=3D11.1k, BW=3D86.0MiB/s =
(91.2MB/s)(4915MiB/56492msec)
>>=20
>>=20
>> Test: /usr/bin/fio --size=3D1G --direct=3D1 --rw=3Drandread --
>> refill_buffers --norandommap --randrepeat=3D0 --ioengine=3Dlibaio =
--bs=3D4k=20
>> --rwmixread=3D100 --iodepth=3D1024 --numjobs=3D16 --runtime=3D240 --
>> group_reporting
>>=20
>> NFSv3 on RDMA:
>> Stock v4.19-rc2:
>> =E2=80=A2 read: IOPS=3D149k, BW=3D580MiB/s =
(608MB/s)(16.0GiB/28241msec)
>> Trond's kernel (with fair queuing):
>> =E2=80=A2 read: IOPS=3D81.5k, BW=3D318MiB/s =
(334MB/s)(16.0GiB/51450msec)
>> Trond's kernel (without fair queuing):
>> =E2=80=A2 read: IOPS=3D82.4k, BW=3D322MiB/s =
(337MB/s)(16.0GiB/50918msec)
>>=20
>> NFSv3 on TCP (IPoIB):
>> Stock v4.19-rc2:
>> =E2=80=A2 read: IOPS=3D37.2k, BW=3D145MiB/s =
(153MB/s)(16.0GiB/112630msec)
>> Trond's kernel (with fair queuing):
>> =E2=80=A2 read: IOPS=3D2715, BW=3D10.6MiB/s =
(11.1MB/s)(2573MiB/242594msec)
>> Trond's kernel (without fair queuing):
>> =E2=80=A2 read: IOPS=3D2869, BW=3D11.2MiB/s =
(11.8MB/s)(2724MiB/242979msec)
>>=20
>>=20
>> Test: /home/cel/bin/iozone -M -i0 -s8g -r512k -az -I -N
>>=20
>> My kernel: 4.19.0-rc2-00026-g50d68a4
>> system call latencies in microseconds, N=3D5:
>> =E2=80=A2 write: mean=3D602, std=3D13.0
>> =E2=80=A2 rewrite: mean=3D541, std=3D17.3
>> server round trip latency in microseconds, N=3D5:
>> =E2=80=A2 RTT: mean=3D354, std=3D3.0
>>=20
>> Trond's kernel (with fair queuing):
>> system call latencies in microseconds, N=3D5:
>> =E2=80=A2 write: mean=3D572, std=3D10.6
>> =E2=80=A2 rewrite: mean=3D533, std=3D7.9
>> server round trip latency in microseconds, N=3D5:
>> =E2=80=A2 RTT: mean=3D352, std=3D2.7
>=20
> Thanks for testing! I've been spending the last 3 days trying to =
figure
> out why we're seeing regressions with RDMA. I think I have a few
> candidates:
>=20
> - The congestion control was failing to wake up the write lock when we
> queue a request that has already been allocated a congestion control
> credit.
> - The livelock avoidance code in xprt_transmit() was causing the
> queueing to break.
> - Incorrect return value returned by xprt_transmit() when the queue is
> empty causes the request to retry waiting for the lock.
> - A race in xprt_prepare_transmit() could cause the request to wait =
for
> the write lock despite having been transmitted by another request.
> - The change to convert the write lock into a non-priority queue also
> changed the wake up code, causing the request that is granted the lock
> to be queued on rpciod, instead of on the low-latency xprtiod
> workqueue.
>=20
> I've fixed all the above. In addition, I've tightened up a few cases
> where we were grabbing spinlocks unnecessarily, and I've converted the
> reply lookup to use an rbtree in order to reduce the amount of time we
> need to hold the xprt->queue_lock.
>=20
> The new code has been rebased onto 4.19.0-rc3, and is now available on
> the 'testing' branch. Would you be able to give it another quick spin?

We're in much better shape now. Compare the stock v4.19-rc2
numbers above with these from your latest testing branch.
The new results show a consistent 10% throughput improvement.

test 1 from above:

NFSv3 on RDMA:
4.19.0-rc3-13903-g11dddfd:
=E2=80=A2 read: IOPS=3D118k, BW=3D921MiB/s =
(966MB/s)(11.2GiB/12469msec)
=E2=80=A2 write: IOPS=3D50.3k, BW=3D393MiB/s =
(412MB/s)(4899MiB/12469msec)

NFSv3 on TCP (IPoIB):
4.19.0-rc3-13903-g11dddfd:
=E2=80=A2 read: IOPS=3D27.4k, BW=3D214MiB/s =
(224MB/s)(11.2GiB/53650msec)
=E2=80=A2 write: IOPS=3D11.7k, BW=3D91.6MiB/s =
(96.0MB/s)(4913MiB/53650msec)


test 2 from above:

NFSv3 on RDMA:
4.19.0-rc3-13903-g11dddfd:
=E2=80=A2 read: IOPS=3D163k, BW=3D636MiB/s =
(667MB/s)(16.0GiB/25743msec)

NFSv3 on TCP (IPoIB):
4.19.0-rc3-13903-g11dddfd:
=E2=80=A2 read: IOPS=3D44.2k, BW=3D173MiB/s =
(181MB/s)(16.0GiB/94898msec)


--
Chuck Lever

2018-09-28 22:59:47

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

Hi Trond,

This patch ends up fixing a problem related to recall of delegations.
This problem has existed in the kernel for quite awhile (early RHEL
3.10-based release show the problem, haven't tried 2.6-based).

If you agree with what I present, then can the commit message for this
patch be changed to reflect that it solves this problem?

Problem can be seen by running "nfstest_delegation --runtests
recall26". Jorge can probably describe the test better but I believe
the main client machine opens a file for read and gets a read
delegation to it and locks it, then from a different process in the
same machine there is an appending open (open for write is sent and
gets a write delegation). From a different machine another client
opens the same file which triggers a CB_RECALL. The main client,
un-patched, ends up simply returning the delegation and never recovers
the lock. This is pretty bad as it's possible data corruption case.

In the nfs_delegation_claim_locks() this condition -- if
(nfs_file_open_context(fl->fl_file) != ctx) -- is true when it should
have been false.

There seems to be a disagreement between what's in
nfs_file_open_context(fl->fl_file) and what
get_nfs_open_context(inode's ctx) returns but I don't think how this
patch fixes it. But it does.
On Wed, Sep 5, 2018 at 3:27 PM Trond Myklebust <[email protected]> wrote:
>
> Reduce contention on the inode->i_lock by ensuring that we use RCU
> when looking up the NFS open context.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfs/delegation.c | 11 ++++++-----
> fs/nfs/inode.c | 35 +++++++++++++++--------------------
> fs/nfs/nfs4proc.c | 30 ++++++++++++++++++++++++------
> fs/nfs/nfs4state.c | 12 ++++++------
> fs/nfs/pnfs.c | 5 ++++-
> include/linux/nfs_fs.h | 1 +
> 6 files changed, 56 insertions(+), 38 deletions(-)
>
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index f033f3a69a3b..76d205d1c7bc 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -136,8 +136,8 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> int err;
>
> again:
> - spin_lock(&inode->i_lock);
> - list_for_each_entry(ctx, &nfsi->open_files, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> state = ctx->state;
> if (state == NULL)
> continue;
> @@ -147,8 +147,9 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> continue;
> if (!nfs4_stateid_match(&state->stateid, stateid))
> continue;
> - get_nfs_open_context(ctx);
> - spin_unlock(&inode->i_lock);
> + if (!get_nfs_open_context(ctx))
> + continue;
> + rcu_read_unlock();
> sp = state->owner;
> /* Block nfs4_proc_unlck */
> mutex_lock(&sp->so_delegreturn_mutex);
> @@ -164,7 +165,7 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> return err;
> goto again;
> }
> - spin_unlock(&inode->i_lock);
> + rcu_read_unlock();
> return 0;
> }
>
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index 052db41a7f80..5b1eee4952b7 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -977,9 +977,9 @@ EXPORT_SYMBOL_GPL(alloc_nfs_open_context);
>
> struct nfs_open_context *get_nfs_open_context(struct nfs_open_context *ctx)
> {
> - if (ctx != NULL)
> - refcount_inc(&ctx->lock_context.count);
> - return ctx;
> + if (ctx != NULL && refcount_inc_not_zero(&ctx->lock_context.count))
> + return ctx;
> + return NULL;
> }
> EXPORT_SYMBOL_GPL(get_nfs_open_context);
>
> @@ -988,13 +988,13 @@ static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync)
> struct inode *inode = d_inode(ctx->dentry);
> struct super_block *sb = ctx->dentry->d_sb;
>
> + if (!refcount_dec_and_test(&ctx->lock_context.count))
> + return;
> if (!list_empty(&ctx->list)) {
> - if (!refcount_dec_and_lock(&ctx->lock_context.count, &inode->i_lock))
> - return;
> - list_del(&ctx->list);
> + spin_lock(&inode->i_lock);
> + list_del_rcu(&ctx->list);
> spin_unlock(&inode->i_lock);
> - } else if (!refcount_dec_and_test(&ctx->lock_context.count))
> - return;
> + }
> if (inode != NULL)
> NFS_PROTO(inode)->close_context(ctx, is_sync);
> if (ctx->cred != NULL)
> @@ -1002,7 +1002,7 @@ static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync)
> dput(ctx->dentry);
> nfs_sb_deactive(sb);
> kfree(ctx->mdsthreshold);
> - kfree(ctx);
> + kfree_rcu(ctx, rcu_head);
> }
>
> void put_nfs_open_context(struct nfs_open_context *ctx)
> @@ -1026,10 +1026,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)
> struct nfs_inode *nfsi = NFS_I(inode);
>
> spin_lock(&inode->i_lock);
> - if (ctx->mode & FMODE_WRITE)
> - list_add(&ctx->list, &nfsi->open_files);
> - else
> - list_add_tail(&ctx->list, &nfsi->open_files);
> + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> spin_unlock(&inode->i_lock);
> }
> EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> @@ -1050,16 +1047,17 @@ struct nfs_open_context *nfs_find_open_context(struct inode *inode, struct rpc_c
> struct nfs_inode *nfsi = NFS_I(inode);
> struct nfs_open_context *pos, *ctx = NULL;
>
> - spin_lock(&inode->i_lock);
> - list_for_each_entry(pos, &nfsi->open_files, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(pos, &nfsi->open_files, list) {
> if (cred != NULL && pos->cred != cred)
> continue;
> if ((pos->mode & (FMODE_READ|FMODE_WRITE)) != mode)
> continue;
> ctx = get_nfs_open_context(pos);
> - break;
> + if (ctx)
> + break;
> }
> - spin_unlock(&inode->i_lock);
> + rcu_read_unlock();
> return ctx;
> }
>
> @@ -1077,9 +1075,6 @@ void nfs_file_clear_open_context(struct file *filp)
> if (ctx->error < 0)
> invalidate_inode_pages2(inode->i_mapping);
> filp->private_data = NULL;
> - spin_lock(&inode->i_lock);
> - list_move_tail(&ctx->list, &NFS_I(inode)->open_files);
> - spin_unlock(&inode->i_lock);
> put_nfs_open_context_sync(ctx);
> }
> }
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 8220a168282e..10c20a5b075d 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -1933,23 +1933,41 @@ nfs4_opendata_to_nfs4_state(struct nfs4_opendata *data)
> return ret;
> }
>
> -static struct nfs_open_context *nfs4_state_find_open_context(struct nfs4_state *state)
> +static struct nfs_open_context *
> +nfs4_state_find_open_context_mode(struct nfs4_state *state, fmode_t mode)
> {
> struct nfs_inode *nfsi = NFS_I(state->inode);
> struct nfs_open_context *ctx;
>
> - spin_lock(&state->inode->i_lock);
> - list_for_each_entry(ctx, &nfsi->open_files, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> if (ctx->state != state)
> continue;
> - get_nfs_open_context(ctx);
> - spin_unlock(&state->inode->i_lock);
> + if ((ctx->mode & mode) != mode)
> + continue;
> + if (!get_nfs_open_context(ctx))
> + continue;
> + rcu_read_unlock();
> return ctx;
> }
> - spin_unlock(&state->inode->i_lock);
> + rcu_read_unlock();
> return ERR_PTR(-ENOENT);
> }
>
> +static struct nfs_open_context *
> +nfs4_state_find_open_context(struct nfs4_state *state)
> +{
> + struct nfs_open_context *ctx;
> +
> + ctx = nfs4_state_find_open_context_mode(state, FMODE_READ|FMODE_WRITE);
> + if (!IS_ERR(ctx))
> + return ctx;
> + ctx = nfs4_state_find_open_context_mode(state, FMODE_WRITE);
> + if (!IS_ERR(ctx))
> + return ctx;
> + return nfs4_state_find_open_context_mode(state, FMODE_READ);
> +}
> +
> static struct nfs4_opendata *nfs4_open_recoverdata_alloc(struct nfs_open_context *ctx,
> struct nfs4_state *state, enum open_claim_type4 claim)
> {
> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> index 40a08cd483f0..be92ce4259e9 100644
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -1437,8 +1437,8 @@ void nfs_inode_find_state_and_recover(struct inode *inode,
> struct nfs4_state *state;
> bool found = false;
>
> - spin_lock(&inode->i_lock);
> - list_for_each_entry(ctx, &nfsi->open_files, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> state = ctx->state;
> if (state == NULL)
> continue;
> @@ -1456,7 +1456,7 @@ void nfs_inode_find_state_and_recover(struct inode *inode,
> nfs4_state_mark_reclaim_nograce(clp, state))
> found = true;
> }
> - spin_unlock(&inode->i_lock);
> + rcu_read_unlock();
>
> nfs_inode_find_delegation_state_and_recover(inode, stateid);
> if (found)
> @@ -1469,13 +1469,13 @@ static void nfs4_state_mark_open_context_bad(struct nfs4_state *state)
> struct nfs_inode *nfsi = NFS_I(inode);
> struct nfs_open_context *ctx;
>
> - spin_lock(&inode->i_lock);
> - list_for_each_entry(ctx, &nfsi->open_files, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> if (ctx->state != state)
> continue;
> set_bit(NFS_CONTEXT_BAD, &ctx->flags);
> }
> - spin_unlock(&inode->i_lock);
> + rcu_read_unlock();
> }
>
> static void nfs4_state_mark_recovery_failed(struct nfs4_state *state, int error)
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index c5672c02afd6..06cb90e9bc6e 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1339,6 +1339,7 @@ bool pnfs_roc(struct inode *ino,
> if (!nfs_have_layout(ino))
> return false;
> retry:
> + rcu_read_lock();
> spin_lock(&ino->i_lock);
> lo = nfsi->layout;
> if (!lo || !pnfs_layout_is_valid(lo) ||
> @@ -1349,6 +1350,7 @@ bool pnfs_roc(struct inode *ino,
> pnfs_get_layout_hdr(lo);
> if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
> spin_unlock(&ino->i_lock);
> + rcu_read_unlock();
> wait_on_bit(&lo->plh_flags, NFS_LAYOUT_RETURN,
> TASK_UNINTERRUPTIBLE);
> pnfs_put_layout_hdr(lo);
> @@ -1362,7 +1364,7 @@ bool pnfs_roc(struct inode *ino,
> skip_read = true;
> }
>
> - list_for_each_entry(ctx, &nfsi->open_files, list) {
> + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> state = ctx->state;
> if (state == NULL)
> continue;
> @@ -1410,6 +1412,7 @@ bool pnfs_roc(struct inode *ino,
>
> out_noroc:
> spin_unlock(&ino->i_lock);
> + rcu_read_unlock();
> pnfs_layoutcommit_inode(ino, true);
> if (roc) {
> struct pnfs_layoutdriver_type *ld = NFS_SERVER(ino)->pnfs_curr_ld;
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index d2f4f88a0e66..6e0417c02279 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -83,6 +83,7 @@ struct nfs_open_context {
>
> struct list_head list;
> struct nfs4_threshold *mdsthreshold;
> + struct rcu_head rcu_head;
> };
>
> struct nfs_open_dir_context {
> --
> 2.17.1
>

2018-09-28 23:19:22

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

Also shouldn't this be a bug fix for 4.19 and also go to stable?
On Fri, Sep 28, 2018 at 12:34 PM Olga Kornievskaia <[email protected]> wrote:
>
> Hi Trond,
>
> This patch ends up fixing a problem related to recall of delegations.
> This problem has existed in the kernel for quite awhile (early RHEL
> 3.10-based release show the problem, haven't tried 2.6-based).
>
> If you agree with what I present, then can the commit message for this
> patch be changed to reflect that it solves this problem?
>
> Problem can be seen by running "nfstest_delegation --runtests
> recall26". Jorge can probably describe the test better but I believe
> the main client machine opens a file for read and gets a read
> delegation to it and locks it, then from a different process in the
> same machine there is an appending open (open for write is sent and
> gets a write delegation). From a different machine another client
> opens the same file which triggers a CB_RECALL. The main client,
> un-patched, ends up simply returning the delegation and never recovers
> the lock. This is pretty bad as it's possible data corruption case.
>
> In the nfs_delegation_claim_locks() this condition -- if
> (nfs_file_open_context(fl->fl_file) != ctx) -- is true when it should
> have been false.
>
> There seems to be a disagreement between what's in
> nfs_file_open_context(fl->fl_file) and what
> get_nfs_open_context(inode's ctx) returns but I don't think how this
> patch fixes it. But it does.
> On Wed, Sep 5, 2018 at 3:27 PM Trond Myklebust <[email protected]> wrote:
> >
> > Reduce contention on the inode->i_lock by ensuring that we use RCU
> > when looking up the NFS open context.
> >
> > Signed-off-by: Trond Myklebust <[email protected]>
> > ---
> > fs/nfs/delegation.c | 11 ++++++-----
> > fs/nfs/inode.c | 35 +++++++++++++++--------------------
> > fs/nfs/nfs4proc.c | 30 ++++++++++++++++++++++++------
> > fs/nfs/nfs4state.c | 12 ++++++------
> > fs/nfs/pnfs.c | 5 ++++-
> > include/linux/nfs_fs.h | 1 +
> > 6 files changed, 56 insertions(+), 38 deletions(-)
> >
> > diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> > index f033f3a69a3b..76d205d1c7bc 100644
> > --- a/fs/nfs/delegation.c
> > +++ b/fs/nfs/delegation.c
> > @@ -136,8 +136,8 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> > int err;
> >
> > again:
> > - spin_lock(&inode->i_lock);
> > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > state = ctx->state;
> > if (state == NULL)
> > continue;
> > @@ -147,8 +147,9 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> > continue;
> > if (!nfs4_stateid_match(&state->stateid, stateid))
> > continue;
> > - get_nfs_open_context(ctx);
> > - spin_unlock(&inode->i_lock);
> > + if (!get_nfs_open_context(ctx))
> > + continue;
> > + rcu_read_unlock();
> > sp = state->owner;
> > /* Block nfs4_proc_unlck */
> > mutex_lock(&sp->so_delegreturn_mutex);
> > @@ -164,7 +165,7 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> > return err;
> > goto again;
> > }
> > - spin_unlock(&inode->i_lock);
> > + rcu_read_unlock();
> > return 0;
> > }
> >
> > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > index 052db41a7f80..5b1eee4952b7 100644
> > --- a/fs/nfs/inode.c
> > +++ b/fs/nfs/inode.c
> > @@ -977,9 +977,9 @@ EXPORT_SYMBOL_GPL(alloc_nfs_open_context);
> >
> > struct nfs_open_context *get_nfs_open_context(struct nfs_open_context *ctx)
> > {
> > - if (ctx != NULL)
> > - refcount_inc(&ctx->lock_context.count);
> > - return ctx;
> > + if (ctx != NULL && refcount_inc_not_zero(&ctx->lock_context.count))
> > + return ctx;
> > + return NULL;
> > }
> > EXPORT_SYMBOL_GPL(get_nfs_open_context);
> >
> > @@ -988,13 +988,13 @@ static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync)
> > struct inode *inode = d_inode(ctx->dentry);
> > struct super_block *sb = ctx->dentry->d_sb;
> >
> > + if (!refcount_dec_and_test(&ctx->lock_context.count))
> > + return;
> > if (!list_empty(&ctx->list)) {
> > - if (!refcount_dec_and_lock(&ctx->lock_context.count, &inode->i_lock))
> > - return;
> > - list_del(&ctx->list);
> > + spin_lock(&inode->i_lock);
> > + list_del_rcu(&ctx->list);
> > spin_unlock(&inode->i_lock);
> > - } else if (!refcount_dec_and_test(&ctx->lock_context.count))
> > - return;
> > + }
> > if (inode != NULL)
> > NFS_PROTO(inode)->close_context(ctx, is_sync);
> > if (ctx->cred != NULL)
> > @@ -1002,7 +1002,7 @@ static void __put_nfs_open_context(struct nfs_open_context *ctx, int is_sync)
> > dput(ctx->dentry);
> > nfs_sb_deactive(sb);
> > kfree(ctx->mdsthreshold);
> > - kfree(ctx);
> > + kfree_rcu(ctx, rcu_head);
> > }
> >
> > void put_nfs_open_context(struct nfs_open_context *ctx)
> > @@ -1026,10 +1026,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)
> > struct nfs_inode *nfsi = NFS_I(inode);
> >
> > spin_lock(&inode->i_lock);
> > - if (ctx->mode & FMODE_WRITE)
> > - list_add(&ctx->list, &nfsi->open_files);
> > - else
> > - list_add_tail(&ctx->list, &nfsi->open_files);
> > + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> > spin_unlock(&inode->i_lock);
> > }
> > EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> > @@ -1050,16 +1047,17 @@ struct nfs_open_context *nfs_find_open_context(struct inode *inode, struct rpc_c
> > struct nfs_inode *nfsi = NFS_I(inode);
> > struct nfs_open_context *pos, *ctx = NULL;
> >
> > - spin_lock(&inode->i_lock);
> > - list_for_each_entry(pos, &nfsi->open_files, list) {
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(pos, &nfsi->open_files, list) {
> > if (cred != NULL && pos->cred != cred)
> > continue;
> > if ((pos->mode & (FMODE_READ|FMODE_WRITE)) != mode)
> > continue;
> > ctx = get_nfs_open_context(pos);
> > - break;
> > + if (ctx)
> > + break;
> > }
> > - spin_unlock(&inode->i_lock);
> > + rcu_read_unlock();
> > return ctx;
> > }
> >
> > @@ -1077,9 +1075,6 @@ void nfs_file_clear_open_context(struct file *filp)
> > if (ctx->error < 0)
> > invalidate_inode_pages2(inode->i_mapping);
> > filp->private_data = NULL;
> > - spin_lock(&inode->i_lock);
> > - list_move_tail(&ctx->list, &NFS_I(inode)->open_files);
> > - spin_unlock(&inode->i_lock);
> > put_nfs_open_context_sync(ctx);
> > }
> > }
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index 8220a168282e..10c20a5b075d 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -1933,23 +1933,41 @@ nfs4_opendata_to_nfs4_state(struct nfs4_opendata *data)
> > return ret;
> > }
> >
> > -static struct nfs_open_context *nfs4_state_find_open_context(struct nfs4_state *state)
> > +static struct nfs_open_context *
> > +nfs4_state_find_open_context_mode(struct nfs4_state *state, fmode_t mode)
> > {
> > struct nfs_inode *nfsi = NFS_I(state->inode);
> > struct nfs_open_context *ctx;
> >
> > - spin_lock(&state->inode->i_lock);
> > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > if (ctx->state != state)
> > continue;
> > - get_nfs_open_context(ctx);
> > - spin_unlock(&state->inode->i_lock);
> > + if ((ctx->mode & mode) != mode)
> > + continue;
> > + if (!get_nfs_open_context(ctx))
> > + continue;
> > + rcu_read_unlock();
> > return ctx;
> > }
> > - spin_unlock(&state->inode->i_lock);
> > + rcu_read_unlock();
> > return ERR_PTR(-ENOENT);
> > }
> >
> > +static struct nfs_open_context *
> > +nfs4_state_find_open_context(struct nfs4_state *state)
> > +{
> > + struct nfs_open_context *ctx;
> > +
> > + ctx = nfs4_state_find_open_context_mode(state, FMODE_READ|FMODE_WRITE);
> > + if (!IS_ERR(ctx))
> > + return ctx;
> > + ctx = nfs4_state_find_open_context_mode(state, FMODE_WRITE);
> > + if (!IS_ERR(ctx))
> > + return ctx;
> > + return nfs4_state_find_open_context_mode(state, FMODE_READ);
> > +}
> > +
> > static struct nfs4_opendata *nfs4_open_recoverdata_alloc(struct nfs_open_context *ctx,
> > struct nfs4_state *state, enum open_claim_type4 claim)
> > {
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 40a08cd483f0..be92ce4259e9 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1437,8 +1437,8 @@ void nfs_inode_find_state_and_recover(struct inode *inode,
> > struct nfs4_state *state;
> > bool found = false;
> >
> > - spin_lock(&inode->i_lock);
> > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > state = ctx->state;
> > if (state == NULL)
> > continue;
> > @@ -1456,7 +1456,7 @@ void nfs_inode_find_state_and_recover(struct inode *inode,
> > nfs4_state_mark_reclaim_nograce(clp, state))
> > found = true;
> > }
> > - spin_unlock(&inode->i_lock);
> > + rcu_read_unlock();
> >
> > nfs_inode_find_delegation_state_and_recover(inode, stateid);
> > if (found)
> > @@ -1469,13 +1469,13 @@ static void nfs4_state_mark_open_context_bad(struct nfs4_state *state)
> > struct nfs_inode *nfsi = NFS_I(inode);
> > struct nfs_open_context *ctx;
> >
> > - spin_lock(&inode->i_lock);
> > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > if (ctx->state != state)
> > continue;
> > set_bit(NFS_CONTEXT_BAD, &ctx->flags);
> > }
> > - spin_unlock(&inode->i_lock);
> > + rcu_read_unlock();
> > }
> >
> > static void nfs4_state_mark_recovery_failed(struct nfs4_state *state, int error)
> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > index c5672c02afd6..06cb90e9bc6e 100644
> > --- a/fs/nfs/pnfs.c
> > +++ b/fs/nfs/pnfs.c
> > @@ -1339,6 +1339,7 @@ bool pnfs_roc(struct inode *ino,
> > if (!nfs_have_layout(ino))
> > return false;
> > retry:
> > + rcu_read_lock();
> > spin_lock(&ino->i_lock);
> > lo = nfsi->layout;
> > if (!lo || !pnfs_layout_is_valid(lo) ||
> > @@ -1349,6 +1350,7 @@ bool pnfs_roc(struct inode *ino,
> > pnfs_get_layout_hdr(lo);
> > if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
> > spin_unlock(&ino->i_lock);
> > + rcu_read_unlock();
> > wait_on_bit(&lo->plh_flags, NFS_LAYOUT_RETURN,
> > TASK_UNINTERRUPTIBLE);
> > pnfs_put_layout_hdr(lo);
> > @@ -1362,7 +1364,7 @@ bool pnfs_roc(struct inode *ino,
> > skip_read = true;
> > }
> >
> > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > state = ctx->state;
> > if (state == NULL)
> > continue;
> > @@ -1410,6 +1412,7 @@ bool pnfs_roc(struct inode *ino,
> >
> > out_noroc:
> > spin_unlock(&ino->i_lock);
> > + rcu_read_unlock();
> > pnfs_layoutcommit_inode(ino, true);
> > if (roc) {
> > struct pnfs_layoutdriver_type *ld = NFS_SERVER(ino)->pnfs_curr_ld;
> > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > index d2f4f88a0e66..6e0417c02279 100644
> > --- a/include/linux/nfs_fs.h
> > +++ b/include/linux/nfs_fs.h
> > @@ -83,6 +83,7 @@ struct nfs_open_context {
> >
> > struct list_head list;
> > struct nfs4_threshold *mdsthreshold;
> > + struct rcu_head rcu_head;
> > };
> >
> > struct nfs_open_dir_context {
> > --
> > 2.17.1
> >

2018-09-29 00:14:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Fri, 2018-09-28 at 12:54 -0400, Olga Kornievskaia wrote:
> Also shouldn't this be a bug fix for 4.19 and also go to stable?

It wasn't really intended as a bugfix because I wasn't aware that there
was a bug to fix in the first place. I assume the modification to
nfs4_state_find_open_context() to cause it to look for an open context
with a READ/WRITE open mode first is what fixes the bug. Is that the
case?

> On Fri, Sep 28, 2018 at 12:34 PM Olga Kornievskaia <[email protected]>
> wrote:
> >
> > Hi Trond,
> >
> > This patch ends up fixing a problem related to recall of
> > delegations.
> > This problem has existed in the kernel for quite awhile (early RHEL
> > 3.10-based release show the problem, haven't tried 2.6-based).
> >
> > If you agree with what I present, then can the commit message for
> > this
> > patch be changed to reflect that it solves this problem?
> >
> > Problem can be seen by running "nfstest_delegation --runtests
> > recall26". Jorge can probably describe the test better but I
> > believe
> > the main client machine opens a file for read and gets a read
> > delegation to it and locks it, then from a different process in the
> > same machine there is an appending open (open for write is sent and
> > gets a write delegation). From a different machine another client
> > opens the same file which triggers a CB_RECALL. The main client,
> > un-patched, ends up simply returning the delegation and never
> > recovers
> > the lock. This is pretty bad as it's possible data corruption case.
> >
> > In the nfs_delegation_claim_locks() this condition -- if
> > (nfs_file_open_context(fl->fl_file) != ctx) -- is true when it
> > should
> > have been false.
> >
> > There seems to be a disagreement between what's in
> > nfs_file_open_context(fl->fl_file) and what
> > get_nfs_open_context(inode's ctx) returns but I don't think how
> > this
> > patch fixes it. But it does.
> > On Wed, Sep 5, 2018 at 3:27 PM Trond Myklebust <[email protected]>
> > wrote:
> > >
> > > Reduce contention on the inode->i_lock by ensuring that we use
> > > RCU
> > > when looking up the NFS open context.
> > >
> > > Signed-off-by: Trond Myklebust <[email protected]>
> > > ---
> > > fs/nfs/delegation.c | 11 ++++++-----
> > > fs/nfs/inode.c | 35 +++++++++++++++--------------------
> > > fs/nfs/nfs4proc.c | 30 ++++++++++++++++++++++++------
> > > fs/nfs/nfs4state.c | 12 ++++++------
> > > fs/nfs/pnfs.c | 5 ++++-
> > > include/linux/nfs_fs.h | 1 +
> > > 6 files changed, 56 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> > > index f033f3a69a3b..76d205d1c7bc 100644
> > > --- a/fs/nfs/delegation.c
> > > +++ b/fs/nfs/delegation.c
> > > @@ -136,8 +136,8 @@ static int nfs_delegation_claim_opens(struct
> > > inode *inode,
> > > int err;
> > >
> > > again:
> > > - spin_lock(&inode->i_lock);
> > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > state = ctx->state;
> > > if (state == NULL)
> > > continue;
> > > @@ -147,8 +147,9 @@ static int nfs_delegation_claim_opens(struct
> > > inode *inode,
> > > continue;
> > > if (!nfs4_stateid_match(&state->stateid,
> > > stateid))
> > > continue;
> > > - get_nfs_open_context(ctx);
> > > - spin_unlock(&inode->i_lock);
> > > + if (!get_nfs_open_context(ctx))
> > > + continue;
> > > + rcu_read_unlock();
> > > sp = state->owner;
> > > /* Block nfs4_proc_unlck */
> > > mutex_lock(&sp->so_delegreturn_mutex);
> > > @@ -164,7 +165,7 @@ static int nfs_delegation_claim_opens(struct
> > > inode *inode,
> > > return err;
> > > goto again;
> > > }
> > > - spin_unlock(&inode->i_lock);
> > > + rcu_read_unlock();
> > > return 0;
> > > }
> > >
> > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > > index 052db41a7f80..5b1eee4952b7 100644
> > > --- a/fs/nfs/inode.c
> > > +++ b/fs/nfs/inode.c
> > > @@ -977,9 +977,9 @@ EXPORT_SYMBOL_GPL(alloc_nfs_open_context);
> > >
> > > struct nfs_open_context *get_nfs_open_context(struct
> > > nfs_open_context *ctx)
> > > {
> > > - if (ctx != NULL)
> > > - refcount_inc(&ctx->lock_context.count);
> > > - return ctx;
> > > + if (ctx != NULL && refcount_inc_not_zero(&ctx-
> > > >lock_context.count))
> > > + return ctx;
> > > + return NULL;
> > > }
> > > EXPORT_SYMBOL_GPL(get_nfs_open_context);
> > >
> > > @@ -988,13 +988,13 @@ static void __put_nfs_open_context(struct
> > > nfs_open_context *ctx, int is_sync)
> > > struct inode *inode = d_inode(ctx->dentry);
> > > struct super_block *sb = ctx->dentry->d_sb;
> > >
> > > + if (!refcount_dec_and_test(&ctx->lock_context.count))
> > > + return;
> > > if (!list_empty(&ctx->list)) {
> > > - if (!refcount_dec_and_lock(&ctx-
> > > >lock_context.count, &inode->i_lock))
> > > - return;
> > > - list_del(&ctx->list);
> > > + spin_lock(&inode->i_lock);
> > > + list_del_rcu(&ctx->list);
> > > spin_unlock(&inode->i_lock);
> > > - } else if (!refcount_dec_and_test(&ctx-
> > > >lock_context.count))
> > > - return;
> > > + }
> > > if (inode != NULL)
> > > NFS_PROTO(inode)->close_context(ctx, is_sync);
> > > if (ctx->cred != NULL)
> > > @@ -1002,7 +1002,7 @@ static void __put_nfs_open_context(struct
> > > nfs_open_context *ctx, int is_sync)
> > > dput(ctx->dentry);
> > > nfs_sb_deactive(sb);
> > > kfree(ctx->mdsthreshold);
> > > - kfree(ctx);
> > > + kfree_rcu(ctx, rcu_head);
> > > }
> > >
> > > void put_nfs_open_context(struct nfs_open_context *ctx)
> > > @@ -1026,10 +1026,7 @@ void nfs_inode_attach_open_context(struct
> > > nfs_open_context *ctx)
> > > struct nfs_inode *nfsi = NFS_I(inode);
> > >
> > > spin_lock(&inode->i_lock);
> > > - if (ctx->mode & FMODE_WRITE)
> > > - list_add(&ctx->list, &nfsi->open_files);
> > > - else
> > > - list_add_tail(&ctx->list, &nfsi->open_files);
> > > + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> > > spin_unlock(&inode->i_lock);
> > > }
> > > EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> > > @@ -1050,16 +1047,17 @@ struct nfs_open_context
> > > *nfs_find_open_context(struct inode *inode, struct rpc_c
> > > struct nfs_inode *nfsi = NFS_I(inode);
> > > struct nfs_open_context *pos, *ctx = NULL;
> > >
> > > - spin_lock(&inode->i_lock);
> > > - list_for_each_entry(pos, &nfsi->open_files, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(pos, &nfsi->open_files, list) {
> > > if (cred != NULL && pos->cred != cred)
> > > continue;
> > > if ((pos->mode & (FMODE_READ|FMODE_WRITE)) !=
> > > mode)
> > > continue;
> > > ctx = get_nfs_open_context(pos);
> > > - break;
> > > + if (ctx)
> > > + break;
> > > }
> > > - spin_unlock(&inode->i_lock);
> > > + rcu_read_unlock();
> > > return ctx;
> > > }
> > >
> > > @@ -1077,9 +1075,6 @@ void nfs_file_clear_open_context(struct
> > > file *filp)
> > > if (ctx->error < 0)
> > > invalidate_inode_pages2(inode-
> > > >i_mapping);
> > > filp->private_data = NULL;
> > > - spin_lock(&inode->i_lock);
> > > - list_move_tail(&ctx->list, &NFS_I(inode)-
> > > >open_files);
> > > - spin_unlock(&inode->i_lock);
> > > put_nfs_open_context_sync(ctx);
> > > }
> > > }
> > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > > index 8220a168282e..10c20a5b075d 100644
> > > --- a/fs/nfs/nfs4proc.c
> > > +++ b/fs/nfs/nfs4proc.c
> > > @@ -1933,23 +1933,41 @@ nfs4_opendata_to_nfs4_state(struct
> > > nfs4_opendata *data)
> > > return ret;
> > > }
> > >
> > > -static struct nfs_open_context
> > > *nfs4_state_find_open_context(struct nfs4_state *state)
> > > +static struct nfs_open_context *
> > > +nfs4_state_find_open_context_mode(struct nfs4_state *state,
> > > fmode_t mode)
> > > {
> > > struct nfs_inode *nfsi = NFS_I(state->inode);
> > > struct nfs_open_context *ctx;
> > >
> > > - spin_lock(&state->inode->i_lock);
> > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > if (ctx->state != state)
> > > continue;
> > > - get_nfs_open_context(ctx);
> > > - spin_unlock(&state->inode->i_lock);
> > > + if ((ctx->mode & mode) != mode)
> > > + continue;
> > > + if (!get_nfs_open_context(ctx))
> > > + continue;
> > > + rcu_read_unlock();
> > > return ctx;
> > > }
> > > - spin_unlock(&state->inode->i_lock);
> > > + rcu_read_unlock();
> > > return ERR_PTR(-ENOENT);
> > > }
> > >
> > > +static struct nfs_open_context *
> > > +nfs4_state_find_open_context(struct nfs4_state *state)
> > > +{
> > > + struct nfs_open_context *ctx;
> > > +
> > > + ctx = nfs4_state_find_open_context_mode(state,
> > > FMODE_READ|FMODE_WRITE);
> > > + if (!IS_ERR(ctx))
> > > + return ctx;
> > > + ctx = nfs4_state_find_open_context_mode(state,
> > > FMODE_WRITE);
> > > + if (!IS_ERR(ctx))
> > > + return ctx;
> > > + return nfs4_state_find_open_context_mode(state,
> > > FMODE_READ);
> > > +}
> > > +
> > > static struct nfs4_opendata *nfs4_open_recoverdata_alloc(struct
> > > nfs_open_context *ctx,
> > > struct nfs4_state *state, enum open_claim_type4
> > > claim)
> > > {
> > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > > index 40a08cd483f0..be92ce4259e9 100644
> > > --- a/fs/nfs/nfs4state.c
> > > +++ b/fs/nfs/nfs4state.c
> > > @@ -1437,8 +1437,8 @@ void
> > > nfs_inode_find_state_and_recover(struct inode *inode,
> > > struct nfs4_state *state;
> > > bool found = false;
> > >
> > > - spin_lock(&inode->i_lock);
> > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > state = ctx->state;
> > > if (state == NULL)
> > > continue;
> > > @@ -1456,7 +1456,7 @@ void
> > > nfs_inode_find_state_and_recover(struct inode *inode,
> > > nfs4_state_mark_reclaim_nograce(clp, state))
> > > found = true;
> > > }
> > > - spin_unlock(&inode->i_lock);
> > > + rcu_read_unlock();
> > >
> > > nfs_inode_find_delegation_state_and_recover(inode,
> > > stateid);
> > > if (found)
> > > @@ -1469,13 +1469,13 @@ static void
> > > nfs4_state_mark_open_context_bad(struct nfs4_state *state)
> > > struct nfs_inode *nfsi = NFS_I(inode);
> > > struct nfs_open_context *ctx;
> > >
> > > - spin_lock(&inode->i_lock);
> > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > if (ctx->state != state)
> > > continue;
> > > set_bit(NFS_CONTEXT_BAD, &ctx->flags);
> > > }
> > > - spin_unlock(&inode->i_lock);
> > > + rcu_read_unlock();
> > > }
> > >
> > > static void nfs4_state_mark_recovery_failed(struct nfs4_state
> > > *state, int error)
> > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > > index c5672c02afd6..06cb90e9bc6e 100644
> > > --- a/fs/nfs/pnfs.c
> > > +++ b/fs/nfs/pnfs.c
> > > @@ -1339,6 +1339,7 @@ bool pnfs_roc(struct inode *ino,
> > > if (!nfs_have_layout(ino))
> > > return false;
> > > retry:
> > > + rcu_read_lock();
> > > spin_lock(&ino->i_lock);
> > > lo = nfsi->layout;
> > > if (!lo || !pnfs_layout_is_valid(lo) ||
> > > @@ -1349,6 +1350,7 @@ bool pnfs_roc(struct inode *ino,
> > > pnfs_get_layout_hdr(lo);
> > > if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
> > > spin_unlock(&ino->i_lock);
> > > + rcu_read_unlock();
> > > wait_on_bit(&lo->plh_flags, NFS_LAYOUT_RETURN,
> > > TASK_UNINTERRUPTIBLE);
> > > pnfs_put_layout_hdr(lo);
> > > @@ -1362,7 +1364,7 @@ bool pnfs_roc(struct inode *ino,
> > > skip_read = true;
> > > }
> > >
> > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > state = ctx->state;
> > > if (state == NULL)
> > > continue;
> > > @@ -1410,6 +1412,7 @@ bool pnfs_roc(struct inode *ino,
> > >
> > > out_noroc:
> > > spin_unlock(&ino->i_lock);
> > > + rcu_read_unlock();
> > > pnfs_layoutcommit_inode(ino, true);
> > > if (roc) {
> > > struct pnfs_layoutdriver_type *ld =
> > > NFS_SERVER(ino)->pnfs_curr_ld;
> > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > index d2f4f88a0e66..6e0417c02279 100644
> > > --- a/include/linux/nfs_fs.h
> > > +++ b/include/linux/nfs_fs.h
> > > @@ -83,6 +83,7 @@ struct nfs_open_context {
> > >
> > > struct list_head list;
> > > struct nfs4_threshold *mdsthreshold;
> > > + struct rcu_head rcu_head;
> > > };
> > >
> > > struct nfs_open_dir_context {
> > > --
> > > 2.17.1
> > >

2018-09-29 00:56:22

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Fri, Sep 28, 2018 at 1:50 PM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2018-09-28 at 12:54 -0400, Olga Kornievskaia wrote:
> > Also shouldn't this be a bug fix for 4.19 and also go to stable?
>
> It wasn't really intended as a bugfix because I wasn't aware that there
> was a bug to fix in the first place. I assume the modification to
> nfs4_state_find_open_context() to cause it to look for an open context
> with a READ/WRITE open mode first is what fixes the bug. Is that the
> case?

I don't think so. nfs4_state_find_open_context() never gets calls
during the run of this test case. This function is called during the
reclaim on an open. In delegation recall the opens are not reclaimed,
it is just reclaiming the locks.

Actually, when I undo this piece of the patch, I can make it fail again.

@@ -1027,10 +1027,7 @@ void nfs_inode_attach_open_context(struct
nfs_open_context *ctx)
struct nfs_inode *nfsi = NFS_I(inode);

spin_lock(&inode->i_lock);
- if (ctx->mode & FMODE_WRITE)
- list_add(&ctx->list, &nfsi->open_files);
- else
- list_add_tail(&ctx->list, &nfsi->open_files);
+ list_add_tail_rcu(&ctx->list, &nfsi->open_files);
spin_unlock(&inode->i_lock);
}
EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);

It looks like the placement in the list matters? Any ideas why?

>
> > On Fri, Sep 28, 2018 at 12:34 PM Olga Kornievskaia <[email protected]>
> > wrote:
> > >
> > > Hi Trond,
> > >
> > > This patch ends up fixing a problem related to recall of
> > > delegations.
> > > This problem has existed in the kernel for quite awhile (early RHEL
> > > 3.10-based release show the problem, haven't tried 2.6-based).
> > >
> > > If you agree with what I present, then can the commit message for
> > > this
> > > patch be changed to reflect that it solves this problem?
> > >
> > > Problem can be seen by running "nfstest_delegation --runtests
> > > recall26". Jorge can probably describe the test better but I
> > > believe
> > > the main client machine opens a file for read and gets a read
> > > delegation to it and locks it, then from a different process in the
> > > same machine there is an appending open (open for write is sent and
> > > gets a write delegation). From a different machine another client
> > > opens the same file which triggers a CB_RECALL. The main client,
> > > un-patched, ends up simply returning the delegation and never
> > > recovers
> > > the lock. This is pretty bad as it's possible data corruption case.
> > >
> > > In the nfs_delegation_claim_locks() this condition -- if
> > > (nfs_file_open_context(fl->fl_file) != ctx) -- is true when it
> > > should
> > > have been false.
> > >
> > > There seems to be a disagreement between what's in
> > > nfs_file_open_context(fl->fl_file) and what
> > > get_nfs_open_context(inode's ctx) returns but I don't think how
> > > this
> > > patch fixes it. But it does.
> > > On Wed, Sep 5, 2018 at 3:27 PM Trond Myklebust <[email protected]>
> > > wrote:
> > > >
> > > > Reduce contention on the inode->i_lock by ensuring that we use
> > > > RCU
> > > > when looking up the NFS open context.
> > > >
> > > > Signed-off-by: Trond Myklebust <[email protected]>
> > > > ---
> > > > fs/nfs/delegation.c | 11 ++++++-----
> > > > fs/nfs/inode.c | 35 +++++++++++++++--------------------
> > > > fs/nfs/nfs4proc.c | 30 ++++++++++++++++++++++++------
> > > > fs/nfs/nfs4state.c | 12 ++++++------
> > > > fs/nfs/pnfs.c | 5 ++++-
> > > > include/linux/nfs_fs.h | 1 +
> > > > 6 files changed, 56 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> > > > index f033f3a69a3b..76d205d1c7bc 100644
> > > > --- a/fs/nfs/delegation.c
> > > > +++ b/fs/nfs/delegation.c
> > > > @@ -136,8 +136,8 @@ static int nfs_delegation_claim_opens(struct
> > > > inode *inode,
> > > > int err;
> > > >
> > > > again:
> > > > - spin_lock(&inode->i_lock);
> > > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > > + rcu_read_lock();
> > > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > > state = ctx->state;
> > > > if (state == NULL)
> > > > continue;
> > > > @@ -147,8 +147,9 @@ static int nfs_delegation_claim_opens(struct
> > > > inode *inode,
> > > > continue;
> > > > if (!nfs4_stateid_match(&state->stateid,
> > > > stateid))
> > > > continue;
> > > > - get_nfs_open_context(ctx);
> > > > - spin_unlock(&inode->i_lock);
> > > > + if (!get_nfs_open_context(ctx))
> > > > + continue;
> > > > + rcu_read_unlock();
> > > > sp = state->owner;
> > > > /* Block nfs4_proc_unlck */
> > > > mutex_lock(&sp->so_delegreturn_mutex);
> > > > @@ -164,7 +165,7 @@ static int nfs_delegation_claim_opens(struct
> > > > inode *inode,
> > > > return err;
> > > > goto again;
> > > > }
> > > > - spin_unlock(&inode->i_lock);
> > > > + rcu_read_unlock();
> > > > return 0;
> > > > }
> > > >
> > > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > > > index 052db41a7f80..5b1eee4952b7 100644
> > > > --- a/fs/nfs/inode.c
> > > > +++ b/fs/nfs/inode.c
> > > > @@ -977,9 +977,9 @@ EXPORT_SYMBOL_GPL(alloc_nfs_open_context);
> > > >
> > > > struct nfs_open_context *get_nfs_open_context(struct
> > > > nfs_open_context *ctx)
> > > > {
> > > > - if (ctx != NULL)
> > > > - refcount_inc(&ctx->lock_context.count);
> > > > - return ctx;
> > > > + if (ctx != NULL && refcount_inc_not_zero(&ctx-
> > > > >lock_context.count))
> > > > + return ctx;
> > > > + return NULL;
> > > > }
> > > > EXPORT_SYMBOL_GPL(get_nfs_open_context);
> > > >
> > > > @@ -988,13 +988,13 @@ static void __put_nfs_open_context(struct
> > > > nfs_open_context *ctx, int is_sync)
> > > > struct inode *inode = d_inode(ctx->dentry);
> > > > struct super_block *sb = ctx->dentry->d_sb;
> > > >
> > > > + if (!refcount_dec_and_test(&ctx->lock_context.count))
> > > > + return;
> > > > if (!list_empty(&ctx->list)) {
> > > > - if (!refcount_dec_and_lock(&ctx-
> > > > >lock_context.count, &inode->i_lock))
> > > > - return;
> > > > - list_del(&ctx->list);
> > > > + spin_lock(&inode->i_lock);
> > > > + list_del_rcu(&ctx->list);
> > > > spin_unlock(&inode->i_lock);
> > > > - } else if (!refcount_dec_and_test(&ctx-
> > > > >lock_context.count))
> > > > - return;
> > > > + }
> > > > if (inode != NULL)
> > > > NFS_PROTO(inode)->close_context(ctx, is_sync);
> > > > if (ctx->cred != NULL)
> > > > @@ -1002,7 +1002,7 @@ static void __put_nfs_open_context(struct
> > > > nfs_open_context *ctx, int is_sync)
> > > > dput(ctx->dentry);
> > > > nfs_sb_deactive(sb);
> > > > kfree(ctx->mdsthreshold);
> > > > - kfree(ctx);
> > > > + kfree_rcu(ctx, rcu_head);
> > > > }
> > > >
> > > > void put_nfs_open_context(struct nfs_open_context *ctx)
> > > > @@ -1026,10 +1026,7 @@ void nfs_inode_attach_open_context(struct
> > > > nfs_open_context *ctx)
> > > > struct nfs_inode *nfsi = NFS_I(inode);
> > > >
> > > > spin_lock(&inode->i_lock);
> > > > - if (ctx->mode & FMODE_WRITE)
> > > > - list_add(&ctx->list, &nfsi->open_files);
> > > > - else
> > > > - list_add_tail(&ctx->list, &nfsi->open_files);
> > > > + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> > > > spin_unlock(&inode->i_lock);
> > > > }
> > > > EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> > > > @@ -1050,16 +1047,17 @@ struct nfs_open_context
> > > > *nfs_find_open_context(struct inode *inode, struct rpc_c
> > > > struct nfs_inode *nfsi = NFS_I(inode);
> > > > struct nfs_open_context *pos, *ctx = NULL;
> > > >
> > > > - spin_lock(&inode->i_lock);
> > > > - list_for_each_entry(pos, &nfsi->open_files, list) {
> > > > + rcu_read_lock();
> > > > + list_for_each_entry_rcu(pos, &nfsi->open_files, list) {
> > > > if (cred != NULL && pos->cred != cred)
> > > > continue;
> > > > if ((pos->mode & (FMODE_READ|FMODE_WRITE)) !=
> > > > mode)
> > > > continue;
> > > > ctx = get_nfs_open_context(pos);
> > > > - break;
> > > > + if (ctx)
> > > > + break;
> > > > }
> > > > - spin_unlock(&inode->i_lock);
> > > > + rcu_read_unlock();
> > > > return ctx;
> > > > }
> > > >
> > > > @@ -1077,9 +1075,6 @@ void nfs_file_clear_open_context(struct
> > > > file *filp)
> > > > if (ctx->error < 0)
> > > > invalidate_inode_pages2(inode-
> > > > >i_mapping);
> > > > filp->private_data = NULL;
> > > > - spin_lock(&inode->i_lock);
> > > > - list_move_tail(&ctx->list, &NFS_I(inode)-
> > > > >open_files);
> > > > - spin_unlock(&inode->i_lock);
> > > > put_nfs_open_context_sync(ctx);
> > > > }
> > > > }
> > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > > > index 8220a168282e..10c20a5b075d 100644
> > > > --- a/fs/nfs/nfs4proc.c
> > > > +++ b/fs/nfs/nfs4proc.c
> > > > @@ -1933,23 +1933,41 @@ nfs4_opendata_to_nfs4_state(struct
> > > > nfs4_opendata *data)
> > > > return ret;
> > > > }
> > > >
> > > > -static struct nfs_open_context
> > > > *nfs4_state_find_open_context(struct nfs4_state *state)
> > > > +static struct nfs_open_context *
> > > > +nfs4_state_find_open_context_mode(struct nfs4_state *state,
> > > > fmode_t mode)
> > > > {
> > > > struct nfs_inode *nfsi = NFS_I(state->inode);
> > > > struct nfs_open_context *ctx;
> > > >
> > > > - spin_lock(&state->inode->i_lock);
> > > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > > + rcu_read_lock();
> > > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > > if (ctx->state != state)
> > > > continue;
> > > > - get_nfs_open_context(ctx);
> > > > - spin_unlock(&state->inode->i_lock);
> > > > + if ((ctx->mode & mode) != mode)
> > > > + continue;
> > > > + if (!get_nfs_open_context(ctx))
> > > > + continue;
> > > > + rcu_read_unlock();
> > > > return ctx;
> > > > }
> > > > - spin_unlock(&state->inode->i_lock);
> > > > + rcu_read_unlock();
> > > > return ERR_PTR(-ENOENT);
> > > > }
> > > >
> > > > +static struct nfs_open_context *
> > > > +nfs4_state_find_open_context(struct nfs4_state *state)
> > > > +{
> > > > + struct nfs_open_context *ctx;
> > > > +
> > > > + ctx = nfs4_state_find_open_context_mode(state,
> > > > FMODE_READ|FMODE_WRITE);
> > > > + if (!IS_ERR(ctx))
> > > > + return ctx;
> > > > + ctx = nfs4_state_find_open_context_mode(state,
> > > > FMODE_WRITE);
> > > > + if (!IS_ERR(ctx))
> > > > + return ctx;
> > > > + return nfs4_state_find_open_context_mode(state,
> > > > FMODE_READ);
> > > > +}
> > > > +
> > > > static struct nfs4_opendata *nfs4_open_recoverdata_alloc(struct
> > > > nfs_open_context *ctx,
> > > > struct nfs4_state *state, enum open_claim_type4
> > > > claim)
> > > > {
> > > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > > > index 40a08cd483f0..be92ce4259e9 100644
> > > > --- a/fs/nfs/nfs4state.c
> > > > +++ b/fs/nfs/nfs4state.c
> > > > @@ -1437,8 +1437,8 @@ void
> > > > nfs_inode_find_state_and_recover(struct inode *inode,
> > > > struct nfs4_state *state;
> > > > bool found = false;
> > > >
> > > > - spin_lock(&inode->i_lock);
> > > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > > + rcu_read_lock();
> > > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > > state = ctx->state;
> > > > if (state == NULL)
> > > > continue;
> > > > @@ -1456,7 +1456,7 @@ void
> > > > nfs_inode_find_state_and_recover(struct inode *inode,
> > > > nfs4_state_mark_reclaim_nograce(clp, state))
> > > > found = true;
> > > > }
> > > > - spin_unlock(&inode->i_lock);
> > > > + rcu_read_unlock();
> > > >
> > > > nfs_inode_find_delegation_state_and_recover(inode,
> > > > stateid);
> > > > if (found)
> > > > @@ -1469,13 +1469,13 @@ static void
> > > > nfs4_state_mark_open_context_bad(struct nfs4_state *state)
> > > > struct nfs_inode *nfsi = NFS_I(inode);
> > > > struct nfs_open_context *ctx;
> > > >
> > > > - spin_lock(&inode->i_lock);
> > > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > > + rcu_read_lock();
> > > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > > if (ctx->state != state)
> > > > continue;
> > > > set_bit(NFS_CONTEXT_BAD, &ctx->flags);
> > > > }
> > > > - spin_unlock(&inode->i_lock);
> > > > + rcu_read_unlock();
> > > > }
> > > >
> > > > static void nfs4_state_mark_recovery_failed(struct nfs4_state
> > > > *state, int error)
> > > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > > > index c5672c02afd6..06cb90e9bc6e 100644
> > > > --- a/fs/nfs/pnfs.c
> > > > +++ b/fs/nfs/pnfs.c
> > > > @@ -1339,6 +1339,7 @@ bool pnfs_roc(struct inode *ino,
> > > > if (!nfs_have_layout(ino))
> > > > return false;
> > > > retry:
> > > > + rcu_read_lock();
> > > > spin_lock(&ino->i_lock);
> > > > lo = nfsi->layout;
> > > > if (!lo || !pnfs_layout_is_valid(lo) ||
> > > > @@ -1349,6 +1350,7 @@ bool pnfs_roc(struct inode *ino,
> > > > pnfs_get_layout_hdr(lo);
> > > > if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
> > > > spin_unlock(&ino->i_lock);
> > > > + rcu_read_unlock();
> > > > wait_on_bit(&lo->plh_flags, NFS_LAYOUT_RETURN,
> > > > TASK_UNINTERRUPTIBLE);
> > > > pnfs_put_layout_hdr(lo);
> > > > @@ -1362,7 +1364,7 @@ bool pnfs_roc(struct inode *ino,
> > > > skip_read = true;
> > > > }
> > > >
> > > > - list_for_each_entry(ctx, &nfsi->open_files, list) {
> > > > + list_for_each_entry_rcu(ctx, &nfsi->open_files, list) {
> > > > state = ctx->state;
> > > > if (state == NULL)
> > > > continue;
> > > > @@ -1410,6 +1412,7 @@ bool pnfs_roc(struct inode *ino,
> > > >
> > > > out_noroc:
> > > > spin_unlock(&ino->i_lock);
> > > > + rcu_read_unlock();
> > > > pnfs_layoutcommit_inode(ino, true);
> > > > if (roc) {
> > > > struct pnfs_layoutdriver_type *ld =
> > > > NFS_SERVER(ino)->pnfs_curr_ld;
> > > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> > > > index d2f4f88a0e66..6e0417c02279 100644
> > > > --- a/include/linux/nfs_fs.h
> > > > +++ b/include/linux/nfs_fs.h
> > > > @@ -83,6 +83,7 @@ struct nfs_open_context {
> > > >
> > > > struct list_head list;
> > > > struct nfs4_threshold *mdsthreshold;
> > > > + struct rcu_head rcu_head;
> > > > };
> > > >
> > > > struct nfs_open_dir_context {
> > > > --
> > > > 2.17.1
> > > >
>

2018-09-29 01:19:06

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

T24gRnJpLCAyMDE4LTA5LTI4IGF0IDE0OjMxIC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gT24gRnJpLCBTZXAgMjgsIDIwMTggYXQgMTo1MCBQTSBUcm9uZCBNeWtsZWJ1c3QgPHRy
b25kbXlAZ21haWwuY29tPg0KPiB3cm90ZToNCj4gPiANCj4gPiBPbiBGcmksIDIwMTgtMDktMjgg
YXQgMTI6NTQgLTA0MDAsIE9sZ2EgS29ybmlldnNrYWlhIHdyb3RlOg0KPiA+ID4gQWxzbyBzaG91
bGRuJ3QgdGhpcyBiZSBhIGJ1ZyBmaXggZm9yIDQuMTkgYW5kIGFsc28gZ28gdG8gc3RhYmxlPw0K
PiA+IA0KPiA+IEl0IHdhc24ndCByZWFsbHkgaW50ZW5kZWQgYXMgYSBidWdmaXggYmVjYXVzZSBJ
IHdhc24ndCBhd2FyZSB0aGF0DQo+ID4gdGhlcmUNCj4gPiB3YXMgYSBidWcgdG8gZml4IGluIHRo
ZSBmaXJzdCBwbGFjZS4gSSBhc3N1bWUgdGhlIG1vZGlmaWNhdGlvbiB0bw0KPiA+IG5mczRfc3Rh
dGVfZmluZF9vcGVuX2NvbnRleHQoKSB0byBjYXVzZSBpdCB0byBsb29rIGZvciBhbiBvcGVuDQo+
ID4gY29udGV4dA0KPiA+IHdpdGggYSBSRUFEL1dSSVRFIG9wZW4gbW9kZSBmaXJzdCBpcyB3aGF0
IGZpeGVzIHRoZSBidWcuIElzIHRoYXQNCj4gPiB0aGUNCj4gPiBjYXNlPw0KPiANCj4gSSBkb24n
dCB0aGluayBzby4gbmZzNF9zdGF0ZV9maW5kX29wZW5fY29udGV4dCgpIG5ldmVyIGdldHMgY2Fs
bHMNCj4gZHVyaW5nIHRoZSBydW4gb2YgdGhpcyB0ZXN0IGNhc2UuIFRoaXMgZnVuY3Rpb24gaXMg
Y2FsbGVkIGR1cmluZyB0aGUNCj4gcmVjbGFpbSBvbiBhbiBvcGVuLiBJbiBkZWxlZ2F0aW9uIHJl
Y2FsbCB0aGUgb3BlbnMgYXJlIG5vdCByZWNsYWltZWQsDQo+IGl0IGlzIGp1c3QgcmVjbGFpbWlu
ZyB0aGUgbG9ja3MuDQo+IA0KPiBBY3R1YWxseSwgd2hlbiBJIHVuZG8gdGhpcyBwaWVjZSBvZiB0
aGUgcGF0Y2gsIEkgY2FuIG1ha2UgaXQgZmFpbA0KPiBhZ2Fpbi4NCj4gDQo+IEBAIC0xMDI3LDEw
ICsxMDI3LDcgQEAgdm9pZCBuZnNfaW5vZGVfYXR0YWNoX29wZW5fY29udGV4dChzdHJ1Y3QNCj4g
bmZzX29wZW5fY29udGV4dCAqY3R4KQ0KPiAgICAgICAgIHN0cnVjdCBuZnNfaW5vZGUgKm5mc2kg
PSBORlNfSShpbm9kZSk7DQo+IA0KPiAgICAgICAgIHNwaW5fbG9jaygmaW5vZGUtPmlfbG9jayk7
DQo+IC0gICAgICAgaWYgKGN0eC0+bW9kZSAmIEZNT0RFX1dSSVRFKQ0KPiAtICAgICAgICAgICAg
ICAgbGlzdF9hZGQoJmN0eC0+bGlzdCwgJm5mc2ktPm9wZW5fZmlsZXMpOw0KPiAtICAgICAgIGVs
c2UNCj4gLSAgICAgICAgICAgICAgIGxpc3RfYWRkX3RhaWwoJmN0eC0+bGlzdCwgJm5mc2ktPm9w
ZW5fZmlsZXMpOw0KPiArICAgICAgIGxpc3RfYWRkX3RhaWxfcmN1KCZjdHgtPmxpc3QsICZuZnNp
LT5vcGVuX2ZpbGVzKTsNCj4gICAgICAgICBzcGluX3VubG9jaygmaW5vZGUtPmlfbG9jayk7DQo+
ICB9DQo+ICBFWFBPUlRfU1lNQk9MX0dQTChuZnNfaW5vZGVfYXR0YWNoX29wZW5fY29udGV4dCk7
DQo+IA0KPiBJdCBsb29rcyBsaWtlIHRoZSBwbGFjZW1lbnQgaW4gdGhlIGxpc3QgbWF0dGVycz8g
QW55IGlkZWFzIHdoeT8NCg0KQ3VycmVudGx5LCB0aGUgbGlzdCBpcyBvcmRlcmVkIHNvIHRoYXQg
d3JpdGVhYmxlIG9wZW4gY29udGV4dHMgYXJlDQphbHdheXMgZm91bmQgZmlyc3QuIFRoZSByZWFz
b24gaXMgdGhhdCB3aGVuIHRyYXZlcnNpbmcgdGhlIGxpc3QgZHVyaW5nDQpzZXJ2ZXIgcmVib290
IHJlY292ZXJ5LCB3ZSB3YW50IHRvIGVuc3VyZSB0aGF0IHdlIHJlY2xhaW0gYW55IHdyaXRlDQpk
ZWxlZ2F0aW9ucyBvbiB0aGUgZmlsZSBmaXJzdCBzbyB0aGF0IHdlIGNhbiBjYWNoZSBhbGwgc3Vi
c2VxdWVudCBvcGVucw0KYW5kIGxvY2tzIG9mIHRoYXQgZmlsZS4NCg0KQSBkZWxlZ2F0aW9uIHJl
dHVybiByZXF1aXJlcyB1cyB0byByZWNvdmVyIGFsbCBjYWNoZWQgc3RhdGUsIHNvIGl0IG11c3QN
CnJlY2xhaW0gYm90aCBPUEVOIGFuZCBMT0NLIHN0YXRlLiBJdCBpcyBub3QsIGhvd2V2ZXIsIGV4
cGVjdGVkIHRvDQpkZXBlbmQgb24gdGhlIG9wZW4gY29udGV4dCBsaXN0IG9yZGVyaW5nLCBzaW5j
ZSB0aGVyZSBjYW4gYmUgbm8gZnVydGhlcg0KY2FjaGluZy4NCklPVzogbm8sIEkgZG9uJ3Qgc2Vl
IHdoeSB5b3VyIGJ1ZyB3b3VsZCBkZXBlbmQgb24gdGhlIGxpc3Qgb3JkZXIgdW5sZXNzDQpzb21l
IHBhcnQgb2YgdGhlIHJlY292ZXJ5IGlzIGFjdHVhbGx5IGZhaWxpbmcuDQoNCi0tIA0KVHJvbmQg
TXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIEhhbW1lcnNwYWNlDQp0cm9u
ZC5teWtsZWJ1c3RAaGFtbWVyc3BhY2UuY29tDQoNCg0K

2018-09-29 01:36:06

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Fri, Sep 28, 2018 at 2:54 PM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2018-09-28 at 14:31 -0400, Olga Kornievskaia wrote:
> > On Fri, Sep 28, 2018 at 1:50 PM Trond Myklebust <[email protected]>
> > wrote:
> > >
> > > On Fri, 2018-09-28 at 12:54 -0400, Olga Kornievskaia wrote:
> > > > Also shouldn't this be a bug fix for 4.19 and also go to stable?
> > >
> > > It wasn't really intended as a bugfix because I wasn't aware that
> > > there
> > > was a bug to fix in the first place. I assume the modification to
> > > nfs4_state_find_open_context() to cause it to look for an open
> > > context
> > > with a READ/WRITE open mode first is what fixes the bug. Is that
> > > the
> > > case?
> >
> > I don't think so. nfs4_state_find_open_context() never gets calls
> > during the run of this test case. This function is called during the
> > reclaim on an open. In delegation recall the opens are not reclaimed,
> > it is just reclaiming the locks.
> >
> > Actually, when I undo this piece of the patch, I can make it fail
> > again.
> >
> > @@ -1027,10 +1027,7 @@ void nfs_inode_attach_open_context(struct
> > nfs_open_context *ctx)
> > struct nfs_inode *nfsi = NFS_I(inode);
> >
> > spin_lock(&inode->i_lock);
> > - if (ctx->mode & FMODE_WRITE)
> > - list_add(&ctx->list, &nfsi->open_files);
> > - else
> > - list_add_tail(&ctx->list, &nfsi->open_files);
> > + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> > spin_unlock(&inode->i_lock);
> > }
> > EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> >
> > It looks like the placement in the list matters? Any ideas why?
>
> Currently, the list is ordered so that writeable open contexts are
> always found first. The reason is that when traversing the list during
> server reboot recovery, we want to ensure that we reclaim any write
> delegations on the file first so that we can cache all subsequent opens
> and locks of that file.
>
> A delegation return requires us to recover all cached state, so it must
> reclaim both OPEN and LOCK state. It is not, however, expected to
> depend on the open context list ordering, since there can be no further
> caching.
> IOW: no, I don't see why your bug would depend on the list order unless
> some part of the recovery is actually failing.

Ok thanks. Need to fix the lack of OPEN reclaim.

> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2018-09-29 02:21:17

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Fri, Sep 28, 2018 at 3:10 PM Olga Kornievskaia <[email protected]> wrote:
>
> On Fri, Sep 28, 2018 at 2:54 PM Trond Myklebust <[email protected]> wrote:
> >
> > On Fri, 2018-09-28 at 14:31 -0400, Olga Kornievskaia wrote:
> > > On Fri, Sep 28, 2018 at 1:50 PM Trond Myklebust <[email protected]>
> > > wrote:
> > > >
> > > > On Fri, 2018-09-28 at 12:54 -0400, Olga Kornievskaia wrote:
> > > > > Also shouldn't this be a bug fix for 4.19 and also go to stable?
> > > >
> > > > It wasn't really intended as a bugfix because I wasn't aware that
> > > > there
> > > > was a bug to fix in the first place. I assume the modification to
> > > > nfs4_state_find_open_context() to cause it to look for an open
> > > > context
> > > > with a READ/WRITE open mode first is what fixes the bug. Is that
> > > > the
> > > > case?
> > >
> > > I don't think so. nfs4_state_find_open_context() never gets calls
> > > during the run of this test case. This function is called during the
> > > reclaim on an open. In delegation recall the opens are not reclaimed,
> > > it is just reclaiming the locks.
> > >
> > > Actually, when I undo this piece of the patch, I can make it fail
> > > again.
> > >
> > > @@ -1027,10 +1027,7 @@ void nfs_inode_attach_open_context(struct
> > > nfs_open_context *ctx)
> > > struct nfs_inode *nfsi = NFS_I(inode);
> > >
> > > spin_lock(&inode->i_lock);
> > > - if (ctx->mode & FMODE_WRITE)
> > > - list_add(&ctx->list, &nfsi->open_files);
> > > - else
> > > - list_add_tail(&ctx->list, &nfsi->open_files);
> > > + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> > > spin_unlock(&inode->i_lock);
> > > }
> > > EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> > >
> > > It looks like the placement in the list matters? Any ideas why?
> >
> > Currently, the list is ordered so that writeable open contexts are
> > always found first. The reason is that when traversing the list during
> > server reboot recovery, we want to ensure that we reclaim any write
> > delegations on the file first so that we can cache all subsequent opens
> > and locks of that file.
> >
> > A delegation return requires us to recover all cached state, so it must
> > reclaim both OPEN and LOCK state. It is not, however, expected to
> > depend on the open context list ordering, since there can be no further
> > caching.
> > IOW: no, I don't see why your bug would depend on the list order unless
> > some part of the recovery is actually failing.
>
> Ok thanks. Need to fix the lack of OPEN reclaim.

Wait, why are we suppose to reclaim the open state when we have a
valid open stateid? We don't have any cached opens that server doesn't
know about. RFC7530 says "if the file has other open reference", I
think the emphasis is on "other". I don't believe we need to be
sending anything besides the locks to the server. Then I'm back to
square one.

>
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
> >
> >

2018-09-29 02:32:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

T24gRnJpLCAyMDE4LTA5LTI4IGF0IDE1OjU1IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gT24gRnJpLCBTZXAgMjgsIDIwMTggYXQgMzoxMCBQTSBPbGdhIEtvcm5pZXZza2FpYSA8
YWdsb0B1bWljaC5lZHU+DQo+IHdyb3RlOg0KPiA+IA0KPiA+IE9uIEZyaSwgU2VwIDI4LCAyMDE4
IGF0IDI6NTQgUE0gVHJvbmQgTXlrbGVidXN0IDwNCj4gPiB0cm9uZG15QGhhbW1lcnNwYWNlLmNv
bT4gd3JvdGU6DQo+ID4gPiANCj4gPiA+IE9uIEZyaSwgMjAxOC0wOS0yOCBhdCAxNDozMSAtMDQw
MCwgT2xnYSBLb3JuaWV2c2thaWEgd3JvdGU6DQo+ID4gPiA+IE9uIEZyaSwgU2VwIDI4LCAyMDE4
IGF0IDE6NTAgUE0gVHJvbmQgTXlrbGVidXN0IDwNCj4gPiA+ID4gdHJvbmRteUBnbWFpbC5jb20+
DQo+ID4gPiA+IHdyb3RlOg0KPiA+ID4gPiA+IA0KPiA+ID4gPiA+IE9uIEZyaSwgMjAxOC0wOS0y
OCBhdCAxMjo1NCAtMDQwMCwgT2xnYSBLb3JuaWV2c2thaWEgd3JvdGU6DQo+ID4gPiA+ID4gPiBB
bHNvIHNob3VsZG4ndCB0aGlzIGJlIGEgYnVnIGZpeCBmb3IgNC4xOSBhbmQgYWxzbyBnbyB0bw0K
PiA+ID4gPiA+ID4gc3RhYmxlPw0KPiA+ID4gPiA+IA0KPiA+ID4gPiA+IEl0IHdhc24ndCByZWFs
bHkgaW50ZW5kZWQgYXMgYSBidWdmaXggYmVjYXVzZSBJIHdhc24ndCBhd2FyZQ0KPiA+ID4gPiA+
IHRoYXQNCj4gPiA+ID4gPiB0aGVyZQ0KPiA+ID4gPiA+IHdhcyBhIGJ1ZyB0byBmaXggaW4gdGhl
IGZpcnN0IHBsYWNlLiBJIGFzc3VtZSB0aGUNCj4gPiA+ID4gPiBtb2RpZmljYXRpb24gdG8NCj4g
PiA+ID4gPiBuZnM0X3N0YXRlX2ZpbmRfb3Blbl9jb250ZXh0KCkgdG8gY2F1c2UgaXQgdG8gbG9v
ayBmb3IgYW4NCj4gPiA+ID4gPiBvcGVuDQo+ID4gPiA+ID4gY29udGV4dA0KPiA+ID4gPiA+IHdp
dGggYSBSRUFEL1dSSVRFIG9wZW4gbW9kZSBmaXJzdCBpcyB3aGF0IGZpeGVzIHRoZSBidWcuIElz
DQo+ID4gPiA+ID4gdGhhdA0KPiA+ID4gPiA+IHRoZQ0KPiA+ID4gPiA+IGNhc2U/DQo+ID4gPiA+
IA0KPiA+ID4gPiBJIGRvbid0IHRoaW5rIHNvLiBuZnM0X3N0YXRlX2ZpbmRfb3Blbl9jb250ZXh0
KCkgbmV2ZXIgZ2V0cw0KPiA+ID4gPiBjYWxscw0KPiA+ID4gPiBkdXJpbmcgdGhlIHJ1biBvZiB0
aGlzIHRlc3QgY2FzZS4gVGhpcyBmdW5jdGlvbiBpcyBjYWxsZWQNCj4gPiA+ID4gZHVyaW5nIHRo
ZQ0KPiA+ID4gPiByZWNsYWltIG9uIGFuIG9wZW4uIEluIGRlbGVnYXRpb24gcmVjYWxsIHRoZSBv
cGVucyBhcmUgbm90DQo+ID4gPiA+IHJlY2xhaW1lZCwNCj4gPiA+ID4gaXQgaXMganVzdCByZWNs
YWltaW5nIHRoZSBsb2Nrcy4NCj4gPiA+ID4gDQo+ID4gPiA+IEFjdHVhbGx5LCB3aGVuIEkgdW5k
byB0aGlzIHBpZWNlIG9mIHRoZSBwYXRjaCwgSSBjYW4gbWFrZSBpdA0KPiA+ID4gPiBmYWlsDQo+
ID4gPiA+IGFnYWluLg0KPiA+ID4gPiANCj4gPiA+ID4gQEAgLTEwMjcsMTAgKzEwMjcsNyBAQCB2
b2lkDQo+ID4gPiA+IG5mc19pbm9kZV9hdHRhY2hfb3Blbl9jb250ZXh0KHN0cnVjdA0KPiA+ID4g
PiBuZnNfb3Blbl9jb250ZXh0ICpjdHgpDQo+ID4gPiA+ICAgICAgICAgc3RydWN0IG5mc19pbm9k
ZSAqbmZzaSA9IE5GU19JKGlub2RlKTsNCj4gPiA+ID4gDQo+ID4gPiA+ICAgICAgICAgc3Bpbl9s
b2NrKCZpbm9kZS0+aV9sb2NrKTsNCj4gPiA+ID4gLSAgICAgICBpZiAoY3R4LT5tb2RlICYgRk1P
REVfV1JJVEUpDQo+ID4gPiA+IC0gICAgICAgICAgICAgICBsaXN0X2FkZCgmY3R4LT5saXN0LCAm
bmZzaS0+b3Blbl9maWxlcyk7DQo+ID4gPiA+IC0gICAgICAgZWxzZQ0KPiA+ID4gPiAtICAgICAg
ICAgICAgICAgbGlzdF9hZGRfdGFpbCgmY3R4LT5saXN0LCAmbmZzaS0+b3Blbl9maWxlcyk7DQo+
ID4gPiA+ICsgICAgICAgbGlzdF9hZGRfdGFpbF9yY3UoJmN0eC0+bGlzdCwgJm5mc2ktPm9wZW5f
ZmlsZXMpOw0KPiA+ID4gPiAgICAgICAgIHNwaW5fdW5sb2NrKCZpbm9kZS0+aV9sb2NrKTsNCj4g
PiA+ID4gIH0NCj4gPiA+ID4gIEVYUE9SVF9TWU1CT0xfR1BMKG5mc19pbm9kZV9hdHRhY2hfb3Bl
bl9jb250ZXh0KTsNCj4gPiA+ID4gDQo+ID4gPiA+IEl0IGxvb2tzIGxpa2UgdGhlIHBsYWNlbWVu
dCBpbiB0aGUgbGlzdCBtYXR0ZXJzPyBBbnkgaWRlYXMgd2h5Pw0KPiA+ID4gDQo+ID4gPiBDdXJy
ZW50bHksIHRoZSBsaXN0IGlzIG9yZGVyZWQgc28gdGhhdCB3cml0ZWFibGUgb3BlbiBjb250ZXh0
cw0KPiA+ID4gYXJlDQo+ID4gPiBhbHdheXMgZm91bmQgZmlyc3QuIFRoZSByZWFzb24gaXMgdGhh
dCB3aGVuIHRyYXZlcnNpbmcgdGhlIGxpc3QNCj4gPiA+IGR1cmluZw0KPiA+ID4gc2VydmVyIHJl
Ym9vdCByZWNvdmVyeSwgd2Ugd2FudCB0byBlbnN1cmUgdGhhdCB3ZSByZWNsYWltIGFueQ0KPiA+
ID4gd3JpdGUNCj4gPiA+IGRlbGVnYXRpb25zIG9uIHRoZSBmaWxlIGZpcnN0IHNvIHRoYXQgd2Ug
Y2FuIGNhY2hlIGFsbCBzdWJzZXF1ZW50DQo+ID4gPiBvcGVucw0KPiA+ID4gYW5kIGxvY2tzIG9m
IHRoYXQgZmlsZS4NCj4gPiA+IA0KPiA+ID4gQSBkZWxlZ2F0aW9uIHJldHVybiByZXF1aXJlcyB1
cyB0byByZWNvdmVyIGFsbCBjYWNoZWQgc3RhdGUsIHNvDQo+ID4gPiBpdCBtdXN0DQo+ID4gPiBy
ZWNsYWltIGJvdGggT1BFTiBhbmQgTE9DSyBzdGF0ZS4gSXQgaXMgbm90LCBob3dldmVyLCBleHBl
Y3RlZCB0bw0KPiA+ID4gZGVwZW5kIG9uIHRoZSBvcGVuIGNvbnRleHQgbGlzdCBvcmRlcmluZywg
c2luY2UgdGhlcmUgY2FuIGJlIG5vDQo+ID4gPiBmdXJ0aGVyDQo+ID4gPiBjYWNoaW5nLg0KPiA+
ID4gSU9XOiBubywgSSBkb24ndCBzZWUgd2h5IHlvdXIgYnVnIHdvdWxkIGRlcGVuZCBvbiB0aGUg
bGlzdCBvcmRlcg0KPiA+ID4gdW5sZXNzDQo+ID4gPiBzb21lIHBhcnQgb2YgdGhlIHJlY292ZXJ5
IGlzIGFjdHVhbGx5IGZhaWxpbmcuDQo+ID4gDQo+ID4gT2sgdGhhbmtzLiBOZWVkIHRvIGZpeCB0
aGUgbGFjayBvZiBPUEVOIHJlY2xhaW0uDQo+IA0KPiBXYWl0LCB3aHkgYXJlIHdlIHN1cHBvc2Ug
dG8gcmVjbGFpbSB0aGUgb3BlbiBzdGF0ZSB3aGVuIHdlIGhhdmUgYQ0KPiB2YWxpZCBvcGVuIHN0
YXRlaWQ/IFdlIGRvbid0IGhhdmUgYW55IGNhY2hlZCBvcGVucyB0aGF0IHNlcnZlcg0KPiBkb2Vz
bid0DQo+IGtub3cgYWJvdXQuIFJGQzc1MzAgc2F5cyAiaWYgdGhlIGZpbGUgaGFzIG90aGVyIG9w
ZW4gcmVmZXJlbmNlIiwgSQ0KPiB0aGluayB0aGUgZW1waGFzaXMgaXMgb24gIm90aGVyIi4gSSBk
b24ndCBiZWxpZXZlIHdlIG5lZWQgdG8gYmUNCj4gc2VuZGluZyBhbnl0aGluZyBiZXNpZGVzIHRo
ZSBsb2NrcyB0byB0aGUgc2VydmVyLiBUaGVuIEknbSBiYWNrIHRvDQo+IHNxdWFyZSBvbmUuDQoN
CkhvbGRpbmcgYSBkZWxlZ2F0aW9uIGRvZXMgbm90IGltcGx5IHRoYXQgd2UgaG9sZCBhbiBvcGVu
IHN0YXRlaWQuIFVuZGVyDQpMaW51eCwgdGhlIG9wZW4gc3RhdGVpZCBnZXRzIGNsb3NlZCBhcyBz
b29uIGFzIHRoZSBhcHBsaWNhdGlvbiBjbG9zZXMNCnRoZSBmaWxlLg0KDQpUaGUgZGVsZWdhdGlv
biwgb24gdGhlIG90aGVyIGhhbmQsIGlzIHJldGFpbmVkIHVudGlsIGVpdGhlciBpdCBpcw0KcmVj
YWxsZWQsIG9yIHdlIHNlZSB0aGF0IHRoZSBmaWxlIGhhcyBub3QgYmVlbiB1c2VkIGZvciAyIGxl
YXNlDQpwZXJpb2RzLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBt
YWludGFpbmVyLCBIYW1tZXJzcGFjZQ0KdHJvbmQubXlrbGVidXN0QGhhbW1lcnNwYWNlLmNvbQ0K
DQoNCg==

2018-09-29 02:44:49

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Fri, Sep 28, 2018 at 4:07 PM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2018-09-28 at 15:55 -0400, Olga Kornievskaia wrote:
> > On Fri, Sep 28, 2018 at 3:10 PM Olga Kornievskaia <[email protected]>
> > wrote:
> > >
> > > On Fri, Sep 28, 2018 at 2:54 PM Trond Myklebust <
> > > [email protected]> wrote:
> > > >
> > > > On Fri, 2018-09-28 at 14:31 -0400, Olga Kornievskaia wrote:
> > > > > On Fri, Sep 28, 2018 at 1:50 PM Trond Myklebust <
> > > > > [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > On Fri, 2018-09-28 at 12:54 -0400, Olga Kornievskaia wrote:
> > > > > > > Also shouldn't this be a bug fix for 4.19 and also go to
> > > > > > > stable?
> > > > > >
> > > > > > It wasn't really intended as a bugfix because I wasn't aware
> > > > > > that
> > > > > > there
> > > > > > was a bug to fix in the first place. I assume the
> > > > > > modification to
> > > > > > nfs4_state_find_open_context() to cause it to look for an
> > > > > > open
> > > > > > context
> > > > > > with a READ/WRITE open mode first is what fixes the bug. Is
> > > > > > that
> > > > > > the
> > > > > > case?
> > > > >
> > > > > I don't think so. nfs4_state_find_open_context() never gets
> > > > > calls
> > > > > during the run of this test case. This function is called
> > > > > during the
> > > > > reclaim on an open. In delegation recall the opens are not
> > > > > reclaimed,
> > > > > it is just reclaiming the locks.
> > > > >
> > > > > Actually, when I undo this piece of the patch, I can make it
> > > > > fail
> > > > > again.
> > > > >
> > > > > @@ -1027,10 +1027,7 @@ void
> > > > > nfs_inode_attach_open_context(struct
> > > > > nfs_open_context *ctx)
> > > > > struct nfs_inode *nfsi = NFS_I(inode);
> > > > >
> > > > > spin_lock(&inode->i_lock);
> > > > > - if (ctx->mode & FMODE_WRITE)
> > > > > - list_add(&ctx->list, &nfsi->open_files);
> > > > > - else
> > > > > - list_add_tail(&ctx->list, &nfsi->open_files);
> > > > > + list_add_tail_rcu(&ctx->list, &nfsi->open_files);
> > > > > spin_unlock(&inode->i_lock);
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(nfs_inode_attach_open_context);
> > > > >
> > > > > It looks like the placement in the list matters? Any ideas why?
> > > >
> > > > Currently, the list is ordered so that writeable open contexts
> > > > are
> > > > always found first. The reason is that when traversing the list
> > > > during
> > > > server reboot recovery, we want to ensure that we reclaim any
> > > > write
> > > > delegations on the file first so that we can cache all subsequent
> > > > opens
> > > > and locks of that file.
> > > >
> > > > A delegation return requires us to recover all cached state, so
> > > > it must
> > > > reclaim both OPEN and LOCK state. It is not, however, expected to
> > > > depend on the open context list ordering, since there can be no
> > > > further
> > > > caching.
> > > > IOW: no, I don't see why your bug would depend on the list order
> > > > unless
> > > > some part of the recovery is actually failing.
> > >
> > > Ok thanks. Need to fix the lack of OPEN reclaim.
> >
> > Wait, why are we suppose to reclaim the open state when we have a
> > valid open stateid? We don't have any cached opens that server
> > doesn't
> > know about. RFC7530 says "if the file has other open reference", I
> > think the emphasis is on "other". I don't believe we need to be
> > sending anything besides the locks to the server. Then I'm back to
> > square one.
>
> Holding a delegation does not imply that we hold an open stateid. Under
> Linux, the open stateid gets closed as soon as the application closes
> the file.
>
> The delegation, on the other hand, is retained until either it is
> recalled, or we see that the file has not been used for 2 lease
> periods.

Ok I agree with all of it but I'm saying it doesn't need to be
reclaimed unconditionally or are you saying that's what the linux
client does? In this test case, the file hasn't been closed or
expired. I'm stating that the client has a valid open stateid and
should only be required to reclaim the locks (which with this patch it
does).

>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2018-09-29 03:04:20

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

T24gRnJpLCAyMDE4LTA5LTI4IGF0IDE2OjE5IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gT24gRnJpLCBTZXAgMjgsIDIwMTggYXQgNDowNyBQTSBUcm9uZCBNeWtsZWJ1c3QgPA0K
PiB0cm9uZG15QGhhbW1lcnNwYWNlLmNvbT4gd3JvdGU6DQo+ID4gDQo+ID4gT24gRnJpLCAyMDE4
LTA5LTI4IGF0IDE1OjU1IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90ZToNCj4gPiA+IE9u
IEZyaSwgU2VwIDI4LCAyMDE4IGF0IDM6MTAgUE0gT2xnYSBLb3JuaWV2c2thaWEgPGFnbG9AdW1p
Y2guZWR1DQo+ID4gPiA+DQo+ID4gPiB3cm90ZToNCj4gPiA+ID4gDQo+ID4gPiBXYWl0LCB3aHkg
YXJlIHdlIHN1cHBvc2UgdG8gcmVjbGFpbSB0aGUgb3BlbiBzdGF0ZSB3aGVuIHdlIGhhdmUgYQ0K
PiA+ID4gdmFsaWQgb3BlbiBzdGF0ZWlkPyBXZSBkb24ndCBoYXZlIGFueSBjYWNoZWQgb3BlbnMg
dGhhdCBzZXJ2ZXINCj4gPiA+IGRvZXNuJ3QNCj4gPiA+IGtub3cgYWJvdXQuIFJGQzc1MzAgc2F5
cyAiaWYgdGhlIGZpbGUgaGFzIG90aGVyIG9wZW4gcmVmZXJlbmNlIiwNCj4gPiA+IEkNCj4gPiA+
IHRoaW5rIHRoZSBlbXBoYXNpcyBpcyBvbiAib3RoZXIiLiBJIGRvbid0IGJlbGlldmUgd2UgbmVl
ZCB0byBiZQ0KPiA+ID4gc2VuZGluZyBhbnl0aGluZyBiZXNpZGVzIHRoZSBsb2NrcyB0byB0aGUg
c2VydmVyLiBUaGVuIEknbSBiYWNrDQo+ID4gPiB0bw0KPiA+ID4gc3F1YXJlIG9uZS4NCj4gPiAN
Cj4gPiBIb2xkaW5nIGEgZGVsZWdhdGlvbiBkb2VzIG5vdCBpbXBseSB0aGF0IHdlIGhvbGQgYW4g
b3BlbiBzdGF0ZWlkLg0KPiA+IFVuZGVyDQo+ID4gTGludXgsIHRoZSBvcGVuIHN0YXRlaWQgZ2V0
cyBjbG9zZWQgYXMgc29vbiBhcyB0aGUgYXBwbGljYXRpb24NCj4gPiBjbG9zZXMNCj4gPiB0aGUg
ZmlsZS4NCj4gPiANCj4gPiBUaGUgZGVsZWdhdGlvbiwgb24gdGhlIG90aGVyIGhhbmQsIGlzIHJl
dGFpbmVkIHVudGlsIGVpdGhlciBpdCBpcw0KPiA+IHJlY2FsbGVkLCBvciB3ZSBzZWUgdGhhdCB0
aGUgZmlsZSBoYXMgbm90IGJlZW4gdXNlZCBmb3IgMiBsZWFzZQ0KPiA+IHBlcmlvZHMuDQo+IA0K
PiBPayBJIGFncmVlIHdpdGggYWxsIG9mIGl0IGJ1dCBJJ20gc2F5aW5nIGl0IGRvZXNuJ3QgbmVl
ZCB0byBiZQ0KPiByZWNsYWltZWQgdW5jb25kaXRpb25hbGx5IG9yIGFyZSB5b3Ugc2F5aW5nIHRo
YXQncyB3aGF0IHRoZSBsaW51eA0KPiBjbGllbnQgZG9lcz8gSW4gdGhpcyB0ZXN0IGNhc2UsIHRo
ZSBmaWxlIGhhc24ndCBiZWVuIGNsb3NlZCBvcg0KPiBleHBpcmVkLiBJJ20gc3RhdGluZyB0aGF0
IHRoZSBjbGllbnQgaGFzIGEgdmFsaWQgb3BlbiBzdGF0ZWlkIGFuZA0KPiBzaG91bGQgb25seSBi
ZSByZXF1aXJlZCB0byByZWNsYWltIHRoZSBsb2NrcyAod2hpY2ggd2l0aCB0aGlzIHBhdGNoDQo+
IGl0DQo+IGRvZXMpLg0KDQpBcyBJIHNhaWQgZWFybGllciwgdGhlIGNsaWVudCBpcyByZXF1aXJl
ZCB0byByZWNvdmVyIGFsbCBfY2FjaGVkXyBvcGVuDQphbmQgbG9jayBzdGF0ZS4gSWYgaXQgYWxy
ZWFkeSBob2xkcyBhbiBvcGVuIHN0YXRlaWQsIHRoZW4gaXQgc2hvdWxkIG5vdA0KbmVlZCB0byBy
ZWNsYWltIHRoZSBvcGVuIG1vZGVzIHRoYXQgYXJlIGNvdmVyZWQgYnkgdGhhdCBzdGF0ZWlkLA0K
aG93ZXZlciBpdCBtYXkgc3RpbGwgbmVlZCB0byByZWNsYWltIHRob3NlIG9wZW4gbW9kZXMgdGhh
dCB3ZXJlIG5vdA0KYWxyZWFkeSBzdWJqZWN0IHRvIGFuIGV4cGxpY2l0IE9QRU4gY2FsbC4NCg0K
SU9XOiBJZiB0aGUgZmlsZSB3YXMgZmlyc3Qgb3BlbmVkIHdpdGggYW4gb3BlbihPX1JEUlcpIGNh
bGwgYnkgdGhlDQphcHBsaWNhdGlvbiwgYnV0IGEgc2Vjb25kIGFwcGxpY2F0aW9uIHRoZW4gb3Bl
bmVkIGl0IHVzaW5nDQpvcGVuKE9fV1JPTkxZKSwgdGhlbiB3ZSBtYXkgYWxyZWFkeSBob2xkIGEg
c3RhdGVpZCB3aXRoIGENCiJTSEFSRV9BQ0NFU1NfQk9USCIgb3BlbiBtb2RlLCBob3dldmVyIHdl
IHdpbGwgc3RpbGwgbmVlZCB0byBzZW5kIGENCnJlY2xhaW0gZm9yIHRoZSBjYWNoZWQgU0hBUkVf
QUNDRVNTX1dSSVRFIG1vZGUsIHNvIHRoYXQgYSBsYXRlcg0KT1BFTl9ET1dOR1JBREUoU0hBUkVf
QUNDRVNTX1dSSVRFKSBjYW4gc3VjY2VlZC4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4
IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgSGFtbWVyc3BhY2UNCnRyb25kLm15a2xlYnVzdEBoYW1t
ZXJzcGFjZS5jb20NCg0KDQo=

2018-10-04 01:28:45

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

Hi Trond,

Here's why the ordering of the "open_files" list matters and
changes/fixes the existing problem.

When we first open the file for writing and get a delegation, it's the
first one on the list. When we opened the file again for the same mode
type, then before the patch, the new entry is inserted before what's
already on the list. Both of these files share the same nfs4_state
that's marked delegated.

Once we receive a delegation recall, in delegation_claim_opens() we
walk the list. First one will be the 2nd open. It's marked delegated
but after calling nfs4_open_delegation_recall() the delegation flag is
cleared. The 2nd open doesn't have the lock associated with it. So no
lock is reclaimed. We now go to the 2nd entry in the open_file list
which is the 1st open but now the delegation flag is cleared so we
never recover the lock.

Any of the opens on the open_list can be the lock holder and we can't
clear the delegation flag on the first treatment of the delegated open
because it might not be the owner of the lock.

I'm trying to figure out how I would fix it but I thought I'd send
this for your comments.

On Fri, Sep 28, 2018 at 4:38 PM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2018-09-28 at 16:19 -0400, Olga Kornievskaia wrote:
> > On Fri, Sep 28, 2018 at 4:07 PM Trond Myklebust <
> > [email protected]> wrote:
> > >
> > > On Fri, 2018-09-28 at 15:55 -0400, Olga Kornievskaia wrote:
> > > > On Fri, Sep 28, 2018 at 3:10 PM Olga Kornievskaia <[email protected]
> > > > >
> > > > wrote:
> > > > >
> > > > Wait, why are we suppose to reclaim the open state when we have a
> > > > valid open stateid? We don't have any cached opens that server
> > > > doesn't
> > > > know about. RFC7530 says "if the file has other open reference",
> > > > I
> > > > think the emphasis is on "other". I don't believe we need to be
> > > > sending anything besides the locks to the server. Then I'm back
> > > > to
> > > > square one.
> > >
> > > Holding a delegation does not imply that we hold an open stateid.
> > > Under
> > > Linux, the open stateid gets closed as soon as the application
> > > closes
> > > the file.
> > >
> > > The delegation, on the other hand, is retained until either it is
> > > recalled, or we see that the file has not been used for 2 lease
> > > periods.
> >
> > Ok I agree with all of it but I'm saying it doesn't need to be
> > reclaimed unconditionally or are you saying that's what the linux
> > client does? In this test case, the file hasn't been closed or
> > expired. I'm stating that the client has a valid open stateid and
> > should only be required to reclaim the locks (which with this patch
> > it
> > does).
>
> As I said earlier, the client is required to recover all _cached_ open
> and lock state. If it already holds an open stateid, then it should not
> need to reclaim the open modes that are covered by that stateid,
> however it may still need to reclaim those open modes that were not
> already subject to an explicit OPEN call.
>
> IOW: If the file was first opened with an open(O_RDRW) call by the
> application, but a second application then opened it using
> open(O_WRONLY), then we may already hold a stateid with a
> "SHARE_ACCESS_BOTH" open mode, however we will still need to send a
> reclaim for the cached SHARE_ACCESS_WRITE mode, so that a later
> OPEN_DOWNGRADE(SHARE_ACCESS_WRITE) can succeed.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2018-10-04 04:55:37

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 2/7] pNFS: Don't allocate more pages than we need to fit a layoutget response

On Wed, Sep 05 2018, Trond Myklebust wrote:

> For the 'files' and 'flexfiles' layout types, we do not expect the reply
> to be any larger than 4k. The block and scsi layout types are a little more
> greedy, so we keep allocating the maximum response size for now.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfs/filelayout/filelayout.c | 1 +
> fs/nfs/flexfilelayout/flexfilelayout.c | 1 +
> fs/nfs/pnfs.c | 7 +++++++
> fs/nfs/pnfs.h | 1 +
> 4 files changed, 10 insertions(+)
>
> diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
> index d175724ff566..61f46facb39c 100644
> --- a/fs/nfs/filelayout/filelayout.c
> +++ b/fs/nfs/filelayout/filelayout.c
> @@ -1164,6 +1164,7 @@ static struct pnfs_layoutdriver_type filelayout_type = {
> .id = LAYOUT_NFSV4_1_FILES,
> .name = "LAYOUT_NFSV4_1_FILES",
> .owner = THIS_MODULE,
> + .max_layoutget_response = 4096, /* 1 page or so... */

If it is really "1 page", then surely it should be PAGE_SIZE.
But some builds have PAGE_SIZE > 4096. So if it is really "probably noy
larger than 4K", then surely the comment should say that.

Confused.

Thanks,
NeilBrown


> .alloc_layout_hdr = filelayout_alloc_layout_hdr,
> .free_layout_hdr = filelayout_free_layout_hdr,
> .alloc_lseg = filelayout_alloc_lseg,
> diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
> index cae43333ef16..86bcba40ca61 100644
> --- a/fs/nfs/flexfilelayout/flexfilelayout.c
> +++ b/fs/nfs/flexfilelayout/flexfilelayout.c
> @@ -2356,6 +2356,7 @@ static struct pnfs_layoutdriver_type flexfilelayout_type = {
> .name = "LAYOUT_FLEX_FILES",
> .owner = THIS_MODULE,
> .flags = PNFS_LAYOUTGET_ON_OPEN,
> + .max_layoutget_response = 4096, /* 1 page or so... */
> .set_layoutdriver = ff_layout_set_layoutdriver,
> .alloc_layout_hdr = ff_layout_alloc_layout_hdr,
> .free_layout_hdr = ff_layout_free_layout_hdr,
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 6b5b2d36f502..c5672c02afd6 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -991,6 +991,7 @@ pnfs_alloc_init_layoutget_args(struct inode *ino,
> gfp_t gfp_flags)
> {
> struct nfs_server *server = pnfs_find_server(ino, ctx);
> + size_t max_reply_sz = server->pnfs_curr_ld->max_layoutget_response;
> size_t max_pages = max_response_pages(server);
> struct nfs4_layoutget *lgp;
>
> @@ -1000,6 +1001,12 @@ pnfs_alloc_init_layoutget_args(struct inode *ino,
> if (lgp == NULL)
> return NULL;
>
> + if (max_reply_sz) {
> + size_t npages = (max_reply_sz + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (npages < max_pages)
> + max_pages = npages;
> + }
> +
> lgp->args.layout.pages = nfs4_alloc_pages(max_pages, gfp_flags);
> if (!lgp->args.layout.pages) {
> kfree(lgp);
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index ece367ebde69..e2e9fcd5341d 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -125,6 +125,7 @@ struct pnfs_layoutdriver_type {
> struct module *owner;
> unsigned flags;
> unsigned max_deviceinfo_size;
> + unsigned max_layoutget_response;
>
> int (*set_layoutdriver) (struct nfs_server *, const struct nfs_fh *);
> int (*clear_layoutdriver) (struct nfs_server *);
> --
> 2.17.1


Attachments:
signature.asc (832.00 B)

2018-10-04 22:16:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

SGkgT2xnYSwNCg0KT24gV2VkLCAyMDE4LTEwLTAzIGF0IDE0OjM4IC0wNDAwLCBPbGdhIEtvcm5p
ZXZza2FpYSB3cm90ZToNCj4gSGkgVHJvbmQsDQo+IA0KPiBIZXJlJ3Mgd2h5IHRoZSBvcmRlcmlu
ZyBvZiB0aGUgIm9wZW5fZmlsZXMiIGxpc3QgbWF0dGVycyBhbmQNCj4gY2hhbmdlcy9maXhlcyB0
aGUgZXhpc3RpbmcgcHJvYmxlbS4NCj4gDQo+IFdoZW4gd2UgZmlyc3Qgb3BlbiB0aGUgZmlsZSBm
b3Igd3JpdGluZyBhbmQgZ2V0IGEgZGVsZWdhdGlvbiwgaXQncw0KPiB0aGUNCj4gZmlyc3Qgb25l
IG9uIHRoZSBsaXN0LiBXaGVuIHdlIG9wZW5lZCB0aGUgZmlsZSBhZ2FpbiBmb3IgdGhlIHNhbWUN
Cj4gbW9kZQ0KPiB0eXBlLCB0aGVuIGJlZm9yZSB0aGUgcGF0Y2gsIHRoZSBuZXcgZW50cnkgaXMg
aW5zZXJ0ZWQgYmVmb3JlIHdoYXQncw0KPiBhbHJlYWR5IG9uIHRoZSBsaXN0LiBCb3RoIG9mIHRo
ZXNlIGZpbGVzIHNoYXJlIHRoZSBzYW1lIG5mczRfc3RhdGUNCj4gdGhhdCdzIG1hcmtlZCBkZWxl
Z2F0ZWQuDQo+IA0KPiBPbmNlIHdlIHJlY2VpdmUgYSBkZWxlZ2F0aW9uIHJlY2FsbCwgaW4gZGVs
ZWdhdGlvbl9jbGFpbV9vcGVucygpIHdlDQo+IHdhbGsgdGhlIGxpc3QuIEZpcnN0IG9uZSB3aWxs
IGJlIHRoZSAybmQgb3Blbi4gSXQncyBtYXJrZWQgZGVsZWdhdGVkDQo+IGJ1dCBhZnRlciBjYWxs
aW5nIG5mczRfb3Blbl9kZWxlZ2F0aW9uX3JlY2FsbCgpIHRoZSBkZWxlZ2F0aW9uIGZsYWcNCj4g
aXMNCj4gY2xlYXJlZC4gVGhlIDJuZCBvcGVuIGRvZXNuJ3QgaGF2ZSB0aGUgbG9jayBhc3NvY2lh
dGVkIHdpdGggaXQuIFNvIG5vDQo+IGxvY2sgaXMgcmVjbGFpbWVkLiBXZSBub3cgZ28gdG8gdGhl
IDJuZCBlbnRyeSBpbiB0aGUgb3Blbl9maWxlIGxpc3QNCj4gd2hpY2ggaXMgdGhlIDFzdCBvcGVu
IGJ1dCBub3cgdGhlIGRlbGVnYXRpb24gZmxhZyBpcyBjbGVhcmVkIHNvIHdlDQo+IG5ldmVyIHJl
Y292ZXIgdGhlIGxvY2suDQo+IA0KPiBBbnkgb2YgdGhlIG9wZW5zIG9uIHRoZSBvcGVuX2xpc3Qg
Y2FuIGJlIHRoZSBsb2NrIGhvbGRlciBhbmQgd2UgY2FuJ3QNCj4gY2xlYXIgdGhlIGRlbGVnYXRp
b24gZmxhZyBvbiB0aGUgZmlyc3QgdHJlYXRtZW50IG9mIHRoZSBkZWxlZ2F0ZWQNCj4gb3Blbg0K
PiBiZWNhdXNlIGl0IG1pZ2h0IG5vdCBiZSB0aGUgb3duZXIgb2YgdGhlIGxvY2suDQo+IA0KPiBJ
J20gdHJ5aW5nIHRvIGZpZ3VyZSBvdXQgaG93IEkgd291bGQgZml4IGl0IGJ1dCBJIHRob3VnaHQg
SSdkIHNlbmQNCj4gdGhpcyBmb3IgeW91ciBjb21tZW50cy4NCg0KVGhlIGV4cGVjdGF0aW9uIGhl
cmUgaXMgdGhhdCBvbmNlIHdlIGZpbmQgYW4gb3BlbiBjb250ZXh0IHdpdGggYQ0Kc3RhdGVpZCB0
aGF0IG5lZWRzIHRvIGJlIHJlY2xhaW1lZCBvciByZWNvdmVyZWQsIHdlIHJlY292ZXIgX2FsbF8g
dGhlDQpzdGF0ZSBhc3NvY2lhdGVkIHdpdGggdGhhdCBzdGF0ZWlkLg0KSU9XOiB0aGUgZXhwZWN0
YXRpb24gaXMgdGhhdCB3ZSBmaXJzdCBsb29rIGF0IHRoZSBvcGVuIHN0YXRlLCBhbmQNCihkZXBl
bmRpbmcgb24gd2hldGhlciB0aGlzIGlzIGEgd3JpdGUgZGVsZWdhdGlvbiBvciBhIHJlYWQgZGVs
ZWdhdGlvbikNCnJ1biB0aHJvdWdoIGEgc2V0IG9mIGNhbGxzIHRvIG5mczRfb3Blbl9yZWNvdmVy
X2hlbHBlcigpIHRoYXQgd2lsbA0KcmVjb3ZlciBhbGwgb3V0c3RhbmRpbmcgb3BlbiBzdGF0ZSBm
b3IgdGhpcyBmaWxlLg0KV2UgdGhlbiBpdGVyYXRlIHRocm91Z2ggYWxsIHRoZSBsb2NrIHN0YXRl
aWRzIGZvciB0aGUgZmlsZSBhbmQgcmVjb3Zlcg0KdGhvc2UuDQoNClNvIHdoeSBhcmUgd2UgaG9s
ZGluZyB0aGUgb3BlbiBjb250ZXh0LCBhbmQgbm90IGp1c3QgcGlubmluZyB0aGUNCnN0YXRlaWQg
d2hpbGUgd2UgcGVyZm9ybSB0aGUgcmVjb3Zlcnk/IFRoZSBtYWluIHJlYXNvbiBpcyB0byBlbnN1
cmUNCnRoYXQgd2UgYWxzbyBwaW4gdGhlIHBhdGguIFRoZSBzdGF0ZWlkIGhvbGRzIGEgcmVmZXJl
bmNlIHRvIHRoZSBpbm9kZSwNCmJ1dCByZWNvdmVyeSBjYW4gcmVxdWlyZSB1cyB0byBwZXJmb3Jt
IGFuIE9QRU4gYmFzZWQgb24gcGF0aCAoZS5nLiB3aGVuDQp1c2luZyBORlM0X09QRU5fQ0xBSU1f
REVMRUdBVEVfQ1VSKS4gSGVuY2UgdGhlIHV0aWxpdHkgb2YgdGhlIG9wZW4NCmNvbnRleHQsIHdo
aWNoIGNhcnJpZXMgYSByZWZlcmVuY2UgdG8gYSBzdHJ1Y3QgZGVudHJ5Lg0KDQo+IA0KPiBPbiBG
cmksIFNlcCAyOCwgMjAxOCBhdCA0OjM4IFBNIFRyb25kIE15a2xlYnVzdCA8DQo+IHRyb25kbXlA
aGFtbWVyc3BhY2UuY29tPiB3cm90ZToNCj4gPiANCj4gPiBPbiBGcmksIDIwMTgtMDktMjggYXQg
MTY6MTkgLTA0MDAsIE9sZ2EgS29ybmlldnNrYWlhIHdyb3RlOg0KPiA+ID4gT24gRnJpLCBTZXAg
MjgsIDIwMTggYXQgNDowNyBQTSBUcm9uZCBNeWtsZWJ1c3QgPA0KPiA+ID4gdHJvbmRteUBoYW1t
ZXJzcGFjZS5jb20+IHdyb3RlOg0KPiA+ID4gPiANCj4gPiA+ID4gT24gRnJpLCAyMDE4LTA5LTI4
IGF0IDE1OjU1IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90ZToNCj4gPiA+ID4gPiBPbiBG
cmksIFNlcCAyOCwgMjAxOCBhdCAzOjEwIFBNIE9sZ2EgS29ybmlldnNrYWlhIDwNCj4gPiA+ID4g
PiBhZ2xvQHVtaWNoLmVkdQ0KPiA+ID4gPiA+ID4gDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gd3Jv
dGU6DQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBXYWl0LCB3aHkgYXJlIHdl
IHN1cHBvc2UgdG8gcmVjbGFpbSB0aGUgb3BlbiBzdGF0ZSB3aGVuIHdlDQo+ID4gPiA+ID4gaGF2
ZSBhDQo+ID4gPiA+ID4gdmFsaWQgb3BlbiBzdGF0ZWlkPyBXZSBkb24ndCBoYXZlIGFueSBjYWNo
ZWQgb3BlbnMgdGhhdA0KPiA+ID4gPiA+IHNlcnZlcg0KPiA+ID4gPiA+IGRvZXNuJ3QNCj4gPiA+
ID4gPiBrbm93IGFib3V0LiBSRkM3NTMwIHNheXMgImlmIHRoZSBmaWxlIGhhcyBvdGhlciBvcGVu
DQo+ID4gPiA+ID4gcmVmZXJlbmNlIiwNCj4gPiA+ID4gPiBJDQo+ID4gPiA+ID4gdGhpbmsgdGhl
IGVtcGhhc2lzIGlzIG9uICJvdGhlciIuIEkgZG9uJ3QgYmVsaWV2ZSB3ZSBuZWVkIHRvDQo+ID4g
PiA+ID4gYmUNCj4gPiA+ID4gPiBzZW5kaW5nIGFueXRoaW5nIGJlc2lkZXMgdGhlIGxvY2tzIHRv
IHRoZSBzZXJ2ZXIuIFRoZW4gSSdtDQo+ID4gPiA+ID4gYmFjaw0KPiA+ID4gPiA+IHRvDQo+ID4g
PiA+ID4gc3F1YXJlIG9uZS4NCj4gPiA+ID4gDQo+ID4gPiA+IEhvbGRpbmcgYSBkZWxlZ2F0aW9u
IGRvZXMgbm90IGltcGx5IHRoYXQgd2UgaG9sZCBhbiBvcGVuDQo+ID4gPiA+IHN0YXRlaWQuDQo+
ID4gPiA+IFVuZGVyDQo+ID4gPiA+IExpbnV4LCB0aGUgb3BlbiBzdGF0ZWlkIGdldHMgY2xvc2Vk
IGFzIHNvb24gYXMgdGhlIGFwcGxpY2F0aW9uDQo+ID4gPiA+IGNsb3Nlcw0KPiA+ID4gPiB0aGUg
ZmlsZS4NCj4gPiA+ID4gDQo+ID4gPiA+IFRoZSBkZWxlZ2F0aW9uLCBvbiB0aGUgb3RoZXIgaGFu
ZCwgaXMgcmV0YWluZWQgdW50aWwgZWl0aGVyIGl0DQo+ID4gPiA+IGlzDQo+ID4gPiA+IHJlY2Fs
bGVkLCBvciB3ZSBzZWUgdGhhdCB0aGUgZmlsZSBoYXMgbm90IGJlZW4gdXNlZCBmb3IgMiBsZWFz
ZQ0KPiA+ID4gPiBwZXJpb2RzLg0KPiA+ID4gDQo+ID4gPiBPayBJIGFncmVlIHdpdGggYWxsIG9m
IGl0IGJ1dCBJJ20gc2F5aW5nIGl0IGRvZXNuJ3QgbmVlZCB0byBiZQ0KPiA+ID4gcmVjbGFpbWVk
IHVuY29uZGl0aW9uYWxseSBvciBhcmUgeW91IHNheWluZyB0aGF0J3Mgd2hhdCB0aGUgbGludXgN
Cj4gPiA+IGNsaWVudCBkb2VzPyBJbiB0aGlzIHRlc3QgY2FzZSwgdGhlIGZpbGUgaGFzbid0IGJl
ZW4gY2xvc2VkIG9yDQo+ID4gPiBleHBpcmVkLiBJJ20gc3RhdGluZyB0aGF0IHRoZSBjbGllbnQg
aGFzIGEgdmFsaWQgb3BlbiBzdGF0ZWlkIGFuZA0KPiA+ID4gc2hvdWxkIG9ubHkgYmUgcmVxdWly
ZWQgdG8gcmVjbGFpbSB0aGUgbG9ja3MgKHdoaWNoIHdpdGggdGhpcw0KPiA+ID4gcGF0Y2gNCj4g
PiA+IGl0DQo+ID4gPiBkb2VzKS4NCj4gPiANCj4gPiBBcyBJIHNhaWQgZWFybGllciwgdGhlIGNs
aWVudCBpcyByZXF1aXJlZCB0byByZWNvdmVyIGFsbCBfY2FjaGVkXw0KPiA+IG9wZW4NCj4gPiBh
bmQgbG9jayBzdGF0ZS4gSWYgaXQgYWxyZWFkeSBob2xkcyBhbiBvcGVuIHN0YXRlaWQsIHRoZW4g
aXQgc2hvdWxkDQo+ID4gbm90DQo+ID4gbmVlZCB0byByZWNsYWltIHRoZSBvcGVuIG1vZGVzIHRo
YXQgYXJlIGNvdmVyZWQgYnkgdGhhdCBzdGF0ZWlkLA0KPiA+IGhvd2V2ZXIgaXQgbWF5IHN0aWxs
IG5lZWQgdG8gcmVjbGFpbSB0aG9zZSBvcGVuIG1vZGVzIHRoYXQgd2VyZSBub3QNCj4gPiBhbHJl
YWR5IHN1YmplY3QgdG8gYW4gZXhwbGljaXQgT1BFTiBjYWxsLg0KPiA+IA0KPiA+IElPVzogSWYg
dGhlIGZpbGUgd2FzIGZpcnN0IG9wZW5lZCB3aXRoIGFuIG9wZW4oT19SRFJXKSBjYWxsIGJ5IHRo
ZQ0KPiA+IGFwcGxpY2F0aW9uLCBidXQgYSBzZWNvbmQgYXBwbGljYXRpb24gdGhlbiBvcGVuZWQg
aXQgdXNpbmcNCj4gPiBvcGVuKE9fV1JPTkxZKSwgdGhlbiB3ZSBtYXkgYWxyZWFkeSBob2xkIGEg
c3RhdGVpZCB3aXRoIGENCj4gPiAiU0hBUkVfQUNDRVNTX0JPVEgiIG9wZW4gbW9kZSwgaG93ZXZl
ciB3ZSB3aWxsIHN0aWxsIG5lZWQgdG8gc2VuZCBhDQo+ID4gcmVjbGFpbSBmb3IgdGhlIGNhY2hl
ZCBTSEFSRV9BQ0NFU1NfV1JJVEUgbW9kZSwgc28gdGhhdCBhIGxhdGVyDQo+ID4gT1BFTl9ET1dO
R1JBREUoU0hBUkVfQUNDRVNTX1dSSVRFKSBjYW4gc3VjY2VlZC4NCj4gPiANCj4gPiAtLQ0KPiA+
IFRyb25kIE15a2xlYnVzdA0KPiA+IExpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgSGFtbWVy
c3BhY2UNCj4gPiB0cm9uZC5teWtsZWJ1c3RAaGFtbWVyc3BhY2UuY29tDQo+ID4gDQo+ID4gDQot
LSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBIYW1tZXJz
cGFjZQ0KdHJvbmQubXlrbGVidXN0QGhhbW1lcnNwYWNlLmNvbQ0KDQoNCg==

2018-10-04 22:43:16

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Thu, Oct 4, 2018 at 11:22 AM Trond Myklebust <[email protected]> wrote:
>
> Hi Olga,
>
> On Wed, 2018-10-03 at 14:38 -0400, Olga Kornievskaia wrote:
> > Hi Trond,
> >
> > Here's why the ordering of the "open_files" list matters and
> > changes/fixes the existing problem.
> >
> > When we first open the file for writing and get a delegation, it's
> > the
> > first one on the list. When we opened the file again for the same
> > mode
> > type, then before the patch, the new entry is inserted before what's
> > already on the list. Both of these files share the same nfs4_state
> > that's marked delegated.
> >
> > Once we receive a delegation recall, in delegation_claim_opens() we
> > walk the list. First one will be the 2nd open. It's marked delegated
> > but after calling nfs4_open_delegation_recall() the delegation flag
> > is
> > cleared. The 2nd open doesn't have the lock associated with it. So no
> > lock is reclaimed. We now go to the 2nd entry in the open_file list
> > which is the 1st open but now the delegation flag is cleared so we
> > never recover the lock.
> >
> > Any of the opens on the open_list can be the lock holder and we can't
> > clear the delegation flag on the first treatment of the delegated
> > open
> > because it might not be the owner of the lock.
> >
> > I'm trying to figure out how I would fix it but I thought I'd send
> > this for your comments.
>
> The expectation here is that once we find an open context with a
> stateid that needs to be reclaimed or recovered, we recover _all_ the
> state associated with that stateid.
> IOW: the expectation is that we first look at the open state, and
> (depending on whether this is a write delegation or a read delegation)
> run through a set of calls to nfs4_open_recover_helper() that will
> recover all outstanding open state for this file.

That's true. I see that it will recover all the outstanding opens for this file.

> We then iterate through all the lock stateids for the file and recover
> those.

However this is not true. Because we pass in a specific
nfs_open_context into the nfs_delegation_claim_locks() and while
looping thru the list of locks for the file we compare if the open
context associated with the file lock is same as the passed in
context. The passed in context is that of the first nfs_open_context
that was marked delegated and not necessarily the context that hold
the locks. That's the problem.

While we are looping thru the locks for the file, we need to be
checking against any and all the nfs_open_context that are associated
with the file and recovering those locks. I'm still not sure how to do
it.

>
> So why are we holding the open context, and not just pinning the
> stateid while we perform the recovery? The main reason is to ensure
> that we also pin the path. The stateid holds a reference to the inode,
> but recovery can require us to perform an OPEN based on path (e.g. when
> using NFS4_OPEN_CLAIM_DELEGATE_CUR). Hence the utility of the open
> context, which carries a reference to a struct dentry.
>
> >
> > On Fri, Sep 28, 2018 at 4:38 PM Trond Myklebust <
> > [email protected]> wrote:
> > >
> > > On Fri, 2018-09-28 at 16:19 -0400, Olga Kornievskaia wrote:
> > > > On Fri, Sep 28, 2018 at 4:07 PM Trond Myklebust <
> > > > [email protected]> wrote:
> > > > >
> > > > > On Fri, 2018-09-28 at 15:55 -0400, Olga Kornievskaia wrote:
> > > > > > On Fri, Sep 28, 2018 at 3:10 PM Olga Kornievskaia <
> > > > > > [email protected]
> > > > > > >
> > > > > >
> > > > > > wrote:
> > > > > > >
> > > > > >
> > > > > > Wait, why are we suppose to reclaim the open state when we
> > > > > > have a
> > > > > > valid open stateid? We don't have any cached opens that
> > > > > > server
> > > > > > doesn't
> > > > > > know about. RFC7530 says "if the file has other open
> > > > > > reference",
> > > > > > I
> > > > > > think the emphasis is on "other". I don't believe we need to
> > > > > > be
> > > > > > sending anything besides the locks to the server. Then I'm
> > > > > > back
> > > > > > to
> > > > > > square one.
> > > > >
> > > > > Holding a delegation does not imply that we hold an open
> > > > > stateid.
> > > > > Under
> > > > > Linux, the open stateid gets closed as soon as the application
> > > > > closes
> > > > > the file.
> > > > >
> > > > > The delegation, on the other hand, is retained until either it
> > > > > is
> > > > > recalled, or we see that the file has not been used for 2 lease
> > > > > periods.
> > > >
> > > > Ok I agree with all of it but I'm saying it doesn't need to be
> > > > reclaimed unconditionally or are you saying that's what the linux
> > > > client does? In this test case, the file hasn't been closed or
> > > > expired. I'm stating that the client has a valid open stateid and
> > > > should only be required to reclaim the locks (which with this
> > > > patch
> > > > it
> > > > does).
> > >
> > > As I said earlier, the client is required to recover all _cached_
> > > open
> > > and lock state. If it already holds an open stateid, then it should
> > > not
> > > need to reclaim the open modes that are covered by that stateid,
> > > however it may still need to reclaim those open modes that were not
> > > already subject to an explicit OPEN call.
> > >
> > > IOW: If the file was first opened with an open(O_RDRW) call by the
> > > application, but a second application then opened it using
> > > open(O_WRONLY), then we may already hold a stateid with a
> > > "SHARE_ACCESS_BOTH" open mode, however we will still need to send a
> > > reclaim for the cached SHARE_ACCESS_WRITE mode, so that a later
> > > OPEN_DOWNGRADE(SHARE_ACCESS_WRITE) can succeed.
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > [email protected]
> > >
> > >
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2018-10-04 23:07:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

T24gVGh1LCAyMDE4LTEwLTA0IGF0IDExOjQ5IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gT24gVGh1LCBPY3QgNCwgMjAxOCBhdCAxMToyMiBBTSBUcm9uZCBNeWtsZWJ1c3QgPA0K
PiB0cm9uZG15QGhhbW1lcnNwYWNlLmNvbT4gd3JvdGU6DQo+ID4gDQo+ID4gSGkgT2xnYSwNCj4g
PiANCj4gPiBPbiBXZWQsIDIwMTgtMTAtMDMgYXQgMTQ6MzggLTA0MDAsIE9sZ2EgS29ybmlldnNr
YWlhIHdyb3RlOg0KPiA+ID4gSGkgVHJvbmQsDQo+ID4gPiANCj4gPiA+IEhlcmUncyB3aHkgdGhl
IG9yZGVyaW5nIG9mIHRoZSAib3Blbl9maWxlcyIgbGlzdCBtYXR0ZXJzIGFuZA0KPiA+ID4gY2hh
bmdlcy9maXhlcyB0aGUgZXhpc3RpbmcgcHJvYmxlbS4NCj4gPiA+IA0KPiA+ID4gV2hlbiB3ZSBm
aXJzdCBvcGVuIHRoZSBmaWxlIGZvciB3cml0aW5nIGFuZCBnZXQgYSBkZWxlZ2F0aW9uLA0KPiA+
ID4gaXQncw0KPiA+ID4gdGhlDQo+ID4gPiBmaXJzdCBvbmUgb24gdGhlIGxpc3QuIFdoZW4gd2Ug
b3BlbmVkIHRoZSBmaWxlIGFnYWluIGZvciB0aGUgc2FtZQ0KPiA+ID4gbW9kZQ0KPiA+ID4gdHlw
ZSwgdGhlbiBiZWZvcmUgdGhlIHBhdGNoLCB0aGUgbmV3IGVudHJ5IGlzIGluc2VydGVkIGJlZm9y
ZQ0KPiA+ID4gd2hhdCdzDQo+ID4gPiBhbHJlYWR5IG9uIHRoZSBsaXN0LiBCb3RoIG9mIHRoZXNl
IGZpbGVzIHNoYXJlIHRoZSBzYW1lDQo+ID4gPiBuZnM0X3N0YXRlDQo+ID4gPiB0aGF0J3MgbWFy
a2VkIGRlbGVnYXRlZC4NCj4gPiA+IA0KPiA+ID4gT25jZSB3ZSByZWNlaXZlIGEgZGVsZWdhdGlv
biByZWNhbGwsIGluIGRlbGVnYXRpb25fY2xhaW1fb3BlbnMoKQ0KPiA+ID4gd2UNCj4gPiA+IHdh
bGsgdGhlIGxpc3QuIEZpcnN0IG9uZSB3aWxsIGJlIHRoZSAybmQgb3Blbi4gSXQncyBtYXJrZWQN
Cj4gPiA+IGRlbGVnYXRlZA0KPiA+ID4gYnV0IGFmdGVyIGNhbGxpbmcgbmZzNF9vcGVuX2RlbGVn
YXRpb25fcmVjYWxsKCkgdGhlIGRlbGVnYXRpb24NCj4gPiA+IGZsYWcNCj4gPiA+IGlzDQo+ID4g
PiBjbGVhcmVkLiBUaGUgMm5kIG9wZW4gZG9lc24ndCBoYXZlIHRoZSBsb2NrIGFzc29jaWF0ZWQg
d2l0aCBpdC4NCj4gPiA+IFNvIG5vDQo+ID4gPiBsb2NrIGlzIHJlY2xhaW1lZC4gV2Ugbm93IGdv
IHRvIHRoZSAybmQgZW50cnkgaW4gdGhlIG9wZW5fZmlsZQ0KPiA+ID4gbGlzdA0KPiA+ID4gd2hp
Y2ggaXMgdGhlIDFzdCBvcGVuIGJ1dCBub3cgdGhlIGRlbGVnYXRpb24gZmxhZyBpcyBjbGVhcmVk
IHNvDQo+ID4gPiB3ZQ0KPiA+ID4gbmV2ZXIgcmVjb3ZlciB0aGUgbG9jay4NCj4gPiA+IA0KPiA+
ID4gQW55IG9mIHRoZSBvcGVucyBvbiB0aGUgb3Blbl9saXN0IGNhbiBiZSB0aGUgbG9jayBob2xk
ZXIgYW5kIHdlDQo+ID4gPiBjYW4ndA0KPiA+ID4gY2xlYXIgdGhlIGRlbGVnYXRpb24gZmxhZyBv
biB0aGUgZmlyc3QgdHJlYXRtZW50IG9mIHRoZSBkZWxlZ2F0ZWQNCj4gPiA+IG9wZW4NCj4gPiA+
IGJlY2F1c2UgaXQgbWlnaHQgbm90IGJlIHRoZSBvd25lciBvZiB0aGUgbG9jay4NCj4gPiA+IA0K
PiA+ID4gSSdtIHRyeWluZyB0byBmaWd1cmUgb3V0IGhvdyBJIHdvdWxkIGZpeCBpdCBidXQgSSB0
aG91Z2h0IEknZA0KPiA+ID4gc2VuZA0KPiA+ID4gdGhpcyBmb3IgeW91ciBjb21tZW50cy4NCj4g
PiANCj4gPiBUaGUgZXhwZWN0YXRpb24gaGVyZSBpcyB0aGF0IG9uY2Ugd2UgZmluZCBhbiBvcGVu
IGNvbnRleHQgd2l0aCBhDQo+ID4gc3RhdGVpZCB0aGF0IG5lZWRzIHRvIGJlIHJlY2xhaW1lZCBv
ciByZWNvdmVyZWQsIHdlIHJlY292ZXIgX2FsbF8NCj4gPiB0aGUNCj4gPiBzdGF0ZSBhc3NvY2lh
dGVkIHdpdGggdGhhdCBzdGF0ZWlkLg0KPiA+IElPVzogdGhlIGV4cGVjdGF0aW9uIGlzIHRoYXQg
d2UgZmlyc3QgbG9vayBhdCB0aGUgb3BlbiBzdGF0ZSwgYW5kDQo+ID4gKGRlcGVuZGluZyBvbiB3
aGV0aGVyIHRoaXMgaXMgYSB3cml0ZSBkZWxlZ2F0aW9uIG9yIGEgcmVhZA0KPiA+IGRlbGVnYXRp
b24pDQo+ID4gcnVuIHRocm91Z2ggYSBzZXQgb2YgY2FsbHMgdG8gbmZzNF9vcGVuX3JlY292ZXJf
aGVscGVyKCkgdGhhdCB3aWxsDQo+ID4gcmVjb3ZlciBhbGwgb3V0c3RhbmRpbmcgb3BlbiBzdGF0
ZSBmb3IgdGhpcyBmaWxlLg0KPiANCj4gVGhhdCdzIHRydWUuIEkgc2VlIHRoYXQgaXQgd2lsbCBy
ZWNvdmVyIGFsbCB0aGUgb3V0c3RhbmRpbmcgb3BlbnMgZm9yDQo+IHRoaXMgZmlsZS4NCj4gDQo+
ID4gV2UgdGhlbiBpdGVyYXRlIHRocm91Z2ggYWxsIHRoZSBsb2NrIHN0YXRlaWRzIGZvciB0aGUg
ZmlsZSBhbmQNCj4gPiByZWNvdmVyDQo+ID4gdGhvc2UuDQo+IA0KPiBIb3dldmVyIHRoaXMgaXMg
bm90IHRydWUuIEJlY2F1c2Ugd2UgcGFzcyBpbiBhIHNwZWNpZmljDQo+IG5mc19vcGVuX2NvbnRl
eHQgaW50byB0aGUgbmZzX2RlbGVnYXRpb25fY2xhaW1fbG9ja3MoKSBhbmQgd2hpbGUNCj4gbG9v
cGluZyB0aHJ1IHRoZSBsaXN0IG9mIGxvY2tzIGZvciB0aGUgZmlsZSB3ZSBjb21wYXJlIGlmIHRo
ZSBvcGVuDQo+IGNvbnRleHQgYXNzb2NpYXRlZCB3aXRoIHRoZSBmaWxlIGxvY2sgaXMgc2FtZSBh
cyB0aGUgcGFzc2VkIGluDQo+IGNvbnRleHQuIFRoZSBwYXNzZWQgaW4gY29udGV4dCBpcyB0aGF0
IG9mIHRoZSBmaXJzdCBuZnNfb3Blbl9jb250ZXh0DQo+IHRoYXQgd2FzIG1hcmtlZCBkZWxlZ2F0
ZWQgYW5kIG5vdCBuZWNlc3NhcmlseSB0aGUgY29udGV4dCB0aGF0IGhvbGQNCj4gdGhlIGxvY2tz
LiBUaGF0J3MgdGhlIHByb2JsZW0uDQo+IA0KPiBXaGlsZSB3ZSBhcmUgbG9vcGluZyB0aHJ1IHRo
ZSBsb2NrcyBmb3IgdGhlIGZpbGUsIHdlIG5lZWQgdG8gYmUNCj4gY2hlY2tpbmcgYWdhaW5zdCBh
bnkgYW5kIGFsbCB0aGUgbmZzX29wZW5fY29udGV4dCB0aGF0IGFyZSBhc3NvY2lhdGVkDQo+IHdp
dGggdGhlIGZpbGUgYW5kIHJlY292ZXJpbmcgdGhvc2UgbG9ja3MuIEknbSBzdGlsbCBub3Qgc3Vy
ZSBob3cgdG8NCj4gZG8NCj4gaXQuDQo+IA0KDQpJbnRlcmVzdGluZy4gV2h5IGFyZSB3ZSBjaGVj
a2luZyB0aGF0IG9wZW4gY29udGV4dD8gSSBjYW4ndCBzZWUgYW55DQpyZWFzb24gd2h5IHdlIHdv
dWxkIHdhbnQgdG8gZG8gdGhhdC4NCg0KPiA+IA0KPiA+IFNvIHdoeSBhcmUgd2UgaG9sZGluZyB0
aGUgb3BlbiBjb250ZXh0LCBhbmQgbm90IGp1c3QgcGlubmluZyB0aGUNCj4gPiBzdGF0ZWlkIHdo
aWxlIHdlIHBlcmZvcm0gdGhlIHJlY292ZXJ5PyBUaGUgbWFpbiByZWFzb24gaXMgdG8gZW5zdXJl
DQo+ID4gdGhhdCB3ZSBhbHNvIHBpbiB0aGUgcGF0aC4gVGhlIHN0YXRlaWQgaG9sZHMgYSByZWZl
cmVuY2UgdG8gdGhlDQo+ID4gaW5vZGUsDQo+ID4gYnV0IHJlY292ZXJ5IGNhbiByZXF1aXJlIHVz
IHRvIHBlcmZvcm0gYW4gT1BFTiBiYXNlZCBvbiBwYXRoIChlLmcuDQo+ID4gd2hlbg0KPiA+IHVz
aW5nIE5GUzRfT1BFTl9DTEFJTV9ERUxFR0FURV9DVVIpLiBIZW5jZSB0aGUgdXRpbGl0eSBvZiB0
aGUgb3Blbg0KPiA+IGNvbnRleHQsIHdoaWNoIGNhcnJpZXMgYSByZWZlcmVuY2UgdG8gYSBzdHJ1
Y3QgZGVudHJ5Lg0KPiA+IA0KPiA+ID4gDQo+ID4gPiBPbiBGcmksIFNlcCAyOCwgMjAxOCBhdCA0
OjM4IFBNIFRyb25kIE15a2xlYnVzdCA8DQo+ID4gPiB0cm9uZG15QGhhbW1lcnNwYWNlLmNvbT4g
d3JvdGU6DQo+ID4gPiA+IA0KPiA+ID4gPiBPbiBGcmksIDIwMTgtMDktMjggYXQgMTY6MTkgLTA0
MDAsIE9sZ2EgS29ybmlldnNrYWlhIHdyb3RlOg0KPiA+ID4gPiA+IE9uIEZyaSwgU2VwIDI4LCAy
MDE4IGF0IDQ6MDcgUE0gVHJvbmQgTXlrbGVidXN0IDwNCj4gPiA+ID4gPiB0cm9uZG15QGhhbW1l
cnNwYWNlLmNvbT4gd3JvdGU6DQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+IE9uIEZyaSwgMjAx
OC0wOS0yOCBhdCAxNTo1NSAtMDQwMCwgT2xnYSBLb3JuaWV2c2thaWEgd3JvdGU6DQo+ID4gPiA+
ID4gPiA+IE9uIEZyaSwgU2VwIDI4LCAyMDE4IGF0IDM6MTAgUE0gT2xnYSBLb3JuaWV2c2thaWEg
PA0KPiA+ID4gPiA+ID4gPiBhZ2xvQHVtaWNoLmVkdQ0KPiA+ID4gPiA+ID4gPiA+IA0KPiA+ID4g
PiA+ID4gPiANCj4gPiA+ID4gPiA+ID4gd3JvdGU6DQo+ID4gPiA+ID4gPiA+ID4gDQo+ID4gPiA+
ID4gPiA+IA0KPiA+ID4gPiA+ID4gPiBXYWl0LCB3aHkgYXJlIHdlIHN1cHBvc2UgdG8gcmVjbGFp
bSB0aGUgb3BlbiBzdGF0ZSB3aGVuDQo+ID4gPiA+ID4gPiA+IHdlDQo+ID4gPiA+ID4gPiA+IGhh
dmUgYQ0KPiA+ID4gPiA+ID4gPiB2YWxpZCBvcGVuIHN0YXRlaWQ/IFdlIGRvbid0IGhhdmUgYW55
IGNhY2hlZCBvcGVucyB0aGF0DQo+ID4gPiA+ID4gPiA+IHNlcnZlcg0KPiA+ID4gPiA+ID4gPiBk
b2Vzbid0DQo+ID4gPiA+ID4gPiA+IGtub3cgYWJvdXQuIFJGQzc1MzAgc2F5cyAiaWYgdGhlIGZp
bGUgaGFzIG90aGVyIG9wZW4NCj4gPiA+ID4gPiA+ID4gcmVmZXJlbmNlIiwNCj4gPiA+ID4gPiA+
ID4gSQ0KPiA+ID4gPiA+ID4gPiB0aGluayB0aGUgZW1waGFzaXMgaXMgb24gIm90aGVyIi4gSSBk
b24ndCBiZWxpZXZlIHdlIG5lZWQNCj4gPiA+ID4gPiA+ID4gdG8NCj4gPiA+ID4gPiA+ID4gYmUN
Cj4gPiA+ID4gPiA+ID4gc2VuZGluZyBhbnl0aGluZyBiZXNpZGVzIHRoZSBsb2NrcyB0byB0aGUg
c2VydmVyLiBUaGVuDQo+ID4gPiA+ID4gPiA+IEknbQ0KPiA+ID4gPiA+ID4gPiBiYWNrDQo+ID4g
PiA+ID4gPiA+IHRvDQo+ID4gPiA+ID4gPiA+IHNxdWFyZSBvbmUuDQo+ID4gPiA+ID4gPiANCj4g
PiA+ID4gPiA+IEhvbGRpbmcgYSBkZWxlZ2F0aW9uIGRvZXMgbm90IGltcGx5IHRoYXQgd2UgaG9s
ZCBhbiBvcGVuDQo+ID4gPiA+ID4gPiBzdGF0ZWlkLg0KPiA+ID4gPiA+ID4gVW5kZXINCj4gPiA+
ID4gPiA+IExpbnV4LCB0aGUgb3BlbiBzdGF0ZWlkIGdldHMgY2xvc2VkIGFzIHNvb24gYXMgdGhl
DQo+ID4gPiA+ID4gPiBhcHBsaWNhdGlvbg0KPiA+ID4gPiA+ID4gY2xvc2VzDQo+ID4gPiA+ID4g
PiB0aGUgZmlsZS4NCj4gPiA+ID4gPiA+IA0KPiA+ID4gPiA+ID4gVGhlIGRlbGVnYXRpb24sIG9u
IHRoZSBvdGhlciBoYW5kLCBpcyByZXRhaW5lZCB1bnRpbCBlaXRoZXINCj4gPiA+ID4gPiA+IGl0
DQo+ID4gPiA+ID4gPiBpcw0KPiA+ID4gPiA+ID4gcmVjYWxsZWQsIG9yIHdlIHNlZSB0aGF0IHRo
ZSBmaWxlIGhhcyBub3QgYmVlbiB1c2VkIGZvciAyDQo+ID4gPiA+ID4gPiBsZWFzZQ0KPiA+ID4g
PiA+ID4gcGVyaW9kcy4NCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBPayBJIGFncmVlIHdpdGggYWxs
IG9mIGl0IGJ1dCBJJ20gc2F5aW5nIGl0IGRvZXNuJ3QgbmVlZCB0bw0KPiA+ID4gPiA+IGJlDQo+
ID4gPiA+ID4gcmVjbGFpbWVkIHVuY29uZGl0aW9uYWxseSBvciBhcmUgeW91IHNheWluZyB0aGF0
J3Mgd2hhdCB0aGUNCj4gPiA+ID4gPiBsaW51eA0KPiA+ID4gPiA+IGNsaWVudCBkb2VzPyBJbiB0
aGlzIHRlc3QgY2FzZSwgdGhlIGZpbGUgaGFzbid0IGJlZW4gY2xvc2VkDQo+ID4gPiA+ID4gb3IN
Cj4gPiA+ID4gPiBleHBpcmVkLiBJJ20gc3RhdGluZyB0aGF0IHRoZSBjbGllbnQgaGFzIGEgdmFs
aWQgb3BlbiBzdGF0ZWlkDQo+ID4gPiA+ID4gYW5kDQo+ID4gPiA+ID4gc2hvdWxkIG9ubHkgYmUg
cmVxdWlyZWQgdG8gcmVjbGFpbSB0aGUgbG9ja3MgKHdoaWNoIHdpdGggdGhpcw0KPiA+ID4gPiA+
IHBhdGNoDQo+ID4gPiA+ID4gaXQNCj4gPiA+ID4gPiBkb2VzKS4NCj4gPiA+ID4gDQo+ID4gPiA+
IEFzIEkgc2FpZCBlYXJsaWVyLCB0aGUgY2xpZW50IGlzIHJlcXVpcmVkIHRvIHJlY292ZXIgYWxs
DQo+ID4gPiA+IF9jYWNoZWRfDQo+ID4gPiA+IG9wZW4NCj4gPiA+ID4gYW5kIGxvY2sgc3RhdGUu
IElmIGl0IGFscmVhZHkgaG9sZHMgYW4gb3BlbiBzdGF0ZWlkLCB0aGVuIGl0DQo+ID4gPiA+IHNo
b3VsZA0KPiA+ID4gPiBub3QNCj4gPiA+ID4gbmVlZCB0byByZWNsYWltIHRoZSBvcGVuIG1vZGVz
IHRoYXQgYXJlIGNvdmVyZWQgYnkgdGhhdA0KPiA+ID4gPiBzdGF0ZWlkLA0KPiA+ID4gPiBob3dl
dmVyIGl0IG1heSBzdGlsbCBuZWVkIHRvIHJlY2xhaW0gdGhvc2Ugb3BlbiBtb2RlcyB0aGF0IHdl
cmUNCj4gPiA+ID4gbm90DQo+ID4gPiA+IGFscmVhZHkgc3ViamVjdCB0byBhbiBleHBsaWNpdCBP
UEVOIGNhbGwuDQo+ID4gPiA+IA0KPiA+ID4gPiBJT1c6IElmIHRoZSBmaWxlIHdhcyBmaXJzdCBv
cGVuZWQgd2l0aCBhbiBvcGVuKE9fUkRSVykgY2FsbCBieQ0KPiA+ID4gPiB0aGUNCj4gPiA+ID4g
YXBwbGljYXRpb24sIGJ1dCBhIHNlY29uZCBhcHBsaWNhdGlvbiB0aGVuIG9wZW5lZCBpdCB1c2lu
Zw0KPiA+ID4gPiBvcGVuKE9fV1JPTkxZKSwgdGhlbiB3ZSBtYXkgYWxyZWFkeSBob2xkIGEgc3Rh
dGVpZCB3aXRoIGENCj4gPiA+ID4gIlNIQVJFX0FDQ0VTU19CT1RIIiBvcGVuIG1vZGUsIGhvd2V2
ZXIgd2Ugd2lsbCBzdGlsbCBuZWVkIHRvDQo+ID4gPiA+IHNlbmQgYQ0KPiA+ID4gPiByZWNsYWlt
IGZvciB0aGUgY2FjaGVkIFNIQVJFX0FDQ0VTU19XUklURSBtb2RlLCBzbyB0aGF0IGEgbGF0ZXIN
Cj4gPiA+ID4gT1BFTl9ET1dOR1JBREUoU0hBUkVfQUNDRVNTX1dSSVRFKSBjYW4gc3VjY2VlZC4N
Cj4gPiA+ID4gDQo+ID4gPiA+IC0tDQo+ID4gPiA+IFRyb25kIE15a2xlYnVzdA0KPiA+ID4gPiBM
aW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIEhhbW1lcnNwYWNlDQo+ID4gPiA+IHRyb25kLm15
a2xlYnVzdEBoYW1tZXJzcGFjZS5jb20NCj4gPiA+ID4gDQo+ID4gPiA+IA0KPiA+IA0KPiA+IC0t
DQo+ID4gVHJvbmQgTXlrbGVidXN0DQo+ID4gTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBI
YW1tZXJzcGFjZQ0KPiA+IHRyb25kLm15a2xlYnVzdEBoYW1tZXJzcGFjZS5jb20NCj4gPiANCj4g
PiANClRyb25kIE15a2xlYnVzdA0KQ1RPLCBIYW1tZXJzcGFjZSBJbmMNCjQzMDAgRWwgQ2FtaW5v
IFJlYWwsIFN1aXRlIDEwNQ0KTG9zIEFsdG9zLCBDQSA5NDAyMg0Kd3d3LmhhbW1lci5zcGFjZQ0K
DQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBIYW1t
ZXJzcGFjZQ0KdHJvbmQubXlrbGVidXN0QGhhbW1lcnNwYWNlLmNvbQ0KDQoNCg==

2018-10-04 23:25:42

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Thu, Oct 4, 2018 at 12:13 PM Trond Myklebust <[email protected]> wrote:
>
> On Thu, 2018-10-04 at 11:49 -0400, Olga Kornievskaia wrote:
> > On Thu, Oct 4, 2018 at 11:22 AM Trond Myklebust <
> > [email protected]> wrote:
> > >
> > > Hi Olga,
> > >
> > > On Wed, 2018-10-03 at 14:38 -0400, Olga Kornievskaia wrote:
> > > > Hi Trond,
> > > >
> > > > Here's why the ordering of the "open_files" list matters and
> > > > changes/fixes the existing problem.
> > > >
> > > > When we first open the file for writing and get a delegation,
> > > > it's
> > > > the
> > > > first one on the list. When we opened the file again for the same
> > > > mode
> > > > type, then before the patch, the new entry is inserted before
> > > > what's
> > > > already on the list. Both of these files share the same
> > > > nfs4_state
> > > > that's marked delegated.
> > > >
> > > > Once we receive a delegation recall, in delegation_claim_opens()
> > > > we
> > > > walk the list. First one will be the 2nd open. It's marked
> > > > delegated
> > > > but after calling nfs4_open_delegation_recall() the delegation
> > > > flag
> > > > is
> > > > cleared. The 2nd open doesn't have the lock associated with it.
> > > > So no
> > > > lock is reclaimed. We now go to the 2nd entry in the open_file
> > > > list
> > > > which is the 1st open but now the delegation flag is cleared so
> > > > we
> > > > never recover the lock.
> > > >
> > > > Any of the opens on the open_list can be the lock holder and we
> > > > can't
> > > > clear the delegation flag on the first treatment of the delegated
> > > > open
> > > > because it might not be the owner of the lock.
> > > >
> > > > I'm trying to figure out how I would fix it but I thought I'd
> > > > send
> > > > this for your comments.
> > >
> > > The expectation here is that once we find an open context with a
> > > stateid that needs to be reclaimed or recovered, we recover _all_
> > > the
> > > state associated with that stateid.
> > > IOW: the expectation is that we first look at the open state, and
> > > (depending on whether this is a write delegation or a read
> > > delegation)
> > > run through a set of calls to nfs4_open_recover_helper() that will
> > > recover all outstanding open state for this file.
> >
> > That's true. I see that it will recover all the outstanding opens for
> > this file.
> >
> > > We then iterate through all the lock stateids for the file and
> > > recover
> > > those.
> >
> > However this is not true. Because we pass in a specific
> > nfs_open_context into the nfs_delegation_claim_locks() and while
> > looping thru the list of locks for the file we compare if the open
> > context associated with the file lock is same as the passed in
> > context. The passed in context is that of the first nfs_open_context
> > that was marked delegated and not necessarily the context that hold
> > the locks. That's the problem.
> >
> > While we are looping thru the locks for the file, we need to be
> > checking against any and all the nfs_open_context that are associated
> > with the file and recovering those locks. I'm still not sure how to
> > do
> > it.
> >
>
> Interesting. Why are we checking that open context? I can't see any
> reason why we would want to do that.

I'm thinking it's probably because we might have open context
(non-delegated) on this file that holds a lock (say we opened with
O_DIRECT and got a lock; that lock was sent to the server so doesn't
need to recovered). Then we opened a file again and got a delegation
and got a different lock. When we are looping thru the locks for the
file, we only need to recover the 2nd one.

I'm thinking can't we check that the passed in state is the same that
is what we get nfs_file_open_context(fl->fl_file)->state ? I'm testing
it now...


>
> > >
> > > So why are we holding the open context, and not just pinning the
> > > stateid while we perform the recovery? The main reason is to ensure
> > > that we also pin the path. The stateid holds a reference to the
> > > inode,
> > > but recovery can require us to perform an OPEN based on path (e.g.
> > > when
> > > using NFS4_OPEN_CLAIM_DELEGATE_CUR). Hence the utility of the open
> > > context, which carries a reference to a struct dentry.
> > >
> > > >
> > > > On Fri, Sep 28, 2018 at 4:38 PM Trond Myklebust <
> > > > [email protected]> wrote:
> > > > >
> > > > > On Fri, 2018-09-28 at 16:19 -0400, Olga Kornievskaia wrote:
> > > > > > On Fri, Sep 28, 2018 at 4:07 PM Trond Myklebust <
> > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, 2018-09-28 at 15:55 -0400, Olga Kornievskaia wrote:
> > > > > > > > On Fri, Sep 28, 2018 at 3:10 PM Olga Kornievskaia <
> > > > > > > > [email protected]
> > > > > > > > >
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > >
> > > > > > > > Wait, why are we suppose to reclaim the open state when
> > > > > > > > we
> > > > > > > > have a
> > > > > > > > valid open stateid? We don't have any cached opens that
> > > > > > > > server
> > > > > > > > doesn't
> > > > > > > > know about. RFC7530 says "if the file has other open
> > > > > > > > reference",
> > > > > > > > I
> > > > > > > > think the emphasis is on "other". I don't believe we need
> > > > > > > > to
> > > > > > > > be
> > > > > > > > sending anything besides the locks to the server. Then
> > > > > > > > I'm
> > > > > > > > back
> > > > > > > > to
> > > > > > > > square one.
> > > > > > >
> > > > > > > Holding a delegation does not imply that we hold an open
> > > > > > > stateid.
> > > > > > > Under
> > > > > > > Linux, the open stateid gets closed as soon as the
> > > > > > > application
> > > > > > > closes
> > > > > > > the file.
> > > > > > >
> > > > > > > The delegation, on the other hand, is retained until either
> > > > > > > it
> > > > > > > is
> > > > > > > recalled, or we see that the file has not been used for 2
> > > > > > > lease
> > > > > > > periods.
> > > > > >
> > > > > > Ok I agree with all of it but I'm saying it doesn't need to
> > > > > > be
> > > > > > reclaimed unconditionally or are you saying that's what the
> > > > > > linux
> > > > > > client does? In this test case, the file hasn't been closed
> > > > > > or
> > > > > > expired. I'm stating that the client has a valid open stateid
> > > > > > and
> > > > > > should only be required to reclaim the locks (which with this
> > > > > > patch
> > > > > > it
> > > > > > does).
> > > > >
> > > > > As I said earlier, the client is required to recover all
> > > > > _cached_
> > > > > open
> > > > > and lock state. If it already holds an open stateid, then it
> > > > > should
> > > > > not
> > > > > need to reclaim the open modes that are covered by that
> > > > > stateid,
> > > > > however it may still need to reclaim those open modes that were
> > > > > not
> > > > > already subject to an explicit OPEN call.
> > > > >
> > > > > IOW: If the file was first opened with an open(O_RDRW) call by
> > > > > the
> > > > > application, but a second application then opened it using
> > > > > open(O_WRONLY), then we may already hold a stateid with a
> > > > > "SHARE_ACCESS_BOTH" open mode, however we will still need to
> > > > > send a
> > > > > reclaim for the cached SHARE_ACCESS_WRITE mode, so that a later
> > > > > OPEN_DOWNGRADE(SHARE_ACCESS_WRITE) can succeed.
> > > > >
> > > > > --
> > > > > Trond Myklebust
> > > > > Linux NFS client maintainer, Hammerspace
> > > > > [email protected]
> > > > >
> > > > >
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > [email protected]
> > >
> > >
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> http://www.hammer.space
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2018-10-04 23:38:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

T24gVGh1LCAyMDE4LTEwLTA0IGF0IDEyOjMxIC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gT24gVGh1LCBPY3QgNCwgMjAxOCBhdCAxMjoxMyBQTSBUcm9uZCBNeWtsZWJ1c3QgPA0K
PiB0cm9uZG15QGhhbW1lcnNwYWNlLmNvbT4gd3JvdGU6DQo+ID4gDQo+ID4gT24gVGh1LCAyMDE4
LTEwLTA0IGF0IDExOjQ5IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90ZToNCj4gPiA+IE9u
IFRodSwgT2N0IDQsIDIwMTggYXQgMTE6MjIgQU0gVHJvbmQgTXlrbGVidXN0IDwNCj4gPiA+IHRy
b25kbXlAaGFtbWVyc3BhY2UuY29tPiB3cm90ZToNCj4gPiA+ID4gDQo+ID4gPiA+IEhpIE9sZ2Es
DQo+ID4gPiA+IA0KPiA+ID4gPiBPbiBXZWQsIDIwMTgtMTAtMDMgYXQgMTQ6MzggLTA0MDAsIE9s
Z2EgS29ybmlldnNrYWlhIHdyb3RlOg0KPiA+ID4gPiA+IEhpIFRyb25kLA0KPiA+ID4gPiA+IA0K
PiA+ID4gPiA+IEhlcmUncyB3aHkgdGhlIG9yZGVyaW5nIG9mIHRoZSAib3Blbl9maWxlcyIgbGlz
dCBtYXR0ZXJzIGFuZA0KPiA+ID4gPiA+IGNoYW5nZXMvZml4ZXMgdGhlIGV4aXN0aW5nIHByb2Js
ZW0uDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gV2hlbiB3ZSBmaXJzdCBvcGVuIHRoZSBmaWxlIGZv
ciB3cml0aW5nIGFuZCBnZXQgYSBkZWxlZ2F0aW9uLA0KPiA+ID4gPiA+IGl0J3MNCj4gPiA+ID4g
PiB0aGUNCj4gPiA+ID4gPiBmaXJzdCBvbmUgb24gdGhlIGxpc3QuIFdoZW4gd2Ugb3BlbmVkIHRo
ZSBmaWxlIGFnYWluIGZvciB0aGUNCj4gPiA+ID4gPiBzYW1lDQo+ID4gPiA+ID4gbW9kZQ0KPiA+
ID4gPiA+IHR5cGUsIHRoZW4gYmVmb3JlIHRoZSBwYXRjaCwgdGhlIG5ldyBlbnRyeSBpcyBpbnNl
cnRlZCBiZWZvcmUNCj4gPiA+ID4gPiB3aGF0J3MNCj4gPiA+ID4gPiBhbHJlYWR5IG9uIHRoZSBs
aXN0LiBCb3RoIG9mIHRoZXNlIGZpbGVzIHNoYXJlIHRoZSBzYW1lDQo+ID4gPiA+ID4gbmZzNF9z
dGF0ZQ0KPiA+ID4gPiA+IHRoYXQncyBtYXJrZWQgZGVsZWdhdGVkLg0KPiA+ID4gPiA+IA0KPiA+
ID4gPiA+IE9uY2Ugd2UgcmVjZWl2ZSBhIGRlbGVnYXRpb24gcmVjYWxsLCBpbg0KPiA+ID4gPiA+
IGRlbGVnYXRpb25fY2xhaW1fb3BlbnMoKQ0KPiA+ID4gPiA+IHdlDQo+ID4gPiA+ID4gd2FsayB0
aGUgbGlzdC4gRmlyc3Qgb25lIHdpbGwgYmUgdGhlIDJuZCBvcGVuLiBJdCdzIG1hcmtlZA0KPiA+
ID4gPiA+IGRlbGVnYXRlZA0KPiA+ID4gPiA+IGJ1dCBhZnRlciBjYWxsaW5nIG5mczRfb3Blbl9k
ZWxlZ2F0aW9uX3JlY2FsbCgpIHRoZQ0KPiA+ID4gPiA+IGRlbGVnYXRpb24NCj4gPiA+ID4gPiBm
bGFnDQo+ID4gPiA+ID4gaXMNCj4gPiA+ID4gPiBjbGVhcmVkLiBUaGUgMm5kIG9wZW4gZG9lc24n
dCBoYXZlIHRoZSBsb2NrIGFzc29jaWF0ZWQgd2l0aA0KPiA+ID4gPiA+IGl0Lg0KPiA+ID4gPiA+
IFNvIG5vDQo+ID4gPiA+ID4gbG9jayBpcyByZWNsYWltZWQuIFdlIG5vdyBnbyB0byB0aGUgMm5k
IGVudHJ5IGluIHRoZQ0KPiA+ID4gPiA+IG9wZW5fZmlsZQ0KPiA+ID4gPiA+IGxpc3QNCj4gPiA+
ID4gPiB3aGljaCBpcyB0aGUgMXN0IG9wZW4gYnV0IG5vdyB0aGUgZGVsZWdhdGlvbiBmbGFnIGlz
IGNsZWFyZWQNCj4gPiA+ID4gPiBzbw0KPiA+ID4gPiA+IHdlDQo+ID4gPiA+ID4gbmV2ZXIgcmVj
b3ZlciB0aGUgbG9jay4NCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBBbnkgb2YgdGhlIG9wZW5zIG9u
IHRoZSBvcGVuX2xpc3QgY2FuIGJlIHRoZSBsb2NrIGhvbGRlciBhbmQNCj4gPiA+ID4gPiB3ZQ0K
PiA+ID4gPiA+IGNhbid0DQo+ID4gPiA+ID4gY2xlYXIgdGhlIGRlbGVnYXRpb24gZmxhZyBvbiB0
aGUgZmlyc3QgdHJlYXRtZW50IG9mIHRoZQ0KPiA+ID4gPiA+IGRlbGVnYXRlZA0KPiA+ID4gPiA+
IG9wZW4NCj4gPiA+ID4gPiBiZWNhdXNlIGl0IG1pZ2h0IG5vdCBiZSB0aGUgb3duZXIgb2YgdGhl
IGxvY2suDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gSSdtIHRyeWluZyB0byBmaWd1cmUgb3V0IGhv
dyBJIHdvdWxkIGZpeCBpdCBidXQgSSB0aG91Z2h0IEknZA0KPiA+ID4gPiA+IHNlbmQNCj4gPiA+
ID4gPiB0aGlzIGZvciB5b3VyIGNvbW1lbnRzLg0KPiA+ID4gPiANCj4gPiA+ID4gVGhlIGV4cGVj
dGF0aW9uIGhlcmUgaXMgdGhhdCBvbmNlIHdlIGZpbmQgYW4gb3BlbiBjb250ZXh0IHdpdGgNCj4g
PiA+ID4gYQ0KPiA+ID4gPiBzdGF0ZWlkIHRoYXQgbmVlZHMgdG8gYmUgcmVjbGFpbWVkIG9yIHJl
Y292ZXJlZCwgd2UgcmVjb3Zlcg0KPiA+ID4gPiBfYWxsXw0KPiA+ID4gPiB0aGUNCj4gPiA+ID4g
c3RhdGUgYXNzb2NpYXRlZCB3aXRoIHRoYXQgc3RhdGVpZC4NCj4gPiA+ID4gSU9XOiB0aGUgZXhw
ZWN0YXRpb24gaXMgdGhhdCB3ZSBmaXJzdCBsb29rIGF0IHRoZSBvcGVuIHN0YXRlLA0KPiA+ID4g
PiBhbmQNCj4gPiA+ID4gKGRlcGVuZGluZyBvbiB3aGV0aGVyIHRoaXMgaXMgYSB3cml0ZSBkZWxl
Z2F0aW9uIG9yIGEgcmVhZA0KPiA+ID4gPiBkZWxlZ2F0aW9uKQ0KPiA+ID4gPiBydW4gdGhyb3Vn
aCBhIHNldCBvZiBjYWxscyB0byBuZnM0X29wZW5fcmVjb3Zlcl9oZWxwZXIoKSB0aGF0DQo+ID4g
PiA+IHdpbGwNCj4gPiA+ID4gcmVjb3ZlciBhbGwgb3V0c3RhbmRpbmcgb3BlbiBzdGF0ZSBmb3Ig
dGhpcyBmaWxlLg0KPiA+ID4gDQo+ID4gPiBUaGF0J3MgdHJ1ZS4gSSBzZWUgdGhhdCBpdCB3aWxs
IHJlY292ZXIgYWxsIHRoZSBvdXRzdGFuZGluZyBvcGVucw0KPiA+ID4gZm9yDQo+ID4gPiB0aGlz
IGZpbGUuDQo+ID4gPiANCj4gPiA+ID4gV2UgdGhlbiBpdGVyYXRlIHRocm91Z2ggYWxsIHRoZSBs
b2NrIHN0YXRlaWRzIGZvciB0aGUgZmlsZSBhbmQNCj4gPiA+ID4gcmVjb3Zlcg0KPiA+ID4gPiB0
aG9zZS4NCj4gPiA+IA0KPiA+ID4gSG93ZXZlciB0aGlzIGlzIG5vdCB0cnVlLiBCZWNhdXNlIHdl
IHBhc3MgaW4gYSBzcGVjaWZpYw0KPiA+ID4gbmZzX29wZW5fY29udGV4dCBpbnRvIHRoZSBuZnNf
ZGVsZWdhdGlvbl9jbGFpbV9sb2NrcygpIGFuZCB3aGlsZQ0KPiA+ID4gbG9vcGluZyB0aHJ1IHRo
ZSBsaXN0IG9mIGxvY2tzIGZvciB0aGUgZmlsZSB3ZSBjb21wYXJlIGlmIHRoZQ0KPiA+ID4gb3Bl
bg0KPiA+ID4gY29udGV4dCBhc3NvY2lhdGVkIHdpdGggdGhlIGZpbGUgbG9jayBpcyBzYW1lIGFz
IHRoZSBwYXNzZWQgaW4NCj4gPiA+IGNvbnRleHQuIFRoZSBwYXNzZWQgaW4gY29udGV4dCBpcyB0
aGF0IG9mIHRoZSBmaXJzdA0KPiA+ID4gbmZzX29wZW5fY29udGV4dA0KPiA+ID4gdGhhdCB3YXMg
bWFya2VkIGRlbGVnYXRlZCBhbmQgbm90IG5lY2Vzc2FyaWx5IHRoZSBjb250ZXh0IHRoYXQNCj4g
PiA+IGhvbGQNCj4gPiA+IHRoZSBsb2Nrcy4gVGhhdCdzIHRoZSBwcm9ibGVtLg0KPiA+ID4gDQo+
ID4gPiBXaGlsZSB3ZSBhcmUgbG9vcGluZyB0aHJ1IHRoZSBsb2NrcyBmb3IgdGhlIGZpbGUsIHdl
IG5lZWQgdG8gYmUNCj4gPiA+IGNoZWNraW5nIGFnYWluc3QgYW55IGFuZCBhbGwgdGhlIG5mc19v
cGVuX2NvbnRleHQgdGhhdCBhcmUNCj4gPiA+IGFzc29jaWF0ZWQNCj4gPiA+IHdpdGggdGhlIGZp
bGUgYW5kIHJlY292ZXJpbmcgdGhvc2UgbG9ja3MuIEknbSBzdGlsbCBub3Qgc3VyZSBob3cNCj4g
PiA+IHRvDQo+ID4gPiBkbw0KPiA+ID4gaXQuDQo+ID4gPiANCj4gPiANCj4gPiBJbnRlcmVzdGlu
Zy4gV2h5IGFyZSB3ZSBjaGVja2luZyB0aGF0IG9wZW4gY29udGV4dD8gSSBjYW4ndCBzZWUgYW55
DQo+ID4gcmVhc29uIHdoeSB3ZSB3b3VsZCB3YW50IHRvIGRvIHRoYXQuDQo+IA0KPiBJJ20gdGhp
bmtpbmcgaXQncyBwcm9iYWJseSBiZWNhdXNlIHdlIG1pZ2h0IGhhdmUgb3BlbiBjb250ZXh0DQo+
IChub24tZGVsZWdhdGVkKSBvbiB0aGlzIGZpbGUgdGhhdCBob2xkcyBhIGxvY2sgKHNheSB3ZSBv
cGVuZWQgd2l0aA0KPiBPX0RJUkVDVCBhbmQgZ290IGEgbG9jazsgdGhhdCBsb2NrIHdhcyBzZW50
IHRvIHRoZSBzZXJ2ZXIgc28gZG9lc24ndA0KPiBuZWVkIHRvIHJlY292ZXJlZCkuIFRoZW4gd2Ug
b3BlbmVkIGEgZmlsZSBhZ2FpbiBhbmQgZ290IGEgZGVsZWdhdGlvbg0KPiBhbmQgZ290IGEgZGlm
ZmVyZW50IGxvY2suIFdoZW4gd2UgYXJlIGxvb3BpbmcgdGhydSB0aGUgbG9ja3MgZm9yIHRoZQ0K
PiBmaWxlLCB3ZSBvbmx5IG5lZWQgdG8gcmVjb3ZlciB0aGUgMm5kIG9uZS4NCj4gDQo+IEknbSB0
aGlua2luZyBjYW4ndCB3ZSBjaGVjayB0aGF0IHRoZSBwYXNzZWQgaW4gc3RhdGUgaXMgdGhlIHNh
bWUgdGhhdA0KPiBpcyB3aGF0IHdlIGdldCBuZnNfZmlsZV9vcGVuX2NvbnRleHQoZmwtPmZsX2Zp
bGUpLT5zdGF0ZSA/IEknbQ0KPiB0ZXN0aW5nDQo+IGl0IG5vdy4uLg0KDQpUaGVyZSBpcyBub3Ro
aW5nIGluIHRoZSBwcm90b2NvbCB0aGF0IHN0b3BzIHlvdSBmcm9tIHJlY292ZXJpbmcgYSBsb2Nr
DQp0d2ljZSAodW5sZXNzIHRoZSBzZXJ2ZXIgZG9lcyBNaWNyb3NvZnQgc3RhY2tlZCBsb2Nrcywg
d2hpY2ggb3VyIGNsaWVudA0KZG9lc24ndCBzdXBwb3J0KS4gSSdkIHByZWZlciB0byBqdXN0IGdv
IGZvciBzaW1wbGljaXR5IGhlcmUsIGFuZA0KcmVjb3ZlciBldmVyeXRoaW5nLg0KDQouLi5hbmQg
eWVzLCBJIGFncmVlIHRoYXQgd2Ugc2hvdWxkIHNlbGVjdCB0aGUgbG9ja3MgdG8gcmVjb3ZlciBi
YXNlZCBvbg0KdGhlIHN0YXRlLiBBcyBJIHNhaWQsIHRoYXQgd2FzIHRoZSBleHBlY3RhdGlvbiBp
biB0aGUgZmlyc3QgcGxhY2UuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xp
ZW50IG1haW50YWluZXIsIEhhbW1lcnNwYWNlDQp0cm9uZC5teWtsZWJ1c3RAaGFtbWVyc3BhY2Uu
Y29tDQoNCg0K

2018-10-05 01:46:02

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH 5/7] NFS: Convert lookups of the open context to RCU

On Thu, Oct 4, 2018 at 12:42 PM Trond Myklebust <[email protected]> wrote:
>
> On Thu, 2018-10-04 at 12:31 -0400, Olga Kornievskaia wrote:
> > On Thu, Oct 4, 2018 at 12:13 PM Trond Myklebust <
> > [email protected]> wrote:
> > >
> > > On Thu, 2018-10-04 at 11:49 -0400, Olga Kornievskaia wrote:
> > > > On Thu, Oct 4, 2018 at 11:22 AM Trond Myklebust <
> > > > [email protected]> wrote:
> > > > >
> > > > > Hi Olga,
> > > > >
> > > > > On Wed, 2018-10-03 at 14:38 -0400, Olga Kornievskaia wrote:
> > > > > > Hi Trond,
> > > > > >
> > > > > > Here's why the ordering of the "open_files" list matters and
> > > > > > changes/fixes the existing problem.
> > > > > >
> > > > > > When we first open the file for writing and get a delegation,
> > > > > > it's
> > > > > > the
> > > > > > first one on the list. When we opened the file again for the
> > > > > > same
> > > > > > mode
> > > > > > type, then before the patch, the new entry is inserted before
> > > > > > what's
> > > > > > already on the list. Both of these files share the same
> > > > > > nfs4_state
> > > > > > that's marked delegated.
> > > > > >
> > > > > > Once we receive a delegation recall, in
> > > > > > delegation_claim_opens()
> > > > > > we
> > > > > > walk the list. First one will be the 2nd open. It's marked
> > > > > > delegated
> > > > > > but after calling nfs4_open_delegation_recall() the
> > > > > > delegation
> > > > > > flag
> > > > > > is
> > > > > > cleared. The 2nd open doesn't have the lock associated with
> > > > > > it.
> > > > > > So no
> > > > > > lock is reclaimed. We now go to the 2nd entry in the
> > > > > > open_file
> > > > > > list
> > > > > > which is the 1st open but now the delegation flag is cleared
> > > > > > so
> > > > > > we
> > > > > > never recover the lock.
> > > > > >
> > > > > > Any of the opens on the open_list can be the lock holder and
> > > > > > we
> > > > > > can't
> > > > > > clear the delegation flag on the first treatment of the
> > > > > > delegated
> > > > > > open
> > > > > > because it might not be the owner of the lock.
> > > > > >
> > > > > > I'm trying to figure out how I would fix it but I thought I'd
> > > > > > send
> > > > > > this for your comments.
> > > > >
> > > > > The expectation here is that once we find an open context with
> > > > > a
> > > > > stateid that needs to be reclaimed or recovered, we recover
> > > > > _all_
> > > > > the
> > > > > state associated with that stateid.
> > > > > IOW: the expectation is that we first look at the open state,
> > > > > and
> > > > > (depending on whether this is a write delegation or a read
> > > > > delegation)
> > > > > run through a set of calls to nfs4_open_recover_helper() that
> > > > > will
> > > > > recover all outstanding open state for this file.
> > > >
> > > > That's true. I see that it will recover all the outstanding opens
> > > > for
> > > > this file.
> > > >
> > > > > We then iterate through all the lock stateids for the file and
> > > > > recover
> > > > > those.
> > > >
> > > > However this is not true. Because we pass in a specific
> > > > nfs_open_context into the nfs_delegation_claim_locks() and while
> > > > looping thru the list of locks for the file we compare if the
> > > > open
> > > > context associated with the file lock is same as the passed in
> > > > context. The passed in context is that of the first
> > > > nfs_open_context
> > > > that was marked delegated and not necessarily the context that
> > > > hold
> > > > the locks. That's the problem.
> > > >
> > > > While we are looping thru the locks for the file, we need to be
> > > > checking against any and all the nfs_open_context that are
> > > > associated
> > > > with the file and recovering those locks. I'm still not sure how
> > > > to
> > > > do
> > > > it.
> > > >
> > >
> > > Interesting. Why are we checking that open context? I can't see any
> > > reason why we would want to do that.
> >
> > I'm thinking it's probably because we might have open context
> > (non-delegated) on this file that holds a lock (say we opened with
> > O_DIRECT and got a lock; that lock was sent to the server so doesn't
> > need to recovered). Then we opened a file again and got a delegation
> > and got a different lock. When we are looping thru the locks for the
> > file, we only need to recover the 2nd one.
> >
> > I'm thinking can't we check that the passed in state is the same that
> > is what we get nfs_file_open_context(fl->fl_file)->state ? I'm
> > testing
> > it now...
>
> There is nothing in the protocol that stops you from recovering a lock
> twice (unless the server does Microsoft stacked locks, which our client
> doesn't support). I'd prefer to just go for simplicity here, and
> recover everything.
>
> ...and yes, I agree that we should select the locks to recover based on
> the state. As I said, that was the expectation in the first place.

Ok. I sent out a patch but not sure if you want something that just
removes the check all together.





>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>