LinuxLists.cc - [RFC PATCH] NFSv4: replace seqcount_t with a seqlock

2016-10-21 16:47:51

Subject: [RFC PATCH] NFSv4: replace seqcount_t with a seqlock_t

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
I don't understand how it is possible that we don't end up with two
writers for the same resource because the `sp->so_lock' lock is dropped
is soon in the list_for_each_entry() loop. It might be the
test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
bit on each iteration so I *think* we could have two invocations of the
same struct nfs4_state_owner in nfs4_reclaim_open_state().
So there is that.

But back to the list_for_each_entry() macro.
It seems that this `so_lock' lock protects the ->so_states list among
other atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock lock during the loop. And after nfs4_reclaim_locks() invocation we
nfs4_put_open_state() and then grab the ->so_lock again. So if we were the =
last
user of this struct and we remove it, then the following list_next_entry()
invocation is a use-after-free. Even if we use list_for_each_entry_safe() t=
here
is no guarantee that the following member is still valid because it might h=
ave
been removed by something that invoked nfs4_put_open_state(), right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a seqlock_t which ensures that there is only one writer at a time. So it
should be basically what is happening now plus a tiny tiny tiny lock
plus lockdep coverage. I tried to test this myself but I don't manage to get
into this code path at all so I might be doing something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/nfs/delegation.c | 4 ++--
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 4 ++--
fs/nfs/nfs4state.c | 10 ++++------
4 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..d726d2e09353 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode *i=
node,
sp =3D state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq =3D raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq =3D read_seqbegin(&sp->so_reclaim_seqlock);
err =3D nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err =3D nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (!err && read_seqretry(&sp->so_reclaim_seqlock, seq))
err =3D -EAGAIN;
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..9b5089bf0f2e 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,7 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ seqlock_t so_reclaim_seqlock;
struct mutex so_delegreturn_mutex;
};
=20
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ad917bd72b38..3d6be8405c08 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opend=
ata *opendata,
unsigned int seq;
int ret;
=20
- seq =3D raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq =3D raw_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
=20
ret =3D _nfs4_proc_open(opendata);
if (ret !=3D 0)
@@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opend=
ata *opendata,
ctx->state =3D state;
if (d_inode(dentry) =3D=3D state->inode) {
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (read_seqretry(&sp->so_reclaim_seqlock, seq))
nfs4_schedule_stateid_recovery(server, state);
}
out:
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..74be6378ca08 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,7 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ seqlock_init(&sp->so_reclaim_seqlock);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1497,8 @@ static int nfs4_reclaim_open_state(struct nfs4_state_=
owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ write_seqlock(&sp->so_reclaim_seqlock);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1566,12 @@ static int nfs4_reclaim_open_state(struct nfs4_stat=
e_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ write_sequnlock(&sp->so_reclaim_seqlock);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ write_sequnlock(&sp->so_reclaim_seqlock);
return status;
}
=20
--=20
2.9.3

2016-10-28 21:34:43

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
I don't understand how it is possible that we don't end up with two
writers for the same ressource because the `sp->so_lock' lock is dropped
is soon in the list_for_each_entry() loop. It might be the
test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
bit on each iteration so I *think* we could have two invocations of the
same struct nfs4_state_owner in nfs4_reclaim_open_state().
So there is that.

But back to the list_for_each_entry() macro.
It seems that this lock protects the ->so_states list among other
atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
we were the last user of this struct and we remove it, then the
following list_next_entry() invocation is a use-after-free. Even if we
use list_for_each_entry_safe() there is no guarantee that the following
member is still valid because it might have been removed by another
writer, right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a seqlock_t which ensures that there is only one writer at a time. So it
should be basically what is happening now plus a tiny tiny tiny lock
plus lockdep coverage. I tried to this myself but I don't manage to get
into this code path at all so I might be doing something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

v1…v2: write_seqlock() disables preemption and some function need it
(thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
preemption enabled because a preempted writer would stall the reader
spinning. This is a duct tape mutex. Maybe the seqlock should go.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/nfs/delegation.c | 4 ++--
fs/nfs/nfs4_fs.h | 3 ++-
fs/nfs/nfs4proc.c | 4 ++--
fs/nfs/nfs4state.c | 23 +++++++++++++++++------
4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..d726d2e09353 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = read_seqbegin(&sp->so_reclaim_seqlock);
err = nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err = nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (!err && read_seqretry(&sp->so_reclaim_seqlock, seq))
err = -EAGAIN;
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..2fee1a2e8b57 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,8 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ seqlock_t so_reclaim_seqlock;
+ struct mutex so_reclaim_seqlock_mutex;
struct mutex so_delegreturn_mutex;
};

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 7897826d7c51..9b9d53cd85f9 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
unsigned int seq;
int ret;

- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = raw_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);

ret = _nfs4_proc_open(opendata);
if (ret != 0)
@@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
ctx->state = state;
if (d_inode(dentry) == state->inode) {
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (read_seqretry(&sp->so_reclaim_seqlock, seq))
nfs4_schedule_stateid_recovery(server, state);
}
out:
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..a442d9867942 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,8 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ seqlock_init(&sp->so_reclaim_seqlock);
+ mutex_init(&sp->so_reclaim_seqlock_mutex);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ /*
+ * XXX
+ * This mutex is wrong. It protects against multiple writer. However
+ * write_seqlock() should have been used for this task. This would avoid
+ * preemption while the seqlock is held which is good because the writer
+ * shouldn't be preempted as it would let the reader spin for no good
+ * reason. There are a few memory allocations and kthread_run() so we
+ * have this mutex now.
+ */
+ mutex_lock(&sp->so_reclaim_seqlock_mutex);
+ write_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
+ mutex_unlock(&sp->so_reclaim_seqlock_mutex);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
+ mutex_unlock(&sp->so_reclaim_seqlock_mutex);
return status;
}

--
2.10.1

2016-10-28 22:24:59

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

DQo+IE9uIE9jdCAyOCwgMjAxNiwgYXQgMTc6MDUsIFNlYmFzdGlhbiBBbmRyemVqIFNpZXdpb3Ig
PGJpZ2Vhc3lAbGludXRyb25peC5kZT4gd3JvdGU6DQo+IA0KPiBUaGUgcmF3X3dyaXRlX3NlcWNv
dW50X2JlZ2luKCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGUoKSBidWdzIG1lDQo+IGJlY2F1
c2UgaXQgbWFwcyB0byBwcmVlbXB0X2Rpc2FibGUoKSBpbiAtUlQgd2hpY2ggSSBjYW4ndCBoYXZl
IGF0IHRoaXMNCj4gcG9pbnQuIFNvIEkgdG9vayBhIGxvb2sgYXQgdGhlIGNvZGUuDQo+IEl0IHRo
ZSBsb2NrZGVwIHBhcnQgd2FzIHJlbW92ZWQgaW4gY29tbWl0IGFiYmVjMmRhMTNmMCAoIk5GUzog
VXNlDQo+IHJhd193cml0ZV9zZXFjb3VudF9iZWdpbi9lbmQgaW50IG5mczRfcmVjbGFpbV9vcGVu
X3N0YXRlIikgYmVjYXVzZQ0KPiBsb2NrZGVwIGNvbXBsYWluZWQuIFRoZSB3aG9sZSBzZXFjb3Vu
dCB0aGluZyB3YXMgaW50cm9kdWNlZCBpbiBjb21taXQNCj4gYzEzN2FmYWJlMzMwICgiTkZTdjQ6
IEFsbG93IHRoZSBzdGF0ZSBtYW5hZ2VyIHRvIG1hcmsgYW4gb3Blbl9vd25lciBhcw0KPiBiZWlu
ZyByZWNvdmVyZWQiKS4NCj4gSSBkb24ndCB1bmRlcnN0YW5kIGhvdyBpdCBpcyBwb3NzaWJsZSB0
aGF0IHdlIGRvbid0IGVuZCB1cCB3aXRoIHR3bw0KPiB3cml0ZXJzIGZvciB0aGUgc2FtZSByZXNz
b3VyY2UgYmVjYXVzZSB0aGUgYHNwLT5zb19sb2NrJyBsb2NrIGlzIGRyb3BwZWQNCj4gaXMgc29v
biBpbiB0aGUgbGlzdF9mb3JfZWFjaF9lbnRyeSgpIGxvb3AuIEl0IG1pZ2h0IGJlIHRoZQ0KPiB0
ZXN0X2FuZF9jbGVhcl9iaXQoKSBjaGVjayBpbiBuZnM0X2RvX3JlY2xhaW0oKSBidXQgaXQgbWln
aHQgY2xlYXIgb25lDQo+IGJpdCBvbiBlYWNoIGl0ZXJhdGlvbiBzbyBJICp0aGluayogd2UgY291
bGQgaGF2ZSB0d28gaW52b2NhdGlvbnMgb2YgdGhlDQo+IHNhbWUgc3RydWN0IG5mczRfc3RhdGVf
b3duZXIgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGUoKS4NCj4gU28gdGhlcmUgaXMgdGhhdC4N
Cj4gDQo+IEJ1dCBiYWNrIHRvIHRoZSBsaXN0X2Zvcl9lYWNoX2VudHJ5KCkgbWFjcm8uDQo+IEl0
IHNlZW1zIHRoYXQgdGhpcyBsb2NrIHByb3RlY3RzIHRoZSAtPnNvX3N0YXRlcyBsaXN0IGFtb25n
IG90aGVyDQo+IGF0b21pY190ICYgZmxhZ3MgbWVtYmVycy4gU28gYXQgdGhlIGJlZ2luIG9mIHRo
ZSBsb29wIHdlIGluYyAtPmNvdW50DQo+IGVuc3VyaW5nIHRoYXQgdGhpcyBmaWVsZCBpcyBub3Qg
cmVtb3ZlZCB3aGlsZSB3ZSB1c2UgaXQuIFNvIHdlIGRyb3AgdGhlDQo+IC0+c29fbG9jayBsb2Mg
ZHVyaW5nIHRoZSBsb29wIGl0IHNlZW1zLiBBbmQgYWZ0ZXIgbmZzNF9yZWNsYWltX2xvY2tzKCkN
Cj4gaW52b2NhdGlvbiB3ZSBuZnM0X3B1dF9vcGVuX3N0YXRlKCkgYW5kIGdyYWIgdGhlIC0+c29f
bG9jayBhZ2Fpbi4gU28gaWYNCj4gd2Ugd2VyZSB0aGUgbGFzdCB1c2VyIG9mIHRoaXMgc3RydWN0
IGFuZCB3ZSByZW1vdmUgaXQsIHRoZW4gdGhlDQo+IGZvbGxvd2luZyBsaXN0X25leHRfZW50cnko
KSBpbnZvY2F0aW9uIGlzIGEgdXNlLWFmdGVyLWZyZWUuIEV2ZW4gaWYgd2UNCj4gdXNlIGxpc3Rf
Zm9yX2VhY2hfZW50cnlfc2FmZSgpIHRoZXJlIGlzIG5vIGd1YXJhbnRlZSB0aGF0IHRoZSBmb2xs
b3dpbmcNCj4gbWVtYmVyIGlzIHN0aWxsIHZhbGlkIGJlY2F1c2UgaXQgbWlnaHQgaGF2ZSBiZWVu
IHJlbW92ZWQgYnkgYW5vdGhlcg0KPiB3cml0ZXIsIHJpZ2h0Pw0KPiBTbyB0aGVyZSBpcyB0aGlz
Lg0KPiANCj4gSG93ZXZlciB0byBhZGRyZXNzIG15IGluaXRpYWwgcHJvYmxlbSBJIGhhdmUgaGVy
ZSBhIHBhdGNoIDopIFNvIGl0IHVzZXMNCj4gYSBzZXFsb2NrX3Qgd2hpY2ggZW5zdXJlcyB0aGF0
IHRoZXJlIGlzIG9ubHkgb25lIHdyaXRlciBhdCBhIHRpbWUuIFNvIGl0DQo+IHNob3VsZCBiZSBi
YXNpY2FsbHkgd2hhdCBpcyBoYXBwZW5pbmcgbm93IHBsdXMgYSB0aW55IHRpbnkgdGlueSBsb2Nr
DQo+IHBsdXMgbG9ja2RlcCBjb3ZlcmFnZS4gSSB0cmllZCB0byB0aGlzIG15c2VsZiBidXQgSSBk
b24ndCBtYW5hZ2UgdG8gZ2V0DQo+IGludG8gdGhpcyBjb2RlIHBhdGggYXQgYWxsIHNvIEkgbWln
aHQgYmUgZG9pbmcgc29tZXRoaW5nIHdyb25nLg0KPiANCj4gQ291bGQgeW91IHBsZWFzZSBjaGVj
ayBpZiB0aGlzIHBhdGNoIGlzIHdvcmtpbmcgZm9yIHlvdSBhbmQgd2hldGhlciBteQ0KPiBsaXN0
X2Zvcl9lYWNoX2VudHJ5KCkgb2JzZXJ2YXRpb24gaXMgY29ycmVjdCBvciBub3Q/DQo+IA0KPiB2
MeKApnYyOiB3cml0ZV9zZXFsb2NrKCkgZGlzYWJsZXMgcHJlZW1wdGlvbiBhbmQgc29tZSBmdW5j
dGlvbiBuZWVkIGl0DQo+ICh0aHJlYWRfcnVuKCksIG5vbi1HRlBfQVRPTUlDIG1lbW9yeSBhbGxv
Y3Rpb24oKSkuIFdlIGRvbid0IHdhbnQNCj4gcHJlZW1wdGlvbiBlbmFibGVkIGJlY2F1c2UgYSBw
cmVlbXB0ZWQgd3JpdGVyIHdvdWxkIHN0YWxsIHRoZSByZWFkZXINCj4gc3Bpbm5pbmcuIFRoaXMg
aXMgYSBkdWN0IHRhcGUgbXV0ZXguIE1heWJlIHRoZSBzZXFsb2NrIHNob3VsZCBnby4NCj4gDQo+
IFNpZ25lZC1vZmYtYnk6IFNlYmFzdGlhbiBBbmRyemVqIFNpZXdpb3IgPGJpZ2Vhc3lAbGludXRy
b25peC5kZT4NCj4gLS0tDQo+IGZzL25mcy9kZWxlZ2F0aW9uLmMgfCAgNCArKy0tDQo+IGZzL25m
cy9uZnM0X2ZzLmggICAgfCAgMyArKy0NCj4gZnMvbmZzL25mczRwcm9jLmMgICB8ICA0ICsrLS0N
Cj4gZnMvbmZzL25mczRzdGF0ZS5jICB8IDIzICsrKysrKysrKysrKysrKysrLS0tLS0tDQo+IDQg
ZmlsZXMgY2hhbmdlZCwgMjMgaW5zZXJ0aW9ucygrKSwgMTEgZGVsZXRpb25zKC0pDQo+IA0KPiBk
aWZmIC0tZ2l0IGEvZnMvbmZzL2RlbGVnYXRpb24uYyBiL2ZzL25mcy9kZWxlZ2F0aW9uLmMNCj4g
aW5kZXggZGZmNjAwYWUwZDc0Li5kNzI2ZDJlMDkzNTMgMTAwNjQ0DQo+IC0tLSBhL2ZzL25mcy9k
ZWxlZ2F0aW9uLmMNCj4gKysrIGIvZnMvbmZzL2RlbGVnYXRpb24uYw0KPiBAQCAtMTUwLDExICsx
NTAsMTEgQEAgc3RhdGljIGludCBuZnNfZGVsZWdhdGlvbl9jbGFpbV9vcGVucyhzdHJ1Y3QgaW5v
ZGUgKmlub2RlLA0KPiAJCXNwID0gc3RhdGUtPm93bmVyOw0KPiAJCS8qIEJsb2NrIG5mczRfcHJv
Y191bmxjayAqLw0KPiAJCW11dGV4X2xvY2soJnNwLT5zb19kZWxlZ3JldHVybl9tdXRleCk7DQo+
IC0JCXNlcSA9IHJhd19zZXFjb3VudF9iZWdpbigmc3AtPnNvX3JlY2xhaW1fc2VxY291bnQpOw0K
PiArCQlzZXEgPSByZWFkX3NlcWJlZ2luKCZzcC0+c29fcmVjbGFpbV9zZXFsb2NrKTsNCj4gCQll
cnIgPSBuZnM0X29wZW5fZGVsZWdhdGlvbl9yZWNhbGwoY3R4LCBzdGF0ZSwgc3RhdGVpZCwgdHlw
ZSk7DQo+IAkJaWYgKCFlcnIpDQo+IAkJCWVyciA9IG5mc19kZWxlZ2F0aW9uX2NsYWltX2xvY2tz
KGN0eCwgc3RhdGUsIHN0YXRlaWQpOw0KPiAtCQlpZiAoIWVyciAmJiByZWFkX3NlcWNvdW50X3Jl
dHJ5KCZzcC0+c29fcmVjbGFpbV9zZXFjb3VudCwgc2VxKSkNCj4gKwkJaWYgKCFlcnIgJiYgcmVh
ZF9zZXFyZXRyeSgmc3AtPnNvX3JlY2xhaW1fc2VxbG9jaywgc2VxKSkNCj4gCQkJZXJyID0gLUVB
R0FJTjsNCj4gCQltdXRleF91bmxvY2soJnNwLT5zb19kZWxlZ3JldHVybl9tdXRleCk7DQo+IAkJ
cHV0X25mc19vcGVuX2NvbnRleHQoY3R4KTsNCj4gZGlmZiAtLWdpdCBhL2ZzL25mcy9uZnM0X2Zz
LmggYi9mcy9uZnMvbmZzNF9mcy5oDQo+IGluZGV4IDliM2E4MmFiYWIwNy4uMmZlZTFhMmU4YjU3
IDEwMDY0NA0KPiAtLS0gYS9mcy9uZnMvbmZzNF9mcy5oDQo+ICsrKyBiL2ZzL25mcy9uZnM0X2Zz
LmgNCj4gQEAgLTExMSw3ICsxMTEsOCBAQCBzdHJ1Y3QgbmZzNF9zdGF0ZV9vd25lciB7DQo+IAl1
bnNpZ25lZCBsb25nCSAgICAgc29fZmxhZ3M7DQo+IAlzdHJ1Y3QgbGlzdF9oZWFkICAgICBzb19z
dGF0ZXM7DQo+IAlzdHJ1Y3QgbmZzX3NlcWlkX2NvdW50ZXIgc29fc2VxaWQ7DQo+IC0Jc2VxY291
bnRfdAkgICAgIHNvX3JlY2xhaW1fc2VxY291bnQ7DQo+ICsJc2VxbG9ja190CSAgICAgc29fcmVj
bGFpbV9zZXFsb2NrOw0KPiArCXN0cnVjdCBtdXRleAkgICAgIHNvX3JlY2xhaW1fc2VxbG9ja19t
dXRleDsNCj4gCXN0cnVjdCBtdXRleAkgICAgIHNvX2RlbGVncmV0dXJuX211dGV4Ow0KPiB9Ow0K
PiANCj4gZGlmZiAtLWdpdCBhL2ZzL25mcy9uZnM0cHJvYy5jIGIvZnMvbmZzL25mczRwcm9jLmMN
Cj4gaW5kZXggNzg5NzgyNmQ3YzUxLi45YjlkNTNjZDg1ZjkgMTAwNjQ0DQo+IC0tLSBhL2ZzL25m
cy9uZnM0cHJvYy5jDQo+ICsrKyBiL2ZzL25mcy9uZnM0cHJvYy5jDQo+IEBAIC0yNjg1LDcgKzI2
ODUsNyBAQCBzdGF0aWMgaW50IF9uZnM0X29wZW5fYW5kX2dldF9zdGF0ZShzdHJ1Y3QgbmZzNF9v
cGVuZGF0YSAqb3BlbmRhdGEsDQo+IAl1bnNpZ25lZCBpbnQgc2VxOw0KPiAJaW50IHJldDsNCj4g
DQo+IC0Jc2VxID0gcmF3X3NlcWNvdW50X2JlZ2luKCZzcC0+c29fcmVjbGFpbV9zZXFjb3VudCk7
DQo+ICsJc2VxID0gcmF3X3NlcWNvdW50X2JlZ2luKCZzcC0+c29fcmVjbGFpbV9zZXFsb2NrLnNl
cWNvdW50KTsNCj4gDQo+IAlyZXQgPSBfbmZzNF9wcm9jX29wZW4ob3BlbmRhdGEpOw0KPiAJaWYg
KHJldCAhPSAwKQ0KPiBAQCAtMjcyMyw3ICsyNzIzLDcgQEAgc3RhdGljIGludCBfbmZzNF9vcGVu
X2FuZF9nZXRfc3RhdGUoc3RydWN0IG5mczRfb3BlbmRhdGEgKm9wZW5kYXRhLA0KPiAJY3R4LT5z
dGF0ZSA9IHN0YXRlOw0KPiAJaWYgKGRfaW5vZGUoZGVudHJ5KSA9PSBzdGF0ZS0+aW5vZGUpIHsN
Cj4gCQluZnNfaW5vZGVfYXR0YWNoX29wZW5fY29udGV4dChjdHgpOw0KPiAtCQlpZiAocmVhZF9z
ZXFjb3VudF9yZXRyeSgmc3AtPnNvX3JlY2xhaW1fc2VxY291bnQsIHNlcSkpDQo+ICsJCWlmIChy
ZWFkX3NlcXJldHJ5KCZzcC0+c29fcmVjbGFpbV9zZXFsb2NrLCBzZXEpKQ0KPiAJCQluZnM0X3Nj
aGVkdWxlX3N0YXRlaWRfcmVjb3Zlcnkoc2VydmVyLCBzdGF0ZSk7DQo+IAl9DQo+IG91dDoNCj4g
ZGlmZiAtLWdpdCBhL2ZzL25mcy9uZnM0c3RhdGUuYyBiL2ZzL25mcy9uZnM0c3RhdGUuYw0KPiBp
bmRleCA1ZjQyODFlYzVmNzIuLmE0NDJkOTg2Nzk0MiAxMDA2NDQNCj4gLS0tIGEvZnMvbmZzL25m
czRzdGF0ZS5jDQo+ICsrKyBiL2ZzL25mcy9uZnM0c3RhdGUuYw0KPiBAQCAtNDg4LDcgKzQ4OCw4
IEBAIG5mczRfYWxsb2Nfc3RhdGVfb3duZXIoc3RydWN0IG5mc19zZXJ2ZXIgKnNlcnZlciwNCj4g
CW5mczRfaW5pdF9zZXFpZF9jb3VudGVyKCZzcC0+c29fc2VxaWQpOw0KPiAJYXRvbWljX3NldCgm
c3AtPnNvX2NvdW50LCAxKTsNCj4gCUlOSVRfTElTVF9IRUFEKCZzcC0+c29fbHJ1KTsNCj4gLQlz
ZXFjb3VudF9pbml0KCZzcC0+c29fcmVjbGFpbV9zZXFjb3VudCk7DQo+ICsJc2VxbG9ja19pbml0
KCZzcC0+c29fcmVjbGFpbV9zZXFsb2NrKTsNCj4gKwltdXRleF9pbml0KCZzcC0+c29fcmVjbGFp
bV9zZXFsb2NrX211dGV4KTsNCj4gCW11dGV4X2luaXQoJnNwLT5zb19kZWxlZ3JldHVybl9tdXRl
eCk7DQo+IAlyZXR1cm4gc3A7DQo+IH0NCj4gQEAgLTE0OTcsOCArMTQ5OCwxOCBAQCBzdGF0aWMg
aW50IG5mczRfcmVjbGFpbV9vcGVuX3N0YXRlKHN0cnVjdCBuZnM0X3N0YXRlX293bmVyICpzcCwg
Y29uc3Qgc3RydWN0IG5mcw0KPiAJICogcmVjb3ZlcmluZyBhZnRlciBhIG5ldHdvcmsgcGFydGl0
aW9uIG9yIGEgcmVib290IGZyb20gYQ0KPiAJICogc2VydmVyIHRoYXQgZG9lc24ndCBzdXBwb3J0
IGEgZ3JhY2UgcGVyaW9kLg0KPiAJICovDQo+ICsJLyoNCj4gKwkgKiBYWFgNCj4gKwkgKiBUaGlz
IG11dGV4IGlzIHdyb25nLiBJdCBwcm90ZWN0cyBhZ2FpbnN0IG11bHRpcGxlIHdyaXRlci4gSG93
ZXZlcg0KDQpUaGVyZSBpcyBvbmx5IDEgcmVjb3ZlcnkgdGhyZWFkIHBlciBjbGllbnQvc2VydmVy
IHBhaXIsIHdoaWNoIGlzIHdoeSB3ZSBrbm93IHRoZXJlIGlzIG9ubHkgYSBzaW5nbGUgd3JpdGVy
LiBObyBuZWVkIGZvciBhIG11dGV4IGhlcmUuDQoNCj4gKwkgKiB3cml0ZV9zZXFsb2NrKCkgc2hv
dWxkIGhhdmUgYmVlbiB1c2VkIGZvciB0aGlzIHRhc2suIFRoaXMgd291bGQgYXZvaWQNCj4gKwkg
KiBwcmVlbXB0aW9uIHdoaWxlIHRoZSBzZXFsb2NrIGlzIGhlbGQgd2hpY2ggaXMgZ29vZCBiZWNh
dXNlIHRoZSB3cml0ZXINCj4gKwkgKiBzaG91bGRuJ3QgYmUgcHJlZW1wdGVkIGFzIGl0IHdvdWxk
IGxldCB0aGUgcmVhZGVyIHNwaW4gZm9yIG5vIGdvb2QNCg0KUmVjb3ZlcnkgaW52b2x2ZXMgc2Vu
ZGluZyBSUEMgY2FsbHMgdG8gYSBzZXJ2ZXIuIFRoZXJlIGlzIG5vIGF2b2lkaW5nIHByZWVtcHRp
b24gaWYgd2UgaGF2ZSB0byByZWNvdmVyIHN0YXRlLiBUaGUgcG9pbnQgb2YgdGhlIHNlcXVlbmNl
IGNvdW50ZXIgaXMgdG8gc2lnbmFsIHRvIHByb2Nlc3NlcyB0aGF0IG1heSBoYXZlIHJhY2VkIHdp
dGggdGhpcyByZWNvdmVyeSB0aHJlYWQgdGhhdCB0aGV5IG1heSBuZWVkIHRvIHJlcGxheSB0aGVp
ciBsb2Nrcy4NCg0KPiArCSAqIHJlYXNvbi4gVGhlcmUgYXJlIGEgZmV3IG1lbW9yeSBhbGxvY2F0
aW9ucyBhbmQga3RocmVhZF9ydW4oKSBzbyB3ZQ0KPiArCSAqIGhhdmUgdGhpcyBtdXRleCBub3cu
DQo+ICsJICovDQo+ICsJbXV0ZXhfbG9jaygmc3AtPnNvX3JlY2xhaW1fc2VxbG9ja19tdXRleCk7
DQo+ICsJd3JpdGVfc2VxY291bnRfYmVnaW4oJnNwLT5zb19yZWNsYWltX3NlcWxvY2suc2VxY291
bnQpOw0KPiAJc3Bpbl9sb2NrKCZzcC0+c29fbG9jayk7DQo+IC0JcmF3X3dyaXRlX3NlcWNvdW50
X2JlZ2luKCZzcC0+c29fcmVjbGFpbV9zZXFjb3VudCk7DQo+IHJlc3RhcnQ6DQo+IAlsaXN0X2Zv
cl9lYWNoX2VudHJ5KHN0YXRlLCAmc3AtPnNvX3N0YXRlcywgb3Blbl9zdGF0ZXMpIHsNCj4gCQlp
ZiAoIXRlc3RfYW5kX2NsZWFyX2JpdChvcHMtPnN0YXRlX2ZsYWdfYml0LCAmc3RhdGUtPmZsYWdz
KSkNCj4gQEAgLTE1NjYsMTQgKzE1NzcsMTQgQEAgc3RhdGljIGludCBuZnM0X3JlY2xhaW1fb3Bl
bl9zdGF0ZShzdHJ1Y3QgbmZzNF9zdGF0ZV9vd25lciAqc3AsIGNvbnN0IHN0cnVjdCBuZnMNCj4g
CQlzcGluX2xvY2soJnNwLT5zb19sb2NrKTsNCj4gCQlnb3RvIHJlc3RhcnQ7DQo+IAl9DQo+IC0J
cmF3X3dyaXRlX3NlcWNvdW50X2VuZCgmc3AtPnNvX3JlY2xhaW1fc2VxY291bnQpOw0KPiAJc3Bp
bl91bmxvY2soJnNwLT5zb19sb2NrKTsNCj4gKwl3cml0ZV9zZXFjb3VudF9lbmQoJnNwLT5zb19y
ZWNsYWltX3NlcWxvY2suc2VxY291bnQpOw0KDQpUaGlzIHdpbGwgcmVpbnRyb2R1Y2UgbG9ja2Rl
cCBjaGVja2luZy4gV2UgZG9u4oCZdCBuZWVkIG9yIHdhbnQgdGhhdC4uLg0KDQo+ICsJbXV0ZXhf
dW5sb2NrKCZzcC0+c29fcmVjbGFpbV9zZXFsb2NrX211dGV4KTsNCj4gCXJldHVybiAwOw0KPiBv
dXRfZXJyOg0KPiAJbmZzNF9wdXRfb3Blbl9zdGF0ZShzdGF0ZSk7DQo+IC0Jc3Bpbl9sb2NrKCZz
cC0+c29fbG9jayk7DQo+IC0JcmF3X3dyaXRlX3NlcWNvdW50X2VuZCgmc3AtPnNvX3JlY2xhaW1f
c2VxY291bnQpOw0KPiAtCXNwaW5fdW5sb2NrKCZzcC0+c29fbG9jayk7DQo+ICsJd3JpdGVfc2Vx
Y291bnRfZW5kKCZzcC0+c29fcmVjbGFpbV9zZXFsb2NrLnNlcWNvdW50KTsNCj4gKwltdXRleF91
bmxvY2soJnNwLT5zb19yZWNsYWltX3NlcWxvY2tfbXV0ZXgpOw0KPiAJcmV0dXJuIHN0YXR1czsN
Cj4gfQ0KPiANCj4gLS0gDQo+IDIuMTAuMQ0KPiANCg0K

2016-10-31 10:46:54

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

On 2016-10-28 22:24:52 [+0000], Trond Myklebust wrote:
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 5f4281ec5f72..a442d9867942 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
> > * recovering after a network partition or a reboot from a
> > * server that doesn't support a grace period.
> > */
> > + /*
> > + * XXX
> > + * This mutex is wrong. It protects against multiple writer. However
>
> There is only 1 recovery thread per client/server pair, which is why we know there is only a single writer. No need for a mutex here.

This isn't documented but it should.

> > + * write_seqlock() should have been used for this task. This would avoid
> > + * preemption while the seqlock is held which is good because the writer
> > + * shouldn't be preempted as it would let the reader spin for no good
>
> Recovery involves sending RPC calls to a server. There is no avoiding preemption if we have to recover state. The point of the sequence counter is to signal to processes that may have raced with this recovery thread that they may need to replay their locks.

Then the sequence counter is the wrong mechanism used here. As I
explained: the call can be preempted and once it is, the reader will
spin and waste cycles.
What is wrong with a rwsem? Are the reader not preemptible?

> > + * reason. There are a few memory allocations and kthread_run() so we
> > + * have this mutex now.
> > + */
> > + mutex_lock(&sp->so_reclaim_seqlock_mutex);
> > + write_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
> > spin_lock(&sp->so_lock);
> > - raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
> > restart:
> > list_for_each_entry(state, &sp->so_states, open_states) {
> > if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
> > @@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
> > spin_lock(&sp->so_lock);
> > goto restart;
> > }
> > - raw_write_seqcount_end(&sp->so_reclaim_seqcount);
> > spin_unlock(&sp->so_lock);
> > + write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
>
> This will reintroduce lockdep checking. We don’t need or want that...

Why isn't lockdep welcome? And what kind warning will it introduce? How
do I trigger this? I tried various things but never got close to this.

Sebastian

2016-10-31 13:20:13

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
Trond Myklebust confirms that there i only one writer thread at
nfs4_reclaim_open_state()

The list_for_each_entry() in nfs4_reclaim_open_state:
It seems that this lock protects the ->so_states list among other
atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
we were the last user of this struct and we remove it, then the
following list_next_entry() invocation is a use-after-free. Even if we
use list_for_each_entry_safe() there is no guarantee that the following
member is still valid because it might have been removed by someone
invoking nfs4_put_open_state() on it, right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a rw_semaphore which ensures that there is only one writer at a time or
multiple reader. So it should be basically what is happening now plus a
tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
don't manage to get into this code path at all so I might be doing
something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

v2…v3: replace the seqlock with a RW semaphore.

v1…v2: write_seqlock() disables preemption and some function need it
(thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
preemption enabled because a preempted writer would stall the reader
spinning. This is a duct tape mutex. Maybe the seqlock should go.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/nfs/delegation.c | 6 ++----
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 9 +++------
fs/nfs/nfs4state.c | 10 ++++------
4 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..55d5531aedbb 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -130,7 +130,6 @@ static int nfs_delegation_claim_opens(struct inode *inode,
struct nfs_open_context *ctx;
struct nfs4_state_owner *sp;
struct nfs4_state *state;
- unsigned int seq;
int err;

again:
@@ -150,12 +149,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ down_read(&sp->so_reclaim_rw);
err = nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err = nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
- err = -EAGAIN;
+ up_read(&sp->so_reclaim_rw);
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
if (err != 0)
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..03d37826a183 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,7 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ struct rw_semaphore so_reclaim_rw;
struct mutex so_delegreturn_mutex;
};

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 7897826d7c51..ba6589d1fd36 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2682,10 +2682,9 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
struct nfs_server *server = sp->so_server;
struct dentry *dentry;
struct nfs4_state *state;
- unsigned int seq;
int ret;

- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ down_read(&sp->so_reclaim_rw);

ret = _nfs4_proc_open(opendata);
if (ret != 0)
@@ -2721,12 +2720,10 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
goto out;

ctx->state = state;
- if (d_inode(dentry) == state->inode) {
+ if (d_inode(dentry) == state->inode)
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
- nfs4_schedule_stateid_recovery(server, state);
- }
out:
+ up_read(&sp->so_reclaim_rw);
return ret;
}

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..61c431fa2fe3 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,7 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ init_rwsem(&sp->so_reclaim_rw);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1497,8 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ down_write(&sp->so_reclaim_rw);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1566,12 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ up_write(&sp->so_reclaim_rw);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ up_write(&sp->so_reclaim_rw);
return status;
}

--
2.10.2

2016-10-31 15:30:12

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

DQo+IE9uIE9jdCAzMSwgMjAxNiwgYXQgMDk6MTksIFNlYmFzdGlhbiBBbmRyemVqIFNpZXdpb3Ig
PGJpZ2Vhc3lAbGludXRyb25peC5kZT4gd3JvdGU6DQo+IA0KPiBUaGUgcmF3X3dyaXRlX3NlcWNv
dW50X2JlZ2luKCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGUoKSBidWdzIG1lDQo+IGJlY2F1
c2UgaXQgbWFwcyB0byBwcmVlbXB0X2Rpc2FibGUoKSBpbiAtUlQgd2hpY2ggSSBjYW4ndCBoYXZl
IGF0IHRoaXMNCj4gcG9pbnQuIFNvIEkgdG9vayBhIGxvb2sgYXQgdGhlIGNvZGUuDQo+IEl0IHRo
ZSBsb2NrZGVwIHBhcnQgd2FzIHJlbW92ZWQgaW4gY29tbWl0IGFiYmVjMmRhMTNmMCAoIk5GUzog
VXNlDQo+IHJhd193cml0ZV9zZXFjb3VudF9iZWdpbi9lbmQgaW50IG5mczRfcmVjbGFpbV9vcGVu
X3N0YXRlIikgYmVjYXVzZQ0KPiBsb2NrZGVwIGNvbXBsYWluZWQuIFRoZSB3aG9sZSBzZXFjb3Vu
dCB0aGluZyB3YXMgaW50cm9kdWNlZCBpbiBjb21taXQNCj4gYzEzN2FmYWJlMzMwICgiTkZTdjQ6
IEFsbG93IHRoZSBzdGF0ZSBtYW5hZ2VyIHRvIG1hcmsgYW4gb3Blbl9vd25lciBhcw0KPiBiZWlu
ZyByZWNvdmVyZWQiKS4NCj4gVHJvbmQgTXlrbGVidXN0IGNvbmZpcm1zIHRoYXQgdGhlcmUgaSBv
bmx5IG9uZSB3cml0ZXIgdGhyZWFkIGF0DQo+IG5mczRfcmVjbGFpbV9vcGVuX3N0YXRlKCkNCj4g
DQo+IFRoZSBsaXN0X2Zvcl9lYWNoX2VudHJ5KCkgaW4gbmZzNF9yZWNsYWltX29wZW5fc3RhdGU6
DQo+IEl0IHNlZW1zIHRoYXQgdGhpcyBsb2NrIHByb3RlY3RzIHRoZSAtPnNvX3N0YXRlcyBsaXN0
IGFtb25nIG90aGVyDQo+IGF0b21pY190ICYgZmxhZ3MgbWVtYmVycy4gU28gYXQgdGhlIGJlZ2lu
IG9mIHRoZSBsb29wIHdlIGluYyAtPmNvdW50DQo+IGVuc3VyaW5nIHRoYXQgdGhpcyBmaWVsZCBp
cyBub3QgcmVtb3ZlZCB3aGlsZSB3ZSB1c2UgaXQuIFNvIHdlIGRyb3AgdGhlDQo+IC0+c29fbG9j
ayBsb2MgZHVyaW5nIHRoZSBsb29wIGl0IHNlZW1zLiBBbmQgYWZ0ZXIgbmZzNF9yZWNsYWltX2xv
Y2tzKCkNCj4gaW52b2NhdGlvbiB3ZSBuZnM0X3B1dF9vcGVuX3N0YXRlKCkgYW5kIGdyYWIgdGhl
IC0+c29fbG9jayBhZ2Fpbi4gU28gaWYNCj4gd2Ugd2VyZSB0aGUgbGFzdCB1c2VyIG9mIHRoaXMg
c3RydWN0IGFuZCB3ZSByZW1vdmUgaXQsIHRoZW4gdGhlDQo+IGZvbGxvd2luZyBsaXN0X25leHRf
ZW50cnkoKSBpbnZvY2F0aW9uIGlzIGEgdXNlLWFmdGVyLWZyZWUuIEV2ZW4gaWYgd2UNCj4gdXNl
IGxpc3RfZm9yX2VhY2hfZW50cnlfc2FmZSgpIHRoZXJlIGlzIG5vIGd1YXJhbnRlZSB0aGF0IHRo
ZSBmb2xsb3dpbmcNCj4gbWVtYmVyIGlzIHN0aWxsIHZhbGlkIGJlY2F1c2UgaXQgbWlnaHQgaGF2
ZSBiZWVuIHJlbW92ZWQgYnkgc29tZW9uZQ0KPiBpbnZva2luZyBuZnM0X3B1dF9vcGVuX3N0YXRl
KCkgb24gaXQsIHJpZ2h0Pw0KPiBTbyB0aGVyZSBpcyB0aGlzLg0KPiANCj4gSG93ZXZlciB0byBh
ZGRyZXNzIG15IGluaXRpYWwgcHJvYmxlbSBJIGhhdmUgaGVyZSBhIHBhdGNoIDopIFNvIGl0IHVz
ZXMNCj4gYSByd19zZW1hcGhvcmUgd2hpY2ggZW5zdXJlcyB0aGF0IHRoZXJlIGlzIG9ubHkgb25l
IHdyaXRlciBhdCBhIHRpbWUgb3INCj4gbXVsdGlwbGUgcmVhZGVyLiBTbyBpdCBzaG91bGQgYmUg
YmFzaWNhbGx5IHdoYXQgaXMgaGFwcGVuaW5nIG5vdyBwbHVzIGENCj4gdGlueSB0aW55IHRpbnkg
bG9jayBwbHVzIGxvY2tkZXAgY292ZXJhZ2UuIEkgdHJpZWQgdG8gdGhpcyBteXNlbGYgYnV0IEkN
Cj4gZG9uJ3QgbWFuYWdlIHRvIGdldCBpbnRvIHRoaXMgY29kZSBwYXRoIGF0IGFsbCBzbyBJIG1p
Z2h0IGJlIGRvaW5nDQo+IHNvbWV0aGluZyB3cm9uZy4NCj4gDQo+IENvdWxkIHlvdSBwbGVhc2Ug
Y2hlY2sgaWYgdGhpcyBwYXRjaCBpcyB3b3JraW5nIGZvciB5b3UgYW5kIHdoZXRoZXIgbXkNCj4g
bGlzdF9mb3JfZWFjaF9lbnRyeSgpIG9ic2VydmF0aW9uIGlzIGNvcnJlY3Qgb3Igbm90Pw0KPiAN
Cj4gdjLigKZ2MzogcmVwbGFjZSB0aGUgc2VxbG9jayB3aXRoIGEgUlcgc2VtYXBob3JlLg0KPiAN
Cg0KTkFDSy4gVGhhdCB3aWxsIGRlYWRsb2NrLiBUaGUgcmVhc29uIHdoeSB3ZSB1c2UgYSBzZXFs
b2NrIHRoZXJlIGlzIHByZWNpc2VseSBiZWNhdXNlIHdlIGNhbm5vdCBhbGxvdyBvcmRpbmFyeSBS
UEMgY2FsbHMgdG8gbG9jayBvdXQgdGhlIHJlY292ZXJ5IHRocmVhZC4gSWYgdGhlIHNlcnZlciBy
ZWJvb3RzLCB0aGVuIG9yZGluYXJ5IFJQQyBjYWxscyB3aWxsIGZhaWwgdW50aWwgdGhlIHJlY292
ZXJ5IHRocmVhZCBoYXMgaGFkIGEgY2hhbmNlIHRvIHJlLWVzdGFibGlzaCB0aGUgc3RhdGUuDQoN
Cg0K

2016-10-31 15:56:25

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote:
> > On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <[email protected]> wrote:
> > The list_for_each_entry() in nfs4_reclaim_open_state:
> > It seems that this lock protects the ->so_states list among other
> > atomic_t & flags members. So at the begin of the loop we inc ->count
> > ensuring that this field is not removed while we use it. So we drop the
> > ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> > invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> > we were the last user of this struct and we remove it, then the
> > following list_next_entry() invocation is a use-after-free. Even if we
> > use list_for_each_entry_safe() there is no guarantee that the following
> > member is still valid because it might have been removed by someone
> > invoking nfs4_put_open_state() on it, right?
> > So there is this.
> >
> > However to address my initial problem I have here a patch :) So it uses
> > a rw_semaphore which ensures that there is only one writer at a time or
> > multiple reader. So it should be basically what is happening now plus a
> > tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
> > don't manage to get into this code path at all so I might be doing
> > something wrong.
> >
> > Could you please check if this patch is working for you and whether my
> > list_for_each_entry() observation is correct or not?
> >
> > v2…v3: replace the seqlock with a RW semaphore.
> >
>
> NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread.
Hmmm. So this is getting invoked if I reboot the server? A restart of
nfs-kernel-server is the same thing?
Is the list_for_each_entry() observation I made correct?

>If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state.

This means that the ordinary RPC call won't return and fail but wait
with the lock held until the recovery thread did its thing?

Sebastian

2016-10-31 16:11:08

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

DQo+IE9uIE9jdCAzMSwgMjAxNiwgYXQgMTE6NTYsIFNlYmFzdGlhbiBBbmRyemVqIFNpZXdpb3Ig
PGJpZ2Vhc3lAbGludXRyb25peC5kZT4gd3JvdGU6DQo+IA0KPiBPbiAyMDE2LTEwLTMxIDE1OjMw
OjAyIFsrMDAwMF0sIFRyb25kIE15a2xlYnVzdCB3cm90ZToNCj4+PiBPbiBPY3QgMzEsIDIwMTYs
IGF0IDA5OjE5LCBTZWJhc3RpYW4gQW5kcnplaiBTaWV3aW9yIDxiaWdlYXN5QGxpbnV0cm9uaXgu
ZGU+IHdyb3RlOg0KPj4+IFRoZSBsaXN0X2Zvcl9lYWNoX2VudHJ5KCkgaW4gbmZzNF9yZWNsYWlt
X29wZW5fc3RhdGU6DQo+Pj4gSXQgc2VlbXMgdGhhdCB0aGlzIGxvY2sgcHJvdGVjdHMgdGhlIC0+
c29fc3RhdGVzIGxpc3QgYW1vbmcgb3RoZXINCj4+PiBhdG9taWNfdCAmIGZsYWdzIG1lbWJlcnMu
IFNvIGF0IHRoZSBiZWdpbiBvZiB0aGUgbG9vcCB3ZSBpbmMgLT5jb3VudA0KPj4+IGVuc3VyaW5n
IHRoYXQgdGhpcyBmaWVsZCBpcyBub3QgcmVtb3ZlZCB3aGlsZSB3ZSB1c2UgaXQuIFNvIHdlIGRy
b3AgdGhlDQo+Pj4gLT5zb19sb2NrIGxvYyBkdXJpbmcgdGhlIGxvb3AgaXQgc2VlbXMuIEFuZCBh
ZnRlciBuZnM0X3JlY2xhaW1fbG9ja3MoKQ0KPj4+IGludm9jYXRpb24gd2UgbmZzNF9wdXRfb3Bl
bl9zdGF0ZSgpIGFuZCBncmFiIHRoZSAtPnNvX2xvY2sgYWdhaW4uIFNvIGlmDQo+Pj4gd2Ugd2Vy
ZSB0aGUgbGFzdCB1c2VyIG9mIHRoaXMgc3RydWN0IGFuZCB3ZSByZW1vdmUgaXQsIHRoZW4gdGhl
DQo+Pj4gZm9sbG93aW5nIGxpc3RfbmV4dF9lbnRyeSgpIGludm9jYXRpb24gaXMgYSB1c2UtYWZ0
ZXItZnJlZS4gRXZlbiBpZiB3ZQ0KPj4+IHVzZSBsaXN0X2Zvcl9lYWNoX2VudHJ5X3NhZmUoKSB0
aGVyZSBpcyBubyBndWFyYW50ZWUgdGhhdCB0aGUgZm9sbG93aW5nDQo+Pj4gbWVtYmVyIGlzIHN0
aWxsIHZhbGlkIGJlY2F1c2UgaXQgbWlnaHQgaGF2ZSBiZWVuIHJlbW92ZWQgYnkgc29tZW9uZQ0K
Pj4+IGludm9raW5nIG5mczRfcHV0X29wZW5fc3RhdGUoKSBvbiBpdCwgcmlnaHQ/DQo+Pj4gU28g
dGhlcmUgaXMgdGhpcy4NCj4+PiANCj4+PiBIb3dldmVyIHRvIGFkZHJlc3MgbXkgaW5pdGlhbCBw
cm9ibGVtIEkgaGF2ZSBoZXJlIGEgcGF0Y2ggOikgU28gaXQgdXNlcw0KPj4+IGEgcndfc2VtYXBo
b3JlIHdoaWNoIGVuc3VyZXMgdGhhdCB0aGVyZSBpcyBvbmx5IG9uZSB3cml0ZXIgYXQgYSB0aW1l
IG9yDQo+Pj4gbXVsdGlwbGUgcmVhZGVyLiBTbyBpdCBzaG91bGQgYmUgYmFzaWNhbGx5IHdoYXQg
aXMgaGFwcGVuaW5nIG5vdyBwbHVzIGENCj4+PiB0aW55IHRpbnkgdGlueSBsb2NrIHBsdXMgbG9j
a2RlcCBjb3ZlcmFnZS4gSSB0cmllZCB0byB0aGlzIG15c2VsZiBidXQgSQ0KPj4+IGRvbid0IG1h
bmFnZSB0byBnZXQgaW50byB0aGlzIGNvZGUgcGF0aCBhdCBhbGwgc28gSSBtaWdodCBiZSBkb2lu
Zw0KPj4+IHNvbWV0aGluZyB3cm9uZy4NCj4+PiANCj4+PiBDb3VsZCB5b3UgcGxlYXNlIGNoZWNr
IGlmIHRoaXMgcGF0Y2ggaXMgd29ya2luZyBmb3IgeW91IGFuZCB3aGV0aGVyIG15DQo+Pj4gbGlz
dF9mb3JfZWFjaF9lbnRyeSgpIG9ic2VydmF0aW9uIGlzIGNvcnJlY3Qgb3Igbm90Pw0KPj4+IA0K
Pj4+IHYy4oCmdjM6IHJlcGxhY2UgdGhlIHNlcWxvY2sgd2l0aCBhIFJXIHNlbWFwaG9yZS4NCj4+
PiANCj4+IA0KPj4gTkFDSy4gVGhhdCB3aWxsIGRlYWRsb2NrLiBUaGUgcmVhc29uIHdoeSB3ZSB1
c2UgYSBzZXFsb2NrIHRoZXJlIGlzIHByZWNpc2VseSBiZWNhdXNlIHdlIGNhbm5vdCBhbGxvdyBv
cmRpbmFyeSBSUEMgY2FsbHMgdG8gbG9jayBvdXQgdGhlIHJlY292ZXJ5IHRocmVhZC4gDQo+IEht
bW0uIFNvIHRoaXMgaXMgZ2V0dGluZyBpbnZva2VkIGlmIEkgcmVib290IHRoZSBzZXJ2ZXI/IEEg
cmVzdGFydCBvZg0KPiBuZnMta2VybmVsLXNlcnZlciBpcyB0aGUgc2FtZSB0aGluZz8NCj4gSXMg
dGhlIGxpc3RfZm9yX2VhY2hfZW50cnkoKSBvYnNlcnZhdGlvbiBJIG1hZGUgY29ycmVjdD8NCg0K
WWVzLCBhbmQgeWVzLiBXZSBjYW7igJl0IHJlbHkgb24gdGhlIGxpc3QgcG9pbnRlcnMgcmVtYWlu
aW5nIGNvcnJlY3QsIHNvIHdlIHJlc3RhcnQgdGhlIGxpc3Qgc2NhbiBhbmQgd2UgdXNlIHRoZSBv
cHMtPnN0YXRlX2ZsYWdfYml0IHRvIHNpZ25hbCB3aGV0aGVyIG9yIG5vdCBzdGF0ZSBoYXMgYmVl
biByZWNvdmVyZWQgZm9yIHRoZSBlbnRyeSBiZWluZyBzY2FubmVkLg0KDQo+IA0KPj4gSWYgdGhl
IHNlcnZlciByZWJvb3RzLCB0aGVuIG9yZGluYXJ5IFJQQyBjYWxscyB3aWxsIGZhaWwgdW50aWwg
dGhlIHJlY292ZXJ5IHRocmVhZCBoYXMgaGFkIGEgY2hhbmNlIHRvIHJlLWVzdGFibGlzaCB0aGUg
c3RhdGUuDQo+IA0KPiBUaGlzIG1lYW5zIHRoYXQgdGhlIG9yZGluYXJ5IFJQQyBjYWxsIHdvbid0
IHJldHVybiBhbmQgZmFpbCBidXQgd2FpdA0KPiB3aXRoIHRoZSBsb2NrIGhlbGQgdW50aWwgdGhl
IHJlY292ZXJ5IHRocmVhZCBkaWQgaXRzIHRoaW5nPw0KDQpJdCB1c2VzIHRoZSBzZXFjb3VudF90
IHRvIGZpZ3VyZSBvdXQgaWYgYSByZWNvdmVyeSBvY2N1cnJlZCB3aGlsZSBhbiBPUEVOIG9yIENM
T1NFIHdhcyBiZWluZyBwcm9jZXNzZWQuIElmIHNvLCBpdCBzY2hlZHVsZXMgYSBuZXcgcmVjb3Zl
cnkgb2YgdGhlIHN0YXRlaWRzIGluIHF1ZXN0aW9uLg==

2016-11-02 17:12:02

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote:
>
> Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.

but this is tested at the top of the loop and by then you look at
lists' ->next pointer which might be invalid.

Sebastian

2016-11-02 17:15:31

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: [PATCH v4] NFSv4: replace seqcount_t with a seqlock_t

2016-11-02 17:37:43

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

DQo+IE9uIE5vdiAyLCAyMDE2LCBhdCAxMzoxMSwgU2ViYXN0aWFuIEFuZHJ6ZWogU2lld2lvciA8
YmlnZWFzeUBsaW51dHJvbml4LmRlPiB3cm90ZToNCj4gDQo+IE9uIDIwMTYtMTAtMzEgMTY6MTE6
MDIgWyswMDAwXSwgVHJvbmQgTXlrbGVidXN0IHdyb3RlOg0KPj4gDQo+PiBZZXMsIGFuZCB5ZXMu
IFdlIGNhbuKAmXQgcmVseSBvbiB0aGUgbGlzdCBwb2ludGVycyByZW1haW5pbmcgY29ycmVjdCwg
c28gd2UgcmVzdGFydCB0aGUgbGlzdCBzY2FuIGFuZCB3ZSB1c2UgdGhlIG9wcy0+c3RhdGVfZmxh
Z19iaXQgdG8gc2lnbmFsIHdoZXRoZXIgb3Igbm90IHN0YXRlIGhhcyBiZWVuIHJlY292ZXJlZCBm
b3IgdGhlIGVudHJ5IGJlaW5nIHNjYW5uZWQuDQo+IA0KPiBidXQgdGhpcyBpcyB0ZXN0ZWQgYXQg
dGhlIHRvcCBvZiB0aGUgbG9vcCBhbmQgYnkgdGhlbiB5b3UgbG9vayBhdA0KPiBsaXN0cycgLT5u
ZXh0IHBvaW50ZXIgd2hpY2ggbWlnaHQgYmUgaW52YWxpZC4NCj4gDQoNCk5vLiBXZSBlbnN1cmUg
d2UgcmVzdGFydCB0aGUgbGlzdCBzY2FuIGlmIHdlIHJlbGVhc2UgdGhlIHNwaW5sb2NrLiBJdOKA
mXMgc2FmZeKApg0KDQoNCg==