Subject: [RFC PATCH] NFSv4: replace seqcount_t with a seqlock_t

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
I don't understand how it is possible that we don't end up with two
writers for the same resource because the `sp->so_lock' lock is dropped
is soon in the list_for_each_entry() loop. It might be the
test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
bit on each iteration so I *think* we could have two invocations of the
same struct nfs4_state_owner in nfs4_reclaim_open_state().
So there is that.

But back to the list_for_each_entry() macro.
It seems that this `so_lock' lock protects the ->so_states list among
other atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock lock during the loop. And after nfs4_reclaim_locks() invocation we
nfs4_put_open_state() and then grab the ->so_lock again. So if we were the last
user of this struct and we remove it, then the following list_next_entry()
invocation is a use-after-free. Even if we use list_for_each_entry_safe() there
is no guarantee that the following member is still valid because it might have
been removed by something that invoked nfs4_put_open_state(), right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a seqlock_t which ensures that there is only one writer at a time. So it
should be basically what is happening now plus a tiny tiny tiny lock
plus lockdep coverage. I tried to test this myself but I don't manage to get
into this code path at all so I might be doing something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/nfs/delegation.c | 4 ++--
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 4 ++--
fs/nfs/nfs4state.c | 10 ++++------
4 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..d726d2e09353 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = read_seqbegin(&sp->so_reclaim_seqlock);
err = nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err = nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (!err && read_seqretry(&sp->so_reclaim_seqlock, seq))
err = -EAGAIN;
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..9b5089bf0f2e 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,7 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ seqlock_t so_reclaim_seqlock;
struct mutex so_delegreturn_mutex;
};

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ad917bd72b38..3d6be8405c08 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
unsigned int seq;
int ret;

- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = raw_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);

ret = _nfs4_proc_open(opendata);
if (ret != 0)
@@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
ctx->state = state;
if (d_inode(dentry) == state->inode) {
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (read_seqretry(&sp->so_reclaim_seqlock, seq))
nfs4_schedule_stateid_recovery(server, state);
}
out:
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..74be6378ca08 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,7 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ seqlock_init(&sp->so_reclaim_seqlock);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1497,8 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ write_seqlock(&sp->so_reclaim_seqlock);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1566,12 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ write_sequnlock(&sp->so_reclaim_seqlock);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ write_sequnlock(&sp->so_reclaim_seqlock);
return status;
}

--
2.9.3


2016-10-25 06:53:27

by kernel test robot

[permalink] [raw]
Subject: [lkp] [NFSv4] 931437ee2c: BUG: sleeping function called from invalid context at mm/slab.h:393


FYI, we noticed the following commit:

https://github.com/0day-ci/linux Sebastian-Andrzej-Siewior/NFSv4-replace-seqcount_t-with-a-seqlock_t/20161022-013104
commit 931437ee2c100a50c36771c947ce3674f8160592 ("NFSv4: replace seqcount_t with a seqlock_t")

in testcase: ebizzy
with following parameters:

nr_threads: 200%
iterations: 100x
duration: 10s


ebizzy is designed to generate a workload resembling common web application server workloads.


on test machine: qemu-system-x86_64 -enable-kvm -cpu kvm64,+ssse3 -m 1G

caused below changes:


+----------------------------------------------------------------------+------------+------------+
| | 14155cafea | 931437ee2c |
+----------------------------------------------------------------------+------------+------------+
| boot_successes | 4 | 0 |
| boot_failures | 20 | 22 |
| invoked_oom-killer:gfp_mask=0x | 2 | 2 |
| Mem-Info | 2 | 2 |
| Kernel_panic-not_syncing:Out_of_memory_and_no_killable_processes | 2 | 2 |
| BUG:kernel_reboot-without-warning_in_test_stage | 18 | 1 |
| WARNING:at_fs/sysfs/dir.c:#sysfs_warn_dup | 0 | 2 |
| calltrace:parport_pc_init | 0 | 2 |
| calltrace:SyS_finit_module | 0 | 2 |
| WARNING:at_lib/kobject.c:#kobject_add_internal | 0 | 2 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h | 0 | 17 |
| calltrace:nfs4_run_state_manager | 0 | 17 |
| BUG:scheduling_while_atomic | 0 | 17 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/mempool.c | 0 | 17 |
| WARNING:at_kernel/time/timer.c:#del_timer_sync | 0 | 17 |
| BUG:sleeping_function_called_from_invalid_context_at_net/core/sock.c | 0 | 17 |
| Kernel_panic-not_syncing:Aiee,killing_interrupt_handler | 0 | 1 |
+----------------------------------------------------------------------+------------+------------+



[ 81.000854]
[ 81.126394] 2016-10-24 04:32:43 ./ebizzy -t 2 -S 10
[ 81.195131]
[ 90.526940] BUG: sleeping function called from invalid context at mm/slab.h:393
[ 90.565208] in_atomic(): 1, irqs_disabled(): 0, pid: 892, name: 192.168.1.1-man
[ 90.591099] CPU: 0 PID: 892 Comm: 192.168.1.1-man Not tainted 4.9.0-rc1-00004-g931437e #1
[ 90.618251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 90.646609] ffffc900007c7c50 ffffffff81465d39 ffff88003ce54900 0000000000000189
[ 90.681028] ffffc900007c7c68 ffffffff810a7bf3 ffffffff81ca1ad7 ffffc900007c7c90
[ 90.731868] ffffffff810a7c8a 0000000002408040 0000000002408040 ffff88002d001140
[ 90.800830] Call Trace:
[ 90.822833] [<ffffffff81465d39>] dump_stack+0x63/0x8a
[ 90.848179] [<ffffffff810a7bf3>] ___might_sleep+0xd3/0x120
[ 90.874357] [<ffffffff810a7c8a>] __might_sleep+0x4a/0x80
[ 90.890597] [<ffffffff811ebbde>] kmem_cache_alloc_trace+0x15e/0x1b0
[ 90.908456] [<ffffffffa00fea19>] nfs4_opendata_alloc+0x69/0x4f0 [nfsv4]
[ 90.926565] [<ffffffffa00feecf>] nfs4_open_recoverdata_alloc+0x2f/0x60 [nfsv4]
[ 90.953595] [<ffffffffa00ff65b>] nfs4_open_expired+0xab/0x190 [nfsv4]
[ 90.979588] [<ffffffffa00ff76c>] nfs40_open_expired+0x2c/0x30 [nfsv4]
[ 90.997767] [<ffffffffa0112a67>] nfs4_do_reclaim+0x177/0x700 [nfsv4]
[ 91.016039] [<ffffffffa01130c4>] nfs4_state_manager+0xd4/0x820 [nfsv4]
[ 91.034842] [<ffffffffa0113810>] ? nfs4_state_manager+0x820/0x820 [nfsv4]
[ 91.053886] [<ffffffffa0113834>] nfs4_run_state_manager+0x24/0x40 [nfsv4]
[ 91.072752] [<ffffffff810a05f5>] kthread+0xd5/0xf0
[ 91.087524] [<ffffffff810a0520>] ? kthread_park+0x60/0x60
[ 91.103832] [<ffffffff81955205>] ret_from_fork+0x25/0x30
[ 91.120094] BUG: scheduling while atomic: 192.168.1.1-man/892/0x00000002
[ 91.138151] Modules linked in: sr_mod cdrom sg ppdev rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver ata_generic pata_acpi snd_pcm snd_timer snd soundcore pcspkr ata_piix serio_raw i2c_piix4 libata parport_pc parport floppy acpi_cpufreq ip_tables
[ 91.231670] CPU: 0 PID: 892 Comm: 192.168.1.1-man Tainted: G W 4.9.0-rc1-00004-g931437e #1
[ 91.261226] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 91.288772] ffffc900007c7b00 ffffffff81465d39 0000000000000000 ffff88002d419300
[ 91.319323] ffffc900007c7b10 ffffffff810a77b4 ffffc900007c7b68 ffffffff8194fc84
[ 91.349108] ffff88003ccfe940 00ff88002d418b00 ffff88002d419300 0000000000014118
[ 91.388579] Call Trace:
[ 91.409483] [<ffffffff81465d39>] dump_stack+0x63/0x8a
[ 91.434764] [<ffffffff810a77b4>] __schedule_bug+0x54/0x70
[ 91.450511] [<ffffffff8194fc84>] __schedule+0x554/0x6f0
[ 91.466616] [<ffffffff819183d0>] ? __rpc_wait_for_completion_task+0x30/0x30
[ 91.485829] [<ffffffff8194fe5d>] schedule+0x3d/0x90
[ 91.501617] [<ffffffff819183f4>] rpc_wait_bit_killable+0x24/0xb0
[ 91.529513] [<ffffffff819502a8>] __wait_on_bit+0x58/0x90
[ 91.555378] [<ffffffff819183d0>] ? __rpc_wait_for_completion_task+0x30/0x30
[ 91.574619] [<ffffffff81950353>] out_of_line_wait_on_bit+0x73/0x80
[ 91.607997] [<ffffffff810c5f50>] ? wake_atomic_t_function+0x60/0x60
[ 91.632334] [<ffffffff819183cd>] __rpc_wait_for_completion_task+0x2d/0x30
[ 91.651104] [<ffffffffa00fc204>] nfs4_run_open_task+0x124/0x180 [nfsv4]
[ 91.669799] [<ffffffffa00fcf99>] nfs4_open_recover_helper+0x169/0x230 [nfsv4]
[ 91.695951] [<ffffffffa00fd0bd>] nfs4_open_recover+0x5d/0xf0 [nfsv4]
[ 91.722422] [<ffffffffa00feecf>] ? nfs4_open_recoverdata_alloc+0x2f/0x60 [nfsv4]
[ 91.770448] [<ffffffffa00ff674>] nfs4_open_expired+0xc4/0x190 [nfsv4]
[ 91.795957] [<ffffffffa00ff76c>] nfs40_open_expired+0x2c/0x30 [nfsv4]
[ 91.814298] [<ffffffffa0112a67>] nfs4_do_reclaim+0x177/0x700 [nfsv4]
[ 91.846548] [<ffffffffa01130c4>] nfs4_state_manager+0xd4/0x820 [nfsv4]
[ 91.864852] [<ffffffffa0113810>] ? nfs4_state_manager+0x820/0x820 [nfsv4]
[ 91.883782] [<ffffffffa0113834>] nfs4_run_state_manager+0x24/0x40 [nfsv4]
[ 91.902511] [<ffffffff810a05f5>] kthread+0xd5/0xf0
[ 91.918131] [<ffffffff810a0520>] ? kthread_park+0x60/0x60
[ 91.934622] [<ffffffff81955205>] ret_from_fork+0x25/0x30
[ 92.049501] 695 records/s 348 346
[ 92.105889]
[ 92.239333] real 11.00 s


To reproduce:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml



Thanks,
Xiaolong


Attachments:
(No filename) (7.06 kB)
config-4.9.0-rc1-00004-g931437e (150.11 kB)
job-script (5.98 kB)
dmesg.xz (13.89 kB)
job.yaml (3.69 kB)
Download all attachments
Subject: Re: [lkp] [NFSv4] 931437ee2c: BUG: sleeping function called from invalid context at mm/slab.h:393

On 2016-10-25 14:52:39 [+0800], kernel test robot wrote:
>
> FYI, we noticed the following commit:
>
> https://github.com/0day-ci/linux Sebastian-Andrzej-Siewior/NFSv4-replace-seqcount_t-with-a-seqlock_t/20161022-013104
> commit 931437ee2c100a50c36771c947ce3674f8160592 ("NFSv4: replace seqcount_t with a seqlock_t")
>
> in testcase: ebizzy
> with following parameters:
>
> nr_threads: 200%
> iterations: 100x
> duration: 10s
>
>
> ebizzy is designed to generate a workload resembling common web application server workloads.
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu kvm64,+ssse3 -m 1G

It can't get it triggered. In fact I never get even close to the code
path. I tried NFS as rootfs and this test of yours. And I tried
"xfstests-dev -nfs" with no luck.
So you have NFS as your rootfs. Anything special on the server side? It
seems to be a basic NFSv4 export.

Sebastian

Subject: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
I don't understand how it is possible that we don't end up with two
writers for the same ressource because the `sp->so_lock' lock is dropped
is soon in the list_for_each_entry() loop. It might be the
test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
bit on each iteration so I *think* we could have two invocations of the
same struct nfs4_state_owner in nfs4_reclaim_open_state().
So there is that.

But back to the list_for_each_entry() macro.
It seems that this lock protects the ->so_states list among other
atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
we were the last user of this struct and we remove it, then the
following list_next_entry() invocation is a use-after-free. Even if we
use list_for_each_entry_safe() there is no guarantee that the following
member is still valid because it might have been removed by another
writer, right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a seqlock_t which ensures that there is only one writer at a time. So it
should be basically what is happening now plus a tiny tiny tiny lock
plus lockdep coverage. I tried to this myself but I don't manage to get
into this code path at all so I might be doing something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

v1…v2: write_seqlock() disables preemption and some function need it
(thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
preemption enabled because a preempted writer would stall the reader
spinning. This is a duct tape mutex. Maybe the seqlock should go.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/nfs/delegation.c | 4 ++--
fs/nfs/nfs4_fs.h | 3 ++-
fs/nfs/nfs4proc.c | 4 ++--
fs/nfs/nfs4state.c | 23 +++++++++++++++++------
4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..d726d2e09353 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = read_seqbegin(&sp->so_reclaim_seqlock);
err = nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err = nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (!err && read_seqretry(&sp->so_reclaim_seqlock, seq))
err = -EAGAIN;
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..2fee1a2e8b57 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,8 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ seqlock_t so_reclaim_seqlock;
+ struct mutex so_reclaim_seqlock_mutex;
struct mutex so_delegreturn_mutex;
};

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 7897826d7c51..9b9d53cd85f9 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
unsigned int seq;
int ret;

- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = raw_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);

ret = _nfs4_proc_open(opendata);
if (ret != 0)
@@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
ctx->state = state;
if (d_inode(dentry) == state->inode) {
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (read_seqretry(&sp->so_reclaim_seqlock, seq))
nfs4_schedule_stateid_recovery(server, state);
}
out:
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..a442d9867942 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,8 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ seqlock_init(&sp->so_reclaim_seqlock);
+ mutex_init(&sp->so_reclaim_seqlock_mutex);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ /*
+ * XXX
+ * This mutex is wrong. It protects against multiple writer. However
+ * write_seqlock() should have been used for this task. This would avoid
+ * preemption while the seqlock is held which is good because the writer
+ * shouldn't be preempted as it would let the reader spin for no good
+ * reason. There are a few memory allocations and kthread_run() so we
+ * have this mutex now.
+ */
+ mutex_lock(&sp->so_reclaim_seqlock_mutex);
+ write_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
+ mutex_unlock(&sp->so_reclaim_seqlock_mutex);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
+ mutex_unlock(&sp->so_reclaim_seqlock_mutex);
return status;
}

--
2.10.1

2016-10-28 22:25:01

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t


> On Oct 28, 2016, at 17:05, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
> because it maps to preempt_disable() in -RT which I can't have at this
> point. So I took a look at the code.
> It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
> raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
> lockdep complained. The whole seqcount thing was introduced in commit
> c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
> being recovered").
> I don't understand how it is possible that we don't end up with two
> writers for the same ressource because the `sp->so_lock' lock is dropped
> is soon in the list_for_each_entry() loop. It might be the
> test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
> bit on each iteration so I *think* we could have two invocations of the
> same struct nfs4_state_owner in nfs4_reclaim_open_state().
> So there is that.
>
> But back to the list_for_each_entry() macro.
> It seems that this lock protects the ->so_states list among other
> atomic_t & flags members. So at the begin of the loop we inc ->count
> ensuring that this field is not removed while we use it. So we drop the
> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> we were the last user of this struct and we remove it, then the
> following list_next_entry() invocation is a use-after-free. Even if we
> use list_for_each_entry_safe() there is no guarantee that the following
> member is still valid because it might have been removed by another
> writer, right?
> So there is this.
>
> However to address my initial problem I have here a patch :) So it uses
> a seqlock_t which ensures that there is only one writer at a time. So it
> should be basically what is happening now plus a tiny tiny tiny lock
> plus lockdep coverage. I tried to this myself but I don't manage to get
> into this code path at all so I might be doing something wrong.
>
> Could you please check if this patch is working for you and whether my
> list_for_each_entry() observation is correct or not?
>
> v1…v2: write_seqlock() disables preemption and some function need it
> (thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
> preemption enabled because a preempted writer would stall the reader
> spinning. This is a duct tape mutex. Maybe the seqlock should go.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> fs/nfs/delegation.c | 4 ++--
> fs/nfs/nfs4_fs.h | 3 ++-
> fs/nfs/nfs4proc.c | 4 ++--
> fs/nfs/nfs4state.c | 23 +++++++++++++++++------
> 4 files changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index dff600ae0d74..d726d2e09353 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
> sp = state->owner;
> /* Block nfs4_proc_unlck */
> mutex_lock(&sp->so_delegreturn_mutex);
> - seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
> + seq = read_seqbegin(&sp->so_reclaim_seqlock);
> err = nfs4_open_delegation_recall(ctx, state, stateid, type);
> if (!err)
> err = nfs_delegation_claim_locks(ctx, state, stateid);
> - if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
> + if (!err && read_seqretry(&sp->so_reclaim_seqlock, seq))
> err = -EAGAIN;
> mutex_unlock(&sp->so_delegreturn_mutex);
> put_nfs_open_context(ctx);
> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> index 9b3a82abab07..2fee1a2e8b57 100644
> --- a/fs/nfs/nfs4_fs.h
> +++ b/fs/nfs/nfs4_fs.h
> @@ -111,7 +111,8 @@ struct nfs4_state_owner {
> unsigned long so_flags;
> struct list_head so_states;
> struct nfs_seqid_counter so_seqid;
> - seqcount_t so_reclaim_seqcount;
> + seqlock_t so_reclaim_seqlock;
> + struct mutex so_reclaim_seqlock_mutex;
> struct mutex so_delegreturn_mutex;
> };
>
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 7897826d7c51..9b9d53cd85f9 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
> unsigned int seq;
> int ret;
>
> - seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
> + seq = raw_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
>
> ret = _nfs4_proc_open(opendata);
> if (ret != 0)
> @@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
> ctx->state = state;
> if (d_inode(dentry) == state->inode) {
> nfs_inode_attach_open_context(ctx);
> - if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
> + if (read_seqretry(&sp->so_reclaim_seqlock, seq))
> nfs4_schedule_stateid_recovery(server, state);
> }
> out:
> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> index 5f4281ec5f72..a442d9867942 100644
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -488,7 +488,8 @@ nfs4_alloc_state_owner(struct nfs_server *server,
> nfs4_init_seqid_counter(&sp->so_seqid);
> atomic_set(&sp->so_count, 1);
> INIT_LIST_HEAD(&sp->so_lru);
> - seqcount_init(&sp->so_reclaim_seqcount);
> + seqlock_init(&sp->so_reclaim_seqlock);
> + mutex_init(&sp->so_reclaim_seqlock_mutex);
> mutex_init(&sp->so_delegreturn_mutex);
> return sp;
> }
> @@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
> * recovering after a network partition or a reboot from a
> * server that doesn't support a grace period.
> */
> + /*
> + * XXX
> + * This mutex is wrong. It protects against multiple writer. However

There is only 1 recovery thread per client/server pair, which is why we know there is only a single writer. No need for a mutex here.

> + * write_seqlock() should have been used for this task. This would avoid
> + * preemption while the seqlock is held which is good because the writer
> + * shouldn't be preempted as it would let the reader spin for no good

Recovery involves sending RPC calls to a server. There is no avoiding preemption if we have to recover state. The point of the sequence counter is to signal to processes that may have raced with this recovery thread that they may need to replay their locks.

> + * reason. There are a few memory allocations and kthread_run() so we
> + * have this mutex now.
> + */
> + mutex_lock(&sp->so_reclaim_seqlock_mutex);
> + write_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
> spin_lock(&sp->so_lock);
> - raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
> restart:
> list_for_each_entry(state, &sp->so_states, open_states) {
> if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
> @@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
> spin_lock(&sp->so_lock);
> goto restart;
> }
> - raw_write_seqcount_end(&sp->so_reclaim_seqcount);
> spin_unlock(&sp->so_lock);
> + write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);

This will reintroduce lockdep checking. We don’t need or want that...

> + mutex_unlock(&sp->so_reclaim_seqlock_mutex);
> return 0;
> out_err:
> nfs4_put_open_state(state);
> - spin_lock(&sp->so_lock);
> - raw_write_seqcount_end(&sp->so_reclaim_seqcount);
> - spin_unlock(&sp->so_lock);
> + write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
> + mutex_unlock(&sp->so_reclaim_seqlock_mutex);
> return status;
> }
>
> --
> 2.10.1
>


Subject: Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

On 2016-10-28 22:24:52 [+0000], Trond Myklebust wrote:
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 5f4281ec5f72..a442d9867942 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
> > * recovering after a network partition or a reboot from a
> > * server that doesn't support a grace period.
> > */
> > + /*
> > + * XXX
> > + * This mutex is wrong. It protects against multiple writer. However
>
> There is only 1 recovery thread per client/server pair, which is why we know there is only a single writer. No need for a mutex here.

This isn't documented but it should.

> > + * write_seqlock() should have been used for this task. This would avoid
> > + * preemption while the seqlock is held which is good because the writer
> > + * shouldn't be preempted as it would let the reader spin for no good
>
> Recovery involves sending RPC calls to a server. There is no avoiding preemption if we have to recover state. The point of the sequence counter is to signal to processes that may have raced with this recovery thread that they may need to replay their locks.

Then the sequence counter is the wrong mechanism used here. As I
explained: the call can be preempted and once it is, the reader will
spin and waste cycles.
What is wrong with a rwsem? Are the reader not preemptible?

> > + * reason. There are a few memory allocations and kthread_run() so we
> > + * have this mutex now.
> > + */
> > + mutex_lock(&sp->so_reclaim_seqlock_mutex);
> > + write_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
> > spin_lock(&sp->so_lock);
> > - raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
> > restart:
> > list_for_each_entry(state, &sp->so_states, open_states) {
> > if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
> > @@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
> > spin_lock(&sp->so_lock);
> > goto restart;
> > }
> > - raw_write_seqcount_end(&sp->so_reclaim_seqcount);
> > spin_unlock(&sp->so_lock);
> > + write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
>
> This will reintroduce lockdep checking. We don’t need or want that...

Why isn't lockdep welcome? And what kind warning will it introduce? How
do I trigger this? I tried various things but never got close to this.

Sebastian

Subject: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
Trond Myklebust confirms that there i only one writer thread at
nfs4_reclaim_open_state()

The list_for_each_entry() in nfs4_reclaim_open_state:
It seems that this lock protects the ->so_states list among other
atomic_t & flags members. So at the begin of the loop we inc ->count
ensuring that this field is not removed while we use it. So we drop the
->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
we were the last user of this struct and we remove it, then the
following list_next_entry() invocation is a use-after-free. Even if we
use list_for_each_entry_safe() there is no guarantee that the following
member is still valid because it might have been removed by someone
invoking nfs4_put_open_state() on it, right?
So there is this.

However to address my initial problem I have here a patch :) So it uses
a rw_semaphore which ensures that there is only one writer at a time or
multiple reader. So it should be basically what is happening now plus a
tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
don't manage to get into this code path at all so I might be doing
something wrong.

Could you please check if this patch is working for you and whether my
list_for_each_entry() observation is correct or not?

v2…v3: replace the seqlock with a RW semaphore.

v1…v2: write_seqlock() disables preemption and some function need it
(thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
preemption enabled because a preempted writer would stall the reader
spinning. This is a duct tape mutex. Maybe the seqlock should go.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/nfs/delegation.c | 6 ++----
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 9 +++------
fs/nfs/nfs4state.c | 10 ++++------
4 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..55d5531aedbb 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -130,7 +130,6 @@ static int nfs_delegation_claim_opens(struct inode *inode,
struct nfs_open_context *ctx;
struct nfs4_state_owner *sp;
struct nfs4_state *state;
- unsigned int seq;
int err;

again:
@@ -150,12 +149,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ down_read(&sp->so_reclaim_rw);
err = nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err = nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
- err = -EAGAIN;
+ up_read(&sp->so_reclaim_rw);
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
if (err != 0)
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..03d37826a183 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,7 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ struct rw_semaphore so_reclaim_rw;
struct mutex so_delegreturn_mutex;
};

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 7897826d7c51..ba6589d1fd36 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2682,10 +2682,9 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
struct nfs_server *server = sp->so_server;
struct dentry *dentry;
struct nfs4_state *state;
- unsigned int seq;
int ret;

- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ down_read(&sp->so_reclaim_rw);

ret = _nfs4_proc_open(opendata);
if (ret != 0)
@@ -2721,12 +2720,10 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
goto out;

ctx->state = state;
- if (d_inode(dentry) == state->inode) {
+ if (d_inode(dentry) == state->inode)
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
- nfs4_schedule_stateid_recovery(server, state);
- }
out:
+ up_read(&sp->so_reclaim_rw);
return ret;
}

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..61c431fa2fe3 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,7 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ init_rwsem(&sp->so_reclaim_rw);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1497,8 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ down_write(&sp->so_reclaim_rw);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1566,12 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ up_write(&sp->so_reclaim_rw);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ up_write(&sp->so_reclaim_rw);
return status;
}

--
2.10.2

2016-10-31 15:30:16

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore


> On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
> because it maps to preempt_disable() in -RT which I can't have at this
> point. So I took a look at the code.
> It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
> raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
> lockdep complained. The whole seqcount thing was introduced in commit
> c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
> being recovered").
> Trond Myklebust confirms that there i only one writer thread at
> nfs4_reclaim_open_state()
>
> The list_for_each_entry() in nfs4_reclaim_open_state:
> It seems that this lock protects the ->so_states list among other
> atomic_t & flags members. So at the begin of the loop we inc ->count
> ensuring that this field is not removed while we use it. So we drop the
> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> we were the last user of this struct and we remove it, then the
> following list_next_entry() invocation is a use-after-free. Even if we
> use list_for_each_entry_safe() there is no guarantee that the following
> member is still valid because it might have been removed by someone
> invoking nfs4_put_open_state() on it, right?
> So there is this.
>
> However to address my initial problem I have here a patch :) So it uses
> a rw_semaphore which ensures that there is only one writer at a time or
> multiple reader. So it should be basically what is happening now plus a
> tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
> don't manage to get into this code path at all so I might be doing
> something wrong.
>
> Could you please check if this patch is working for you and whether my
> list_for_each_entry() observation is correct or not?
>
> v2…v3: replace the seqlock with a RW semaphore.
>

NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread. If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state.



Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote:
> > On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <[email protected]> wrote:
> > The list_for_each_entry() in nfs4_reclaim_open_state:
> > It seems that this lock protects the ->so_states list among other
> > atomic_t & flags members. So at the begin of the loop we inc ->count
> > ensuring that this field is not removed while we use it. So we drop the
> > ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> > invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> > we were the last user of this struct and we remove it, then the
> > following list_next_entry() invocation is a use-after-free. Even if we
> > use list_for_each_entry_safe() there is no guarantee that the following
> > member is still valid because it might have been removed by someone
> > invoking nfs4_put_open_state() on it, right?
> > So there is this.
> >
> > However to address my initial problem I have here a patch :) So it uses
> > a rw_semaphore which ensures that there is only one writer at a time or
> > multiple reader. So it should be basically what is happening now plus a
> > tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
> > don't manage to get into this code path at all so I might be doing
> > something wrong.
> >
> > Could you please check if this patch is working for you and whether my
> > list_for_each_entry() observation is correct or not?
> >
> > v2…v3: replace the seqlock with a RW semaphore.
> >
>
> NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread.
Hmmm. So this is getting invoked if I reboot the server? A restart of
nfs-kernel-server is the same thing?
Is the list_for_each_entry() observation I made correct?

>If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state.

This means that the ordinary RPC call won't return and fail but wait
with the lock held until the recovery thread did its thing?

Sebastian

2016-10-31 16:11:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore


> On Oct 31, 2016, at 11:56, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> On 2016-10-31 15:30:02 [+0000], Trond Myklebust wrote:
>>> On Oct 31, 2016, at 09:19, Sebastian Andrzej Siewior <[email protected]> wrote:
>>> The list_for_each_entry() in nfs4_reclaim_open_state:
>>> It seems that this lock protects the ->so_states list among other
>>> atomic_t & flags members. So at the begin of the loop we inc ->count
>>> ensuring that this field is not removed while we use it. So we drop the
>>> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
>>> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
>>> we were the last user of this struct and we remove it, then the
>>> following list_next_entry() invocation is a use-after-free. Even if we
>>> use list_for_each_entry_safe() there is no guarantee that the following
>>> member is still valid because it might have been removed by someone
>>> invoking nfs4_put_open_state() on it, right?
>>> So there is this.
>>>
>>> However to address my initial problem I have here a patch :) So it uses
>>> a rw_semaphore which ensures that there is only one writer at a time or
>>> multiple reader. So it should be basically what is happening now plus a
>>> tiny tiny tiny lock plus lockdep coverage. I tried to this myself but I
>>> don't manage to get into this code path at all so I might be doing
>>> something wrong.
>>>
>>> Could you please check if this patch is working for you and whether my
>>> list_for_each_entry() observation is correct or not?
>>>
>>> v2…v3: replace the seqlock with a RW semaphore.
>>>
>>
>> NACK. That will deadlock. The reason why we use a seqlock there is precisely because we cannot allow ordinary RPC calls to lock out the recovery thread.
> Hmmm. So this is getting invoked if I reboot the server? A restart of
> nfs-kernel-server is the same thing?
> Is the list_for_each_entry() observation I made correct?

Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.

>
>> If the server reboots, then ordinary RPC calls will fail until the recovery thread has had a chance to re-establish the state.
>
> This means that the ordinary RPC call won't return and fail but wait
> with the lock held until the recovery thread did its thing?

It uses the seqcount_t to figure out if a recovery occurred while an OPEN or CLOSE was being processed. If so, it schedules a new recovery of the stateids in question.

Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore

On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote:
>
> Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.

but this is tested at the top of the loop and by then you look at
lists' ->next pointer which might be invalid.

Sebastian

Subject: [PATCH v4] NFSv4: replace seqcount_t with a seqlock_t

The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
because it maps to preempt_disable() in -RT which I can't have at this
point. So I took a look at the code.
It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
lockdep complained. The whole seqcount thing was introduced in commit
c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
being recovered").
The recovery threads runs only once.
write_seqlock() does not work on !RT because it disables preemption and it the
writer side is preemptible (has to remain so despite the fact that it will
block readers).

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---

Is the lockdep okay in the read path. I would like to see the lockdep
warning on the warning side but for now it is disabled.

fs/nfs/delegation.c | 4 ++--
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 4 ++--
fs/nfs/nfs4state.c | 10 ++++------
5 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index dff600ae0d74..d726d2e09353 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode *inode,
sp = state->owner;
/* Block nfs4_proc_unlck */
mutex_lock(&sp->so_delegreturn_mutex);
- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = read_seqbegin(&sp->so_reclaim_seqlock);
err = nfs4_open_delegation_recall(ctx, state, stateid, type);
if (!err)
err = nfs_delegation_claim_locks(ctx, state, stateid);
- if (!err && read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (!err && read_seqretry(&sp->so_reclaim_seqlock, seq))
err = -EAGAIN;
mutex_unlock(&sp->so_delegreturn_mutex);
put_nfs_open_context(ctx);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 9b3a82abab07..9b5089bf0f2e 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -111,7 +111,7 @@ struct nfs4_state_owner {
unsigned long so_flags;
struct list_head so_states;
struct nfs_seqid_counter so_seqid;
- seqcount_t so_reclaim_seqcount;
+ seqlock_t so_reclaim_seqlock;
struct mutex so_delegreturn_mutex;
};

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 7897826d7c51..9b9d53cd85f9 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
unsigned int seq;
int ret;

- seq = raw_seqcount_begin(&sp->so_reclaim_seqcount);
+ seq = raw_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);

ret = _nfs4_proc_open(opendata);
if (ret != 0)
@@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
ctx->state = state;
if (d_inode(dentry) == state->inode) {
nfs_inode_attach_open_context(ctx);
- if (read_seqcount_retry(&sp->so_reclaim_seqcount, seq))
+ if (read_seqretry(&sp->so_reclaim_seqlock, seq))
nfs4_schedule_stateid_recovery(server, state);
}
out:
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 5f4281ec5f72..400a9d4186de 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -488,7 +488,7 @@ nfs4_alloc_state_owner(struct nfs_server *server,
nfs4_init_seqid_counter(&sp->so_seqid);
atomic_set(&sp->so_count, 1);
INIT_LIST_HEAD(&sp->so_lru);
- seqcount_init(&sp->so_reclaim_seqcount);
+ seqlock_init(&sp->so_reclaim_seqlock);
mutex_init(&sp->so_delegreturn_mutex);
return sp;
}
@@ -1497,8 +1497,8 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
* recovering after a network partition or a reboot from a
* server that doesn't support a grace period.
*/
+ raw_write_seqcount_begin(&sp->so_reclaim_seqlock.seqcount);
spin_lock(&sp->so_lock);
- raw_write_seqcount_begin(&sp->so_reclaim_seqcount);
restart:
list_for_each_entry(state, &sp->so_states, open_states) {
if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
@@ -1566,14 +1566,12 @@ static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs
spin_lock(&sp->so_lock);
goto restart;
}
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
spin_unlock(&sp->so_lock);
+ raw_write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
return 0;
out_err:
nfs4_put_open_state(state);
- spin_lock(&sp->so_lock);
- raw_write_seqcount_end(&sp->so_reclaim_seqcount);
- spin_unlock(&sp->so_lock);
+ raw_write_seqcount_end(&sp->so_reclaim_seqlock.seqcount);
return status;
}

--
2.10.2

2016-11-02 17:37:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH v3] NFSv4: replace seqcount_t with a rw_semaphore


> On Nov 2, 2016, at 13:11, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> On 2016-10-31 16:11:02 [+0000], Trond Myklebust wrote:
>>
>> Yes, and yes. We can’t rely on the list pointers remaining correct, so we restart the list scan and we use the ops->state_flag_bit to signal whether or not state has been recovered for the entry being scanned.
>
> but this is tested at the top of the loop and by then you look at
> lists' ->next pointer which might be invalid.
>

No. We ensure we restart the list scan if we release the spinlock. It’s safe…