2012-08-09 13:03:56

by Idan Kedar

[permalink] [raw]
Subject: return layout on error, BUG/deadlock

Hi,

As a result of some experiments, I wanted to see what happens when I
inject an error (hard coded) to the object layout driver. the patch is
at the bottom of this mail. the reason I did this is because when I
inject errors in my modified version of the object layout driver, I
get the same BUG Tigran reported about yesterday:
nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));

In my modified version (based on kernel 3.3), the bug seems to be that
pnfs_ld_write_done calls pnfs_return_layout in the error path, even if
there is in-flight I/O.
I wanted to see if this is a result of my modifications, so I injected
errors to the ORE (on kernel 3.5) and ran Connectathon's basic tests,
and got a deadlock AND that BUG:

[ 112.659066] =================================
[ 112.659066] [ INFO: inconsistent lock state ]
[ 112.659066] 3.5.0-nfsobj+ #35 Not tainted
[ 112.659066] ---------------------------------
[ 112.659066] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 112.659066] kworker/0:2/456 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 112.659066] (&(&objlay->lock)->rlock){+.?...}, at:
[<ffffffffa02558fc>] objlayout_encode_layoutreturn+0x6c/0x420
[objlayoutdriver]
[ 112.659066] {IN-SOFTIRQ-W} state was registered at:
[ 112.659066] [<ffffffff810b2bd8>] __lock_acquire+0x5e8/0x1bb0
[ 112.659066] [<ffffffff810b47c2>] lock_acquire+0x92/0x120
[ 112.659066] [<ffffffff81713e16>] _raw_spin_lock+0x36/0x70
[ 112.659066] [<ffffffffa025622d>]
objlayout_iodone.part.1+0x25/0x56 [objlayoutdriver]
[ 112.659066] [<ffffffffa025573b>] objlayout_write_done+0xcb/0xd0
[objlayoutdriver]
[ 112.659066] [<ffffffffa02541c3>] _write_done+0x43/0x60 [objlayoutdriver]
[ 112.659066] [<ffffffffa023a013>] _last_io+0x13/0x20 [libore]
[ 112.659066] [<ffffffffa023a868>] _done_io+0x28/0x30 [libore]
[ 112.659066] [<ffffffffa0229f94>] osd_request_async_done+0xd4/0xf0 [libosd]
[ 112.659066] [<ffffffff812a4a38>] blk_finish_request+0xa8/0x2c0
[ 112.659066] [<ffffffff812a4c9f>] blk_end_bidi_request+0x4f/0x80
[ 112.659066] [<ffffffff812a4d10>] blk_end_request+0x10/0x20
[ 112.659066] [<ffffffff8138c36f>] scsi_io_completion+0xaf/0x670
[ 112.659066] [<ffffffff81382861>] scsi_finish_command+0xd1/0x130
[ 112.659066] [<ffffffff8138c1bf>] scsi_softirq_done+0x13f/0x160
[ 112.659066] [<ffffffff812abb3c>] blk_done_softirq+0x8c/0xa0
[ 112.659066] [<ffffffff81068be8>] __do_softirq+0xc8/0x250
[ 112.659066] [<ffffffff8171696c>] call_softirq+0x1c/0x30
[ 112.659066] [<ffffffff81015785>] do_softirq+0xa5/0xe0
[ 112.659066] [<ffffffff8106893b>] local_bh_enable_ip+0xeb/0x100
[ 112.659066] [<ffffffff81714244>] _raw_spin_unlock_bh+0x34/0x40
[ 112.659066] [<ffffffff815c8abc>] release_sock+0x14c/0x190
[ 112.659066] [<ffffffff81613005>] tcp_sendpage+0xc5/0x6e0
[ 112.659066] [<ffffffff81638703>] inet_sendpage+0xd3/0x120
[ 112.659066] [<ffffffffa0221586>]
iscsi_sw_tcp_pdu_xmit+0x116/0x290 [iscsi_tcp]
[ 112.659066] [<ffffffff814ba930>] iscsi_tcp_task_xmit+0xb0/0x2d0
[ 112.659066] [<ffffffff814354ae>] iscsi_xmit_task+0x5e/0xb0
[ 112.659066] [<ffffffff81437457>] iscsi_xmitworker+0x1f7/0x330
[ 112.659066] [<ffffffff8107d8af>] process_one_work+0x19f/0x530
[ 112.659066] [<ffffffff8107f5d9>] worker_thread+0x159/0x340
[ 112.659066] [<ffffffff810847a7>] kthread+0xb7/0xc0
[ 112.659066] [<ffffffff81716874>] kernel_thread_helper+0x4/0x10
[ 112.659066] irq event stamp: 56183
[ 112.659066] hardirqs last enabled at (56183): [<ffffffff810688e7>]
local_bh_enable_ip+0x97/0x100
[ 112.659066] hardirqs last disabled at (56181): [<ffffffff81068894>]
local_bh_enable_ip+0x44/0x100
[ 112.659066] softirqs last enabled at (56182): [<ffffffffa00344d0>]
xprt_prepare_transmit+0x70/0x90 [sunrpc]
[ 112.659066] softirqs last disabled at (56180): [<ffffffff81713ee8>]
_raw_spin_lock_bh+0x18/0x70
[ 112.659066]
[ 112.659066] other info that might help us debug this:
[ 112.659066] Possible unsafe locking scenario:
[ 112.659066]
[ 112.659066] CPU0
[ 112.659066] ----
[ 112.659066] lock(&(&objlay->lock)->rlock);
[ 112.659066] <Interrupt>
[ 112.659066] lock(&(&objlay->lock)->rlock);
[ 112.659066]
[ 112.659066] *** DEADLOCK ***
[ 112.659066]
[ 112.659066] 2 locks held by kworker/0:2/456:
[ 112.659066] #0: (events){.+.+.+}, at: [<ffffffff8107d84c>]
process_one_work+0x13c/0x530
[ 112.659066] #1: ((&wdata->task.u.tk_work)){+.+.+.}, at:
[<ffffffff8107d84c>] process_one_work+0x13c/0x530
[ 112.659066]
[ 112.659066] stack backtrace:
[ 112.659066] Pid: 456, comm: kworker/0:2 Not tainted 3.5.0-nfsobj+ #35
[ 112.659066] Call Trace:
[ 112.659066] [<ffffffff81703f08>] print_usage_bug+0x1f5/0x206
[ 112.659066] [<ffffffff8102242f>] ? save_stack_trace+0x2f/0x50
[ 112.659066] [<ffffffff810b2595>] mark_lock+0x295/0x2f0
[ 112.659066] [<ffffffff810b1af0>] ? check_usage_forwards+0x140/0x140
[ 112.659066] [<ffffffff810b2c3a>] __lock_acquire+0x64a/0x1bb0
[ 112.659066] [<ffffffff810aef9d>] ? trace_hardirqs_off+0xd/0x10
[ 112.659066] [<ffffffff8109852f>] ? local_clock+0x6f/0x80
[ 112.659066] [<ffffffff8101baf3>] ? native_sched_clock+0x13/0x80
[ 112.659066] [<ffffffff8101bb69>] ? sched_clock+0x9/0x10
[ 112.659066] [<ffffffff810982c5>] ? sched_clock_local+0x25/0x90
[ 112.659066] [<ffffffff81098458>] ? sched_clock_cpu+0xa8/0x110
[ 112.659066] [<ffffffffa02558fc>] ?
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[ 112.659066] [<ffffffff810b47c2>] lock_acquire+0x92/0x120
[ 112.659066] [<ffffffffa02558fc>] ?
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[ 112.659066] [<ffffffff8101bb69>] ? sched_clock+0x9/0x10
[ 112.659066] [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[ 112.659066] [<ffffffff81713e16>] _raw_spin_lock+0x36/0x70
[ 112.659066] [<ffffffffa02558fc>] ?
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[ 112.659066] [<ffffffff8101bb69>] ? sched_clock+0x9/0x10
[ 112.659066] [<ffffffffa02558fc>]
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[ 112.659066] [<ffffffff81098458>] ? sched_clock_cpu+0xa8/0x110
[ 112.659066] [<ffffffff810aef9d>] ? trace_hardirqs_off+0xd/0x10
[ 112.659066] [<ffffffff8109852f>] ? local_clock+0x6f/0x80
[ 112.659066] [<ffffffffa016541a>] ?
nfs4_xdr_enc_layoutreturn+0x11a/0x170 [nfs]
[ 112.659066] [<ffffffffa0045797>] ?
xdr_encode_opaque_fixed+0x47/0x80 [sunrpc]
[ 112.659066] [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[ 112.659066] [<ffffffffa0165449>] nfs4_xdr_enc_layoutreturn+0x149/0x170 [nfs]
[ 112.659066] [<ffffffff810688e7>] ? local_bh_enable_ip+0x97/0x100
[ 112.659066] [<ffffffffa00344d0>] ? xprt_prepare_transmit+0x70/0x90 [sunrpc]
[ 112.659066] [<ffffffffa0165300>] ? nfs4_xdr_enc_commit+0xe0/0xe0 [nfs]
[ 112.659066] [<ffffffffa003bc05>] rpcauth_wrap_req+0x65/0x70 [sunrpc]
[ 112.659066] [<ffffffffa0030dae>] call_transmit+0x18e/0x260 [sunrpc]
[ 112.659066] [<ffffffffa003a360>] __rpc_execute+0x70/0x280 [sunrpc]
[ 112.659066] [<ffffffffa0030c20>] ? call_connect+0x40/0x40 [sunrpc]
[ 112.659066] [<ffffffffa0030c20>] ? call_connect+0x40/0x40 [sunrpc]
[ 112.659066] [<ffffffff81084e9e>] ? wake_up_bit+0x2e/0x40
[ 112.659066] [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[ 112.659066] [<ffffffffa003ab4f>] rpc_execute+0x4f/0xb0 [sunrpc]
[ 112.659066] [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[ 112.659066] [<ffffffffa00327b5>] rpc_run_task+0x75/0x90 [sunrpc]
[ 112.659066] [<ffffffffa0161c96>] nfs4_proc_layoutreturn+0x86/0xb0 [nfs]
[ 112.659066] [<ffffffffa0174594>] _pnfs_return_layout+0x114/0x180 [nfs]
[ 112.659066] [<ffffffffa0174737>] pnfs_ld_write_done+0x67/0xe0 [nfs]
[ 112.659066] [<ffffffffa02551b5>] _rpc_write_complete+0x15/0x20
[objlayoutdriver]
[ 112.659066] [<ffffffff8107d8af>] process_one_work+0x19f/0x530
[ 112.659066] [<ffffffff8107d84c>] ? process_one_work+0x13c/0x530
[ 112.659066] [<ffffffff8107f5d9>] worker_thread+0x159/0x340
[ 112.659066] [<ffffffff8107f480>] ? manage_workers+0x230/0x230
[ 112.659066] [<ffffffff810847a7>] kthread+0xb7/0xc0
[ 112.659066] [<ffffffff810b5295>] ? trace_hardirqs_on_caller+0x105/0x190
[ 112.659066] [<ffffffff81716874>] kernel_thread_helper+0x4/0x10
[ 112.659066] [<ffffffff81714bb0>] ? retint_restore_args+0x13/0x13
[ 112.659066] [<ffffffff810846f0>] ? __init_kthread_worker+0x70/0x70
[ 112.659066] [<ffffffff81716870>] ? gs_change+0x13/0x13
[ 112.787051] ------------[ cut here ]------------
[ 112.787799] kernel BUG at fs/nfs/nfs4proc.c:6252!
[ 112.787799] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 112.787799] CPU 0
[ 112.787799] Modules linked in:[ 112.787799]
nfs_layout_nfsv41_files objlayoutdriver exofs libore osd libosd
iscsi_tcp netconsole nfs nfsd fscache lockd auth_rpcgss nfs_acl sunrpc
e1000 rtc_cmos serio_raw [last unloaded: scsi_wait_scan]

[ 112.787799] Pid: 456, comm: kworker/0:2 Not tainted 3.5.0-nfsobj+
#35 innotek GmbH VirtualBox
[ 112.787799] RIP: 0010:[<ffffffffa015a245>] [<ffffffffa015a245>]
nfs4_layoutreturn_done+0xd5/0xe0 [nfs]
[ 112.787799] RSP: 0018:ffff88003c7bdb50 EFLAGS: 00010206
[ 112.787799] RAX: ffff880034480b90 RBX: ffff88003cbb5860 RCX: ffff88003add4ca8
[ 112.787799] RDX: 0000000000000000 RSI: ffff8800354b1288 RDI: 0000000000000246
[ 112.787799] RBP: ffff88003c7bdb70 R08: 0000000000000002 R09: 0000000000000000
[ 112.787799] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a782240
[ 112.787799] R13: ffff880034480b68 R14: ffff88003a787138 R15: ffffffffa02551a0
[ 112.787799] FS: 0000000000000000(0000) GS:ffff88003e200000(0000)
knlGS:0000000000000000
[ 112.787799] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 112.787799] CR2: 0000003f33c9b640 CR3: 0000000034070000 CR4: 00000000000006f0
[ 112.787799] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 112.787799] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 112.787799] Process kworker/0:2 (pid: 456, threadinfo
ffff88003c7bc000, task ffff88003add4620)
[ 112.787799] Stack:
[ 112.787799] ffff88003a787138 ffff88003a782240 ffff8800347c4858
ffff88003c7bdcf8
[ 112.787799] ffff88003c7bdb90 ffffffffa00391fc ffff88003a787138
ffff88003a782240
[ 112.787799] ffff88003c7bdc00 ffffffffa003a360 000000003c7bdbc0
ffff88003a7822b0
[ 112.787799] Call Trace:
[ 112.787799] [<ffffffffa00391fc>] rpc_exit_task+0x2c/0x90 [sunrpc]
[ 112.787799] [<ffffffffa003a360>] __rpc_execute+0x70/0x280 [sunrpc]
[ 112.787799] [<ffffffffa00391d0>] ? rpc_async_release+0x20/0x20 [sunrpc]
[ 112.787799] [<ffffffffa00391d0>] ? rpc_async_release+0x20/0x20 [sunrpc]
[ 112.787799] [<ffffffff81084e9e>] ? wake_up_bit+0x2e/0x40
[ 112.787799] [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[ 112.787799] [<ffffffffa003ab4f>] rpc_execute+0x4f/0xb0 [sunrpc]
[ 112.787799] [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[ 112.787799] [<ffffffffa00327b5>] rpc_run_task+0x75/0x90 [sunrpc]
[ 112.787799] [<ffffffffa0161c96>] nfs4_proc_layoutreturn+0x86/0xb0 [nfs]
[ 112.787799] [<ffffffffa0174594>] _pnfs_return_layout+0x114/0x180 [nfs]
[ 112.787799] [<ffffffffa0174737>] pnfs_ld_write_done+0x67/0xe0 [nfs]
[ 112.787799] [<ffffffffa02551b5>] _rpc_write_complete+0x15/0x20
[objlayoutdriver]
[ 112.787799] [<ffffffff8107d8af>] process_one_work+0x19f/0x530
[ 112.787799] [<ffffffff8107d84c>] ? process_one_work+0x13c/0x530
[ 112.787799] [<ffffffff8107f5d9>] worker_thread+0x159/0x340
[ 112.787799] [<ffffffff8107f480>] ? manage_workers+0x230/0x230
[ 112.787799] [<ffffffff810847a7>] kthread+0xb7/0xc0
[ 112.787799] [<ffffffff810b5295>] ? trace_hardirqs_on_caller+0x105/0x190
[ 112.787799] [<ffffffff81716874>] kernel_thread_helper+0x4/0x10
[ 112.787799] [<ffffffff81714bb0>] ? retint_restore_args+0x13/0x13
[ 112.787799] [<ffffffff810846f0>] ? __init_kthread_worker+0x70/0x70
[ 112.787799] [<ffffffff81716870>] ? gs_change+0x13/0x13
[ 112.787799] Code: c3 0f 1f 44 00 00 48 8d 73 64 ba 01 00 00 00 4c
89 ef e8 ff ab 01 00 eb c8 0f 1f 44 00 00 4c 89 e7 e8 e0 66 ed ff e9
5a ff ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec
08 66
[ 112.787799] RIP [<ffffffffa015a245>] nfs4_layoutreturn_done+0xd5/0xe0 [nfs]
[ 112.787799] RSP <ffff88003c7bdb50>
[ 112.872492] ---[ end trace 114822acc0612b2a ]---


My setup:

Data server is osd-osc emulator
MDS is nfsd-exofs on kernel 3.3 (Benny's kernel)

Is there any fix for this? Or is it not an issue and my injection is
just wrong and stupid?
I've seen some discussions on this in the mailing list, and I saw the
patch that removes the BUG, but I'm not sure if this is the right
thing to do for object layout.

---
fs/exofs/ore.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 24a49d4..e9d7b45 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -426,9 +426,14 @@ int ore_check_io(struct ore_io_state *ios,
ore_on_dev_error on_dev_error)
if (unlikely(!or))
continue;

- ret = osd_req_decode_sense(or, &osi);
- if (likely(!ret))
- continue;
+ if (jiffies % 6 == 0) {
+ osi.osd_err_pri = OSD_ERR_PRI_EIO;
+ ret = -EIO;
+ } else {
+ ret = osd_req_decode_sense(or, &osi);
+ if (likely(!ret))
+ continue;
+ }

if (OSD_ERR_PRI_CLEAR_PAGES == osi.osd_err_pri) {
/* start read offset passed endof file */
--
1.7.6.5


--
idank


2012-08-09 16:34:26

by Peng Tao

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond
<[email protected]> wrote:
> On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> <[email protected]> wrote:
>> >> -----Original Message-----
>> >> From: [email protected] [mailto:linux-nfs-
>> >> [email protected]] On Behalf Of Idan Kedar
>> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> To: Boaz Harrosh; NFS list
>> >> Cc: Benny Halevy
>> >> Subject: return layout on error, BUG/deadlock
>> >>
>> >> Hi,
>> >>
>> >> As a result of some experiments, I wanted to see what happens when I
>> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> reported about yesterday:
>> >> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>> >>
>> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> is in-flight I/O.
>> >
>> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>>
>> to what change are you referring?
>
> As I stated in the changelog of the patch that I sent to the list
> yesterday, the behaviour is due to commit 0a57cdac3f.
>
>> >
>> > See the changelog in the patch that I sent to the list yesterday.
>> >
>>
>> I saw that, and if I'm not mistaken these races apply to object layout
>> as well, and in any case they apply in my case. However, it is not
>> easy to mess around with LAYOUTRETURN in object layout, and there have
>> been several discussions on the issue. In one of these discussions
>> Benny clarified that the object layout client must wait for all
>> in-flight I/O to end.
>
> If the problem is that the DS is failing to respond, how does the client
> know that the in-flight I/O has ended?
>
>> So for file layout it probably makes sense, but object layout (and if
>> I understand correctly, block layout as well) something else needs to
>> be done. I thought about sync wait when returning the layout on error,
>> but according to Boaz it will cause deadlocks (Boaz - can you please
>> elaborate?).
>
> The object layoutreturn has the ability to pass a timeout error value to
> the MDS precisely in order to allow the latter to deal with this kind of
> issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>
> The block layout is adding the same ability to layoutreturn in NFSv4.2
> (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
> layoutreturn_device_error4, so presumably they too have a plan for
> dealing with this kind of issue.
It is one thing to tell MDS that there is DS access error by sending
layoutreturn, and it is another thing to return a layout even if there
is overlapping in-flight DS IO...

I certainly agree that client is entitled to return layout to inform
MDS about DS errors and also avoid possible cb_layoutrecall. But it is
just an optimization and should only be done when there is no
in-flight IO (at least for block layout) IMHO.

Thanks,
Tao
>
>> And come to think of it, nfs4_proc_setattr also returns the layout
>> when there may be I/O in-flight (correct me if i'm wrong). So I guess
>> pnfs_return_layout should somehow solve this by itself by saying "if
>> this is fencing (a flag which will be set for file layout only), go
>> ahead, otherwise make the layout as 'needs to be returned' and when
>> the lseg lists gets empty return the layout".
>
> The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
> objects, so that question needs to be directed to Boaz.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>

2012-08-09 15:49:48

by Idan Kedar

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
<[email protected]> wrote:
>> -----Original Message-----
>> From: [email protected] [mailto:linux-nfs-
>> [email protected]] On Behalf Of Idan Kedar
>> Sent: Thursday, August 09, 2012 9:03 AM
>> To: Boaz Harrosh; NFS list
>> Cc: Benny Halevy
>> Subject: return layout on error, BUG/deadlock
>>
>> Hi,
>>
>> As a result of some experiments, I wanted to see what happens when I
>> inject an error (hard coded) to the object layout driver. the patch is at the
>> bottom of this mail. the reason I did this is because when I inject errors in my
>> modified version of the object layout driver, I get the same BUG Tigran
>> reported about yesterday:
>> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>>
>> In my modified version (based on kernel 3.3), the bug seems to be that
>> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> is in-flight I/O.
>
> That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

to what change are you referring?

>
> See the changelog in the patch that I sent to the list yesterday.
>

I saw that, and if I'm not mistaken these races apply to object layout
as well, and in any case they apply in my case. However, it is not
easy to mess around with LAYOUTRETURN in object layout, and there have
been several discussions on the issue. In one of these discussions
Benny clarified that the object layout client must wait for all
in-flight I/O to end.
So for file layout it probably makes sense, but object layout (and if
I understand correctly, block layout as well) something else needs to
be done. I thought about sync wait when returning the layout on error,
but according to Boaz it will cause deadlocks (Boaz - can you please
elaborate?).
And come to think of it, nfs4_proc_setattr also returns the layout
when there may be I/O in-flight (correct me if i'm wrong). So I guess
pnfs_return_layout should somehow solve this by itself by saying "if
this is fencing (a flag which will be set for file layout only), go
ahead, otherwise make the layout as 'needs to be returned' and when
the lseg lists gets empty return the layout".
Comments?

> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>
>

--
idank

2012-08-09 23:05:37

by Idan Kedar

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

On Thu, Aug 9, 2012 at 10:12 PM, Myklebust, Trond
<[email protected]> wrote:
> On Thu, 2012-08-09 at 19:54 +0300, Idan Kedar wrote:
>> On Thu, Aug 9, 2012 at 7:06 PM, Myklebust, Trond
>> <[email protected]> wrote:
>> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> >> <[email protected]> wrote:
>> >> >> -----Original Message-----
>> >> >> From: [email protected] [mailto:linux-nfs-
>> >> >> [email protected]] On Behalf Of Idan Kedar
>> >> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> >> To: Boaz Harrosh; NFS list
>> >> >> Cc: Benny Halevy
>> >> >> Subject: return layout on error, BUG/deadlock
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> As a result of some experiments, I wanted to see what happens when I
>> >> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> >> reported about yesterday:
>> >> >> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>> >> >>
>> >> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> >> is in-flight I/O.
>> >> >
>> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>> >>
>> >> to what change are you referring?
>> >
>> > As I stated in the changelog of the patch that I sent to the list
>> > yesterday, the behaviour is due to commit 0a57cdac3f.
>> >
>>
>> I'm not sure I'm following. I was talking about the state of the code
>> v3.5, not to any specific patch/change. since the client yelled
>> "BUG!!!" and crashed, there is some bug involved, either because the
>> BUG() is correct or because the presence of the BUG() is the bug
>> itself.
>>
>> >> >
>> >> > See the changelog in the patch that I sent to the list yesterday.
>> >> >
>> >>
>> >> I saw that, and if I'm not mistaken these races apply to object layout
>> >> as well, and in any case they apply in my case. However, it is not
>> >> easy to mess around with LAYOUTRETURN in object layout, and there have
>> >> been several discussions on the issue. In one of these discussions
>> >> Benny clarified that the object layout client must wait for all
>> >> in-flight I/O to end.
>> >
>> > If the problem is that the DS is failing to respond, how does the client
>> > know that the in-flight I/O has ended?
>> >
>>
>> after the last rpc_call_done is called there is no I/O from the
>> client's perspective an the layout can be safely returned, after which
>> I/O can be done through the MDS.
>
> No. You are confused.
>
> Being in rpc_call_done just tells you that the client thinks it is done.
> It tells you nothing about what the DS is doing. It may just slooooowly
> be processing the WRITE calls for all you know.
>

That's what I meant by "from the client's perspective". At this point,
the client can say "I'm not *using* this layout anymore" because even
if some I/O on its way to the DS originates in the client, the client
can't do anything about it. The fate of this I/O is meaningless to the
client at that point. This is the best the client can do to help the
MDS preserve integrity - keep the layout as long as it has information
about its I/O.

> Unless you get a response from the DS saying "I'm done", or you have
> some other mechanism to guarantee that the DS is done when you time out,
> then you have no guarantee that it is safe to send to MDS.
>

Exactly. What if I don't have such a mechanism? My options are:
* Ask the MDS to somehow fence in-flight I/O and resend through the MDS.
* Return an error to the user.
What if the server does not support fencing? In that case we shouldn't
send through MDS while there is in-flight I/O. Do we want to assume
fencing is implemented? And if fencing isn't implemented and we return
an error to the user - how do we know when it is safe to perform any
I/O again?
And even if fencing is implemented, I don't think the client should
rely on it, because the whole idea beind pNFS is a smart client that
offloads what it can from the servers.

>> Is there I/O pending from the DS perspective? maybe, but all you can
>> do is send I/O via MDS and hope it will not race. fencing doesn't
>> really save you, it just improves your odds where applicable, i.e.
>> file layout.

One correction to what I said: Fencing does save you, if it's
implemented. LAYOUTRETURN doesn't save you (when fencing is not
implemented).

>
> Actually, when reading the iSCSI spec, it implies that there is a known
> session timeout that is negotiated between the initiator and target.
> After the session expires, you are guaranteed that no further commands
> are running on the DS.
>
> The iFCP protocol attaches a timeout value to each frame, and is
> supposed to enforce that timeout.
>
> So at least those protocols should have a deterministic behaviour that
> can be used by the client.
>
>> >> So for file layout it probably makes sense, but object layout (and if
>> >> I understand correctly, block layout as well) something else needs to
>> >> be done. I thought about sync wait when returning the layout on error,
>> >> but according to Boaz it will cause deadlocks (Boaz - can you please
>> >> elaborate?).
>> >
>> > The object layoutreturn has the ability to pass a timeout error value to
>> > the MDS precisely in order to allow the latter to deal with this kind of
>> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>> >
>> > The block layout is adding the same ability to layoutreturn in NFSv4.2
>> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
>> > layoutreturn_device_error4, so presumably they too have a plan for
>> > dealing with this kind of issue.
>> >

Then on layoutreturn we could call a new layout driver method, end_io
or something returns only when there is no danger of race (either by
waiting for in-flight I/O to complete/timeout/error out or by fencing)
and only then return the layout. This works for me. I don't know if it
works for Boaz though.

>>
>> I'll wait for Boaz (or someone else) to explain what exactly is "this
>> kind of issue". I still don't know why sync wait would deadlock.
>>
>> >> And come to think of it, nfs4_proc_setattr also returns the layout
>> >> when there may be I/O in-flight (correct me if i'm wrong). So I guess
>> >> pnfs_return_layout should somehow solve this by itself by saying "if
>> >> this is fencing (a flag which will be set for file layout only), go
>> >> ahead, otherwise make the layout as 'needs to be returned' and when
>> >> the lseg lists gets empty return the layout".
>> >
>> > The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
>> > objects, so that question needs to be directed to Boaz.
>> >
>>
>> Yes, this entire thread is mainly directed to Boaz... And IIRC he did
>> want to implement something of that sort, and this thread was
>> basically for asking him "did you?".
>>
>> > --
>> > Trond Myklebust
>> > Linux NFS client maintainer
>> >
>> > NetApp
>> > [email protected]
>> > http://www.netapp.com
>> >
>>
>>
>>
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>



--
idank

2012-08-09 16:37:30

by Myklebust, Trond

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

T24gRnJpLCAyMDEyLTA4LTEwIGF0IDAwOjM0ICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gT24g
RnJpLCBBdWcgMTAsIDIwMTIgYXQgMTI6MDYgQU0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPFRyb25k
Lk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTItMDgtMDkgYXQg
MTg6NDkgKzAzMDAsIElkYW4gS2VkYXIgd3JvdGU6DQo+ID4+IE9uIFRodSwgQXVnIDksIDIwMTIg
YXQgNTowNSBQTSwgTXlrbGVidXN0LCBUcm9uZA0KPiA+PiA8VHJvbmQuTXlrbGVidXN0QG5ldGFw
cC5jb20+IHdyb3RlOg0KPiA+PiA+PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+PiA+
PiBGcm9tOiBsaW51eC1uZnMtb3duZXJAdmdlci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtbmZz
LQ0KPiA+PiA+PiBvd25lckB2Z2VyLmtlcm5lbC5vcmddIE9uIEJlaGFsZiBPZiBJZGFuIEtlZGFy
DQo+ID4+ID4+IFNlbnQ6IFRodXJzZGF5LCBBdWd1c3QgMDksIDIwMTIgOTowMyBBTQ0KPiA+PiA+
PiBUbzogQm9heiBIYXJyb3NoOyBORlMgbGlzdA0KPiA+PiA+PiBDYzogQmVubnkgSGFsZXZ5DQo+
ID4+ID4+IFN1YmplY3Q6IHJldHVybiBsYXlvdXQgb24gZXJyb3IsIEJVRy9kZWFkbG9jaw0KPiA+
PiA+Pg0KPiA+PiA+PiBIaSwNCj4gPj4gPj4NCj4gPj4gPj4gQXMgYSByZXN1bHQgb2Ygc29tZSBl
eHBlcmltZW50cywgSSB3YW50ZWQgdG8gc2VlIHdoYXQgaGFwcGVucyB3aGVuIEkNCj4gPj4gPj4g
aW5qZWN0IGFuIGVycm9yIChoYXJkIGNvZGVkKSB0byB0aGUgb2JqZWN0IGxheW91dCBkcml2ZXIu
IHRoZSBwYXRjaCBpcyBhdCB0aGUNCj4gPj4gPj4gYm90dG9tIG9mIHRoaXMgbWFpbC4gdGhlIHJl
YXNvbiBJIGRpZCB0aGlzIGlzIGJlY2F1c2Ugd2hlbiBJIGluamVjdCBlcnJvcnMgaW4gbXkNCj4g
Pj4gPj4gbW9kaWZpZWQgdmVyc2lvbiBvZiB0aGUgb2JqZWN0IGxheW91dCBkcml2ZXIsIEkgZ2V0
IHRoZSBzYW1lIEJVRyBUaWdyYW4NCj4gPj4gPj4gcmVwb3J0ZWQgYWJvdXQgeWVzdGVyZGF5Og0K
PiA+PiA+PiBuZnM0cHJvYy5jOjYyNTIgOiAgIEJVR19PTighbGlzdF9lbXB0eSgmbG8tPnBsaF9z
ZWdzKSk7DQo+ID4+ID4+DQo+ID4+ID4+IEluIG15IG1vZGlmaWVkIHZlcnNpb24gKGJhc2VkIG9u
IGtlcm5lbCAzLjMpLCB0aGUgYnVnIHNlZW1zIHRvIGJlIHRoYXQNCj4gPj4gPj4gcG5mc19sZF93
cml0ZV9kb25lIGNhbGxzIHBuZnNfcmV0dXJuX2xheW91dCBpbiB0aGUgZXJyb3IgcGF0aCwgZXZl
biBpZiB0aGVyZQ0KPiA+PiA+PiBpcyBpbi1mbGlnaHQgSS9PLg0KPiA+PiA+DQo+ID4+ID4gVGhh
dCBpcyBub3QgYSBidWcuIEl0IGlzIGFuIGludGVudGlvbmFsIGNoYW5nZSBpbiBvcmRlciB0byBh
bGxvdyB0aGUgTURTIHRvIGZlbmNlIG9mZiB0aGUgb3V0c3RhbmRpbmcgd3JpdGVzIChpZiBpdCBj
YW4gZG8gc28pIGJlZm9yZSB3ZSByZXRyYW5zbWl0IHRoZW0gYXMgd3JpdGUtdGhyb3VnaC1NRFMu
IE90aGVyd2lzZSwgeW91IHJpc2sgcmFjZXMgYmV0d2VlbiB0aGUgb3V0c3RhbmRpbmcgd3JpdGVz
LXRvLURTIGFuZCB0aGUgbmV3IHdyaXRlcy10aHJvdWdoLU1EUy4NCj4gPj4NCj4gPj4gdG8gd2hh
dCBjaGFuZ2UgYXJlIHlvdSByZWZlcnJpbmc/DQo+ID4NCj4gPiBBcyBJIHN0YXRlZCBpbiB0aGUg
Y2hhbmdlbG9nIG9mIHRoZSBwYXRjaCB0aGF0IEkgc2VudCB0byB0aGUgbGlzdA0KPiA+IHllc3Rl
cmRheSwgdGhlIGJlaGF2aW91ciBpcyBkdWUgdG8gY29tbWl0IDBhNTdjZGFjM2YuDQo+ID4NCj4g
Pj4gPg0KPiA+PiA+IFNlZSB0aGUgY2hhbmdlbG9nIGluIHRoZSBwYXRjaCB0aGF0IEkgc2VudCB0
byB0aGUgbGlzdCB5ZXN0ZXJkYXkuDQo+ID4+ID4NCj4gPj4NCj4gPj4gSSBzYXcgdGhhdCwgYW5k
IGlmIEknbSBub3QgbWlzdGFrZW4gdGhlc2UgcmFjZXMgYXBwbHkgdG8gb2JqZWN0IGxheW91dA0K
PiA+PiBhcyB3ZWxsLCBhbmQgaW4gYW55IGNhc2UgdGhleSBhcHBseSBpbiBteSBjYXNlLiBIb3dl
dmVyLCBpdCBpcyBub3QNCj4gPj4gZWFzeSB0byBtZXNzIGFyb3VuZCB3aXRoIExBWU9VVFJFVFVS
TiBpbiBvYmplY3QgbGF5b3V0LCBhbmQgdGhlcmUgaGF2ZQ0KPiA+PiBiZWVuIHNldmVyYWwgZGlz
Y3Vzc2lvbnMgb24gdGhlIGlzc3VlLiBJbiBvbmUgb2YgdGhlc2UgZGlzY3Vzc2lvbnMNCj4gPj4g
QmVubnkgY2xhcmlmaWVkIHRoYXQgdGhlIG9iamVjdCBsYXlvdXQgY2xpZW50IG11c3Qgd2FpdCBm
b3IgYWxsDQo+ID4+IGluLWZsaWdodCBJL08gdG8gZW5kLg0KPiA+DQo+ID4gSWYgdGhlIHByb2Js
ZW0gaXMgdGhhdCB0aGUgRFMgaXMgZmFpbGluZyB0byByZXNwb25kLCBob3cgZG9lcyB0aGUgY2xp
ZW50DQo+ID4ga25vdyB0aGF0IHRoZSBpbi1mbGlnaHQgSS9PIGhhcyBlbmRlZD8NCj4gPg0KPiA+
PiBTbyBmb3IgZmlsZSBsYXlvdXQgaXQgcHJvYmFibHkgbWFrZXMgc2Vuc2UsIGJ1dCBvYmplY3Qg
bGF5b3V0IChhbmQgaWYNCj4gPj4gSSB1bmRlcnN0YW5kIGNvcnJlY3RseSwgYmxvY2sgbGF5b3V0
IGFzIHdlbGwpIHNvbWV0aGluZyBlbHNlIG5lZWRzIHRvDQo+ID4+IGJlIGRvbmUuIEkgdGhvdWdo
dCBhYm91dCBzeW5jIHdhaXQgd2hlbiByZXR1cm5pbmcgdGhlIGxheW91dCBvbiBlcnJvciwNCj4g
Pj4gYnV0IGFjY29yZGluZyB0byBCb2F6IGl0IHdpbGwgY2F1c2UgZGVhZGxvY2tzIChCb2F6IC0g
Y2FuIHlvdSBwbGVhc2UNCj4gPj4gZWxhYm9yYXRlPykuDQo+ID4NCj4gPiBUaGUgb2JqZWN0IGxh
eW91dHJldHVybiBoYXMgdGhlIGFiaWxpdHkgdG8gcGFzcyBhIHRpbWVvdXQgZXJyb3IgdmFsdWUg
dG8NCj4gPiB0aGUgTURTIHByZWNpc2VseSBpbiBvcmRlciB0byBhbGxvdyB0aGUgbGF0dGVyIHRv
IGRlYWwgd2l0aCB0aGlzIGtpbmQgb2YNCj4gPiBpc3N1ZS4gU2VlIHRoZSBkZXNjcmlwdGlvbiBv
ZiBzdHJ1Y3QgcG5mc19vc2RfaW9lcnI0IGluIHJmYzU2NjQuDQo+ID4NCj4gPiBUaGUgYmxvY2sg
bGF5b3V0IGlzIGFkZGluZyB0aGUgc2FtZSBhYmlsaXR5IHRvIGxheW91dHJldHVybiBpbiBORlN2
NC4yDQo+ID4gKHNlZSBkcmFmdC1pZXRmLW5mc3Y0LW1pbm9ydmVyc2lvbjItMTMudHh0KSB2aWEg
dGhlIHN0cnVjdA0KPiA+IGxheW91dHJldHVybl9kZXZpY2VfZXJyb3I0LCBzbyBwcmVzdW1hYmx5
IHRoZXkgdG9vIGhhdmUgYSBwbGFuIGZvcg0KPiA+IGRlYWxpbmcgd2l0aCB0aGlzIGtpbmQgb2Yg
aXNzdWUuDQo+IEl0IGlzIG9uZSB0aGluZyB0byB0ZWxsIE1EUyB0aGF0IHRoZXJlIGlzIERTIGFj
Y2VzcyBlcnJvciBieSBzZW5kaW5nDQo+IGxheW91dHJldHVybiwgYW5kIGl0IGlzIGFub3RoZXIg
dGhpbmcgdG8gcmV0dXJuIGEgbGF5b3V0IGV2ZW4gaWYgdGhlcmUNCj4gaXMgb3ZlcmxhcHBpbmcg
aW4tZmxpZ2h0IERTIElPLi4uDQo+IA0KPiBJIGNlcnRhaW5seSBhZ3JlZSB0aGF0IGNsaWVudCBp
cyBlbnRpdGxlZCB0byByZXR1cm4gbGF5b3V0IHRvIGluZm9ybQ0KPiBNRFMgYWJvdXQgRFMgZXJy
b3JzIGFuZCBhbHNvIGF2b2lkIHBvc3NpYmxlIGNiX2xheW91dHJlY2FsbC4gQnV0IGl0IGlzDQo+
IGp1c3QgYW4gb3B0aW1pemF0aW9uIGFuZCBzaG91bGQgb25seSBiZSBkb25lIHdoZW4gdGhlcmUg
aXMgbm8NCj4gaW4tZmxpZ2h0IElPIChhdCBsZWFzdCBmb3IgYmxvY2sgbGF5b3V0KSBJTUhPLg0K
DQpIT1cgRE8gWU9VIEdVQVJBTlRFRSBOTyBJTi1GTElHSFQgSU8/DQoNClJlcGVhdGluZyB0aGUg
c2FtZSBtYW50cmEgYWJvdXQgJ25vIGluLWZsaWdodCBJTycgdGhhdCBkb2Vzbid0IGFwcGx5IHRv
DQp0aW1lb3V0IHNpdHVhdGlvbnMgaXNuJ3QgaGVscGZ1bC4NCg0KQSBUSU1FT1VUIG1lYW5zIHRo
YXQgeW91IGhhdmUgTk8gSURFQSBpZiB0aGUgZGF0YSBpcyBzdGlsbCBpbiBmbGlnaHQgb3INCm5v
dC4gVGhhdCdzIHdoZW4geW91IG5lZWQgZmVuY2luZywgYW5kIHRoZSBvbmx5IHRoaW5nIHRoYXQg
Y2FuIHN1cHBseQ0KZmVuY2luZyBpbiB0aGF0IHNpdHVhdGlvbiBpcyB0aGUgTURTLg0KDQotLSAN
ClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0K
VHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-08-09 19:31:45

by Myklebust, Trond

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

T24gRnJpLCAyMDEyLTA4LTEwIGF0IDAwOjQ4ICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gT24g
RnJpLCBBdWcgMTAsIDIwMTIgYXQgMTI6MzcgQU0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPFRyb25k
Lk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gPiBPbiBGcmksIDIwMTItMDgtMTAgYXQg
MDA6MzQgKzA4MDAsIFBlbmcgVGFvIHdyb3RlOg0KPiA+PiBPbiBGcmksIEF1ZyAxMCwgMjAxMiBh
dCAxMjowNiBBTSwgTXlrbGVidXN0LCBUcm9uZA0KPiA+PiA8VHJvbmQuTXlrbGVidXN0QG5ldGFw
cC5jb20+IHdyb3RlOg0KPiA+PiA+IE9uIFRodSwgMjAxMi0wOC0wOSBhdCAxODo0OSArMDMwMCwg
SWRhbiBLZWRhciB3cm90ZToNCj4gPj4gPj4gT24gVGh1LCBBdWcgOSwgMjAxMiBhdCA1OjA1IFBN
LCBNeWtsZWJ1c3QsIFRyb25kDQo+ID4+ID4+IDxUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbT4g
d3JvdGU6DQo+ID4+ID4+ID4+IC0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQo+ID4+ID4+ID4+
IEZyb206IGxpbnV4LW5mcy1vd25lckB2Z2VyLmtlcm5lbC5vcmcgW21haWx0bzpsaW51eC1uZnMt
DQo+ID4+ID4+ID4+IG93bmVyQHZnZXIua2VybmVsLm9yZ10gT24gQmVoYWxmIE9mIElkYW4gS2Vk
YXINCj4gPj4gPj4gPj4gU2VudDogVGh1cnNkYXksIEF1Z3VzdCAwOSwgMjAxMiA5OjAzIEFNDQo+
ID4+ID4+ID4+IFRvOiBCb2F6IEhhcnJvc2g7IE5GUyBsaXN0DQo+ID4+ID4+ID4+IENjOiBCZW5u
eSBIYWxldnkNCj4gPj4gPj4gPj4gU3ViamVjdDogcmV0dXJuIGxheW91dCBvbiBlcnJvciwgQlVH
L2RlYWRsb2NrDQo+ID4+ID4+ID4+DQo+ID4+ID4+ID4+IEhpLA0KPiA+PiA+PiA+Pg0KPiA+PiA+
PiA+PiBBcyBhIHJlc3VsdCBvZiBzb21lIGV4cGVyaW1lbnRzLCBJIHdhbnRlZCB0byBzZWUgd2hh
dCBoYXBwZW5zIHdoZW4gSQ0KPiA+PiA+PiA+PiBpbmplY3QgYW4gZXJyb3IgKGhhcmQgY29kZWQp
IHRvIHRoZSBvYmplY3QgbGF5b3V0IGRyaXZlci4gdGhlIHBhdGNoIGlzIGF0IHRoZQ0KPiA+PiA+
PiA+PiBib3R0b20gb2YgdGhpcyBtYWlsLiB0aGUgcmVhc29uIEkgZGlkIHRoaXMgaXMgYmVjYXVz
ZSB3aGVuIEkgaW5qZWN0IGVycm9ycyBpbiBteQ0KPiA+PiA+PiA+PiBtb2RpZmllZCB2ZXJzaW9u
IG9mIHRoZSBvYmplY3QgbGF5b3V0IGRyaXZlciwgSSBnZXQgdGhlIHNhbWUgQlVHIFRpZ3Jhbg0K
PiA+PiA+PiA+PiByZXBvcnRlZCBhYm91dCB5ZXN0ZXJkYXk6DQo+ID4+ID4+ID4+IG5mczRwcm9j
LmM6NjI1MiA6ICAgQlVHX09OKCFsaXN0X2VtcHR5KCZsby0+cGxoX3NlZ3MpKTsNCj4gPj4gPj4g
Pj4NCj4gPj4gPj4gPj4gSW4gbXkgbW9kaWZpZWQgdmVyc2lvbiAoYmFzZWQgb24ga2VybmVsIDMu
MyksIHRoZSBidWcgc2VlbXMgdG8gYmUgdGhhdA0KPiA+PiA+PiA+PiBwbmZzX2xkX3dyaXRlX2Rv
bmUgY2FsbHMgcG5mc19yZXR1cm5fbGF5b3V0IGluIHRoZSBlcnJvciBwYXRoLCBldmVuIGlmIHRo
ZXJlDQo+ID4+ID4+ID4+IGlzIGluLWZsaWdodCBJL08uDQo+ID4+ID4+ID4NCj4gPj4gPj4gPiBU
aGF0IGlzIG5vdCBhIGJ1Zy4gSXQgaXMgYW4gaW50ZW50aW9uYWwgY2hhbmdlIGluIG9yZGVyIHRv
IGFsbG93IHRoZSBNRFMgdG8gZmVuY2Ugb2ZmIHRoZSBvdXRzdGFuZGluZyB3cml0ZXMgKGlmIGl0
IGNhbiBkbyBzbykgYmVmb3JlIHdlIHJldHJhbnNtaXQgdGhlbSBhcyB3cml0ZS10aHJvdWdoLU1E
Uy4gT3RoZXJ3aXNlLCB5b3UgcmlzayByYWNlcyBiZXR3ZWVuIHRoZSBvdXRzdGFuZGluZyB3cml0
ZXMtdG8tRFMgYW5kIHRoZSBuZXcgd3JpdGVzLXRocm91Z2gtTURTLg0KPiA+PiA+Pg0KPiA+PiA+
PiB0byB3aGF0IGNoYW5nZSBhcmUgeW91IHJlZmVycmluZz8NCj4gPj4gPg0KPiA+PiA+IEFzIEkg
c3RhdGVkIGluIHRoZSBjaGFuZ2Vsb2cgb2YgdGhlIHBhdGNoIHRoYXQgSSBzZW50IHRvIHRoZSBs
aXN0DQo+ID4+ID4geWVzdGVyZGF5LCB0aGUgYmVoYXZpb3VyIGlzIGR1ZSB0byBjb21taXQgMGE1
N2NkYWMzZi4NCj4gPj4gPg0KPiA+PiA+PiA+DQo+ID4+ID4+ID4gU2VlIHRoZSBjaGFuZ2Vsb2cg
aW4gdGhlIHBhdGNoIHRoYXQgSSBzZW50IHRvIHRoZSBsaXN0IHllc3RlcmRheS4NCj4gPj4gPj4g
Pg0KPiA+PiA+Pg0KPiA+PiA+PiBJIHNhdyB0aGF0LCBhbmQgaWYgSSdtIG5vdCBtaXN0YWtlbiB0
aGVzZSByYWNlcyBhcHBseSB0byBvYmplY3QgbGF5b3V0DQo+ID4+ID4+IGFzIHdlbGwsIGFuZCBp
biBhbnkgY2FzZSB0aGV5IGFwcGx5IGluIG15IGNhc2UuIEhvd2V2ZXIsIGl0IGlzIG5vdA0KPiA+
PiA+PiBlYXN5IHRvIG1lc3MgYXJvdW5kIHdpdGggTEFZT1VUUkVUVVJOIGluIG9iamVjdCBsYXlv
dXQsIGFuZCB0aGVyZSBoYXZlDQo+ID4+ID4+IGJlZW4gc2V2ZXJhbCBkaXNjdXNzaW9ucyBvbiB0
aGUgaXNzdWUuIEluIG9uZSBvZiB0aGVzZSBkaXNjdXNzaW9ucw0KPiA+PiA+PiBCZW5ueSBjbGFy
aWZpZWQgdGhhdCB0aGUgb2JqZWN0IGxheW91dCBjbGllbnQgbXVzdCB3YWl0IGZvciBhbGwNCj4g
Pj4gPj4gaW4tZmxpZ2h0IEkvTyB0byBlbmQuDQo+ID4+ID4NCj4gPj4gPiBJZiB0aGUgcHJvYmxl
bSBpcyB0aGF0IHRoZSBEUyBpcyBmYWlsaW5nIHRvIHJlc3BvbmQsIGhvdyBkb2VzIHRoZSBjbGll
bnQNCj4gPj4gPiBrbm93IHRoYXQgdGhlIGluLWZsaWdodCBJL08gaGFzIGVuZGVkPw0KPiA+PiA+
DQo+ID4+ID4+IFNvIGZvciBmaWxlIGxheW91dCBpdCBwcm9iYWJseSBtYWtlcyBzZW5zZSwgYnV0
IG9iamVjdCBsYXlvdXQgKGFuZCBpZg0KPiA+PiA+PiBJIHVuZGVyc3RhbmQgY29ycmVjdGx5LCBi
bG9jayBsYXlvdXQgYXMgd2VsbCkgc29tZXRoaW5nIGVsc2UgbmVlZHMgdG8NCj4gPj4gPj4gYmUg
ZG9uZS4gSSB0aG91Z2h0IGFib3V0IHN5bmMgd2FpdCB3aGVuIHJldHVybmluZyB0aGUgbGF5b3V0
IG9uIGVycm9yLA0KPiA+PiA+PiBidXQgYWNjb3JkaW5nIHRvIEJvYXogaXQgd2lsbCBjYXVzZSBk
ZWFkbG9ja3MgKEJvYXogLSBjYW4geW91IHBsZWFzZQ0KPiA+PiA+PiBlbGFib3JhdGU/KS4NCj4g
Pj4gPg0KPiA+PiA+IFRoZSBvYmplY3QgbGF5b3V0cmV0dXJuIGhhcyB0aGUgYWJpbGl0eSB0byBw
YXNzIGEgdGltZW91dCBlcnJvciB2YWx1ZSB0bw0KPiA+PiA+IHRoZSBNRFMgcHJlY2lzZWx5IGlu
IG9yZGVyIHRvIGFsbG93IHRoZSBsYXR0ZXIgdG8gZGVhbCB3aXRoIHRoaXMga2luZCBvZg0KPiA+
PiA+IGlzc3VlLiBTZWUgdGhlIGRlc2NyaXB0aW9uIG9mIHN0cnVjdCBwbmZzX29zZF9pb2VycjQg
aW4gcmZjNTY2NC4NCj4gPj4gPg0KPiA+PiA+IFRoZSBibG9jayBsYXlvdXQgaXMgYWRkaW5nIHRo
ZSBzYW1lIGFiaWxpdHkgdG8gbGF5b3V0cmV0dXJuIGluIE5GU3Y0LjINCj4gPj4gPiAoc2VlIGRy
YWZ0LWlldGYtbmZzdjQtbWlub3J2ZXJzaW9uMi0xMy50eHQpIHZpYSB0aGUgc3RydWN0DQo+ID4+
ID4gbGF5b3V0cmV0dXJuX2RldmljZV9lcnJvcjQsIHNvIHByZXN1bWFibHkgdGhleSB0b28gaGF2
ZSBhIHBsYW4gZm9yDQo+ID4+ID4gZGVhbGluZyB3aXRoIHRoaXMga2luZCBvZiBpc3N1ZS4NCj4g
Pj4gSXQgaXMgb25lIHRoaW5nIHRvIHRlbGwgTURTIHRoYXQgdGhlcmUgaXMgRFMgYWNjZXNzIGVy
cm9yIGJ5IHNlbmRpbmcNCj4gPj4gbGF5b3V0cmV0dXJuLCBhbmQgaXQgaXMgYW5vdGhlciB0aGlu
ZyB0byByZXR1cm4gYSBsYXlvdXQgZXZlbiBpZiB0aGVyZQ0KPiA+PiBpcyBvdmVybGFwcGluZyBp
bi1mbGlnaHQgRFMgSU8uLi4NCj4gPj4NCj4gPj4gSSBjZXJ0YWlubHkgYWdyZWUgdGhhdCBjbGll
bnQgaXMgZW50aXRsZWQgdG8gcmV0dXJuIGxheW91dCB0byBpbmZvcm0NCj4gPj4gTURTIGFib3V0
IERTIGVycm9ycyBhbmQgYWxzbyBhdm9pZCBwb3NzaWJsZSBjYl9sYXlvdXRyZWNhbGwuIEJ1dCBp
dCBpcw0KPiA+PiBqdXN0IGFuIG9wdGltaXphdGlvbiBhbmQgc2hvdWxkIG9ubHkgYmUgZG9uZSB3
aGVuIHRoZXJlIGlzIG5vDQo+ID4+IGluLWZsaWdodCBJTyAoYXQgbGVhc3QgZm9yIGJsb2NrIGxh
eW91dCkgSU1ITy4NCj4gPg0KPiA+IEhPVyBETyBZT1UgR1VBUkFOVEVFIE5PIElOLUZMSUdIVCBJ
Tz8NCj4gPg0KPiBJIGRvbid0LiBUaGF0J3Mgd2h5IEkgZG9uJ3QgcmV0dXJuIGxheW91dCBpbiBw
bmZzX2xkX3dyaXRlX2RvbmUoKS4gQW5kDQo+IGZvciBsYXlvdXRyZXR1cm4gdXBvbiBjYl9sYXlv
dXRyZXR1cm4sIGJsb2NrIGxheW91dCBjbGllbnQgbmVlZHMgdG8gZG8NCj4gdGltZWQtbGVhc2Ug
SU8gZmVuY2luZyBwZXIgcmZjNTY2MywgYnV0IGl0IGlzIG5vdCBpbXBsZW1lbnRlZCBpbiBMaW51
eA0KPiBjbGllbnQuDQoNClRoZSB0aW1lZC1sZWFzZSBJTyBmZW5jaW5nIGRlc2NyaWJlZCBpbiBy
ZmM1NjYzIGlzIGFib3V0IGluZm9ybWluZyB0aGUNCnNlcnZlciBhYm91dCBob3cgbG9uZyB0aGUg
Y2xpZW50IGV4cGVjdHMgYSBjb21tYW5kIHRvIHN1Y2NlZWQgb3IgZmFpbC4NCkl0IGRvZXNuJ3Qg
b2ZmZXIgYW55IGFkdmljZSBmb3IgaG93IHRoZSBjbGllbnQgaXMgdG8gZGVhbCB3aXRoIGFuDQp1
bnJlc3BvbnNpdmUgRFMuDQoNCldoYXQgeW91IG5lZWQgaGVyZSBpcyBoZWxwIGZyb20gdGhlIHVu
ZGVybHlpbmcgdHJhbnNwb3J0IHByb3RvY29sLiBBcyBJDQpzYWlkIGluIHRoZSBlbWFpbCB0byBJ
ZGFuLCB3aGVuIHJlc2VhcmNoaW5nIGlTQ1NJIGFuZCBpRkNQLCBJIGZvdW5kIHdoYXQNCmFwcGVh
cnMgdG8gYmUgbWVjaGFuaXNtcyBmb3IgcmVsaWFibHkgdGltaW5nIG91dC4NCg0KLS0gDQpUcm9u
ZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25k
Lk15a2xlYnVzdEBuZXRhcHAuY29tDQp3d3cubmV0YXBwLmNvbQ0KDQo=

2012-08-09 14:06:15

by Myklebust, Trond

[permalink] [raw]
Subject: RE: return layout on error, BUG/deadlock

> -----Original Message-----
> From: [email protected] [mailto:linux-nfs-
> [email protected]] On Behalf Of Idan Kedar
> Sent: Thursday, August 09, 2012 9:03 AM
> To: Boaz Harrosh; NFS list
> Cc: Benny Halevy
> Subject: return layout on error, BUG/deadlock
>
> Hi,
>
> As a result of some experiments, I wanted to see what happens when I
> inject an error (hard coded) to the object layout driver. the patch is at the
> bottom of this mail. the reason I did this is because when I inject errors in my
> modified version of the object layout driver, I get the same BUG Tigran
> reported about yesterday:
> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>
> In my modified version (based on kernel 3.3), the bug seems to be that
> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
> is in-flight I/O.

That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

See the changelog in the patch that I sent to the list yesterday.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com



2012-08-09 16:54:53

by Idan Kedar

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

On Thu, Aug 9, 2012 at 7:06 PM, Myklebust, Trond
<[email protected]> wrote:
> On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> <[email protected]> wrote:
>> >> -----Original Message-----
>> >> From: [email protected] [mailto:linux-nfs-
>> >> [email protected]] On Behalf Of Idan Kedar
>> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> To: Boaz Harrosh; NFS list
>> >> Cc: Benny Halevy
>> >> Subject: return layout on error, BUG/deadlock
>> >>
>> >> Hi,
>> >>
>> >> As a result of some experiments, I wanted to see what happens when I
>> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> reported about yesterday:
>> >> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>> >>
>> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> is in-flight I/O.
>> >
>> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>>
>> to what change are you referring?
>
> As I stated in the changelog of the patch that I sent to the list
> yesterday, the behaviour is due to commit 0a57cdac3f.
>

I'm not sure I'm following. I was talking about the state of the code
v3.5, not to any specific patch/change. since the client yelled
"BUG!!!" and crashed, there is some bug involved, either because the
BUG() is correct or because the presence of the BUG() is the bug
itself.

>> >
>> > See the changelog in the patch that I sent to the list yesterday.
>> >
>>
>> I saw that, and if I'm not mistaken these races apply to object layout
>> as well, and in any case they apply in my case. However, it is not
>> easy to mess around with LAYOUTRETURN in object layout, and there have
>> been several discussions on the issue. In one of these discussions
>> Benny clarified that the object layout client must wait for all
>> in-flight I/O to end.
>
> If the problem is that the DS is failing to respond, how does the client
> know that the in-flight I/O has ended?
>

after the last rpc_call_done is called there is no I/O from the
client's perspective an the layout can be safely returned, after which
I/O can be done through the MDS.
Is there I/O pending from the DS perspective? maybe, but all you can
do is send I/O via MDS and hope it will not race. fencing doesn't
really save you, it just improves your odds where applicable, i.e.
file layout.

>> So for file layout it probably makes sense, but object layout (and if
>> I understand correctly, block layout as well) something else needs to
>> be done. I thought about sync wait when returning the layout on error,
>> but according to Boaz it will cause deadlocks (Boaz - can you please
>> elaborate?).
>
> The object layoutreturn has the ability to pass a timeout error value to
> the MDS precisely in order to allow the latter to deal with this kind of
> issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>
> The block layout is adding the same ability to layoutreturn in NFSv4.2
> (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
> layoutreturn_device_error4, so presumably they too have a plan for
> dealing with this kind of issue.
>

I'll wait for Boaz (or someone else) to explain what exactly is "this
kind of issue". I still don't know why sync wait would deadlock.

>> And come to think of it, nfs4_proc_setattr also returns the layout
>> when there may be I/O in-flight (correct me if i'm wrong). So I guess
>> pnfs_return_layout should somehow solve this by itself by saying "if
>> this is fencing (a flag which will be set for file layout only), go
>> ahead, otherwise make the layout as 'needs to be returned' and when
>> the lseg lists gets empty return the layout".
>
> The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
> objects, so that question needs to be directed to Boaz.
>

Yes, this entire thread is mainly directed to Boaz... And IIRC he did
want to implement something of that sort, and this thread was
basically for asking him "did you?".

> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>



--
idank

2012-08-09 19:13:32

by Myklebust, Trond

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

T24gVGh1LCAyMDEyLTA4LTA5IGF0IDE5OjU0ICswMzAwLCBJZGFuIEtlZGFyIHdyb3RlOg0KPiBP
biBUaHUsIEF1ZyA5LCAyMDEyIGF0IDc6MDYgUE0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPFRyb25k
Lk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTItMDgtMDkgYXQg
MTg6NDkgKzAzMDAsIElkYW4gS2VkYXIgd3JvdGU6DQo+ID4+IE9uIFRodSwgQXVnIDksIDIwMTIg
YXQgNTowNSBQTSwgTXlrbGVidXN0LCBUcm9uZA0KPiA+PiA8VHJvbmQuTXlrbGVidXN0QG5ldGFw
cC5jb20+IHdyb3RlOg0KPiA+PiA+PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+PiA+
PiBGcm9tOiBsaW51eC1uZnMtb3duZXJAdmdlci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtbmZz
LQ0KPiA+PiA+PiBvd25lckB2Z2VyLmtlcm5lbC5vcmddIE9uIEJlaGFsZiBPZiBJZGFuIEtlZGFy
DQo+ID4+ID4+IFNlbnQ6IFRodXJzZGF5LCBBdWd1c3QgMDksIDIwMTIgOTowMyBBTQ0KPiA+PiA+
PiBUbzogQm9heiBIYXJyb3NoOyBORlMgbGlzdA0KPiA+PiA+PiBDYzogQmVubnkgSGFsZXZ5DQo+
ID4+ID4+IFN1YmplY3Q6IHJldHVybiBsYXlvdXQgb24gZXJyb3IsIEJVRy9kZWFkbG9jaw0KPiA+
PiA+Pg0KPiA+PiA+PiBIaSwNCj4gPj4gPj4NCj4gPj4gPj4gQXMgYSByZXN1bHQgb2Ygc29tZSBl
eHBlcmltZW50cywgSSB3YW50ZWQgdG8gc2VlIHdoYXQgaGFwcGVucyB3aGVuIEkNCj4gPj4gPj4g
aW5qZWN0IGFuIGVycm9yIChoYXJkIGNvZGVkKSB0byB0aGUgb2JqZWN0IGxheW91dCBkcml2ZXIu
IHRoZSBwYXRjaCBpcyBhdCB0aGUNCj4gPj4gPj4gYm90dG9tIG9mIHRoaXMgbWFpbC4gdGhlIHJl
YXNvbiBJIGRpZCB0aGlzIGlzIGJlY2F1c2Ugd2hlbiBJIGluamVjdCBlcnJvcnMgaW4gbXkNCj4g
Pj4gPj4gbW9kaWZpZWQgdmVyc2lvbiBvZiB0aGUgb2JqZWN0IGxheW91dCBkcml2ZXIsIEkgZ2V0
IHRoZSBzYW1lIEJVRyBUaWdyYW4NCj4gPj4gPj4gcmVwb3J0ZWQgYWJvdXQgeWVzdGVyZGF5Og0K
PiA+PiA+PiBuZnM0cHJvYy5jOjYyNTIgOiAgIEJVR19PTighbGlzdF9lbXB0eSgmbG8tPnBsaF9z
ZWdzKSk7DQo+ID4+ID4+DQo+ID4+ID4+IEluIG15IG1vZGlmaWVkIHZlcnNpb24gKGJhc2VkIG9u
IGtlcm5lbCAzLjMpLCB0aGUgYnVnIHNlZW1zIHRvIGJlIHRoYXQNCj4gPj4gPj4gcG5mc19sZF93
cml0ZV9kb25lIGNhbGxzIHBuZnNfcmV0dXJuX2xheW91dCBpbiB0aGUgZXJyb3IgcGF0aCwgZXZl
biBpZiB0aGVyZQ0KPiA+PiA+PiBpcyBpbi1mbGlnaHQgSS9PLg0KPiA+PiA+DQo+ID4+ID4gVGhh
dCBpcyBub3QgYSBidWcuIEl0IGlzIGFuIGludGVudGlvbmFsIGNoYW5nZSBpbiBvcmRlciB0byBh
bGxvdyB0aGUgTURTIHRvIGZlbmNlIG9mZiB0aGUgb3V0c3RhbmRpbmcgd3JpdGVzIChpZiBpdCBj
YW4gZG8gc28pIGJlZm9yZSB3ZSByZXRyYW5zbWl0IHRoZW0gYXMgd3JpdGUtdGhyb3VnaC1NRFMu
IE90aGVyd2lzZSwgeW91IHJpc2sgcmFjZXMgYmV0d2VlbiB0aGUgb3V0c3RhbmRpbmcgd3JpdGVz
LXRvLURTIGFuZCB0aGUgbmV3IHdyaXRlcy10aHJvdWdoLU1EUy4NCj4gPj4NCj4gPj4gdG8gd2hh
dCBjaGFuZ2UgYXJlIHlvdSByZWZlcnJpbmc/DQo+ID4NCj4gPiBBcyBJIHN0YXRlZCBpbiB0aGUg
Y2hhbmdlbG9nIG9mIHRoZSBwYXRjaCB0aGF0IEkgc2VudCB0byB0aGUgbGlzdA0KPiA+IHllc3Rl
cmRheSwgdGhlIGJlaGF2aW91ciBpcyBkdWUgdG8gY29tbWl0IDBhNTdjZGFjM2YuDQo+ID4NCj4g
DQo+IEknbSBub3Qgc3VyZSBJJ20gZm9sbG93aW5nLiBJIHdhcyB0YWxraW5nIGFib3V0IHRoZSBz
dGF0ZSBvZiB0aGUgY29kZQ0KPiB2My41LCBub3QgdG8gYW55IHNwZWNpZmljIHBhdGNoL2NoYW5n
ZS4gc2luY2UgdGhlIGNsaWVudCB5ZWxsZWQNCj4gIkJVRyEhISIgYW5kIGNyYXNoZWQsIHRoZXJl
IGlzIHNvbWUgYnVnIGludm9sdmVkLCBlaXRoZXIgYmVjYXVzZSB0aGUNCj4gQlVHKCkgaXMgY29y
cmVjdCBvciBiZWNhdXNlIHRoZSBwcmVzZW5jZSBvZiB0aGUgQlVHKCkgaXMgdGhlIGJ1Zw0KPiBp
dHNlbGYuDQo+IA0KPiA+PiA+DQo+ID4+ID4gU2VlIHRoZSBjaGFuZ2Vsb2cgaW4gdGhlIHBhdGNo
IHRoYXQgSSBzZW50IHRvIHRoZSBsaXN0IHllc3RlcmRheS4NCj4gPj4gPg0KPiA+Pg0KPiA+PiBJ
IHNhdyB0aGF0LCBhbmQgaWYgSSdtIG5vdCBtaXN0YWtlbiB0aGVzZSByYWNlcyBhcHBseSB0byBv
YmplY3QgbGF5b3V0DQo+ID4+IGFzIHdlbGwsIGFuZCBpbiBhbnkgY2FzZSB0aGV5IGFwcGx5IGlu
IG15IGNhc2UuIEhvd2V2ZXIsIGl0IGlzIG5vdA0KPiA+PiBlYXN5IHRvIG1lc3MgYXJvdW5kIHdp
dGggTEFZT1VUUkVUVVJOIGluIG9iamVjdCBsYXlvdXQsIGFuZCB0aGVyZSBoYXZlDQo+ID4+IGJl
ZW4gc2V2ZXJhbCBkaXNjdXNzaW9ucyBvbiB0aGUgaXNzdWUuIEluIG9uZSBvZiB0aGVzZSBkaXNj
dXNzaW9ucw0KPiA+PiBCZW5ueSBjbGFyaWZpZWQgdGhhdCB0aGUgb2JqZWN0IGxheW91dCBjbGll
bnQgbXVzdCB3YWl0IGZvciBhbGwNCj4gPj4gaW4tZmxpZ2h0IEkvTyB0byBlbmQuDQo+ID4NCj4g
PiBJZiB0aGUgcHJvYmxlbSBpcyB0aGF0IHRoZSBEUyBpcyBmYWlsaW5nIHRvIHJlc3BvbmQsIGhv
dyBkb2VzIHRoZSBjbGllbnQNCj4gPiBrbm93IHRoYXQgdGhlIGluLWZsaWdodCBJL08gaGFzIGVu
ZGVkPw0KPiA+DQo+IA0KPiBhZnRlciB0aGUgbGFzdCBycGNfY2FsbF9kb25lIGlzIGNhbGxlZCB0
aGVyZSBpcyBubyBJL08gZnJvbSB0aGUNCj4gY2xpZW50J3MgcGVyc3BlY3RpdmUgYW4gdGhlIGxh
eW91dCBjYW4gYmUgc2FmZWx5IHJldHVybmVkLCBhZnRlciB3aGljaA0KPiBJL08gY2FuIGJlIGRv
bmUgdGhyb3VnaCB0aGUgTURTLg0KDQpOby4gWW91IGFyZSBjb25mdXNlZC4NCg0KQmVpbmcgaW4g
cnBjX2NhbGxfZG9uZSBqdXN0IHRlbGxzIHlvdSB0aGF0IHRoZSBjbGllbnQgdGhpbmtzIGl0IGlz
IGRvbmUuDQpJdCB0ZWxscyB5b3Ugbm90aGluZyBhYm91dCB3aGF0IHRoZSBEUyBpcyBkb2luZy4g
SXQgbWF5IGp1c3Qgc2xvb29vb3dseQ0KYmUgcHJvY2Vzc2luZyB0aGUgV1JJVEUgY2FsbHMgZm9y
IGFsbCB5b3Uga25vdy4NCg0KVW5sZXNzIHlvdSBnZXQgYSByZXNwb25zZSBmcm9tIHRoZSBEUyBz
YXlpbmcgIkknbSBkb25lIiwgb3IgeW91IGhhdmUNCnNvbWUgb3RoZXIgbWVjaGFuaXNtIHRvIGd1
YXJhbnRlZSB0aGF0IHRoZSBEUyBpcyBkb25lIHdoZW4geW91IHRpbWUgb3V0LA0KdGhlbiB5b3Ug
aGF2ZSBubyBndWFyYW50ZWUgdGhhdCBpdCBpcyBzYWZlIHRvIHNlbmQgdG8gTURTLg0KDQo+IElz
IHRoZXJlIEkvTyBwZW5kaW5nIGZyb20gdGhlIERTIHBlcnNwZWN0aXZlPyBtYXliZSwgYnV0IGFs
bCB5b3UgY2FuDQo+IGRvIGlzIHNlbmQgSS9PIHZpYSBNRFMgYW5kIGhvcGUgaXQgd2lsbCBub3Qg
cmFjZS4gZmVuY2luZyBkb2Vzbid0DQo+IHJlYWxseSBzYXZlIHlvdSwgaXQganVzdCBpbXByb3Zl
cyB5b3VyIG9kZHMgd2hlcmUgYXBwbGljYWJsZSwgaS5lLg0KPiBmaWxlIGxheW91dC4NCg0KQWN0
dWFsbHksIHdoZW4gcmVhZGluZyB0aGUgaVNDU0kgc3BlYywgaXQgaW1wbGllcyB0aGF0IHRoZXJl
IGlzIGEga25vd24NCnNlc3Npb24gdGltZW91dCB0aGF0IGlzIG5lZ290aWF0ZWQgYmV0d2VlbiB0
aGUgaW5pdGlhdG9yIGFuZCB0YXJnZXQuDQpBZnRlciB0aGUgc2Vzc2lvbiBleHBpcmVzLCB5b3Ug
YXJlIGd1YXJhbnRlZWQgdGhhdCBubyBmdXJ0aGVyIGNvbW1hbmRzDQphcmUgcnVubmluZyBvbiB0
aGUgRFMuDQoNClRoZSBpRkNQIHByb3RvY29sIGF0dGFjaGVzIGEgdGltZW91dCB2YWx1ZSB0byBl
YWNoIGZyYW1lLCBhbmQgaXMNCnN1cHBvc2VkIHRvIGVuZm9yY2UgdGhhdCB0aW1lb3V0Lg0KDQpT
byBhdCBsZWFzdCB0aG9zZSBwcm90b2NvbHMgc2hvdWxkIGhhdmUgYSBkZXRlcm1pbmlzdGljIGJl
aGF2aW91ciB0aGF0DQpjYW4gYmUgdXNlZCBieSB0aGUgY2xpZW50Lg0KDQo+ID4+IFNvIGZvciBm
aWxlIGxheW91dCBpdCBwcm9iYWJseSBtYWtlcyBzZW5zZSwgYnV0IG9iamVjdCBsYXlvdXQgKGFu
ZCBpZg0KPiA+PiBJIHVuZGVyc3RhbmQgY29ycmVjdGx5LCBibG9jayBsYXlvdXQgYXMgd2VsbCkg
c29tZXRoaW5nIGVsc2UgbmVlZHMgdG8NCj4gPj4gYmUgZG9uZS4gSSB0aG91Z2h0IGFib3V0IHN5
bmMgd2FpdCB3aGVuIHJldHVybmluZyB0aGUgbGF5b3V0IG9uIGVycm9yLA0KPiA+PiBidXQgYWNj
b3JkaW5nIHRvIEJvYXogaXQgd2lsbCBjYXVzZSBkZWFkbG9ja3MgKEJvYXogLSBjYW4geW91IHBs
ZWFzZQ0KPiA+PiBlbGFib3JhdGU/KS4NCj4gPg0KPiA+IFRoZSBvYmplY3QgbGF5b3V0cmV0dXJu
IGhhcyB0aGUgYWJpbGl0eSB0byBwYXNzIGEgdGltZW91dCBlcnJvciB2YWx1ZSB0bw0KPiA+IHRo
ZSBNRFMgcHJlY2lzZWx5IGluIG9yZGVyIHRvIGFsbG93IHRoZSBsYXR0ZXIgdG8gZGVhbCB3aXRo
IHRoaXMga2luZCBvZg0KPiA+IGlzc3VlLiBTZWUgdGhlIGRlc2NyaXB0aW9uIG9mIHN0cnVjdCBw
bmZzX29zZF9pb2VycjQgaW4gcmZjNTY2NC4NCj4gPg0KPiA+IFRoZSBibG9jayBsYXlvdXQgaXMg
YWRkaW5nIHRoZSBzYW1lIGFiaWxpdHkgdG8gbGF5b3V0cmV0dXJuIGluIE5GU3Y0LjINCj4gPiAo
c2VlIGRyYWZ0LWlldGYtbmZzdjQtbWlub3J2ZXJzaW9uMi0xMy50eHQpIHZpYSB0aGUgc3RydWN0
DQo+ID4gbGF5b3V0cmV0dXJuX2RldmljZV9lcnJvcjQsIHNvIHByZXN1bWFibHkgdGhleSB0b28g
aGF2ZSBhIHBsYW4gZm9yDQo+ID4gZGVhbGluZyB3aXRoIHRoaXMga2luZCBvZiBpc3N1ZS4NCj4g
Pg0KPiANCj4gSSdsbCB3YWl0IGZvciBCb2F6IChvciBzb21lb25lIGVsc2UpIHRvIGV4cGxhaW4g
d2hhdCBleGFjdGx5IGlzICJ0aGlzDQo+IGtpbmQgb2YgaXNzdWUiLiBJIHN0aWxsIGRvbid0IGtu
b3cgd2h5IHN5bmMgd2FpdCB3b3VsZCBkZWFkbG9jay4NCj4gDQo+ID4+IEFuZCBjb21lIHRvIHRo
aW5rIG9mIGl0LCBuZnM0X3Byb2Nfc2V0YXR0ciBhbHNvIHJldHVybnMgdGhlIGxheW91dA0KPiA+
PiB3aGVuIHRoZXJlIG1heSBiZSBJL08gaW4tZmxpZ2h0IChjb3JyZWN0IG1lIGlmIGknbSB3cm9u
ZykuIFNvIEkgZ3Vlc3MNCj4gPj4gcG5mc19yZXR1cm5fbGF5b3V0IHNob3VsZCBzb21laG93IHNv
bHZlIHRoaXMgYnkgaXRzZWxmIGJ5IHNheWluZyAiaWYNCj4gPj4gdGhpcyBpcyBmZW5jaW5nIChh
IGZsYWcgd2hpY2ggd2lsbCBiZSBzZXQgZm9yIGZpbGUgbGF5b3V0IG9ubHkpLCBnbw0KPiA+PiBh
aGVhZCwgb3RoZXJ3aXNlIG1ha2UgdGhlIGxheW91dCBhcyAnbmVlZHMgdG8gYmUgcmV0dXJuZWQn
IGFuZCB3aGVuDQo+ID4+IHRoZSBsc2VnIGxpc3RzIGdldHMgZW1wdHkgcmV0dXJuIHRoZSBsYXlv
dXQiLg0KPiA+DQo+ID4gVGhlIG9ubHkgbGF5b3V0IHR5cGUgdGhhdCBzZXRzIHRoZSBQTkZTX0xB
WU9VVFJFVF9PTl9TRVRBVFRSIGZsYWcgaXMNCj4gPiBvYmplY3RzLCBzbyB0aGF0IHF1ZXN0aW9u
IG5lZWRzIHRvIGJlIGRpcmVjdGVkIHRvIEJvYXouDQo+ID4NCj4gDQo+IFllcywgdGhpcyBlbnRp
cmUgdGhyZWFkIGlzIG1haW5seSBkaXJlY3RlZCB0byBCb2F6Li4uIEFuZCBJSVJDIGhlIGRpZA0K
PiB3YW50IHRvIGltcGxlbWVudCBzb21ldGhpbmcgb2YgdGhhdCBzb3J0LCBhbmQgdGhpcyB0aHJl
YWQgd2FzDQo+IGJhc2ljYWxseSBmb3IgYXNraW5nIGhpbSAiZGlkIHlvdT8iLg0KPiANCj4gPiAt
LQ0KPiA+IFRyb25kIE15a2xlYnVzdA0KPiA+IExpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0K
PiA+DQo+ID4gTmV0QXBwDQo+ID4gVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCj4gPiB3d3cu
bmV0YXBwLmNvbQ0KPiA+DQo+IA0KPiANCj4gDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51
eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBw
LmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K

2012-08-09 16:48:52

by Peng Tao

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

On Fri, Aug 10, 2012 at 12:37 AM, Myklebust, Trond
<[email protected]> wrote:
> On Fri, 2012-08-10 at 00:34 +0800, Peng Tao wrote:
>> On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond
>> <[email protected]> wrote:
>> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> >> <[email protected]> wrote:
>> >> >> -----Original Message-----
>> >> >> From: [email protected] [mailto:linux-nfs-
>> >> >> [email protected]] On Behalf Of Idan Kedar
>> >> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> >> To: Boaz Harrosh; NFS list
>> >> >> Cc: Benny Halevy
>> >> >> Subject: return layout on error, BUG/deadlock
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> As a result of some experiments, I wanted to see what happens when I
>> >> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> >> reported about yesterday:
>> >> >> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>> >> >>
>> >> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> >> is in-flight I/O.
>> >> >
>> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>> >>
>> >> to what change are you referring?
>> >
>> > As I stated in the changelog of the patch that I sent to the list
>> > yesterday, the behaviour is due to commit 0a57cdac3f.
>> >
>> >> >
>> >> > See the changelog in the patch that I sent to the list yesterday.
>> >> >
>> >>
>> >> I saw that, and if I'm not mistaken these races apply to object layout
>> >> as well, and in any case they apply in my case. However, it is not
>> >> easy to mess around with LAYOUTRETURN in object layout, and there have
>> >> been several discussions on the issue. In one of these discussions
>> >> Benny clarified that the object layout client must wait for all
>> >> in-flight I/O to end.
>> >
>> > If the problem is that the DS is failing to respond, how does the client
>> > know that the in-flight I/O has ended?
>> >
>> >> So for file layout it probably makes sense, but object layout (and if
>> >> I understand correctly, block layout as well) something else needs to
>> >> be done. I thought about sync wait when returning the layout on error,
>> >> but according to Boaz it will cause deadlocks (Boaz - can you please
>> >> elaborate?).
>> >
>> > The object layoutreturn has the ability to pass a timeout error value to
>> > the MDS precisely in order to allow the latter to deal with this kind of
>> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>> >
>> > The block layout is adding the same ability to layoutreturn in NFSv4.2
>> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
>> > layoutreturn_device_error4, so presumably they too have a plan for
>> > dealing with this kind of issue.
>> It is one thing to tell MDS that there is DS access error by sending
>> layoutreturn, and it is another thing to return a layout even if there
>> is overlapping in-flight DS IO...
>>
>> I certainly agree that client is entitled to return layout to inform
>> MDS about DS errors and also avoid possible cb_layoutrecall. But it is
>> just an optimization and should only be done when there is no
>> in-flight IO (at least for block layout) IMHO.
>
> HOW DO YOU GUARANTEE NO IN-FLIGHT IO?
>
I don't. That's why I don't return layout in pnfs_ld_write_done(). And
for layoutreturn upon cb_layoutreturn, block layout client needs to do
timed-lease IO fencing per rfc5663, but it is not implemented in Linux
client.

> Repeating the same mantra about 'no in-flight IO' that doesn't apply to
> timeout situations isn't helpful.
>
> A TIMEOUT means that you have NO IDEA if the data is still in flight or
> not. That's when you need fencing, and the only thing that can supply
> fencing in that situation is the MDS.

>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>



--
Thanks,
Tao

2012-08-09 16:06:20

by Myklebust, Trond

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

T24gVGh1LCAyMDEyLTA4LTA5IGF0IDE4OjQ5ICswMzAwLCBJZGFuIEtlZGFyIHdyb3RlOg0KPiBP
biBUaHUsIEF1ZyA5LCAyMDEyIGF0IDU6MDUgUE0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPFRyb25k
Lk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gPj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdl
LS0tLS0NCj4gPj4gRnJvbTogbGludXgtbmZzLW93bmVyQHZnZXIua2VybmVsLm9yZyBbbWFpbHRv
OmxpbnV4LW5mcy0NCj4gPj4gb3duZXJAdmdlci5rZXJuZWwub3JnXSBPbiBCZWhhbGYgT2YgSWRh
biBLZWRhcg0KPiA+PiBTZW50OiBUaHVyc2RheSwgQXVndXN0IDA5LCAyMDEyIDk6MDMgQU0NCj4g
Pj4gVG86IEJvYXogSGFycm9zaDsgTkZTIGxpc3QNCj4gPj4gQ2M6IEJlbm55IEhhbGV2eQ0KPiA+
PiBTdWJqZWN0OiByZXR1cm4gbGF5b3V0IG9uIGVycm9yLCBCVUcvZGVhZGxvY2sNCj4gPj4NCj4g
Pj4gSGksDQo+ID4+DQo+ID4+IEFzIGEgcmVzdWx0IG9mIHNvbWUgZXhwZXJpbWVudHMsIEkgd2Fu
dGVkIHRvIHNlZSB3aGF0IGhhcHBlbnMgd2hlbiBJDQo+ID4+IGluamVjdCBhbiBlcnJvciAoaGFy
ZCBjb2RlZCkgdG8gdGhlIG9iamVjdCBsYXlvdXQgZHJpdmVyLiB0aGUgcGF0Y2ggaXMgYXQgdGhl
DQo+ID4+IGJvdHRvbSBvZiB0aGlzIG1haWwuIHRoZSByZWFzb24gSSBkaWQgdGhpcyBpcyBiZWNh
dXNlIHdoZW4gSSBpbmplY3QgZXJyb3JzIGluIG15DQo+ID4+IG1vZGlmaWVkIHZlcnNpb24gb2Yg
dGhlIG9iamVjdCBsYXlvdXQgZHJpdmVyLCBJIGdldCB0aGUgc2FtZSBCVUcgVGlncmFuDQo+ID4+
IHJlcG9ydGVkIGFib3V0IHllc3RlcmRheToNCj4gPj4gbmZzNHByb2MuYzo2MjUyIDogICBCVUdf
T04oIWxpc3RfZW1wdHkoJmxvLT5wbGhfc2VncykpOw0KPiA+Pg0KPiA+PiBJbiBteSBtb2RpZmll
ZCB2ZXJzaW9uIChiYXNlZCBvbiBrZXJuZWwgMy4zKSwgdGhlIGJ1ZyBzZWVtcyB0byBiZSB0aGF0
DQo+ID4+IHBuZnNfbGRfd3JpdGVfZG9uZSBjYWxscyBwbmZzX3JldHVybl9sYXlvdXQgaW4gdGhl
IGVycm9yIHBhdGgsIGV2ZW4gaWYgdGhlcmUNCj4gPj4gaXMgaW4tZmxpZ2h0IEkvTy4NCj4gPg0K
PiA+IFRoYXQgaXMgbm90IGEgYnVnLiBJdCBpcyBhbiBpbnRlbnRpb25hbCBjaGFuZ2UgaW4gb3Jk
ZXIgdG8gYWxsb3cgdGhlIE1EUyB0byBmZW5jZSBvZmYgdGhlIG91dHN0YW5kaW5nIHdyaXRlcyAo
aWYgaXQgY2FuIGRvIHNvKSBiZWZvcmUgd2UgcmV0cmFuc21pdCB0aGVtIGFzIHdyaXRlLXRocm91
Z2gtTURTLiBPdGhlcndpc2UsIHlvdSByaXNrIHJhY2VzIGJldHdlZW4gdGhlIG91dHN0YW5kaW5n
IHdyaXRlcy10by1EUyBhbmQgdGhlIG5ldyB3cml0ZXMtdGhyb3VnaC1NRFMuDQo+IA0KPiB0byB3
aGF0IGNoYW5nZSBhcmUgeW91IHJlZmVycmluZz8NCg0KQXMgSSBzdGF0ZWQgaW4gdGhlIGNoYW5n
ZWxvZyBvZiB0aGUgcGF0Y2ggdGhhdCBJIHNlbnQgdG8gdGhlIGxpc3QNCnllc3RlcmRheSwgdGhl
IGJlaGF2aW91ciBpcyBkdWUgdG8gY29tbWl0IDBhNTdjZGFjM2YuDQoNCj4gPg0KPiA+IFNlZSB0
aGUgY2hhbmdlbG9nIGluIHRoZSBwYXRjaCB0aGF0IEkgc2VudCB0byB0aGUgbGlzdCB5ZXN0ZXJk
YXkuDQo+ID4NCj4gDQo+IEkgc2F3IHRoYXQsIGFuZCBpZiBJJ20gbm90IG1pc3Rha2VuIHRoZXNl
IHJhY2VzIGFwcGx5IHRvIG9iamVjdCBsYXlvdXQNCj4gYXMgd2VsbCwgYW5kIGluIGFueSBjYXNl
IHRoZXkgYXBwbHkgaW4gbXkgY2FzZS4gSG93ZXZlciwgaXQgaXMgbm90DQo+IGVhc3kgdG8gbWVz
cyBhcm91bmQgd2l0aCBMQVlPVVRSRVRVUk4gaW4gb2JqZWN0IGxheW91dCwgYW5kIHRoZXJlIGhh
dmUNCj4gYmVlbiBzZXZlcmFsIGRpc2N1c3Npb25zIG9uIHRoZSBpc3N1ZS4gSW4gb25lIG9mIHRo
ZXNlIGRpc2N1c3Npb25zDQo+IEJlbm55IGNsYXJpZmllZCB0aGF0IHRoZSBvYmplY3QgbGF5b3V0
IGNsaWVudCBtdXN0IHdhaXQgZm9yIGFsbA0KPiBpbi1mbGlnaHQgSS9PIHRvIGVuZC4NCg0KSWYg
dGhlIHByb2JsZW0gaXMgdGhhdCB0aGUgRFMgaXMgZmFpbGluZyB0byByZXNwb25kLCBob3cgZG9l
cyB0aGUgY2xpZW50DQprbm93IHRoYXQgdGhlIGluLWZsaWdodCBJL08gaGFzIGVuZGVkPw0KDQo+
IFNvIGZvciBmaWxlIGxheW91dCBpdCBwcm9iYWJseSBtYWtlcyBzZW5zZSwgYnV0IG9iamVjdCBs
YXlvdXQgKGFuZCBpZg0KPiBJIHVuZGVyc3RhbmQgY29ycmVjdGx5LCBibG9jayBsYXlvdXQgYXMg
d2VsbCkgc29tZXRoaW5nIGVsc2UgbmVlZHMgdG8NCj4gYmUgZG9uZS4gSSB0aG91Z2h0IGFib3V0
IHN5bmMgd2FpdCB3aGVuIHJldHVybmluZyB0aGUgbGF5b3V0IG9uIGVycm9yLA0KPiBidXQgYWNj
b3JkaW5nIHRvIEJvYXogaXQgd2lsbCBjYXVzZSBkZWFkbG9ja3MgKEJvYXogLSBjYW4geW91IHBs
ZWFzZQ0KPiBlbGFib3JhdGU/KS4NCg0KVGhlIG9iamVjdCBsYXlvdXRyZXR1cm4gaGFzIHRoZSBh
YmlsaXR5IHRvIHBhc3MgYSB0aW1lb3V0IGVycm9yIHZhbHVlIHRvDQp0aGUgTURTIHByZWNpc2Vs
eSBpbiBvcmRlciB0byBhbGxvdyB0aGUgbGF0dGVyIHRvIGRlYWwgd2l0aCB0aGlzIGtpbmQgb2YN
Cmlzc3VlLiBTZWUgdGhlIGRlc2NyaXB0aW9uIG9mIHN0cnVjdCBwbmZzX29zZF9pb2VycjQgaW4g
cmZjNTY2NC4NCg0KVGhlIGJsb2NrIGxheW91dCBpcyBhZGRpbmcgdGhlIHNhbWUgYWJpbGl0eSB0
byBsYXlvdXRyZXR1cm4gaW4gTkZTdjQuMg0KKHNlZSBkcmFmdC1pZXRmLW5mc3Y0LW1pbm9ydmVy
c2lvbjItMTMudHh0KSB2aWEgdGhlIHN0cnVjdA0KbGF5b3V0cmV0dXJuX2RldmljZV9lcnJvcjQs
IHNvIHByZXN1bWFibHkgdGhleSB0b28gaGF2ZSBhIHBsYW4gZm9yDQpkZWFsaW5nIHdpdGggdGhp
cyBraW5kIG9mIGlzc3VlLg0KDQo+IEFuZCBjb21lIHRvIHRoaW5rIG9mIGl0LCBuZnM0X3Byb2Nf
c2V0YXR0ciBhbHNvIHJldHVybnMgdGhlIGxheW91dA0KPiB3aGVuIHRoZXJlIG1heSBiZSBJL08g
aW4tZmxpZ2h0IChjb3JyZWN0IG1lIGlmIGknbSB3cm9uZykuIFNvIEkgZ3Vlc3MNCj4gcG5mc19y
ZXR1cm5fbGF5b3V0IHNob3VsZCBzb21laG93IHNvbHZlIHRoaXMgYnkgaXRzZWxmIGJ5IHNheWlu
ZyAiaWYNCj4gdGhpcyBpcyBmZW5jaW5nIChhIGZsYWcgd2hpY2ggd2lsbCBiZSBzZXQgZm9yIGZp
bGUgbGF5b3V0IG9ubHkpLCBnbw0KPiBhaGVhZCwgb3RoZXJ3aXNlIG1ha2UgdGhlIGxheW91dCBh
cyAnbmVlZHMgdG8gYmUgcmV0dXJuZWQnIGFuZCB3aGVuDQo+IHRoZSBsc2VnIGxpc3RzIGdldHMg
ZW1wdHkgcmV0dXJuIHRoZSBsYXlvdXQiLg0KDQpUaGUgb25seSBsYXlvdXQgdHlwZSB0aGF0IHNl
dHMgdGhlIFBORlNfTEFZT1VUUkVUX09OX1NFVEFUVFIgZmxhZyBpcw0Kb2JqZWN0cywgc28gdGhh
dCBxdWVzdGlvbiBuZWVkcyB0byBiZSBkaXJlY3RlZCB0byBCb2F6Lg0KDQotLSANClRyb25kIE15
a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlr
bGVidXN0QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-08-10 02:47:09

by Peng Tao

[permalink] [raw]
Subject: Re: return layout on error, BUG/deadlock

On Fri, Aug 10, 2012 at 3:31 AM, Myklebust, Trond
<[email protected]> wrote:
> On Fri, 2012-08-10 at 00:48 +0800, Peng Tao wrote:
>> On Fri, Aug 10, 2012 at 12:37 AM, Myklebust, Trond
>> <[email protected]> wrote:
>> > On Fri, 2012-08-10 at 00:34 +0800, Peng Tao wrote:
>> >> On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond
>> >> <[email protected]> wrote:
>> >> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> >> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> >> >> <[email protected]> wrote:
>> >> >> >> -----Original Message-----
>> >> >> >> From: [email protected] [mailto:linux-nfs-
>> >> >> >> [email protected]] On Behalf Of Idan Kedar
>> >> >> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> >> >> To: Boaz Harrosh; NFS list
>> >> >> >> Cc: Benny Halevy
>> >> >> >> Subject: return layout on error, BUG/deadlock
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> As a result of some experiments, I wanted to see what happens when I
>> >> >> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> >> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> >> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> >> >> reported about yesterday:
>> >> >> >> nfs4proc.c:6252 : BUG_ON(!list_empty(&lo->plh_segs));
>> >> >> >>
>> >> >> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> >> >> is in-flight I/O.
>> >> >> >
>> >> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>> >> >>
>> >> >> to what change are you referring?
>> >> >
>> >> > As I stated in the changelog of the patch that I sent to the list
>> >> > yesterday, the behaviour is due to commit 0a57cdac3f.
>> >> >
>> >> >> >
>> >> >> > See the changelog in the patch that I sent to the list yesterday.
>> >> >> >
>> >> >>
>> >> >> I saw that, and if I'm not mistaken these races apply to object layout
>> >> >> as well, and in any case they apply in my case. However, it is not
>> >> >> easy to mess around with LAYOUTRETURN in object layout, and there have
>> >> >> been several discussions on the issue. In one of these discussions
>> >> >> Benny clarified that the object layout client must wait for all
>> >> >> in-flight I/O to end.
>> >> >
>> >> > If the problem is that the DS is failing to respond, how does the client
>> >> > know that the in-flight I/O has ended?
>> >> >
>> >> >> So for file layout it probably makes sense, but object layout (and if
>> >> >> I understand correctly, block layout as well) something else needs to
>> >> >> be done. I thought about sync wait when returning the layout on error,
>> >> >> but according to Boaz it will cause deadlocks (Boaz - can you please
>> >> >> elaborate?).
>> >> >
>> >> > The object layoutreturn has the ability to pass a timeout error value to
>> >> > the MDS precisely in order to allow the latter to deal with this kind of
>> >> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>> >> >
>> >> > The block layout is adding the same ability to layoutreturn in NFSv4.2
>> >> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
>> >> > layoutreturn_device_error4, so presumably they too have a plan for
>> >> > dealing with this kind of issue.
>> >> It is one thing to tell MDS that there is DS access error by sending
>> >> layoutreturn, and it is another thing to return a layout even if there
>> >> is overlapping in-flight DS IO...
>> >>
>> >> I certainly agree that client is entitled to return layout to inform
>> >> MDS about DS errors and also avoid possible cb_layoutrecall. But it is
>> >> just an optimization and should only be done when there is no
>> >> in-flight IO (at least for block layout) IMHO.
>> >
>> > HOW DO YOU GUARANTEE NO IN-FLIGHT IO?
>> >
>> I don't. That's why I don't return layout in pnfs_ld_write_done(). And
>> for layoutreturn upon cb_layoutreturn, block layout client needs to do
>> timed-lease IO fencing per rfc5663, but it is not implemented in Linux
>> client.
>
> The timed-lease IO fencing described in rfc5663 is about informing the
> server about how long the client expects a command to succeed or fail.
> It doesn't offer any advice for how the client is to deal with an
> unresponsive DS.
>
> What you need here is help from the underlying transport protocol. As I
> said in the email to Idan, when researching iSCSI and iFCP, I found what
> appears to be mechanisms for reliably timing out.
Just checked and found that the layoutreturn-on-error behavior only
affects object and file layout. So block layout stays out and safe.
That's all I would ask for. Thanks for your explanation.

Best,
Tao