2011-02-26 02:42:29

by Boaz Harrosh

[permalink] [raw]
Subject: Client still crashing on 2nd callback after NFS4_ERR_DELAY

I get the below crash on the 2nd recall after the client returned
NFS_ERR_DELAY, and the server sends a retry.

602678a8: [<6001585b>] panic_exit+0x2f/0x45
602678c8: [<60049a3e>] notifier_call_chain+0x32/0x5e
60267908: [<60049a8c>] atomic_notifier_call_chain+0x13/0x15
60267918: [<601b3b9b>] panic+0x105/0x1dc
602679c8: [<601b6056>] _raw_spin_unlock_irqrestore+0x18/0x1c
602679e8: [<60016e5f>] free_irqs+0x74/0xde
60267a18: [<60015162>] relay_signal+0x38/0x79
60267a28: [<60012cef>] sigio_handler+0x5a/0x5f
60267a48: [<600224d0>] sig_handler_common+0x84/0x98
60267a68: [<6002257d>] real_alarm_handler+0x3c/0x3e
60267af0: [<6012fc02>] radix_tree_delete+0x101/0x1d7
60267b38: [<6002252f>] unblock_signals+0x4b/0x5d
60267b78: [<60022616>] sig_handler+0x30/0x3b
60267b98: [<60022848>] handle_signal+0x6d/0xa3
60267be8: [<600241c8>] hard_handler+0x10/0x14
60267ca8: [<7c1d5dfb>] do_callback_layoutrecall+0x457/0x993 [nfs]


(gdb) list *(do_callback_layoutrecall+0x457)
0x2de1f is in do_callback_layoutrecall (/usr0/export/dev/bharrosh/git/pub/linux-pnfs/fs/nfs/callback_proc.c:282).
277 dprintk("%s %d\n", __func__, __LINE__);
278 spin_unlock(&clp->cl_lock);
279 dprintk("%s %d\n", __func__, __LINE__);
280
281 BUG_ON(!lo);
282 BUG_ON(!lo->plh_inode);
283 spin_lock(&lo->plh_inode->i_lock);
284 dprintk("%s %d\n", __func__, __LINE__);
285 if (rv == NFS4_OK) {
286 lo->plh_block_lgets++;

Prints and BUG_ONs added by me.

Any body wants to fix this. How would lo->plh_inode be null, and why only the
second time.

Thanks
Boaz


2011-02-27 22:09:04

by Fred Isaman

[permalink] [raw]
Subject: Re: [PATCH] SQUASHME: pnfs: FIX stupid recall_layout BUG

On Sun, Feb 27, 2011 at 7:04 AM, Boaz Harrosh <[email protected]> wrote:
>
> OK, who wrote this code? He did not stop to think for even a
> second. And surely it was never tested, since it is 100%
> repeatable. Smack yourself on the head!
>

I wrote the code. But note that the tree you are using is taking the
tested, reasonably mature code that is currently in the downstream
kernel and reverting it back to ancient buggy prototype code. My
understanding was that Benny was rightly ripping out the reversions at
bakeathon.

Fred

2011-02-27 15:04:24

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH] SQUASHME: pnfs: FIX stupid recall_layout BUG


OK, who wrote this code? He did not stop to think for even a
second. And surely it was never tested, since it is 100%
repeatable. Smack yourself on the head!

Simple: layout is in use, a recall comes in. _DELAY is returned
meanwhile, a flag is set and no new IOs are allowed. The last
IO dereference the layout, there are no more layouts. The server
persists, sends a second recall, boom below code crash.

* While at it fix an if();else {} style violation
* The added BUG_ON is important because the address taken from
a member of a null pointer is not null, and the crash is not
at all clean.

It no longer crashes my trunc_test, Now I need to verify the recall
is actually honoured.

Signed-off-by: Boaz Harrosh <Boaz Harrosh [email protected]>
---
fs/nfs/callback_proc.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 12ab7b3..34aee16 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -252,9 +252,9 @@ static int initiate_layout_draining(struct pnfs_cb_lrecall_info *cb_info)
if (nfs_compare_fh(&args->cbl_fh,
&NFS_I(lo->plh_inode)->fh))
continue;
- if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags))
+ if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
rv = NFS4ERR_DELAY;
- else {
+ } else {
/* FIXME I need to better understand igrab and
* does having a layout ref keep ino around?
* It should.
@@ -270,6 +270,11 @@ static int initiate_layout_draining(struct pnfs_cb_lrecall_info *cb_info)
}
spin_unlock(&clp->cl_lock);

+ if (rv == NFS4ERR_NOMATCHING_LAYOUT)
+ /* Nothing found */
+ return rv;
+
+ BUG_ON(!lo->plh_inode);
spin_lock(&lo->plh_inode->i_lock);
if (rv == NFS4_OK) {
lo->plh_block_lgets++;
--
1.7.3.4



2011-02-28 18:24:40

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH] SQUASHME: pnfs: FIX stupid recall_layout BUG

On 02/28/2011 12:09 AM, Fred Isaman wrote:
> On Sun, Feb 27, 2011 at 7:04 AM, Boaz Harrosh <[email protected]> wrote:
>>
>> OK, who wrote this code? He did not stop to think for even a
>> second. And surely it was never tested, since it is 100%
>> repeatable. Smack yourself on the head!
>>
>
> I wrote the code. But note that the tree you are using is taking the
> tested, reasonably mature code that is currently in the downstream
> kernel and reverting it back to ancient buggy prototype code. My
> understanding was that Benny was rightly ripping out the reversions at
> bakeathon.
>

OK that figures then. I was using the only tree I could use,
I guess I'll have to wait out until things settle.

> Fred

However with the tree released by benny few hours ago I get
a new crash. The uml backtrace is really bad I'll try to print
out more context to see where was the last put done.
This time it takes a long time to trigger. It feels like
a race.

Call Trace:
602678a8: [<6001585b>] panic_exit+0x2f/0x45
602678c8: [<60049a3e>] notifier_call_chain+0x32/0x5e
60267908: [<60049a8c>] atomic_notifier_call_chain+0x13/0x15
60267918: [<601b3b9b>] panic+0x105/0x1dc
602679c8: [<601b6056>] _raw_spin_unlock_irqrestore+0x18/0x1c
602679e8: [<60016e5f>] free_irqs+0x74/0xde
60267a18: [<60015162>] relay_signal+0x38/0x79
60267a28: [<60012cef>] sigio_handler+0x5a/0x5f
60267a48: [<600224d0>] sig_handler_common+0x84/0x98
60267a68: [<6002257d>] real_alarm_handler+0x3c/0x3e
60267af0: [<60186de1>] tcp_rcv_established+0x107/0x5fa
60267b78: [<60022616>] sig_handler+0x30/0x3b
60267b98: [<60022848>] handle_signal+0x6d/0xa3
60267be8: [<600241c8>] hard_handler+0x10/0x14
60267ca8: [<7b9cdaa3>] destroy_layout_hdr+0x33/0x52 [nfs]

(gdb) list *(destroy_layout_hdr+0x33)
0x2eac7 is in destroy_layout_hdr (/usr0/export/dev/bharrosh/git/pub/linux-pnfs/fs/nfs/pnfs.c:262).
257
258 static void
259 destroy_layout_hdr(struct pnfs_layout_hdr *lo)
260 {
261 dprintk("%s: freeing layout cache %p\n", __func__, lo);
262 BUG_ON(!list_empty(&lo->plh_layouts));
263 NFS_I(lo->plh_inode)->layout = NULL;
264 pnfs_free_layout_hdr(lo);
265 }
266