2006-02-14 19:54:22

by Dave Kleikamp

[permalink] [raw]
Subject: Re: Additional questions on your hang

On Tue, 2006-02-14 at 18:17 +0100, Adrian Bunk wrote:
> On Tue, Feb 14, 2006 at 11:09:20AM -0600, Steve French wrote:
>
> > Adrian,
>
> Hi Steve,
>
> > If your problem turns out to be serious - I do want to find a way to
> > solve this before 2.6.16 goes out
>
> :-)
>
> > A couple of other datapoints that might be helpful ... Although we don't
> > know for sure that the hang is related to
> > SMB read (despite the EAGAIN warning logged on a read) - it might be
> > useful to know whether the problem
> > occurred when mount option "forcedirectio" is specified (since reads in
> > that case will bypass the pagecache).,
>
> With forcedirectio, it's still failing after some time (this time not
> until nearly 1 GB was transferred), but instead of killing the machine
> it's giving a useful error message ("Process mc" since I was copying
> with GNU Midnight Commander):
>
> Feb 14 18:03:16 r063144 kernel: CIFS VFS: No response to cmd 46 mid 42374
> Feb 14 18:03:16 r063144 kernel: CIFS VFS: Send error in read = -11
> Feb 14 18:03:19 r063144 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000031
> Feb 14 18:03:19 r063144 kernel: printing eip:
> Feb 14 18:03:19 r063144 kernel: c0182caa
> Feb 14 18:03:19 r063144 kernel: *pde = 00000000
> Feb 14 18:03:19 r063144 kernel: Oops: 0000 [#1]
> Feb 14 18:03:19 r063144 kernel: Modules linked in:
> Feb 14 18:03:19 r063144 kernel: CPU: 0
> Feb 14 18:03:19 r063144 kernel: EIP: 0060:[cifs_user_read+319/539] Not tainted VLI
> Feb 14 18:03:19 r063144 kernel: EFLAGS: 00210296 (2.6.16-rc3 #1)
> Feb 14 18:03:19 r063144 kernel: EIP is at cifs_user_read+0x13f/0x21b
> Feb 14 18:03:19 r063144 kernel: eax: fffffff5 ebx: 0001b746 ecx: 00000000 edx: 00000000
> Feb 14 18:03:19 r063144 kernel: esi: 00002000 edi: fffffff5 ebp: d3be5b20 esp: c842ff38
> Feb 14 18:03:19 r063144 kernel: ds: 007b es: 007b ss: 0068
> Feb 14 18:03:19 r063144 kernel: Process mc (pid: 29402, threadinfo=c842e000 task=cdbfe030)
> Feb 14 18:03:19 r063144 kernel: Stack: <0>00002000 08dd2000 00000000 c842ff70 c842ff6c c842ff68 00002000 da1725e0
> Feb 14 18:03:19 r063144 kernel: 00000000 c21f9920 c5224aa0 08184fb0 00000000 00000000 00000000 d3be5b20
> Feb 14 18:03:19 r063144 kernel: 00002000 c0182b6b 08184fb0 c013d736 c842ffa4 d3be5b20 fffffff7 08184fb0
> Feb 14 18:03:19 r063144 kernel: Call Trace:
> Feb 14 18:03:19 r063144 kernel: [cifs_user_read+0/539] cifs_user_read+0x0/0x21b
> Feb 14 18:03:19 r063144 kernel: [vfs_read+125/223] vfs_read+0x7d/0xdf
> Feb 14 18:03:19 r063144 kernel: [sys_read+60/99] sys_read+0x3c/0x63
> Feb 14 18:03:19 r063144 kernel: [sysenter_past_esp+84/117] sysenter_past_esp+0x54/0x75
> Feb 14 18:03:19 r063144 kernel: Code: 18 50 8d 44 24 20 50 8d 44 24 28 50 8b 54 24 44 89 d8 ff 72 04 ff 32 56 8b 54 24 28 e8 4b 6d ff ff 8b 54 24 34 89 c7 8b 4c 24 38 <0f> b7 42 31 8d 54 02 04 8b 44 24 2c e8 ce 97 01 00 83 c4 18 85
> Feb 14 18:03:19 r063144 kernel: <3> CIFS VFS: Send error in Close = -9

I wasn't able to trace it to an exact line of code, but I think I see
what the problem is.

This patch moves the copy_to_user from smb_read_data after the test
whether smb_read_data is null. It's good not to dereference a pointer
if you have a reason to test it for null afterward.

This patch has not been compiled or tested.

Signed-off-by: Dave Kleikamp <[email protected]>

diff -urp linux-2.6.16-rc3/fs/cifs/file.c linux/fs/cifs/file.c
--- linux-2.6.16-rc3/fs/cifs/file.c 2006-02-13 07:28:51.000000000 -0600
+++ linux/fs/cifs/file.c 2006-02-14 13:45:09.000000000 -0600
@@ -1441,14 +1441,16 @@ ssize_t cifs_user_read(struct file *file
current_read_size, *poffset,
&bytes_read, &smb_read_data,
&buf_type);
- pSMBr = (struct smb_com_read_rsp *)smb_read_data;
- if (copy_to_user(current_offset,
- smb_read_data + 4 /* RFC1001 hdr */
- + le16_to_cpu(pSMBr->DataOffset),
- bytes_read)) {
- rc = -EFAULT;
- }
if (smb_read_data) {
+ pSMBr = (struct smb_com_read_rsp *)
+ smb_read_data;
+ if (copy_to_user(current_offset,
+ smb_read_data +
+ 4 + /* RFC1001 hdr */
+ le16_to_cpu(pSMBr->DataOffset),
+ bytes_read)) {
+ rc = -EFAULT;
+ }
if(buf_type == CIFS_SMALL_BUFFER)
cifs_small_buf_release(smb_read_data);
else if(buf_type == CIFS_LARGE_BUFFER)

--
David Kleikamp
IBM Linux Technology Center


2006-02-15 22:28:00

by Adrian Bunk

[permalink] [raw]
Subject: Re: Additional questions on your hang

On Tue, Feb 14, 2006 at 01:54:11PM -0600, Dave Kleikamp wrote:
>...
> I wasn't able to trace it to an exact line of code, but I think I see
> what the problem is.
>
> This patch moves the copy_to_user from smb_read_data after the test
> whether smb_read_data is null. It's good not to dereference a pointer
> if you have a reason to test it for null afterward.
>
> This patch has not been compiled or tested.
>...

Thanks, this patch works fine for me and after disassembling it seems
you've found exactly the place where the kernel Oops'ed for me in the
forcedirectio case.

But even with this patch applied, I had a freeze in the
non-forcedirectio case, although much later than usual.
There seems to be more than one bug. :-(

I'm currently trying whether there are still any problems in the
forcedirectio case, and I'll report back as soon as I am able to provide
any additional information.

> David Kleikamp

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed