2003-08-07 14:04:59

by Bernd Schubert

[permalink] [raw]
Subject: [2.4.21]: nbd ksymoops-report

Hi,

every time when nbd-client disconnects a nbd-device the decoded oops
from below will happen.
This only happens after we upgraded from 2.4.20 to 2.4.21,
so I guess the backported update from 2.5.50 causes this.
Since the changelog for 2.4.22-rc1 doesn't describe any updates to nbd,
I think this will be also valid for this kernel version. I will check this
later on this evening.

ksymoops 2.4.8 on i686 2.4.21-tc2. Options used
-v /usr/src/System.maps/vmlinux__2.4.21-tc2 (specified)
-k /proc/ksyms (specified)
-l /proc/modules (default)
-o /lib/modules/2.4.21-tc2/ (default)
-m /usr/src/System.maps/System.map__2.4.21-tc2 (specified)

Aug 6 17:24:31 goedel kernel: d89e2be7
Aug 6 17:24:31 goedel kernel: Oops: 0000
Aug 6 17:24:31 goedel kernel: CPU: 0
Aug 6 17:24:31 goedel kernel: EIP: 1010:[<d89e2be7>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Aug 6 17:24:31 goedel kernel: EFLAGS: 00010282
Aug 6 17:24:31 goedel kernel: eax: 00000000 ebx: d89e43c4 ecx: 00000001 edx: 00000001
Aug 6 17:24:31 goedel kernel: esi: 00000000 edi: d89e43a0 ebp: 00000000 esp: d61a5f14
Aug 6 17:24:31 goedel kernel: ds: 1018 es: 1018 ss: 1018
Aug 6 17:24:31 goedel kernel: Process nbd-client (pid: 650, stackpage=d61a5000)
Aug 6 17:24:31 goedel kernel: Stack: d89e367c d4cd56e0 00000400 0000ab03 ffffffe7 00000000 d61a4000 d7fe44fc
Aug 6 17:24:31 goedel kernel: d61a4000 00098c93 00098c94 00030002 00098c96 00098c97 00098d55 00098d56
Aug 6 17:24:31 goedel kernel: 00098d57 00098d58 00098d59 00098d5a 00098d5b 00098d5c 00098d5d 00098e1b
Aug 6 17:24:31 goedel kernel: Call Trace: [<d89e367c>] [<c0143f94>] [<c014c157>] [<c010a013>]
Aug 6 17:24:31 goedel kernel: Code: 8b 50 08 6a 03 50 8b 42 28 ff d0 c7 86 ac 43 9e d8 00 00 00


>>EIP; d89e2be7 <[nbd]nbd_ioctl+353/480> <=====

>>ebx; d89e43c4 <[nbd].data.end+a4d/96e9>
>>edi; d89e43a0 <[nbd].data.end+a29/96e9>
>>esp; d61a5f14 <_end+15e07790/185558dc>

Trace; d89e367c <[nbd]__module_license+5db/78b>
Trace; c0143f94 <blkdev_ioctl+28/34>
Trace; c014c157 <sys_ioctl+1bb/1f7>
Trace; c010a013 <system_call+33/40>

Code; d89e2be7 <[nbd]nbd_ioctl+353/480>
00000000 <_EIP>:
Code; d89e2be7 <[nbd]nbd_ioctl+353/480> <=====
0: 8b 50 08 mov 0x8(%eax),%edx <=====
Code; d89e2bea <[nbd]nbd_ioctl+356/480>
3: 6a 03 push $0x3
Code; d89e2bec <[nbd]nbd_ioctl+358/480>
5: 50 push %eax
Code; d89e2bed <[nbd]nbd_ioctl+359/480>
6: 8b 42 28 mov 0x28(%edx),%eax
Code; d89e2bf0 <[nbd]nbd_ioctl+35c/480>
9: ff d0 call *%eax
Code; d89e2bf2 <[nbd]nbd_ioctl+35e/480>
b: c7 86 ac 43 9e d8 00 movl $0x0,0xd89e43ac(%esi)
Code; d89e2bf9 <[nbd]nbd_ioctl+365/480>
12: 00 00 00



--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universit?t Heidelberg
INF 229
69120 Heidelberg
e-mail: [email protected]


2003-08-07 14:48:07

by Lou Langholtz

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

Bernd Schubert wrote:

>Hi,
>
>every time when nbd-client disconnects a nbd-device the decoded oops
>from below will happen.
>This only happens after we upgraded from 2.4.20 to 2.4.21,
>so I guess the backported update from 2.5.50 causes this.
>Since the changelog for 2.4.22-rc1 doesn't describe any updates to nbd,
>I think this will be also valid for this kernel version. I will check this
>later on this evening.
>
>
>. . .
>
>
I've seen oops's from nbd disconnect in 2.4 also when some blocks were
still being flushed (using the standard linux kernel distributed nbd
driver). I don't know of any back ported fixes to nbd of the ones I've
been introducing into 2.5+ kernels and have no idea though what could
have changed between 2.4.20 and 2.4.21 that causes the diff you've seen
(unless you just never tried the disconnect while blocks still had to be
flushed before). But a lot of the nbd fixes that have been getting
introduced into 2.5+ could very well close races and eliminate oops's in
2.4 also. Getting some more exposure to these fixes in the 2.5+ kernels
has made a lot of sense since these aren't supposed to be as stable and
things can be tested more acceptably but at some point back-porting
starts making sense too. Are we at that point yet?? I don't know. Paul
Clements is now the NBD maintainer. We should see what he says (I've
CC'd him on this email).

Stay in touch.

2003-08-07 16:55:07

by Paul Clements

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

On Thu, 7 Aug 2003, Bernd Schubert wrote:

> every time when nbd-client disconnects a nbd-device the decoded oops
> from below will happen.
> This only happens after we upgraded from 2.4.20 to 2.4.21,
> so I guess the backported update from 2.5.50 causes this.

Yes, it's definitely related to this...


> Aug 6 17:24:31 goedel kernel: Process nbd-client (pid: 650, stackpage=d61a5000)

Are you using the v2.0 nbd-client from nbd.sf.net?


> Code; d89e2be7 <[nbd]nbd_ioctl+353/480>
> 00000000 <_EIP>:
> Code; d89e2be7 <[nbd]nbd_ioctl+353/480> <=====
> 0: 8b 50 08 mov 0x8(%eax),%edx <=====
> Code; d89e2bea <[nbd]nbd_ioctl+356/480>
> 3: 6a 03 push $0x3
> Code; d89e2bec <[nbd]nbd_ioctl+358/480>
> 5: 50 push %eax
> Code; d89e2bed <[nbd]nbd_ioctl+359/480>
> 6: 8b 42 28 mov 0x28(%edx),%eax
> Code; d89e2bf0 <[nbd]nbd_ioctl+35c/480>
> 9: ff d0 call *%eax


This corresponds to the following source:

lo->sock->ops->shutdown(lo->sock, SEND_SHUTDOWN|RCV_SHUTDOWN);

Somehow, lo->sock is NULL here. The only way I see that this could
happen is if NBD_CLEAR_SOCK got called out of order (or you're
using some non-standard nbd-client).

I guess it would be best to protect the NULLing of lo->sock
in NBD_CLEAR_SOCK just in case, anyway.

Would you be willing to test a patch against 2.4.21?

--
Paul

2003-08-07 17:40:37

by Lou Langholtz

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

Paul Clements wrote:

>On Thu, 7 Aug 2003, Bernd Schubert wrote:
>
>
>
>>every time when nbd-client disconnects a nbd-device the decoded oops
>>from below will happen.
>>This only happens after we upgraded from 2.4.20 to 2.4.21,
>>so I guess the backported update from 2.5.50 causes this.
>>
>>
>
>Yes, it's definitely related to this...
>
>
>
>
>>Aug 6 17:24:31 goedel kernel: Process nbd-client (pid: 650, stackpage=d61a5000)
>>
>>
>
>Are you using the v2.0 nbd-client from nbd.sf.net?
>
>
>
>
>>Code; d89e2be7 <[nbd]nbd_ioctl+353/480>
>>00000000 <_EIP>:
>>Code; d89e2be7 <[nbd]nbd_ioctl+353/480> <=====
>> 0: 8b 50 08 mov 0x8(%eax),%edx <=====
>>Code; d89e2bea <[nbd]nbd_ioctl+356/480>
>> 3: 6a 03 push $0x3
>>Code; d89e2bec <[nbd]nbd_ioctl+358/480>
>> 5: 50 push %eax
>>Code; d89e2bed <[nbd]nbd_ioctl+359/480>
>> 6: 8b 42 28 mov 0x28(%edx),%eax
>>Code; d89e2bf0 <[nbd]nbd_ioctl+35c/480>
>> 9: ff d0 call *%eax
>>
>>
>
>
>This corresponds to the following source:
>
>lo->sock->ops->shutdown(lo->sock, SEND_SHUTDOWN|RCV_SHUTDOWN);
>
>Somehow, lo->sock is NULL here. The only way I see that this could
>happen is if NBD_CLEAR_SOCK got called out of order (or you're
>using some non-standard nbd-client).
>
The out-of-order problem is due to "nbd-client -d" (the disconnect
thread) winning a race with "nbd-client" and setting sock = NULL after
nbd_do_it returned and before NBD_DO_IT gets into its down'd region and
calls shutdown. This was the hazardous race that I was having a hard
time remembering and explaining before that also needed locking for.

2003-08-07 17:36:27

by Paul Clements

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

Paul Clements wrote:
>
> On Thu, 7 Aug 2003, Bernd Schubert wrote:
>
> > every time when nbd-client disconnects a nbd-device the decoded oops
> > from below will happen.
> > This only happens after we upgraded from 2.4.20 to 2.4.21,
> > so I guess the backported update from 2.5.50 causes this.

[snip]

> This corresponds to the following source:
>
> lo->sock->ops->shutdown(lo->sock, SEND_SHUTDOWN|RCV_SHUTDOWN);
>
> Somehow, lo->sock is NULL here. The only way I see that this could

Alright, looking back over the nbd-client source I now see what's going
on. You're calling "nbd-client -d" to manually disconnect?


> Would you be willing to test a patch against 2.4.21?

If you're willing to test the attached patch, I'd be grateful. Otherwise
I'll test it in the next few days and forward on to Marcelo...


Thanks,
Paul


Attachments:
nbd_sock_null_race_fix_2_4_21.diff (1.07 kB)

2003-08-07 18:41:11

by Bernd Schubert

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

Hello!

Yes we are using the nbd-client from sf.net (due to other problems we replaced
the debian (non-standard) sf.net binary with our own compiled binary).

On Thursday 07 August 2003 19:34, you wrote:
> Paul Clements wrote:
> > On Thu, 7 Aug 2003, Bernd Schubert wrote:
> > > every time when nbd-client disconnects a nbd-device the decoded oops
> > > from below will happen.
> > > This only happens after we upgraded from 2.4.20 to 2.4.21,
> > > so I guess the backported update from 2.5.50 causes this.
>
> [snip]
>
> > This corresponds to the following source:
> >
> > lo->sock->ops->shutdown(lo->sock, SEND_SHUTDOWN|RCV_SHUTDOWN);
> >
> > Somehow, lo->sock is NULL here. The only way I see that this could
>
> Alright, looking back over the nbd-client source I now see what's going
> on. You're calling "nbd-client -d" to manually disconnect?

The debian /etc/init.d/nbd-client script calls this on stopping stopping nbd.
To make nbd working again after this oops we always need to reboot now (found
this out after my first mail), so I'm really looking for an alternative way
of stopping nbd. Would 'killall nbd-client' work?

>
> > Would you be willing to test a patch against 2.4.21?
>
> If you're willing to test the attached patch, I'd be grateful. Otherwise
> I'll test it in the next few days and forward on to Marcelo...

I will first test it at home. Unfortunality my laptop is in repair at IBM, so
I only can use nbd via localhost.
If there is a way to prevent the reboot of the client, I can test it on monday
on our cluster at work.

Thanks a lot for your very fast help. Since we are using nbd to have a
fallback server of our main server, we really need a working solution.


Thanks again and best regards,
Bernd

--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universit?t Heidelberg
INF 229
69120 Heidelberg
e-mail: [email protected]

2003-08-07 18:47:22

by Paul Clements

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

Bernd Schubert wrote:

> The debian /etc/init.d/nbd-client script calls this on stopping stopping nbd.
> To make nbd working again after this oops we always need to reboot now (found
> this out after my first mail), so I'm really looking for an alternative way
> of stopping nbd. Would 'killall nbd-client' work?

Yes, "killall -9 nbd-client" would work, and would avoid this problem.
This is how I generally stop nbd-client.


> If there is a way to prevent the reboot of the client, I can test it on monday
> on our cluster at work.

With the patch, you'll no longer see this oops or need to reboot, and
"nbd-client -d" will work as intended.


--
Paul

2003-08-07 22:26:55

by Paul Clements

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

Paul Clements wrote:
>
> Paul Clements wrote:
> >
> > On Thu, 7 Aug 2003, Bernd Schubert wrote:
> >
> > > every time when nbd-client disconnects a nbd-device the decoded oops
> > > from below will happen.
> > > This only happens after we upgraded from 2.4.20 to 2.4.21,
> > > so I guess the backported update from 2.5.50 causes this.

[snip]

> > Would you be willing to test a patch against 2.4.21?
>
> If you're willing to test the attached patch, I'd be grateful. Otherwise
> I'll test it in the next few days and forward on to Marcelo...

OK, the previous patch didn't quite do it. The attached should work (I
got a chance to test it, finally).

Thanks,
Paul


Attachments:
nbd_sock_null_race_fix_2_4_21-2.diff (1.82 kB)

2003-08-08 13:10:51

by Bernd Schubert

[permalink] [raw]
Subject: Re: [2.4.21]: nbd ksymoops-report

On Friday 08 August 2003 00:25, you wrote:
> Paul Clements wrote:
> > Paul Clements wrote:
> > > On Thu, 7 Aug 2003, Bernd Schubert wrote:
> > > > every time when nbd-client disconnects a nbd-device the decoded oops
> > > > from below will happen.
> > > > This only happens after we upgraded from 2.4.20 to 2.4.21,
> > > > so I guess the backported update from 2.5.50 causes this.
>
> [snip]
>
> > > Would you be willing to test a patch against 2.4.21?
> >
> > If you're willing to test the attached patch, I'd be grateful. Otherwise
> > I'll test it in the next few days and forward on to Marcelo...
>
> OK, the previous patch didn't quite do it. The attached should work (I
> got a chance to test it, finally).

Hello Paul,

I just tested the patch and now 'nbd-client -d device' it works fine! When I'm
back at work I will update our nbd-clients to the new module. (Now that you
told me that 'kill -9 pid' even for the old module works, that won't be a
problem.


Thanks a lot,
Bernd