Dear Marcel and guys,
I have stock 2.6.0 kernel, fairly recent Bluez utilities and MSI USB
Bluetooth dongle. I have dund running, and if I try to start a PPP
session from a handheld shortly after the previous one has terminated, I
get kernel oops, pretty much every time.
Funny and probably unrelated thing is that the Palm (TT3) from which the
session was initiated often soft-resets itself in a minute after such
event...
Oops included (manually typed in from a photo of the screen).
Tell me if I can help in further investigation.
Eugene
=================
ksymoops 2.4.9 on i686 2.6.0. Options used
-V (default)
-k /proc/kallsyms (specified)
-l /proc/modules (default)
-o /lib/modules/2.6.0/ (default)
-m /usr/src/linux-2.6/System.map (specified)
Warning (read_ksyms): no kernel symbols in ksyms, is /proc/kallsyms a valid ksyms file?
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
ds: 007b es: 007b ss: 0068
Stack: d0b04b40 00000000 00000024 00000020 c0da4000 cefb3994 c0da5f00 d0b01f53
00000000 00000000 00000020 cf822ab0 00000001 00001000 0efb21c0 00000000
00000000 c0da5d65 ffffffff c030d580 00000000 cf7670f0 00000009 00000000
Call trace:
[<d0b01f53>] l2cap_recv_frame+0x6d3/0xfb0 [l2cap]
[<c017cf7d>] proc_alloc_inode_0x4d/0x80
[<c0207128>] get_device+0x18/0x30
[<d0a58d64>] uhci_submit_common+0x214/0x2e0 [uhci_hcd]
[<c0258c1e>] hcd_submit_urb+0xfe/0x170
[<c0259685>] usb_submit_urb+0x1c5/0x240
[<c010f86a>] do_gettimeofday_0x2a/0xc0
[<d0a532b4>] hci_usb_rx_complete+0x1a4/0x690 [hci_usb]
[<d0b02da4>] l2cap_recv_acldata+0x184/0x350 [l2cap]
[<d0a856c6>] hci_rx_task+0x2d6/0x370 [bluetooth]
[<c0122d46>] tasklet_action+0x46/0x70
[<c0122b59>] do_softirq+0x99/0xa0
[<c010bb32>] do_IRQ+0x122/0x180
[<c0109e48>] common_interrupt+0x18/0x20
Code: 0f 0b 28 01 89 32 b0 d0 eb 93 0f 0b cb 01 a0 32 b0 d0 e9 67
Using defaults from ksymoops -t elf32-i386 -a i386
Trace; d0b01f53 <_end+106e1ec3/3fbddf70>
Trace; c017cf7d <proc_alloc_inode+4d/80>
Trace; c0207128 <get_device+18/30>
Trace; d0a58d64 <_end+10638cd4/3fbddf70>
Trace; c0258c1e <hcd_submit_urb+fe/170>
Trace; c0259685 <usb_submit_urb+1c5/240>
Trace; c010f86a <do_gettimeofday+2a/c0>
Trace; d0a532b4 <_end+10633224/3fbddf70>
Trace; d0b02da4 <_end+106e2d14/3fbddf70>
Trace; d0a856c6 <_end+10665636/3fbddf70>
Trace; c0122d46 <tasklet_action+46/70>
Trace; c0122b59 <do_softirq+99/a0>
Trace; c010bb32 <do_IRQ+122/180>
Trace; c0109e48 <common_interrupt+18/20>
Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 0f 0b ud2a
Code; 00000002 Before first symbol
2: 28 01 sub %al,(%ecx)
Code; 00000004 Before first symbol
4: 89 32 mov %esi,(%edx)
Code; 00000006 Before first symbol
6: b0 d0 mov $0xd0,%al
Code; 00000008 Before first symbol
8: eb 93 jmp ffffff9d <_EIP+0xffffff9d>
Code; 0000000a Before first symbol
a: 0f 0b ud2a
Code; 0000000c Before first symbol
c: cb lret
Code; 0000000d Before first symbol
d: 01 a0 32 b0 d0 e9 add %esp,0xe9d0b032(%eax)
Code; 00000013 Before first symbol
13: 67 00 00 addr16 add %al,(%bx,%si)
<0>Kernel panic: Fatal exception in interrupt
1 warning issued. Results may not be reliable.
Hi Eugene,
> > actually it is not a wrong owner entry. But the RFCOMM core layer have
> > to increment the module reference count of the L2CAP module, when we
> > accept a new incoming L2CAP connection. I am not 100% that this is the
> > right way to fix this problem, but for me it works. Please check if the
> > attached patch helps and don't breaks any other of your setups.
>
> Cool! At least for me, everything works like a charm.
> Thanks a lot, Marcel!
you are welcome. But the thanks go to you, because you found the bug and
kept on investigating.
Now the inquiry problem after boot is the only leftover bug for now ;)
Regards
Marcel
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel
On Tue, 2004-01-13 at 22:33, Marcel Holtmann wrote:
> actually it is not a wrong owner entry. But the RFCOMM core layer have
> to increment the module reference count of the L2CAP module, when we
> accept a new incoming L2CAP connection. I am not 100% that this is the
> right way to fix this problem, but for me it works. Please check if the
> attached patch helps and don't breaks any other of your setups.
Cool! At least for me, everything works like a charm.
Thanks a lot, Marcel!
Eugene
Hi Eugene,
> This means that the termination of an incoming RFCOMM connections
> decrements the reference counting of the L2CAP module. This should not
> happen and I think we have a wrong owner entry somewhere in the RFCOMM
> socket code, but at the moment I don't know where.
actually it is not a wrong owner entry. But the RFCOMM core layer have
to increment the module reference count of the L2CAP module, when we
accept a new incoming L2CAP connection. I am not 100% that this is the
right way to fix this problem, but for me it works. Please check if the
attached patch helps and don't breaks any other of your setups.
Regards
Marcel
Hi Eugene,
> while right now I only have one host where l2test can be run, I tried
> it:
>
> l2test -s 00:01:02:03:04:05
> - l2cap count increased by 2, then I get "Host is down" and count
> decreased by 2
>
> l2test -r
> - count increased by 3, program waits. When I hit ctrl-break, count
> decreased by 3
ok, I ran my own tests and found that the L2CAP module is working fine
and this explains why I never got this problem while working on HID.
Even the PAN and CMTP module do correct reference counting, but if you
use RFCOMM on top of L2CAP and got an incoming RFCOMM connection the
refcounting gets wrong.
This means that the termination of an incoming RFCOMM connections
decrements the reference counting of the L2CAP module. This should not
happen and I think we have a wrong owner entry somewhere in the RFCOMM
socket code, but at the moment I don't know where.
Regards
Marcel
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel
On Tue, 2004-01-13 at 00:35, Marcel Holtmann wrote:
> first use l2test only and don't load any other highlevel Bluetooth
> protocol. I will also try to reproduce it with my test machines. If it
> is a L2CAP only problem it is much more easier to find the bug.
while right now I only have one host where l2test can be run, I tried
it:
l2test -s 00:01:02:03:04:05
- l2cap count increased by 2, then I get "Host is down" and count
decreased by 2
l2test -r
- count increased by 3, program waits. When I hit ctrl-break, count
decreased by 3
Eugene
Hi Eugene,
> I'll ask a friend and we'll try pan during this week.
first use l2test only and don't load any other highlevel Bluetooth
protocol. I will also try to reproduce it with my test machines. If it
is a L2CAP only problem it is much more easier to find the bug.
> In the meanwhile, I tried incoming and outgoing OBEX with a cellphone.
> This is how refcounts change:
>
> incoming outgoing
> start stop start stop
>
> rfcomm +1 -1 +3 -3
> l2cap +1 -2 +2 -2
This is actually what I already thought, the problem exists only on
incoming connections. The RFCOMM refcounting is a bit more complex,
because it involves the TTY part that also modifies the refcounting. Let
us try to start on L2CAP level and then step by step narrow the problem.
Regards
Marcel
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel
On Mon, 2004-01-12 at 18:15, Marcel Holtmann wrote:
> > I can try to get a second usb dongle and connect from another linux
> > machine, but that will take time to arrange.
>
> please do so.
I'll ask a friend and we'll try pan during this week.
In the meanwhile, I tried incoming and outgoing OBEX with a cellphone.
This is how refcounts change:
incoming outgoing
start stop start stop
rfcomm +1 -1 +3 -3
l2cap +1 -2 +2 -2
Eugene
Hi Eugene,
> I can try to get a second usb dongle and connect from another linux
> machine, but that will take time to arrange.
please do so.
> OTOH I could throw a few printk()s into the kernel but I don't know
> where to put them. The only place in l2cap.c that looks relevant to
> refcount management that I can see is at line 353 -
> "sk_set_owner(sk, THIS_MODULE);", but there must be other places too...
there are other places, but they are in the core socket layer.
Regards
Marcel
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel
On Mon, 2004-01-12 at 16:22, Marcel Holtmann wrote:
> > Interesting that running "rfcomm bind" or "rfcomm listen" does not
> > increase l2cap's refcount. It gets increased at the moment when
> > incoming connection actually takes place. And is decreased (by 2) when
> > it terminates.
[...]
> Let us summarize this problem a little bit. We know that this is
> refcount bug and it happens in __module_get(). It also seems that we
> only get hit by this bug if we got in incoming connection. Is this
> right? Do the refcounting work correct on outgoing? My question is if
> this is a L2CAP only problem or if it only happens in conjunction with
> RFCOMM? Can you reproduce it with the l2test program only? Don't load
> the rfcomm.ko module and don't start the sdpd, so you have a better
> starting refcount.
I don't think that I have a device capable of anything but rfcomm... I
have a cellphone (can use obex that works over rfcomm) and a palm (ppp
or hotsync over rfcomm).
I'll check outgoing connections to the cellphone (obex push) and report.
I can try to get a second usb dongle and connect from another linux
machine, but that will take time to arrange.
OTOH I could throw a few printk()s into the kernel but I don't know
where to put them. The only place in l2cap.c that looks relevant to
refcount management that I can see is at line 353 -
"sk_set_owner(sk, THIS_MODULE);", but there must be other places too...
Eugene
Hi Eugene,
> Interesting that running "rfcomm bind" or "rfcomm listen" does not
> increase l2cap's refcount. It gets increased at the moment when
> incoming connection actually takes place. And is decreased (by 2) when
> it terminates.
it is ok that the L2CAP refcount is increased, because every RFCOMM
connection needs an underlaying L2CAP link. But of course the increase
and the decrease should be the same ;)
> I do not understand how refcount is maintained well enough to do further
> investigation myself. Anything I can do to help you (or anybody) to fix
> the problem?
Let us summarize this problem a little bit. We know that this is
refcount bug and it happens in __module_get(). It also seems that we
only get hit by this bug if we got in incoming connection. Is this
right? Do the refcounting work correct on outgoing? My question is if
this is a L2CAP only problem or if it only happens in conjunction with
RFCOMM? Can you reproduce it with the l2test program only? Don't load
the rfcomm.ko module and don't start the sdpd, so you have a better
starting refcount.
Regards
Marcel
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel
On Sun, 2004-01-11 at 17:59, Marcel Holtmann wrote:
> > > As far as I can tell from analyzing the dump, and objdump of l2cap.o
> > > object, BUG() happens when "sk_set_owner(sk, THIS_MODULE);" in
> > > net/bluetooth/l2cap.c line 353 encounters zero refcount of the module.
> > > So, probably somebody incorrectly decreases the l2cap's refcount
> > > elsewhere...
> >
> > yes, this seems to be true (and explains the fixed number of attempts).
> > Reference count increases by one every time session starts, and
> > decreases by *two* when it is finished. Why initial refcount is 5 is
> > still unclear to me...
>
> I thought all refcount bugs were fixed, but the initial refcount is easy
> to explain. It should be raised by the RFCOMM module and the sdpd. Try
> to not load RFCOMM and disable the starting of the sdpd.
Interesting that running "rfcomm bind" or "rfcomm listen" does not
increase l2cap's refcount. It gets increased at the moment when
incoming connection actually takes place. And is decreased (by 2) when
it terminates.
I do not understand how refcount is maintained well enough to do further
investigation myself. Anything I can do to help you (or anybody) to fix
the problem?
Eugene
Hi Eugene,
> > As far as I can tell from analyzing the dump, and objdump of l2cap.o
> > object, BUG() happens when "sk_set_owner(sk, THIS_MODULE);" in
> > net/bluetooth/l2cap.c line 353 encounters zero refcount of the module.
> > So, probably somebody incorrectly decreases the l2cap's refcount
> > elsewhere...
>
> yes, this seems to be true (and explains the fixed number of attempts).
> Reference count increases by one every time session starts, and
> decreases by *two* when it is finished. Why initial refcount is 5 is
> still unclear to me...
I thought all refcount bugs were fixed, but the initial refcount is easy
to explain. It should be raised by the RFCOMM module and the sdpd. Try
to not load RFCOMM and disable the starting of the sdpd.
Regards
Marcel
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Bluez-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bluez-devel
Continuing investigation:
> As far as I can tell from analyzing the dump, and objdump of l2cap.o
> object, BUG() happens when "sk_set_owner(sk, THIS_MODULE);" in
> net/bluetooth/l2cap.c line 353 encounters zero refcount of the module.
> So, probably somebody incorrectly decreases the l2cap's refcount
> elsewhere...
yes, this seems to be true (and explains the fixed number of attempts).
Reference count increases by one every time session starts, and
decreases by *two* when it is finished. Why initial refcount is 5 is
still unclear to me...
root@pccross:/# lsmod|grep l2cap
l2cap 21952 5 rfcomm
bluetooth 44004 7 rfcomm,l2cap,hci_usb
root@pccross:/# PPP opened
root@pccross:/# lsmod|grep l2cap
l2cap 21952 6 rfcomm
bluetooth 44004 7 rfcomm,l2cap,hci_usb
root@pccross:/# PPP closed
root@pccross:/# lsmod|grep l2cap
l2cap 21952 4 rfcomm
bluetooth 44004 7 rfcomm,l2cap,hci_usb
root@pccross:/# PPP opened
root@pccross:/# lsmod|grep l2cap
l2cap 21952 5 rfcomm
bluetooth 44004 7 rfcomm,l2cap,hci_usb
root@pccross:/# PPP closed
root@pccross:/# lsmod|grep l2cap
l2cap 21952 3 rfcomm
bluetooth 44004 7 rfcomm,l2cap,hci_usb
Eugene
Marcel,
With 2.6.1 out, I did some further investigation of the oops when
establishing PPP connection. First of all, apparently oops happens on
7th attempt, after 6 successful sessions. Next, I selected console VGA
mode with smaller font, and got the full Oops dump:
http://www.average.org/btoops/
(first picture is vanilla 2.6.1, second is 2.6.1-mh1)
As far as I can tell from analyzing the dump, and objdump of l2cap.o
object, BUG() happens when "sk_set_owner(sk, THIS_MODULE);" in
net/bluetooth/l2cap.c line 353 encounters zero refcount of the module.
So, probably somebody incorrectly decreases the l2cap's refcount
elsewhere...
Any ideas on that?
Eugene
On Wed, 2004-01-07 at 12:15, Marcel Holtmann wrote:
> Edd, try this too for your lockup.
No change, sadly.
-- Edd
On Wed, 2004-01-07 at 15:15, Marcel Holtmann wrote:
> > I have stock 2.6.0 kernel, fairly recent Bluez utilities and MSI USB
> > Bluetooth dongle. I have dund running, and if I try to start a PPP
> > session from a handheld shortly after the previous one has terminated, I
> > get kernel oops, pretty much every time.
> to be sure that this is not a problem with the different proc code in
> 2.6 apply the attached patch and try again.
No change.
A small correction: oops does not happen *every* time, I have to do 5 or
6 connects/disconnects until it happens. (this must have been the same
before the patch).
I am including another oops trace (photo of the screen). Unfortunately
the top of the stack scrolls off the screen and cannot be seen. If you
need it, I might compile the kernel with framebuffer and use smaller
font.
Eugene
Hi Eugene,
> I have stock 2.6.0 kernel, fairly recent Bluez utilities and MSI USB
> Bluetooth dongle. I have dund running, and if I try to start a PPP
> session from a handheld shortly after the previous one has terminated, I
> get kernel oops, pretty much every time.
>
> Funny and probably unrelated thing is that the Palm (TT3) from which the
> session was initiated often soft-resets itself in a minute after such
> event...
>
> Oops included (manually typed in from a photo of the screen).
> Tell me if I can help in further investigation.
to be sure that this is not a problem with the different proc code in
2.6 apply the attached patch and try again.
Edd, try this too for your lockup.
Regards
Marcel