2009-09-11 07:53:58

by Lan Zhu

[permalink] [raw]
Subject: kernel panic happens when disconnecting Bluetooth headset

Marcel or others,

We met a issue that kernel panic happens when disconnecting some kinds
of Bluetooth headset, then we did some analysis and made some changes
on kernel code which have avoided the panic happening. Would you
please help to check if our analysis and fix is correct?

=============
Issue description
=============
On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
may cause kernel panic on certain conditions.

(Pre-condition is android paired with headset.)
Initiate the connection from android, disconnect it from android, result is OK.
Initiate the connection from android, disconnect it from headset, result is OK.
Initiate the connection from headset, disconnect it from headset, result is OK.
Initiate the connection from headset, disconnect it from android, for
Motorola H12 headset, result is OK.
Initiate the connection from headset, disconnect it from android, for
Motorola H620/560 headset, result is kernel panic.

=============
Kernel panic point
=============
kernel panic at __list_del() in the function rfcomm_session_del() ,
panic reason is "Unable to handle kernel paging request at virtual
address 00200200"

=============
Kernel log analysis
=============
rfcomm_session_del() is still called after the session entry is
removed from the list. Then __list_del() will cause kernel panic
because of the incorrect pointer. This situation occurs when calling
rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
find out why the socket state become BT_CLOSED before we calling
rfcomm_recv_ua().

# [ 171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
[ 171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
[ 171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
[ 171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
[ 171.701110] rfcomm_process_dlcs: session cc751be0 state 1
[ 171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
rx_credits 33 tx_credits 31
[ 171.715515] rfcomm_send_frame: session cc751be0 len 18
[ 171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
[ 174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
[ 174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
[ 174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
[ 174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
[ 174.154968] rfcomm_process_dlcs: session cc751be0 state 1
[ 174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
rx_credits 33 tx_credits 30
[ 174.171173] rfcomm_send_frame: session cc751be0 len 10
[ 174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
[ 174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
[ 174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
[ 174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
[ 174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
session cc751be0
[ 174.236877] rfcomm_send_disc: cc751be0 dlci 20
[ 174.242706] rfcomm_send_frame: session cc751be0 len 4
[ 174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
[ 174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
[ 174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
[ 174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
[ 174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
[ 174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
[ 174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
[ 174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
session cc751be0
[ 174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
[ 174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
[ 174.443572] rfcomm_dlc_free: cd3fe920
[ 174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
[ 174.454528] rfcomm_send_disc: cc751be0 dlci 0
[ 174.459259] rfcomm_send_frame: session cc751be0 len 4
[ 174.464904] rfcomm_process_dlcs: session cc751be0 state 8
[ 174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
[ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
[ 174.903442] rfcomm_l2state_change: ccf70400 state 9
[ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
[ 174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
[ 174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
[ 174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
[ 174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
[ 174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
[ 174.948211] rfcomm_session_del: session cc751be0 state 8
[ 174.953918] @@@@ in rfcomm_session_del()
[ 174.958312] @@@@ s->list = cc751be0
[ 174.962097] @@@@ s->list.next = ccbfe9a0
[ 174.966369] @@@@ s->list.prev = c047d524
[ 174.970733] @@@@ list is valid, call list_del()
[ 174.975646] @@@@ after list_del()
[ 174.979278] @@@@ s->list = cc751be0
[ 174.983184] @@@@ s->list.next = 00100100
[ 174.987457] @@@@ s->list.prev = 00200200
[ 174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
[ 174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
[ 175.005310] rfcomm_session_del: session cc751be0 state 9
[ 175.011169] @@@@ in rfcomm_session_del()
[ 175.015441] @@@@ s->list = cc751be0
[ 175.019409] @@@@ s->list.next = 00100100
[ 175.023651] @@@@ s->list.prev = 00200200
[ 175.027923] @@@@ list is valid, call list_del()
[ 175.032958] Unable to handle kernel paging request at virtual
address 00200200
[ 175.040679] pgd = c0004000
[ 175.043792] [00200200] *pgd=00000000
[ 175.047821] Internal error: Oops: 817 [#1]
[ 175.052246] Modules linked in:
[ 175.055725] CPU: 0 Not tainted (2.6.29-omap1-dirty #34)
[ 175.061859] PC is at rfcomm_session_del+0x6c/0x108
[ 175.067047] LR is at release_console_sem+0x190/0x1a0
[ 175.072509] pc : [<c033ded8>] lr : [<c0066308>] psr: 60000013
[ 175.072509] sp : cc1abf38 ip : cc1abe68 fp : cc1abf4c
[ 175.084960] r10: cc751c04 r9 : c036d2fc r8 : cc751be0
[ 175.090545] r7 : 00000068 r6 : cc751c04 r5 : 00000009 r4 : cc751be0
[ 175.097656] r3 : 00100100 r2 : 00100100 r1 : 00200200 r0 : c0422876

=============
HCI log analysis
=============
Compare the hcidump log of the correct case with the one of the panic
case, we found there is only one difference in the message sequence.
In the panic case, headset send L2CAP Disconn_Req immediately after
sending rfcomm UA frame to android. We think this is the reason that
cause the socket state become BT_CLOSED.

Please compare these two log, pay attention to the message direction
of the last Disconn_Req.


Log of correct case:
----------------------------


009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
L2CAP(d): cid 0x0047 len 18 [psm 3]
RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen

handle 1 packets 1
009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0047 len 4 [psm 3]
RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen

handle 1 packets 1
009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0040 len 4 [psm 3]
RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0047 len 4 [psm 3]
RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen

handle 1 packets 1
009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0040 len 4 [psm 3]
RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen

handle 1 packets 1
009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040




log of panic case:
------------------------



2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
L2CAP(d): cid 0x0041 len 18 [psm 3]
RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
5
handle 1 packets 1
2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0041 len 4 [psm 3]
RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
5
handle 1 packets 1
2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0040 len 4 [psm 3]
RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0041 len 4 [psm 3]
RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
5
handle 1 packets 1
2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0040 len 4 [psm 3]
RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041

=============
Analysis Result
=============
For some kinds of Bluetooth headset such as Motorola H560/H620 which
are based on BCM2044S, they will send L2CAP Disconn_Req command right
after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
rfcomm socket state become BT_CLOSED before completely handling UA
frame, thus it will cause kernel panic. I think we can ignore the
received rfcomm frames if socket state is BT_CLOSED, because it
doesn't make sense in the BT_CLOSED state.


=============
Changed Code
=============
We changed the code in the function rfcomm_process_rx() in
net/bluetooth/rfcomm/core.c, check the socket state first before
handling the received framew. If the socket state is BT_CLOSED, we
don't handle any rfcomm frames but just close the session.

The change is like below

+ if (sk->sk_state != BT_CLOSED) {
/* Get data directly from socket receive queue without copying it. */
while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
skb_orphan(skb);
rfcomm_recv_frame(s, skb);
}
-
- if (sk->sk_state == BT_CLOSED) {
+ } else {
if (!s->initiator)
rfcomm_session_put(s);

rfcomm_session_close(s, sk->sk_err);
}




Thanks,
Zhu Lan


2009-09-23 07:22:26

by Dave Young

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Tue, Sep 22, 2009 at 01:18:26PM -0700, Nick Pelly wrote:
> On Mon, Sep 21, 2009 at 6:29 PM, Nick Pelly <[email protected]> wrote:
> > On Mon, Sep 21, 2009 at 5:52 PM, Nick Pelly <[email protected]> wrote:
> >> On Mon, Sep 14, 2009 at 2:10 AM, Lan Zhu <[email protected]> wrote:
> >>> Hi Marcel,
> >>>
> >>> 2009/9/12 Marcel Holtmann <[email protected]>:
> >>>> Hi Zhu,
> >>>>
> >>>>> >> We met a issue that kernel panic happens when disconnecting some kinds
> >>>>> >> of Bluetooth headset, then we did some analysis and made some changes
> >>>>> >> on kernel code which have avoided the panic happening. Would you
> >>>>> >> please help to check if our analysis and fix is correct?
> >>>>> >>
> >>>>> >> =============
> >>>>> >> Issue description
> >>>>> >> =============
> >>>>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
> >>>>> >> may cause kernel panic on certain conditions.
> >>>>> >>
> >>>>> >> (Pre-condition is android paired with headset.)
> >>>>> >> Initiate the connection from android, disconnect it from android, result is OK.
> >>>>> >> Initiate the connection from android, disconnect it from headset, result is OK.
> >>>>> >> Initiate the connection from headset, disconnect it from headset, result is OK.
> >>>>> >> Initiate the connection from headset, disconnect it from android, for
> >>>>> >> Motorola H12 headset, result is OK.
> >>>>> >> Initiate the connection from headset, disconnect it from android, for
> >>>>> >> Motorola H620/560 headset, result is kernel panic.
> >>>>> >>
> >>>>> >> =============
> >>>>> >> Kernel panic point
> >>>>> >> =============
> >>>>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
> >>>>> >> panic reason is "Unable to handle kernel paging request at virtual
> >>>>> >> address 00200200"
> >>>>> >>
> >>>>> >> =============
> >>>>> >> Kernel log analysis
> >>>>> >> =============
> >>>>> >> rfcomm_session_del() is still called after the session entry is
> >>>>> >> removed from the list. Then __list_del() will cause kernel panic
> >>>>> >> because of the incorrect pointer. This situation occurs when calling
> >>>>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
> >>>>> >> find out why the socket state become BT_CLOSED before we calling
> >>>>> >> rfcomm_recv_ua().
> >>>>> >>
> >>>>> >> # [ ?171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >>>>> >> [ ?171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
> >>>>> >> [ ?171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >>>>> >> [ ?171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >>>>> >> [ ?171.701110] rfcomm_process_dlcs: session cc751be0 state 1
> >>>>> >> [ ?171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >>>>> >> rx_credits 33 tx_credits 31
> >>>>> >> [ ?171.715515] rfcomm_send_frame: session cc751be0 len 18
> >>>>> >> [ ?171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >>>>> >> [ ?174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >>>>> >> [ ?174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
> >>>>> >> [ ?174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >>>>> >> [ ?174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >>>>> >> [ ?174.154968] rfcomm_process_dlcs: session cc751be0 state 1
> >>>>> >> [ ?174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >>>>> >> rx_credits 33 tx_credits 30
> >>>>> >> [ ?174.171173] rfcomm_send_frame: session cc751be0 len 10
> >>>>> >> [ ?174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >>>>> >> [ ?174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
> >>>>> >> [ ?174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
> >>>>> >> [ ?174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
> >>>>> >> [ ?174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
> >>>>> >> session cc751be0
> >>>>> >> [ ?174.236877] rfcomm_send_disc: cc751be0 dlci 20
> >>>>> >> [ ?174.242706] rfcomm_send_frame: session cc751be0 len 4
> >>>>> >> [ ?174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
> >>>>> >> [ ?174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
> >>>>> >> [ ?174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
> >>>>> >> [ ?174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
> >>>>> >> [ ?174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
> >>>>> >> [ ?174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >>>>> >> [ ?174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
> >>>>> >> [ ?174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
> >>>>> >> session cc751be0
> >>>>> >> [ ?174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
> >>>>> >> [ ?174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
> >>>>> >> [ ?174.443572] rfcomm_dlc_free: cd3fe920
> >>>>> >> [ ?174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >>>>> >> [ ?174.454528] rfcomm_send_disc: cc751be0 dlci 0
> >>>>> >> [ ?174.459259] rfcomm_send_frame: session cc751be0 len 4
> >>>>> >> [ ?174.464904] rfcomm_process_dlcs: session cc751be0 state 8
> >>>>> >> [ ?174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >>>>> >> [ ?174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >>>>> >> [ ?174.903442] rfcomm_l2state_change: ccf70400 state 9
> >>>>> >> [ ?174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >>>>> >> [ ?174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
> >>>>> >> [ ?174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
> >>>>> >> [ ?174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >>>>> >> [ ?174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
> >>>>> >> [ ?174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >>>>> >> [ ?174.948211] rfcomm_session_del: session cc751be0 state 8
> >>>>> >> [ ?174.953918] @@@@ in rfcomm_session_del()
> >>>>> >> [ ?174.958312] @@@@ s->list = cc751be0
> >>>>> >> [ ?174.962097] @@@@ s->list.next = ccbfe9a0
> >>>>> >> [ ?174.966369] @@@@ s->list.prev = c047d524
> >>>>> >> [ ?174.970733] @@@@ list is valid, call list_del()
> >>>>> >> [ ?174.975646] @@@@ after list_del()
> >>>>> >> [ ?174.979278] @@@@ s->list = cc751be0
> >>>>> >> [ ?174.983184] @@@@ s->list.next = 00100100
> >>>>> >> [ ?174.987457] @@@@ s->list.prev = 00200200
> >>>>> >> [ ?174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
> >>>>> >> [ ?174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >>>>> >> [ ?175.005310] rfcomm_session_del: session cc751be0 state 9
> >>>>> >> [ ?175.011169] @@@@ in rfcomm_session_del()
> >>>>> >> [ ?175.015441] @@@@ s->list = cc751be0
> >>>>> >> [ ?175.019409] @@@@ s->list.next = 00100100
> >>>>> >> [ ?175.023651] @@@@ s->list.prev = 00200200
> >>>>> >> [ ?175.027923] @@@@ list is valid, call list_del()
> >>>>> >> [ ?175.032958] Unable to handle kernel paging request at virtual
> >>>>> >> address 00200200
> >>>>> >> [ ?175.040679] pgd = c0004000
> >>>>> >> [ ?175.043792] [00200200] *pgd=00000000
> >>>>> >> [ ?175.047821] Internal error: Oops: 817 [#1]
> >>>>> >> [ ?175.052246] Modules linked in:
> >>>>> >> [ ?175.055725] CPU: 0 ? ?Not tainted ?(2.6.29-omap1-dirty #34)
> >>>>> >> [ ?175.061859] PC is at rfcomm_session_del+0x6c/0x108
> >>>>> >> [ ?175.067047] LR is at release_console_sem+0x190/0x1a0
> >>>>> >> [ ?175.072509] pc : [<c033ded8>] ? ?lr : [<c0066308>] ? ?psr: 60000013
> >>>>> >> [ ?175.072509] sp : cc1abf38 ?ip : cc1abe68 ?fp : cc1abf4c
> >>>>> >> [ ?175.084960] r10: cc751c04 ?r9 : c036d2fc ?r8 : cc751be0
> >>>>> >> [ ?175.090545] r7 : 00000068 ?r6 : cc751c04 ?r5 : 00000009 ?r4 : cc751be0
> >>>>> >> [ ?175.097656] r3 : 00100100 ?r2 : 00100100 ?r1 : 00200200 ?r0 : c0422876
> >>>>> >>
> >>>>> >> =============
> >>>>> >> HCI log analysis
> >>>>> >> =============
> >>>>> >> Compare the hcidump log of the correct case with the one of the panic
> >>>>> >> case, we found there is only one difference in the message sequence.
> >>>>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
> >>>>> >> sending rfcomm UA frame to android. We think this is the reason that
> >>>>> >> cause the socket state become BT_CLOSED.
> >>>>> >>
> >>>>> >> Please compare these two log, pay attention to the message direction
> >>>>> >> of the last Disconn_Req.
> >>>>> >>
> >>>>> >>
> >>>>> >> Log of correct case:
> >>>>> >> ----------------------------
> >>>>> >>
> >>>>> >>
> >>>>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
> >>>>> >> ? ?L2CAP(d): cid 0x0047 len 18 [psm 3]
> >>>>> >> ? ? ?RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >>>>> >> ? ? ?0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
> >>>>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >>
> >>>>> >> ? ?handle 1 packets 1
> >>>>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ?L2CAP(d): cid 0x0047 len 4 [psm 3]
> >>>>> >> ? ? ?RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >>>>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >>
> >>>>> >> ? ?handle 1 packets 1
> >>>>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>>>> >> ? ? ?RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >>>>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ?L2CAP(d): cid 0x0047 len 4 [psm 3]
> >>>>> >> ? ? ?RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >>>>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >>
> >>>>> >> ? ?handle 1 packets 1
> >>>>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>>>> >> ? ? ?RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >>>>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
> >>>>> >> ? ?L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
> >>>>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >>
> >>>>> >> ? ?handle 1 packets 1
> >>>>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
> >>>>> >> ? ?L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> log of panic case:
> >>>>> >> ------------------------
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
> >>>>> >> ? ? L2CAP(d): cid 0x0041 len 18 [psm 3]
> >>>>> >> ? ? ? RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >>>>> >> ? ? ? 0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
> >>>>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >> 5
> >>>>> >> ? ? handle 1 packets 1
> >>>>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ? L2CAP(d): cid 0x0041 len 4 [psm 3]
> >>>>> >> ? ? ? RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >>>>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >> 5
> >>>>> >> ? ? handle 1 packets 1
> >>>>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ? L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>>>> >> ? ? ? RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >>>>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ? L2CAP(d): cid 0x0041 len 4 [psm 3]
> >>>>> >> ? ? ? RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >>>>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
> >>>>> >> 5
> >>>>> >> ? ? handle 1 packets 1
> >>>>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
> >>>>> >> ? ? L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>>>> >> ? ? ? RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >>>>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
> >>>>> >> ? ? L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >>>>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
> >>>>> >> ? ? L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
> >>>>> >>
> >>>>> >> =============
> >>>>> >> Analysis Result
> >>>>> >> =============
> >>>>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
> >>>>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command right
> >>>>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
> >>>>> >> rfcomm socket state become BT_CLOSED before completely handling UA
> >>>>> >> frame, thus it will cause kernel panic. I think we can ignore the
> >>>>> >> received rfcomm frames if socket state is BT_CLOSED, because it
> >>>>> >> doesn't make sense in the BT_CLOSED state.
> >>>>> >>
> >>>>> >>
> >>>>> >> =============
> >>>>> >> Changed Code
> >>>>> >> =============
> >>>>> >> We changed the code in the function rfcomm_process_rx() in
> >>>>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
> >>>>> >> handling the received framew. If the socket state is BT_CLOSED, we
> >>>>> >> don't handle any rfcomm frames but just close the session.
> >>>>> >>
> >>>>> >> The change is like below
> >>>>> >>
> >>>>> >> + ? ? ? if (sk->sk_state != BT_CLOSED) {
> >>>>> >> ? ? ? ? /* Get data directly from socket receive queue without copying it. */
> >>>>> >> ? ? ? ? while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >>>>> >> ? ? ? ? ? ? ? ? skb_orphan(skb);
> >>>>> >> ? ? ? ? ? ? ? ? rfcomm_recv_frame(s, skb);
> >>>>> >> ? ? ? ? }
> >>>>> >> -
> >>>>> >> - ? ? ? if (sk->sk_state == BT_CLOSED) {
> >>>>> >> + ? ? ? } else {
> >>>>> >> ? ? ? ? ? ? ? ? if (!s->initiator)
> >>>>> >> ? ? ? ? ? ? ? ? ? ? ? ? rfcomm_session_put(s);
> >>>>> >>
> >>>>> >> ? ? ? ? ? ? ? ? rfcomm_session_close(s, sk->sk_err);
> >>>>> >> ? ? ? ? ?}
> >>>>> >
> >>>>> > so I do see the issue here, but I don't agree with the fix since it
> >>>>> > changes behavior that might cause other issues. So in case the frame
> >>>>> > processing leads to sk->sk_state == BT_CLOSED we are not closing the
> >>>>> > connection anymore if we make it depend on a state before the frame
> >>>>> > processing. And nothing guarantees that rfcomm_process_rx gets scheduled
> >>>>> > again.
> >>>>> >
> >>>>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> >>>>> > index 94b3388..606143b 100644
> >>>>> > --- a/net/bluetooth/rfcomm/core.c
> >>>>> > +++ b/net/bluetooth/rfcomm/core.c
> >>>>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)
> >>>>> >
> >>>>> > ? ? ? ?BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));
> >>>>> >
> >>>>> > + ? ? ? rfcomm_session_hold(s);
> >>>>> > +
> >>>>> > ? ? ? ?/* Get data directly from socket receive queue without copying it. */
> >>>>> > ? ? ? ?while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >>>>> > ? ? ? ? ? ? ? ?skb_orphan(skb);
> >>>>> > ? ? ? ? ? ? ? ?rfcomm_recv_frame(s, skb);
> >>>>> > ? ? ? ?}
> >>>>> >
> >>>>> > + ? ? ? rfcomm_session_put(s);
> >>>>> > +
> >>>>> > ? ? ? ?if (sk->sk_state == BT_CLOSED) {
> >>>>> > ? ? ? ? ? ? ? ?if (!s->initiator)
> >>>>> > ? ? ? ? ? ? ? ? ? ? ? ?rfcomm_session_put(s);
> >>>>> >
> >>>>> > What does the above patch do for you? Since if I read it correctly, then
> >>>>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
> >>>>> > of the session. And then in this case it is delayed until after all
> >>>>> > frames are processed.
> >>>>> >
> >>>>> >
> >>>>>
> >>>>> I've tried your patch but unfortunately kernel panic still happened.
> >>>>>
> >>>>> From the log I noticed that if rfcomm_l2state_change is called before
> >>>>> rfcomm_process_rx, kernel panic will happen definitely.
> >>>>>
> >>>>> Below lines are in the correct log,
> >>>>>
> >>>>> [ ?139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
> >>>>> [ ?139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
> >>>>> ...
> >>>>> [ ?139.457519] rfcomm_l2state_change: ccf70000 state 9
> >>>>> (disconnect ok)
> >>>>>
> >>>>> In the above case, when process_rx, the code in the condition "if
> >>>>> (sk->sk_state == BT_CLOSED)" will never run.
> >>>>>
> >>>>> Below lines are in the panic log,
> >>>>>
> >>>>> [ ?174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >>>>> [ ?174.903442] rfcomm_l2state_change: ccf70400 state 9
> >>>>> [ ?174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >>>>> ...
> >>>>> ( then panic)
> >>>>>
> >>>>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, ?then
> >>>>> process_rx, so the code in the condition "if (sk->sk_state ==
> >>>>> BT_CLOSED) " will be run, it will call session_put twice. I think this
> >>>>> is the root cause of panic.
> >>>>
> >>>> I know why it happens, that is not the problem. My point is not to break
> >>>> current scheduling assumptions.
> >>>>
> >>>> So if you move the rfcomm_session_put() now at the end of the function,
> >>>> then it should be fine, right?
> >>>>
> >>>> Regards
> >>>>
> >>>> Marcel
> >>>>
> >>>>
> >>>>
> >>>
> >>> You are right. I moved the rfcomm_session_put() at the end of
> >>> rfcomm_process_tx() then kernel panic doesn't happen any longer.
> >>>
> >>> The changed code is like below,
> >>>
> >>> @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
> >>>
> >>> ? ? ? ?BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->s
> >>>
> >>> + ? ? ? rfcomm_session_hold(s);
> >>> +
> >>> ? ? ? ?/* Get data directly from socket receive queue without copying it. */
> >>> ? ? ? ?while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >>> ? ? ? ? ? ? ? ?skb_orphan(skb);
> >>> @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
> >>>
> >>> ? ? ? ? ? ? ? ?rfcomm_session_close(s, sk->sk_err);
> >>> ? ? ? ?}
> >>> +
> >>> + ? ? ? rfcomm_session_put(s);
> >>> ?}
> >>>
> >>> ?static inline void rfcomm_accept_connection(struct rfcomm_session *s)
> >>>
> >>> Please submit this change to bluez release.
> >>
> >>
> >> Unfortunately, with this change I get a panic disconnecting from
> >> Motorola H270 in the case that the headset initiated RFCOMM and we
> >> disconnect RFCOMM.
> >>
> >> Here is the hcidump:
> >>
> >> 2009-09-21 17:22:37.384811 < ACL data: handle 1 flags 0x02 dlen 22
> >> ? ?L2CAP(d): cid 0x0041 len 18 [psm 3]
> >> ? ? ?RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >> ? ? ?0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
> >> 2009-09-21 17:22:37.502273 > HCI Event: Number of Completed Packets
> >> (0x13) plen 5
> >> ? ?handle 1 packets 1
> >> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
> >> ? ?L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> ? ? ?RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
> >> (0x13) plen 5
> >> ? ?handle 1 packets 1
> >> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
> >> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> ? ? ?RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
> >> ? ?L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> ? ? ?RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
> >> (0x13) plen 5
> >> ? ?handle 1 packets 1
> >> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
> >> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> ? ? ?RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
> >> ? ?L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >> 2009-09-21 17:22:39.029526 < ACL data: handle 1 flags 0x02 dlen 12
> >> ? ?L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
> >> 2009-09-21 17:22:39.136581 > HCI Event: Number of Completed Packets
> >> (0x13) plen 5
> >> ? ?handle 1 packets 1
> >> 2009-09-21 17:22:41.337203 > HCI Event: Disconn Complete (0x05) plen 4
> >> ? ?status 0x00 handle 1 reason 0x13
> >> ? ?Reason: Remote User Terminated Connection
> >>
> >> And the panic:
> >>
> >> <7>[ 3161.665557] rfcomm:rfcomm_session_del: session c9c06ad0 state 9
> >> <7>[ 3161.671905] l2cap:l2cap_sock_release: sock cea04360, sk c97f02f8
> >> <7>[ 3161.678497] l2cap:l2cap_sock_shutdown: sock cea04360, sk c97f02f8
> >> <7>[ 3161.685028] l2cap:l2cap_sock_kill: sk c97f02f8 state 9
> >> <7>[ 3161.695587] l2cap:l2cap_sock_destruct: sk c97f02f8
> >> <4>[ 3161.700805] npelly 1911 rfcomm_process_sessions session c9c06ad0
> >> refcnt 1802201963
> >> <7>[ 3161.709014] rfcomm:rfcomm_process_dlcs: session c9c06ad0 state 1802201963
> >> <7>[ 3161.716308] rfcomm:rfcomm_process_dlcs: session c9c06ad0 dlc 6b6b6b6b
> >> <1>[ 3161.726776] Unable to handle kernel paging request at virtual
> >> address 6b6b6b6b
> >> <1>[ 3161.734619] pgd = c0004000
> >> <1>[ 3161.737609] [6b6b6b6b] *pgd=00000000
> >> <4>[ 3161.741638] Internal error: Oops: 5 [#1] PREEMPT
> >> <4>[ 3161.746734] Modules linked in:
> >> <4>[ 3161.750213] CPU: 0 ? ?Not tainted
> >> (2.6.29-omap1-07358-g9a3fd55-dirty #206)
> >> <4>[ 3161.757629] PC is at rfcomm_process_dlcs+0x108/0x590
> >> <4>[ 3161.762969] LR is at preempt_schedule+0x44/0x54
> >> <4>[ 3161.767852] pc : [<c03911f4>] ? ?lr : [<c03a27c4>] ? ?psr: 60000113
> >> <4>[ 3161.767883] sp : ccdf9e80 ?ip : ccdf9dd8 ?fp : ccdf9edc
> >> <4>[ 3161.780273] r10: 00000000 ?r9 : c9c06af4 ?r8 : c9c06ad0
> >> <4>[ 3161.786010] r7 : 00000000 ?r6 : c9c06ad0 ?r5 : c4c68680 ?r4 : 6b6b6b6b
> >> <4>[ 3161.792968] r3 : c9c06ae0 ?r2 : ccdf8000 ?r1 : c61a8940 ?r0 : 0000004c
> >> <4>[ 3161.800079] Flags: nZCv ?IRQs on ?FIQs on ?Mode SVC_32 ?ISA ARM
> >> Segment kernel
> >> <4>[ 3161.807983] Control: 10c5387d ?Table: 86db8019 ?DAC: 00000017
> >> <4>[ 3161.814147]
> >> <4>[ 3161.814147] PC: 0xc0391174:
> >> [...]
> >> <4>[ 3162.973175] Backtrace:
> >> <4>[ 3162.976013] [<c03910ec>] (rfcomm_process_dlcs+0x0/0x590) from
> >> [<c03930b0>] (rfcomm_process_sessions+0x1a34/0x1a9c)
> >> <4>[ 3162.987579] [<c039167c>] (rfcomm_process_sessions+0x0/0x1a9c)
> >> from [<c03932ec>] (rfcomm_run+0x1d4/0x2ac)
> >> <4>[ 3162.998199] [<c0393118>] (rfcomm_run+0x0/0x2ac) from
> >> [<c008e7d8>] (kthread+0x5c/0x94)
> >> <4>[ 3163.013763] [<c008e77c>] (kthread+0x0/0x94) from [<c007c998>]
> >> (do_exit+0x0/0x714)
> >>
> >>
> >> Seems like this fix avoids the panic due to calling
> >> rfcomm_session_close() on a deleted session, but does not always
> >> address the unbalanced rfcomm_session_put() which may be the root
> >> cause.
> >>
> >> Lan Zhu suspected this in the original post, and his original fix does
> >> in fact fix this panic as well as the originally reported panic,
> >> because it avoids the unbalanced rfcomm_session_put().
> >>
> >> Marcel I know you are concerned about the original fix changing
> >> scheduling assumptions, are you able to comment on this further?
> >>
> >> Are there any other suggestions for patches for this issue? I have
> >> spent the best part of the day trying to figure this one out, but the
> >> recounting in the rfcomm core is quite subtle and I think it really
> >> needs someone familiar with the code to have a quick look and come up
> >> with the safest patch. I can run tests.
> >>
> >> In the mean time, I am doing some testing of Lan Zhu's original fix
> >> and if there are no better suggestions we will run with that one for
> >> Android.
> >>
> >> Nick
> >>
> >>
> >> Some more analysis:
> >>
> >> With the RFCOMM connection in idle there are 2 references on s->refcnt
> >>
> >> However three references are removed during disconnect with the H270
> >> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink()
> >> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
> >> with dlci = 0 and s->state = BT_DISCONN
> >> - rfcomm_process_sessions() -> rfcomm_process_rx() with sk_state =
> >> BT_CLOSED and s->initiator = 0
> >>
> >> in that order.
> >>
> >> On another headset, for example the Moto H350, we only see the first
> >> two references removed during disconnect.
> >>
> >> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink()
> >> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
> >> with dlci = 0 and s->state = BT_DISCONN
> >>
> >
> > How about this. We still call rfcomm_process_rx(), but avoid the
> > rfcomm_session_put() due to RFCOMM UA when the socket state is
> > BT_CLOSED.
> >
> > It is less invasive, so might address Marcel's concerns with regard to
> > scheduling changes.
> >
> >
> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> > index c109a3a..333c6e9 100644
> > --- a/net/bluetooth/rfcomm/core.c
> > +++ b/net/bluetooth/rfcomm/core.c
> > @@ -1105,6 +1105,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> > *s, u8 dlci)
> > ? ? ? ? ? ? ? ?}
> > ? ? ? ?} else {
> > ? ? ? ? ? ? ? ?/* Control channel */
> > + ? ? ? ? ? ? ? struct socket *sock = s->sock;
> > + ? ? ? ? ? ? ? struct sock *sk = sock->sk;
> > ? ? ? ? ? ? ? ?switch (s->state) {
> > ? ? ? ? ? ? ? ?case BT_CONNECT:
> > ? ? ? ? ? ? ? ? ? ? ? ?s->state = BT_CONNECTED;
> > @@ -1112,7 +1114,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> > *s, u8 dlci)
> > ? ? ? ? ? ? ? ? ? ? ? ?break;
> >
> > ? ? ? ? ? ? ? ?case BT_DISCONN:
> > - ? ? ? ? ? ? ? ? ? ? ? rfcomm_session_put(s);
> > + ? ? ? ? ? ? ? ? ? ? ? if (sk->sk_state != BT_CLOSED)
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? rfcomm_session_put(s);
> > ? ? ? ? ? ? ? ? ? ? ? ?break;
> > ? ? ? ? ? ? ? ?}
> > ? ? ? ?}
> >
>
> I made a minor style improvement and added commit message. Patch available from
>
> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=1048e007842da2d6440679e1ca80f45438a6369d
>

What about following fix? (building test only)


--- linux-2.6.orig/net/bluetooth/rfcomm/core.c 2009-09-19 15:25:23.000000000 +0800
+++ linux-2.6/net/bluetooth/rfcomm/core.c 2009-09-23 15:15:30.000000000 +0800
@@ -1927,16 +1927,15 @@ static inline void rfcomm_process_sessio

rfcomm_session_hold(s);

- switch (s->state) {
- case BT_BOUND:
- rfcomm_check_connection(s);
- break;
-
- default:
- rfcomm_process_rx(s);
- break;
+ rfcomm_check_connection(s);
+
+ if (s->state == BT_CLOSED) {
+ rfcomm_session_put(s);
+ continue;
}

+ rfcomm_process_rx(s);
+
rfcomm_process_dlcs(s);

rfcomm_session_put(s);

2009-09-22 20:18:26

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Mon, Sep 21, 2009 at 6:29 PM, Nick Pelly <[email protected]> wrote:
> On Mon, Sep 21, 2009 at 5:52 PM, Nick Pelly <[email protected]> wrote:
>> On Mon, Sep 14, 2009 at 2:10 AM, Lan Zhu <[email protected]> wrote:
>>> Hi Marcel,
>>>
>>> 2009/9/12 Marcel Holtmann <[email protected]>:
>>>> Hi Zhu,
>>>>
>>>>> >> We met a issue that kernel panic happens when disconnecting some k=
inds
>>>>> >> of Bluetooth headset, then we did some analysis and made some chan=
ges
>>>>> >> on kernel code which have avoided the panic happening. Would you
>>>>> >> please help to check if our analysis and fix is correct?
>>>>> >>
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> Issue description
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headse=
t
>>>>> >> may cause kernel panic on certain conditions.
>>>>> >>
>>>>> >> (Pre-condition is android paired with headset.)
>>>>> >> Initiate the connection from android, disconnect it from android, =
result is OK.
>>>>> >> Initiate the connection from android, disconnect it from headset, =
result is OK.
>>>>> >> Initiate the connection from headset, disconnect it from headset, =
result is OK.
>>>>> >> Initiate the connection from headset, disconnect it from android, =
for
>>>>> >> Motorola H12 headset, result is OK.
>>>>> >> Initiate the connection from headset, disconnect it from android, =
for
>>>>> >> Motorola H620/560 headset, result is kernel panic.
>>>>> >>
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> Kernel panic point
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> kernel panic at __list_del() in the function rfcomm_session_del() =
,
>>>>> >> panic reason is "Unable to handle kernel paging request at virtual
>>>>> >> address 00200200"
>>>>> >>
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> Kernel log analysis
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> rfcomm_session_del() is still called after the session entry is
>>>>> >> removed from the list. Then __list_del() will cause kernel panic
>>>>> >> because of the incorrect pointer. This situation occurs when calli=
ng
>>>>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need t=
o
>>>>> >> find out why the socket state become BT_CLOSED before we calling
>>>>> >> rfcomm_recv_ua().
>>>>> >>
>>>>> >> # [ =A0171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>>>> >> [ =A0171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>>>>> >> [ =A0171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen =
0
>>>>> >> [ =A0171.695709] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>>>> >> [ =A0171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>>>>> >> [ =A0171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>>>> >> rx_credits 33 tx_credits 31
>>>>> >> [ =A0171.715515] rfcomm_send_frame: session cc751be0 len 18
>>>>> >> [ =A0171.721130] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 3
>>>>> >> [ =A0174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>>>> >> [ =A0174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>>>>> >> [ =A0174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen =
0
>>>>> >> [ =A0174.148498] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>>>> >> [ =A0174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>>>>> >> [ =A0174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>>>> >> rx_credits 33 tx_credits 30
>>>>> >> [ =A0174.171173] rfcomm_send_frame: session cc751be0 len 10
>>>>> >> [ =A0174.177642] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 3
>>>>> >> [ =A0174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>>>>> >> [ =A0174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>>>>> >> [ =A0174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket c=
e9a0960
>>>>> >> [ =A0174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 =
err 0
>>>>> >> session cc751be0
>>>>> >> [ =A0174.236877] rfcomm_send_disc: cc751be0 dlci 20
>>>>> >> [ =A0174.242706] rfcomm_send_frame: session cc751be0 len 4
>>>>> >> [ =A0174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeou=
t 2560
>>>>> >> [ =A0174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>>>>> >> [ =A0174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>>>>> >> [ =A0174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>>>>> >> [ =A0174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen =
1
>>>>> >> [ =A0174.411010] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>>>> >> [ =A0174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>>>>> >> [ =A0174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 =
err 0
>>>>> >> session cc751be0
>>>>> >> [ =A0174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>>>>> >> [ =A0174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session =
cc751be0
>>>>> >> [ =A0174.443572] rfcomm_dlc_free: cd3fe920
>>>>> >> [ =A0174.447570] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 3
>>>>> >> [ =A0174.454528] rfcomm_send_disc: cc751be0 dlci 0
>>>>> >> [ =A0174.459259] rfcomm_send_frame: session cc751be0 len 4
>>>>> >> [ =A0174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>>>>> >> [ =A0174.470703] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 2
>>>>> >> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>>>> >> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>>>> >> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen =
1
>>>>> >> [ =A0174.915130] rfcomm_process_rx: @@@ @@@ sk_state =3D 9
>>>>> >> [ =A0174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>>>>> >> [ =A0174.926574] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 2
>>>>> >> [ =A0174.933532] rfcomm_process_rx: @@@ @@@ sk_state =3D=3D BT_CLO=
SED , s->initiator=3D0
>>>>> >> [ =A0174.941253] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 1
>>>>> >> [ =A0174.948211] rfcomm_session_del: session cc751be0 state 8
>>>>> >> [ =A0174.953918] @@@@ in rfcomm_session_del()
>>>>> >> [ =A0174.958312] @@@@ s->list =3D cc751be0
>>>>> >> [ =A0174.962097] @@@@ s->list.next =3D ccbfe9a0
>>>>> >> [ =A0174.966369] @@@@ s->list.prev =3D c047d524
>>>>> >> [ =A0174.970733] @@@@ list is valid, call list_del()
>>>>> >> [ =A0174.975646] @@@@ after list_del()
>>>>> >> [ =A0174.979278] @@@@ s->list =3D cc751be0
>>>>> >> [ =A0174.983184] @@@@ s->list.next =3D 00100100
>>>>> >> [ =A0174.987457] @@@@ s->list.prev =3D 00200200
>>>>> >> [ =A0174.991729] rfcomm_session_close: session cc751be0 state 8 er=
r 104
>>>>> >> [ =A0174.998504] rfcomm_session_put: in rfcomm_session_put, s->ref=
cnt =3D 1
>>>>> >> [ =A0175.005310] rfcomm_session_del: session cc751be0 state 9
>>>>> >> [ =A0175.011169] @@@@ in rfcomm_session_del()
>>>>> >> [ =A0175.015441] @@@@ s->list =3D cc751be0
>>>>> >> [ =A0175.019409] @@@@ s->list.next =3D 00100100
>>>>> >> [ =A0175.023651] @@@@ s->list.prev =3D 00200200
>>>>> >> [ =A0175.027923] @@@@ list is valid, call list_del()
>>>>> >> [ =A0175.032958] Unable to handle kernel paging request at virtual
>>>>> >> address 00200200
>>>>> >> [ =A0175.040679] pgd =3D c0004000
>>>>> >> [ =A0175.043792] [00200200] *pgd=3D00000000
>>>>> >> [ =A0175.047821] Internal error: Oops: 817 [#1]
>>>>> >> [ =A0175.052246] Modules linked in:
>>>>> >> [ =A0175.055725] CPU: 0 =A0 =A0Not tainted =A0(2.6.29-omap1-dirty =
#34)
>>>>> >> [ =A0175.061859] PC is at rfcomm_session_del+0x6c/0x108
>>>>> >> [ =A0175.067047] LR is at release_console_sem+0x190/0x1a0
>>>>> >> [ =A0175.072509] pc : [<c033ded8>] =A0 =A0lr : [<c0066308>] =A0 =
=A0psr: 60000013
>>>>> >> [ =A0175.072509] sp : cc1abf38 =A0ip : cc1abe68 =A0fp : cc1abf4c
>>>>> >> [ =A0175.084960] r10: cc751c04 =A0r9 : c036d2fc =A0r8 : cc751be0
>>>>> >> [ =A0175.090545] r7 : 00000068 =A0r6 : cc751c04 =A0r5 : 00000009 =
=A0r4 : cc751be0
>>>>> >> [ =A0175.097656] r3 : 00100100 =A0r2 : 00100100 =A0r1 : 00200200 =
=A0r0 : c0422876
>>>>> >>
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> HCI log analysis
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> Compare the hcidump log of the correct case with the one of the pa=
nic
>>>>> >> case, we found there is only one difference in the message sequenc=
e.
>>>>> >> In the panic case, headset send L2CAP Disconn_Req immediately afte=
r
>>>>> >> sending rfcomm UA frame to android. We think this is the reason th=
at
>>>>> >> cause the socket state become BT_CLOSED.
>>>>> >>
>>>>> >> Please compare these two log, pay attention to the message directi=
on
>>>>> >> of the last Disconn_Req.
>>>>> >>
>>>>> >>
>>>>> >> Log of correct case:
>>>>> >> ----------------------------
>>>>> >>
>>>>> >>
>>>>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>>>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 18 [psm 3]
>>>>> >> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>>>> >> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>>>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>>> >>
>>>>> >> =A0 =A0handle 1 packets 1
>>>>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>>>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>>> >>
>>>>> >> =A0 =A0handle 1 packets 1
>>>>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>>>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>>>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>>> >>
>>>>> >> =A0 =A0handle 1 packets 1
>>>>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>>>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>>>>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>>>>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>>> >>
>>>>> >> =A0 =A0handle 1 packets 1
>>>>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>>>>> >> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> log of panic case:
>>>>> >> ------------------------
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>>>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 18 [psm 3]
>>>>> >> =A0 =A0 =A0 RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>>>> >> =A0 =A0 =A0 0000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0=
=A0 =A0 =A0..+CIEV: 7,3..
>>>>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packet=
s (0x13) plen
>>>>> >> 5
>>>>> >> =A0 =A0 handle 1 packets 1
>>>>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>>>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packet=
s (0x13) plen
>>>>> >> 5
>>>>> >> =A0 =A0 handle 1 packets 1
>>>>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>>>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>>>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packet=
s (0x13) plen
>>>>> >> 5
>>>>> >> =A0 =A0 handle 1 packets 1
>>>>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>>>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>>>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>>>>> >> =A0 =A0 L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>>>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>>>>> >> =A0 =A0 L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>>>> >>
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> Analysis Result
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 whi=
ch
>>>>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command ri=
ght
>>>>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause t=
he
>>>>> >> rfcomm socket state become BT_CLOSED before completely handling UA
>>>>> >> frame, thus it will cause kernel panic. I think we can ignore the
>>>>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>>>>> >> doesn't make sense in the BT_CLOSED state.
>>>>> >>
>>>>> >>
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> Changed Code
>>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> >> We changed the code in the function rfcomm_process_rx() in
>>>>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>>>>> >> handling the received framew. If the socket state is BT_CLOSED, we
>>>>> >> don't handle any rfcomm frames but just close the session.
>>>>> >>
>>>>> >> The change is like below
>>>>> >>
>>>>> >> + =A0 =A0 =A0 if (sk->sk_state !=3D BT_CLOSED) {
>>>>> >> =A0 =A0 =A0 =A0 /* Get data directly from socket receive queue wit=
hout copying it. */
>>>>> >> =A0 =A0 =A0 =A0 while ((skb =3D skb_dequeue(&sk->sk_receive_queue)=
)) {
>>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb_orphan(skb);
>>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_recv_frame(s, skb);
>>>>> >> =A0 =A0 =A0 =A0 }
>>>>> >> -
>>>>> >> - =A0 =A0 =A0 if (sk->sk_state =3D=3D BT_CLOSED) {
>>>>> >> + =A0 =A0 =A0 } else {
>>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!s->initiator)
>>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put=
(s);
>>>>> >>
>>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_close(s, sk->sk_err=
);
>>>>> >> =A0 =A0 =A0 =A0 =A0}
>>>>> >
>>>>> > so I do see the issue here, but I don't agree with the fix since it
>>>>> > changes behavior that might cause other issues. So in case the fram=
e
>>>>> > processing leads to sk->sk_state =3D=3D BT_CLOSED we are not closin=
g the
>>>>> > connection anymore if we make it depend on a state before the frame
>>>>> > processing. And nothing guarantees that rfcomm_process_rx gets sche=
duled
>>>>> > again.
>>>>> >
>>>>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/cor=
e.c
>>>>> > index 94b3388..606143b 100644
>>>>> > --- a/net/bluetooth/rfcomm/core.c
>>>>> > +++ b/net/bluetooth/rfcomm/core.c
>>>>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct=
rfcomm_session *s)
>>>>> >
>>>>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, =
skb_queue_len(&sk->sk_receive_queue));
>>>>> >
>>>>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>>>>> > +
>>>>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue witho=
ut copying it. */
>>>>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue)))=
{
>>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_recv_frame(s, skb);
>>>>> > =A0 =A0 =A0 =A0}
>>>>> >
>>>>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>>>>> > +
>>>>> > =A0 =A0 =A0 =A0if (sk->sk_state =3D=3D BT_CLOSED) {
>>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!s->initiator)
>>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_put(s=
);
>>>>> >
>>>>> > What does the above patch do for you? Since if I read it correctly,=
then
>>>>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the clo=
sing
>>>>> > of the session. And then in this case it is delayed until after all
>>>>> > frames are processed.
>>>>> >
>>>>> >
>>>>>
>>>>> I've tried your patch but unfortunately kernel panic still happened.
>>>>>
>>>>> From the log I noticed that if rfcomm_l2state_change is called before
>>>>> rfcomm_process_rx, kernel panic will happen definitely.
>>>>>
>>>>> Below lines are in the correct log,
>>>>>
>>>>> [ =A0139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>>>>> [ =A0139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>>>>> ...
>>>>> [ =A0139.457519] rfcomm_l2state_change: ccf70000 state 9
>>>>> (disconnect ok)
>>>>>
>>>>> In the above case, when process_rx, the code in the condition "if
>>>>> (sk->sk_state =3D=3D BT_CLOSED)" will never run.
>>>>>
>>>>> Below lines are in the panic log,
>>>>>
>>>>> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>>>> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>>>> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>>>> ...
>>>>> ( then panic)
>>>>>
>>>>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, =A0t=
hen
>>>>> process_rx, so the code in the condition "if (sk->sk_state =3D=3D
>>>>> BT_CLOSED) " will be run, it will call session_put twice. I think thi=
s
>>>>> is the root cause of panic.
>>>>
>>>> I know why it happens, that is not the problem. My point is not to bre=
ak
>>>> current scheduling assumptions.
>>>>
>>>> So if you move the rfcomm_session_put() now at the end of the function=
,
>>>> then it should be fine, right?
>>>>
>>>> Regards
>>>>
>>>> Marcel
>>>>
>>>>
>>>>
>>>
>>> You are right. I moved the rfcomm_session_put() at the end of
>>> rfcomm_process_tx() then kernel panic doesn't happen any longer.
>>>
>>> The changed code is like below,
>>>
>>> @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcom=
m_session
>>>
>>> =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, skb_=
queue_len(&sk->s
>>>
>>> + =A0 =A0 =A0 rfcomm_session_hold(s);
>>> +
>>> =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without c=
opying it. */
>>> =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>>> @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcom=
m_session
>>>
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_close(s, sk->sk_err);
>>> =A0 =A0 =A0 =A0}
>>> +
>>> + =A0 =A0 =A0 rfcomm_session_put(s);
>>> =A0}
>>>
>>> =A0static inline void rfcomm_accept_connection(struct rfcomm_session *s=
)
>>>
>>> Please submit this change to bluez release.
>>
>>
>> Unfortunately, with this change I get a panic disconnecting from
>> Motorola H270 in the case that the headset initiated RFCOMM and we
>> disconnect RFCOMM.
>>
>> Here is the hcidump:
>>
>> 2009-09-21 17:22:37.384811 < ACL data: handle 1 flags 0x02 dlen 22
>> =A0 =A0L2CAP(d): cid 0x0041 len 18 [psm 3]
>> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =A0 =
=A0 =A0..+CIEV: 7,3..
>> 2009-09-21 17:22:37.502273 > HCI Event: Number of Completed Packets
>> (0x13) plen 5
>> =A0 =A0handle 1 packets 1
>> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
>> (0x13) plen 5
>> =A0 =A0handle 1 packets 1
>> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
>> (0x13) plen 5
>> =A0 =A0handle 1 packets 1
>> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
>> =A0 =A0L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>> 2009-09-21 17:22:39.029526 < ACL data: handle 1 flags 0x02 dlen 12
>> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>> 2009-09-21 17:22:39.136581 > HCI Event: Number of Completed Packets
>> (0x13) plen 5
>> =A0 =A0handle 1 packets 1
>> 2009-09-21 17:22:41.337203 > HCI Event: Disconn Complete (0x05) plen 4
>> =A0 =A0status 0x00 handle 1 reason 0x13
>> =A0 =A0Reason: Remote User Terminated Connection
>>
>> And the panic:
>>
>> <7>[ 3161.665557] rfcomm:rfcomm_session_del: session c9c06ad0 state 9
>> <7>[ 3161.671905] l2cap:l2cap_sock_release: sock cea04360, sk c97f02f8
>> <7>[ 3161.678497] l2cap:l2cap_sock_shutdown: sock cea04360, sk c97f02f8
>> <7>[ 3161.685028] l2cap:l2cap_sock_kill: sk c97f02f8 state 9
>> <7>[ 3161.695587] l2cap:l2cap_sock_destruct: sk c97f02f8
>> <4>[ 3161.700805] npelly 1911 rfcomm_process_sessions session c9c06ad0
>> refcnt 1802201963
>> <7>[ 3161.709014] rfcomm:rfcomm_process_dlcs: session c9c06ad0 state 180=
2201963
>> <7>[ 3161.716308] rfcomm:rfcomm_process_dlcs: session c9c06ad0 dlc 6b6b6=
b6b
>> <1>[ 3161.726776] Unable to handle kernel paging request at virtual
>> address 6b6b6b6b
>> <1>[ 3161.734619] pgd =3D c0004000
>> <1>[ 3161.737609] [6b6b6b6b] *pgd=3D00000000
>> <4>[ 3161.741638] Internal error: Oops: 5 [#1] PREEMPT
>> <4>[ 3161.746734] Modules linked in:
>> <4>[ 3161.750213] CPU: 0 =A0 =A0Not tainted
>> (2.6.29-omap1-07358-g9a3fd55-dirty #206)
>> <4>[ 3161.757629] PC is at rfcomm_process_dlcs+0x108/0x590
>> <4>[ 3161.762969] LR is at preempt_schedule+0x44/0x54
>> <4>[ 3161.767852] pc : [<c03911f4>] =A0 =A0lr : [<c03a27c4>] =A0 =A0psr:=
60000113
>> <4>[ 3161.767883] sp : ccdf9e80 =A0ip : ccdf9dd8 =A0fp : ccdf9edc
>> <4>[ 3161.780273] r10: 00000000 =A0r9 : c9c06af4 =A0r8 : c9c06ad0
>> <4>[ 3161.786010] r7 : 00000000 =A0r6 : c9c06ad0 =A0r5 : c4c68680 =A0r4 =
: 6b6b6b6b
>> <4>[ 3161.792968] r3 : c9c06ae0 =A0r2 : ccdf8000 =A0r1 : c61a8940 =A0r0 =
: 0000004c
>> <4>[ 3161.800079] Flags: nZCv =A0IRQs on =A0FIQs on =A0Mode SVC_32 =A0IS=
A ARM
>> Segment kernel
>> <4>[ 3161.807983] Control: 10c5387d =A0Table: 86db8019 =A0DAC: 00000017
>> <4>[ 3161.814147]
>> <4>[ 3161.814147] PC: 0xc0391174:
>> [...]
>> <4>[ 3162.973175] Backtrace:
>> <4>[ 3162.976013] [<c03910ec>] (rfcomm_process_dlcs+0x0/0x590) from
>> [<c03930b0>] (rfcomm_process_sessions+0x1a34/0x1a9c)
>> <4>[ 3162.987579] [<c039167c>] (rfcomm_process_sessions+0x0/0x1a9c)
>> from [<c03932ec>] (rfcomm_run+0x1d4/0x2ac)
>> <4>[ 3162.998199] [<c0393118>] (rfcomm_run+0x0/0x2ac) from
>> [<c008e7d8>] (kthread+0x5c/0x94)
>> <4>[ 3163.013763] [<c008e77c>] (kthread+0x0/0x94) from [<c007c998>]
>> (do_exit+0x0/0x714)
>>
>>
>> Seems like this fix avoids the panic due to calling
>> rfcomm_session_close() on a deleted session, but does not always
>> address the unbalanced rfcomm_session_put() which may be the root
>> cause.
>>
>> Lan Zhu suspected this in the original post, and his original fix does
>> in fact fix this panic as well as the originally reported panic,
>> because it avoids the unbalanced rfcomm_session_put().
>>
>> Marcel I know you are concerned about the original fix changing
>> scheduling assumptions, are you able to comment on this further?
>>
>> Are there any other suggestions for patches for this issue? I have
>> spent the best part of the day trying to figure this one out, but the
>> recounting in the rfcomm core is quite subtle and I think it really
>> needs someone familiar with the code to have a quick look and come up
>> with the safest patch. I can run tests.
>>
>> In the mean time, I am doing some testing of Lan Zhu's original fix
>> and if there are no better suggestions we will run with that one for
>> Android.
>>
>> Nick
>>
>>
>> Some more analysis:
>>
>> With the RFCOMM connection in idle there are 2 references on s->refcnt
>>
>> However three references are removed during disconnect with the H270
>> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink=
()
>> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
>> with dlci =3D 0 and s->state =3D BT_DISCONN
>> - rfcomm_process_sessions() -> rfcomm_process_rx() with sk_state =3D
>> BT_CLOSED and s->initiator =3D 0
>>
>> in that order.
>>
>> On another headset, for example the Moto H350, we only see the first
>> two references removed during disconnect.
>>
>> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink=
()
>> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
>> with dlci =3D 0 and s->state =3D BT_DISCONN
>>
>
> How about this. We still call rfcomm_process_rx(), but avoid the
> rfcomm_session_put() due to RFCOMM UA when the socket state is
> BT_CLOSED.
>
> It is less invasive, so might address Marcel's concerns with regard to
> scheduling changes.
>
>
> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> index c109a3a..333c6e9 100644
> --- a/net/bluetooth/rfcomm/core.c
> +++ b/net/bluetooth/rfcomm/core.c
> @@ -1105,6 +1105,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> *s, u8 dlci)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0} else {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* Control channel */
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct socket *sock =3D s->sock;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct sock *sk =3D sock->sk;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0switch (s->state) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case BT_CONNECT:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0s->state =3D BT_CONNECTED;
> @@ -1112,7 +1114,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> *s, u8 dlci)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case BT_DISCONN:
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (sk->sk_state !=3D BT_CL=
OSED)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_sess=
ion_put(s);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0}
>

I made a minor style improvement and added commit message. Patch available =
from

http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=3D1048e00=
7842da2d6440679e1ca80f45438a6369d

Nick

2009-09-22 01:29:56

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Mon, Sep 21, 2009 at 5:52 PM, Nick Pelly <[email protected]> wrote:
> On Mon, Sep 14, 2009 at 2:10 AM, Lan Zhu <[email protected]> wrote:
>> Hi Marcel,
>>
>> 2009/9/12 Marcel Holtmann <[email protected]>:
>>> Hi Zhu,
>>>
>>>> >> We met a issue that kernel panic happens when disconnecting some ki=
nds
>>>> >> of Bluetooth headset, then we did some analysis and made some chang=
es
>>>> >> on kernel code which have avoided the panic happening. Would you
>>>> >> please help to check if our analysis and fix is correct?
>>>> >>
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> Issue description
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
>>>> >> may cause kernel panic on certain conditions.
>>>> >>
>>>> >> (Pre-condition is android paired with headset.)
>>>> >> Initiate the connection from android, disconnect it from android, r=
esult is OK.
>>>> >> Initiate the connection from android, disconnect it from headset, r=
esult is OK.
>>>> >> Initiate the connection from headset, disconnect it from headset, r=
esult is OK.
>>>> >> Initiate the connection from headset, disconnect it from android, f=
or
>>>> >> Motorola H12 headset, result is OK.
>>>> >> Initiate the connection from headset, disconnect it from android, f=
or
>>>> >> Motorola H620/560 headset, result is kernel panic.
>>>> >>
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> Kernel panic point
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
>>>> >> panic reason is "Unable to handle kernel paging request at virtual
>>>> >> address 00200200"
>>>> >>
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> Kernel log analysis
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> rfcomm_session_del() is still called after the session entry is
>>>> >> removed from the list. Then __list_del() will cause kernel panic
>>>> >> because of the incorrect pointer. This situation occurs when callin=
g
>>>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
>>>> >> find out why the socket state become BT_CLOSED before we calling
>>>> >> rfcomm_recv_ua().
>>>> >>
>>>> >> # [ =A0171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>>> >> [ =A0171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>>>> >> [ =A0171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>>> >> [ =A0171.695709] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>>> >> [ =A0171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>>>> >> [ =A0171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>>> >> rx_credits 33 tx_credits 31
>>>> >> [ =A0171.715515] rfcomm_send_frame: session cc751be0 len 18
>>>> >> [ =A0171.721130] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 3
>>>> >> [ =A0174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>>> >> [ =A0174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>>>> >> [ =A0174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>>> >> [ =A0174.148498] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>>> >> [ =A0174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>>>> >> [ =A0174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>>> >> rx_credits 33 tx_credits 30
>>>> >> [ =A0174.171173] rfcomm_send_frame: session cc751be0 len 10
>>>> >> [ =A0174.177642] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 3
>>>> >> [ =A0174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>>>> >> [ =A0174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>>>> >> [ =A0174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce=
9a0960
>>>> >> [ =A0174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 e=
rr 0
>>>> >> session cc751be0
>>>> >> [ =A0174.236877] rfcomm_send_disc: cc751be0 dlci 20
>>>> >> [ =A0174.242706] rfcomm_send_frame: session cc751be0 len 4
>>>> >> [ =A0174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout=
2560
>>>> >> [ =A0174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>>>> >> [ =A0174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>>>> >> [ =A0174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>>>> >> [ =A0174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
>>>> >> [ =A0174.411010] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>>> >> [ =A0174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>>>> >> [ =A0174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 e=
rr 0
>>>> >> session cc751be0
>>>> >> [ =A0174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>>>> >> [ =A0174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session c=
c751be0
>>>> >> [ =A0174.443572] rfcomm_dlc_free: cd3fe920
>>>> >> [ =A0174.447570] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 3
>>>> >> [ =A0174.454528] rfcomm_send_disc: cc751be0 dlci 0
>>>> >> [ =A0174.459259] rfcomm_send_frame: session cc751be0 len 4
>>>> >> [ =A0174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>>>> >> [ =A0174.470703] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 2
>>>> >> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>>> >> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>>> >> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>>> >> [ =A0174.915130] rfcomm_process_rx: @@@ @@@ sk_state =3D 9
>>>> >> [ =A0174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>>>> >> [ =A0174.926574] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 2
>>>> >> [ =A0174.933532] rfcomm_process_rx: @@@ @@@ sk_state =3D=3D BT_CLOS=
ED , s->initiator=3D0
>>>> >> [ =A0174.941253] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 1
>>>> >> [ =A0174.948211] rfcomm_session_del: session cc751be0 state 8
>>>> >> [ =A0174.953918] @@@@ in rfcomm_session_del()
>>>> >> [ =A0174.958312] @@@@ s->list =3D cc751be0
>>>> >> [ =A0174.962097] @@@@ s->list.next =3D ccbfe9a0
>>>> >> [ =A0174.966369] @@@@ s->list.prev =3D c047d524
>>>> >> [ =A0174.970733] @@@@ list is valid, call list_del()
>>>> >> [ =A0174.975646] @@@@ after list_del()
>>>> >> [ =A0174.979278] @@@@ s->list =3D cc751be0
>>>> >> [ =A0174.983184] @@@@ s->list.next =3D 00100100
>>>> >> [ =A0174.987457] @@@@ s->list.prev =3D 00200200
>>>> >> [ =A0174.991729] rfcomm_session_close: session cc751be0 state 8 err=
104
>>>> >> [ =A0174.998504] rfcomm_session_put: in rfcomm_session_put, s->refc=
nt =3D 1
>>>> >> [ =A0175.005310] rfcomm_session_del: session cc751be0 state 9
>>>> >> [ =A0175.011169] @@@@ in rfcomm_session_del()
>>>> >> [ =A0175.015441] @@@@ s->list =3D cc751be0
>>>> >> [ =A0175.019409] @@@@ s->list.next =3D 00100100
>>>> >> [ =A0175.023651] @@@@ s->list.prev =3D 00200200
>>>> >> [ =A0175.027923] @@@@ list is valid, call list_del()
>>>> >> [ =A0175.032958] Unable to handle kernel paging request at virtual
>>>> >> address 00200200
>>>> >> [ =A0175.040679] pgd =3D c0004000
>>>> >> [ =A0175.043792] [00200200] *pgd=3D00000000
>>>> >> [ =A0175.047821] Internal error: Oops: 817 [#1]
>>>> >> [ =A0175.052246] Modules linked in:
>>>> >> [ =A0175.055725] CPU: 0 =A0 =A0Not tainted =A0(2.6.29-omap1-dirty #=
34)
>>>> >> [ =A0175.061859] PC is at rfcomm_session_del+0x6c/0x108
>>>> >> [ =A0175.067047] LR is at release_console_sem+0x190/0x1a0
>>>> >> [ =A0175.072509] pc : [<c033ded8>] =A0 =A0lr : [<c0066308>] =A0 =A0=
psr: 60000013
>>>> >> [ =A0175.072509] sp : cc1abf38 =A0ip : cc1abe68 =A0fp : cc1abf4c
>>>> >> [ =A0175.084960] r10: cc751c04 =A0r9 : c036d2fc =A0r8 : cc751be0
>>>> >> [ =A0175.090545] r7 : 00000068 =A0r6 : cc751c04 =A0r5 : 00000009 =
=A0r4 : cc751be0
>>>> >> [ =A0175.097656] r3 : 00100100 =A0r2 : 00100100 =A0r1 : 00200200 =
=A0r0 : c0422876
>>>> >>
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> HCI log analysis
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> Compare the hcidump log of the correct case with the one of the pan=
ic
>>>> >> case, we found there is only one difference in the message sequence=
.
>>>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
>>>> >> sending rfcomm UA frame to android. We think this is the reason tha=
t
>>>> >> cause the socket state become BT_CLOSED.
>>>> >>
>>>> >> Please compare these two log, pay attention to the message directio=
n
>>>> >> of the last Disconn_Req.
>>>> >>
>>>> >>
>>>> >> Log of correct case:
>>>> >> ----------------------------
>>>> >>
>>>> >>
>>>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 18 [psm 3]
>>>> >> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>>> >> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets =
(0x13) plen
>>>> >>
>>>> >> =A0 =A0handle 1 packets 1
>>>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets =
(0x13) plen
>>>> >>
>>>> >> =A0 =A0handle 1 packets 1
>>>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets =
(0x13) plen
>>>> >>
>>>> >> =A0 =A0handle 1 packets 1
>>>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>>>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>>>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets =
(0x13) plen
>>>> >>
>>>> >> =A0 =A0handle 1 packets 1
>>>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>>>> >> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> log of panic case:
>>>> >> ------------------------
>>>> >>
>>>> >>
>>>> >>
>>>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 18 [psm 3]
>>>> >> =A0 =A0 =A0 RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>>> >> =A0 =A0 =A0 0000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>> >> 5
>>>> >> =A0 =A0 handle 1 packets 1
>>>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>> >> 5
>>>> >> =A0 =A0 handle 1 packets 1
>>>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets=
(0x13) plen
>>>> >> 5
>>>> >> =A0 =A0 handle 1 packets 1
>>>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>>>> >> =A0 =A0 L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>>>> >> =A0 =A0 L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>>> >>
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> Analysis Result
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 whic=
h
>>>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command rig=
ht
>>>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause th=
e
>>>> >> rfcomm socket state become BT_CLOSED before completely handling UA
>>>> >> frame, thus it will cause kernel panic. I think we can ignore the
>>>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>>>> >> doesn't make sense in the BT_CLOSED state.
>>>> >>
>>>> >>
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> Changed Code
>>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>> >> We changed the code in the function rfcomm_process_rx() in
>>>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>>>> >> handling the received framew. If the socket state is BT_CLOSED, we
>>>> >> don't handle any rfcomm frames but just close the session.
>>>> >>
>>>> >> The change is like below
>>>> >>
>>>> >> + =A0 =A0 =A0 if (sk->sk_state !=3D BT_CLOSED) {
>>>> >> =A0 =A0 =A0 =A0 /* Get data directly from socket receive queue with=
out copying it. */
>>>> >> =A0 =A0 =A0 =A0 while ((skb =3D skb_dequeue(&sk->sk_receive_queue))=
) {
>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb_orphan(skb);
>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_recv_frame(s, skb);
>>>> >> =A0 =A0 =A0 =A0 }
>>>> >> -
>>>> >> - =A0 =A0 =A0 if (sk->sk_state =3D=3D BT_CLOSED) {
>>>> >> + =A0 =A0 =A0 } else {
>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!s->initiator)
>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(=
s);
>>>> >>
>>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_close(s, sk->sk_err)=
;
>>>> >> =A0 =A0 =A0 =A0 =A0}
>>>> >
>>>> > so I do see the issue here, but I don't agree with the fix since it
>>>> > changes behavior that might cause other issues. So in case the frame
>>>> > processing leads to sk->sk_state =3D=3D BT_CLOSED we are not closing=
the
>>>> > connection anymore if we make it depend on a state before the frame
>>>> > processing. And nothing guarantees that rfcomm_process_rx gets sched=
uled
>>>> > again.
>>>> >
>>>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core=
.c
>>>> > index 94b3388..606143b 100644
>>>> > --- a/net/bluetooth/rfcomm/core.c
>>>> > +++ b/net/bluetooth/rfcomm/core.c
>>>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct =
rfcomm_session *s)
>>>> >
>>>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, s=
kb_queue_len(&sk->sk_receive_queue));
>>>> >
>>>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>>>> > +
>>>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue withou=
t copying it. */
>>>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) =
{
>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_recv_frame(s, skb);
>>>> > =A0 =A0 =A0 =A0}
>>>> >
>>>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>>>> > +
>>>> > =A0 =A0 =A0 =A0if (sk->sk_state =3D=3D BT_CLOSED) {
>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!s->initiator)
>>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_put(s)=
;
>>>> >
>>>> > What does the above patch do for you? Since if I read it correctly, =
then
>>>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the clos=
ing
>>>> > of the session. And then in this case it is delayed until after all
>>>> > frames are processed.
>>>> >
>>>> >
>>>>
>>>> I've tried your patch but unfortunately kernel panic still happened.
>>>>
>>>> From the log I noticed that if rfcomm_l2state_change is called before
>>>> rfcomm_process_rx, kernel panic will happen definitely.
>>>>
>>>> Below lines are in the correct log,
>>>>
>>>> [ =A0139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>>>> [ =A0139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>>>> ...
>>>> [ =A0139.457519] rfcomm_l2state_change: ccf70000 state 9
>>>> (disconnect ok)
>>>>
>>>> In the above case, when process_rx, the code in the condition "if
>>>> (sk->sk_state =3D=3D BT_CLOSED)" will never run.
>>>>
>>>> Below lines are in the panic log,
>>>>
>>>> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>>> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>>> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>>> ...
>>>> ( then panic)
>>>>
>>>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, =A0th=
en
>>>> process_rx, so the code in the condition "if (sk->sk_state =3D=3D
>>>> BT_CLOSED) " will be run, it will call session_put twice. I think this
>>>> is the root cause of panic.
>>>
>>> I know why it happens, that is not the problem. My point is not to brea=
k
>>> current scheduling assumptions.
>>>
>>> So if you move the rfcomm_session_put() now at the end of the function,
>>> then it should be fine, right?
>>>
>>> Regards
>>>
>>> Marcel
>>>
>>>
>>>
>>
>> You are right. I moved the rfcomm_session_put() at the end of
>> rfcomm_process_tx() then kernel panic doesn't happen any longer.
>>
>> The changed code is like below,
>>
>> @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm=
_session
>>
>> =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, skb_q=
ueue_len(&sk->s
>>
>> + =A0 =A0 =A0 rfcomm_session_hold(s);
>> +
>> =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without co=
pying it. */
>> =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>> @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm=
_session
>>
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_close(s, sk->sk_err);
>> =A0 =A0 =A0 =A0}
>> +
>> + =A0 =A0 =A0 rfcomm_session_put(s);
>> =A0}
>>
>> =A0static inline void rfcomm_accept_connection(struct rfcomm_session *s)
>>
>> Please submit this change to bluez release.
>
>
> Unfortunately, with this change I get a panic disconnecting from
> Motorola H270 in the case that the headset initiated RFCOMM and we
> disconnect RFCOMM.
>
> Here is the hcidump:
>
> 2009-09-21 17:22:37.384811 < ACL data: handle 1 flags 0x02 dlen 22
> =A0 =A0L2CAP(d): cid 0x0041 len 18 [psm 3]
> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =A0 =A0=
=A0..+CIEV: 7,3..
> 2009-09-21 17:22:37.502273 > HCI Event: Number of Completed Packets
> (0x13) plen 5
> =A0 =A0handle 1 packets 1
> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
> (0x13) plen 5
> =A0 =A0handle 1 packets 1
> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
> (0x13) plen 5
> =A0 =A0handle 1 packets 1
> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
> =A0 =A0L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> 2009-09-21 17:22:39.029526 < ACL data: handle 1 flags 0x02 dlen 12
> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
> 2009-09-21 17:22:39.136581 > HCI Event: Number of Completed Packets
> (0x13) plen 5
> =A0 =A0handle 1 packets 1
> 2009-09-21 17:22:41.337203 > HCI Event: Disconn Complete (0x05) plen 4
> =A0 =A0status 0x00 handle 1 reason 0x13
> =A0 =A0Reason: Remote User Terminated Connection
>
> And the panic:
>
> <7>[ 3161.665557] rfcomm:rfcomm_session_del: session c9c06ad0 state 9
> <7>[ 3161.671905] l2cap:l2cap_sock_release: sock cea04360, sk c97f02f8
> <7>[ 3161.678497] l2cap:l2cap_sock_shutdown: sock cea04360, sk c97f02f8
> <7>[ 3161.685028] l2cap:l2cap_sock_kill: sk c97f02f8 state 9
> <7>[ 3161.695587] l2cap:l2cap_sock_destruct: sk c97f02f8
> <4>[ 3161.700805] npelly 1911 rfcomm_process_sessions session c9c06ad0
> refcnt 1802201963
> <7>[ 3161.709014] rfcomm:rfcomm_process_dlcs: session c9c06ad0 state 1802=
201963
> <7>[ 3161.716308] rfcomm:rfcomm_process_dlcs: session c9c06ad0 dlc 6b6b6b=
6b
> <1>[ 3161.726776] Unable to handle kernel paging request at virtual
> address 6b6b6b6b
> <1>[ 3161.734619] pgd =3D c0004000
> <1>[ 3161.737609] [6b6b6b6b] *pgd=3D00000000
> <4>[ 3161.741638] Internal error: Oops: 5 [#1] PREEMPT
> <4>[ 3161.746734] Modules linked in:
> <4>[ 3161.750213] CPU: 0 =A0 =A0Not tainted
> (2.6.29-omap1-07358-g9a3fd55-dirty #206)
> <4>[ 3161.757629] PC is at rfcomm_process_dlcs+0x108/0x590
> <4>[ 3161.762969] LR is at preempt_schedule+0x44/0x54
> <4>[ 3161.767852] pc : [<c03911f4>] =A0 =A0lr : [<c03a27c4>] =A0 =A0psr: =
60000113
> <4>[ 3161.767883] sp : ccdf9e80 =A0ip : ccdf9dd8 =A0fp : ccdf9edc
> <4>[ 3161.780273] r10: 00000000 =A0r9 : c9c06af4 =A0r8 : c9c06ad0
> <4>[ 3161.786010] r7 : 00000000 =A0r6 : c9c06ad0 =A0r5 : c4c68680 =A0r4 :=
6b6b6b6b
> <4>[ 3161.792968] r3 : c9c06ae0 =A0r2 : ccdf8000 =A0r1 : c61a8940 =A0r0 :=
0000004c
> <4>[ 3161.800079] Flags: nZCv =A0IRQs on =A0FIQs on =A0Mode SVC_32 =A0ISA=
ARM
> Segment kernel
> <4>[ 3161.807983] Control: 10c5387d =A0Table: 86db8019 =A0DAC: 00000017
> <4>[ 3161.814147]
> <4>[ 3161.814147] PC: 0xc0391174:
> [...]
> <4>[ 3162.973175] Backtrace:
> <4>[ 3162.976013] [<c03910ec>] (rfcomm_process_dlcs+0x0/0x590) from
> [<c03930b0>] (rfcomm_process_sessions+0x1a34/0x1a9c)
> <4>[ 3162.987579] [<c039167c>] (rfcomm_process_sessions+0x0/0x1a9c)
> from [<c03932ec>] (rfcomm_run+0x1d4/0x2ac)
> <4>[ 3162.998199] [<c0393118>] (rfcomm_run+0x0/0x2ac) from
> [<c008e7d8>] (kthread+0x5c/0x94)
> <4>[ 3163.013763] [<c008e77c>] (kthread+0x0/0x94) from [<c007c998>]
> (do_exit+0x0/0x714)
>
>
> Seems like this fix avoids the panic due to calling
> rfcomm_session_close() on a deleted session, but does not always
> address the unbalanced rfcomm_session_put() which may be the root
> cause.
>
> Lan Zhu suspected this in the original post, and his original fix does
> in fact fix this panic as well as the originally reported panic,
> because it avoids the unbalanced rfcomm_session_put().
>
> Marcel I know you are concerned about the original fix changing
> scheduling assumptions, are you able to comment on this further?
>
> Are there any other suggestions for patches for this issue? I have
> spent the best part of the day trying to figure this one out, but the
> recounting in the rfcomm core is quite subtle and I think it really
> needs someone familiar with the code to have a quick look and come up
> with the safest patch. I can run tests.
>
> In the mean time, I am doing some testing of Lan Zhu's original fix
> and if there are no better suggestions we will run with that one for
> Android.
>
> Nick
>
>
> Some more analysis:
>
> With the RFCOMM connection in idle there are 2 references on s->refcnt
>
> However three references are removed during disconnect with the H270
> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink(=
)
> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
> with dlci =3D 0 and s->state =3D BT_DISCONN
> - rfcomm_process_sessions() -> rfcomm_process_rx() with sk_state =3D
> BT_CLOSED and s->initiator =3D 0
>
> in that order.
>
> On another headset, for example the Moto H350, we only see the first
> two references removed during disconnect.
>
> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink(=
)
> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
> with dlci =3D 0 and s->state =3D BT_DISCONN
>

How about this. We still call rfcomm_process_rx(), but avoid the
rfcomm_session_put() due to RFCOMM UA when the socket state is
BT_CLOSED.

It is less invasive, so might address Marcel's concerns with regard to
scheduling changes.


diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
index c109a3a..333c6e9 100644
--- a/net/bluetooth/rfcomm/core.c
+++ b/net/bluetooth/rfcomm/core.c
@@ -1105,6 +1105,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
*s, u8 dlci)
}
} else {
/* Control channel */
+ struct socket *sock =3D s->sock;
+ struct sock *sk =3D sock->sk;
switch (s->state) {
case BT_CONNECT:
s->state =3D BT_CONNECTED;
@@ -1112,7 +1114,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
*s, u8 dlci)
break;

case BT_DISCONN:
- rfcomm_session_put(s);
+ if (sk->sk_state !=3D BT_CLOSED)
+ rfcomm_session_put(s);
break;
}
}

2009-09-22 00:52:59

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Mon, Sep 14, 2009 at 2:10 AM, Lan Zhu <[email protected]> wrote:
> Hi Marcel,
>
> 2009/9/12 Marcel Holtmann <[email protected]>:
>> Hi Zhu,
>>
>>> >> We met a issue that kernel panic happens when disconnecting some kin=
ds
>>> >> of Bluetooth headset, then we did some analysis and made some change=
s
>>> >> on kernel code which have avoided the panic happening. Would you
>>> >> please help to check if our analysis and fix is correct?
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Issue description
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
>>> >> may cause kernel panic on certain conditions.
>>> >>
>>> >> (Pre-condition is android paired with headset.)
>>> >> Initiate the connection from android, disconnect it from android, re=
sult is OK.
>>> >> Initiate the connection from android, disconnect it from headset, re=
sult is OK.
>>> >> Initiate the connection from headset, disconnect it from headset, re=
sult is OK.
>>> >> Initiate the connection from headset, disconnect it from android, fo=
r
>>> >> Motorola H12 headset, result is OK.
>>> >> Initiate the connection from headset, disconnect it from android, fo=
r
>>> >> Motorola H620/560 headset, result is kernel panic.
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Kernel panic point
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
>>> >> panic reason is "Unable to handle kernel paging request at virtual
>>> >> address 00200200"
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Kernel log analysis
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> rfcomm_session_del() is still called after the session entry is
>>> >> removed from the list. Then __list_del() will cause kernel panic
>>> >> because of the incorrect pointer. This situation occurs when calling
>>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
>>> >> find out why the socket state become BT_CLOSED before we calling
>>> >> rfcomm_recv_ua().
>>> >>
>>> >> # [ =A0171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>>> >> [ =A0171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>> >> [ =A0171.695709] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>> >> [ =A0171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>>> >> [ =A0171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>> >> rx_credits 33 tx_credits 31
>>> >> [ =A0171.715515] rfcomm_send_frame: session cc751be0 len 18
>>> >> [ =A0171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 3
>>> >> [ =A0174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>>> >> [ =A0174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>> >> [ =A0174.148498] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>> >> [ =A0174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>>> >> [ =A0174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>> >> rx_credits 33 tx_credits 30
>>> >> [ =A0174.171173] rfcomm_send_frame: session cc751be0 len 10
>>> >> [ =A0174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 3
>>> >> [ =A0174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9=
a0960
>>> >> [ =A0174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 er=
r 0
>>> >> session cc751be0
>>> >> [ =A0174.236877] rfcomm_send_disc: cc751be0 dlci 20
>>> >> [ =A0174.242706] rfcomm_send_frame: session cc751be0 len 4
>>> >> [ =A0174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout =
2560
>>> >> [ =A0174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>>> >> [ =A0174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>>> >> [ =A0174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>>> >> [ =A0174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
>>> >> [ =A0174.411010] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>> >> [ =A0174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>>> >> [ =A0174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 er=
r 0
>>> >> session cc751be0
>>> >> [ =A0174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>>> >> [ =A0174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc=
751be0
>>> >> [ =A0174.443572] rfcomm_dlc_free: cd3fe920
>>> >> [ =A0174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 3
>>> >> [ =A0174.454528] rfcomm_send_disc: cc751be0 dlci 0
>>> >> [ =A0174.459259] rfcomm_send_frame: session cc751be0 len 4
>>> >> [ =A0174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>>> >> [ =A0174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 2
>>> >> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>> >> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>> >> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>> >> [ =A0174.915130] rfcomm_process_rx: @@@ @@@ sk_state =3D 9
>>> >> [ =A0174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>>> >> [ =A0174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 2
>>> >> [ =A0174.933532] rfcomm_process_rx: @@@ @@@ sk_state =3D=3D BT_CLOSE=
D , s->initiator=3D0
>>> >> [ =A0174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 1
>>> >> [ =A0174.948211] rfcomm_session_del: session cc751be0 state 8
>>> >> [ =A0174.953918] @@@@ in rfcomm_session_del()
>>> >> [ =A0174.958312] @@@@ s->list =3D cc751be0
>>> >> [ =A0174.962097] @@@@ s->list.next =3D ccbfe9a0
>>> >> [ =A0174.966369] @@@@ s->list.prev =3D c047d524
>>> >> [ =A0174.970733] @@@@ list is valid, call list_del()
>>> >> [ =A0174.975646] @@@@ after list_del()
>>> >> [ =A0174.979278] @@@@ s->list =3D cc751be0
>>> >> [ =A0174.983184] @@@@ s->list.next =3D 00100100
>>> >> [ =A0174.987457] @@@@ s->list.prev =3D 00200200
>>> >> [ =A0174.991729] rfcomm_session_close: session cc751be0 state 8 err =
104
>>> >> [ =A0174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 1
>>> >> [ =A0175.005310] rfcomm_session_del: session cc751be0 state 9
>>> >> [ =A0175.011169] @@@@ in rfcomm_session_del()
>>> >> [ =A0175.015441] @@@@ s->list =3D cc751be0
>>> >> [ =A0175.019409] @@@@ s->list.next =3D 00100100
>>> >> [ =A0175.023651] @@@@ s->list.prev =3D 00200200
>>> >> [ =A0175.027923] @@@@ list is valid, call list_del()
>>> >> [ =A0175.032958] Unable to handle kernel paging request at virtual
>>> >> address 00200200
>>> >> [ =A0175.040679] pgd =3D c0004000
>>> >> [ =A0175.043792] [00200200] *pgd=3D00000000
>>> >> [ =A0175.047821] Internal error: Oops: 817 [#1]
>>> >> [ =A0175.052246] Modules linked in:
>>> >> [ =A0175.055725] CPU: 0 =A0 =A0Not tainted =A0(2.6.29-omap1-dirty #3=
4)
>>> >> [ =A0175.061859] PC is at rfcomm_session_del+0x6c/0x108
>>> >> [ =A0175.067047] LR is at release_console_sem+0x190/0x1a0
>>> >> [ =A0175.072509] pc : [<c033ded8>] =A0 =A0lr : [<c0066308>] =A0 =A0p=
sr: 60000013
>>> >> [ =A0175.072509] sp : cc1abf38 =A0ip : cc1abe68 =A0fp : cc1abf4c
>>> >> [ =A0175.084960] r10: cc751c04 =A0r9 : c036d2fc =A0r8 : cc751be0
>>> >> [ =A0175.090545] r7 : 00000068 =A0r6 : cc751c04 =A0r5 : 00000009 =A0=
r4 : cc751be0
>>> >> [ =A0175.097656] r3 : 00100100 =A0r2 : 00100100 =A0r1 : 00200200 =A0=
r0 : c0422876
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> HCI log analysis
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Compare the hcidump log of the correct case with the one of the pani=
c
>>> >> case, we found there is only one difference in the message sequence.
>>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
>>> >> sending rfcomm UA frame to android. We think this is the reason that
>>> >> cause the socket state become BT_CLOSED.
>>> >>
>>> >> Please compare these two log, pay attention to the message direction
>>> >> of the last Disconn_Req.
>>> >>
>>> >>
>>> >> Log of correct case:
>>> >> ----------------------------
>>> >>
>>> >>
>>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 18 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>> >> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> log of panic case:
>>> >> ------------------------
>>> >>
>>> >>
>>> >>
>>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 18 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>> >> =A0 =A0 =A0 0000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets =
(0x13) plen
>>> >> 5
>>> >> =A0 =A0 handle 1 packets 1
>>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets =
(0x13) plen
>>> >> 5
>>> >> =A0 =A0 handle 1 packets 1
>>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets =
(0x13) plen
>>> >> 5
>>> >> =A0 =A0 handle 1 packets 1
>>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0 L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0 L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Analysis Result
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
>>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command righ=
t
>>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
>>> >> rfcomm socket state become BT_CLOSED before completely handling UA
>>> >> frame, thus it will cause kernel panic. I think we can ignore the
>>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>>> >> doesn't make sense in the BT_CLOSED state.
>>> >>
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Changed Code
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> We changed the code in the function rfcomm_process_rx() in
>>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>>> >> handling the received framew. If the socket state is BT_CLOSED, we
>>> >> don't handle any rfcomm frames but just close the session.
>>> >>
>>> >> The change is like below
>>> >>
>>> >> + =A0 =A0 =A0 if (sk->sk_state !=3D BT_CLOSED) {
>>> >> =A0 =A0 =A0 =A0 /* Get data directly from socket receive queue witho=
ut copying it. */
>>> >> =A0 =A0 =A0 =A0 while ((skb =3D skb_dequeue(&sk->sk_receive_queue)))=
{
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb_orphan(skb);
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_recv_frame(s, skb);
>>> >> =A0 =A0 =A0 =A0 }
>>> >> -
>>> >> - =A0 =A0 =A0 if (sk->sk_state =3D=3D BT_CLOSED) {
>>> >> + =A0 =A0 =A0 } else {
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!s->initiator)
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s=
);
>>> >>
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_close(s, sk->sk_err);
>>> >> =A0 =A0 =A0 =A0 =A0}
>>> >
>>> > so I do see the issue here, but I don't agree with the fix since it
>>> > changes behavior that might cause other issues. So in case the frame
>>> > processing leads to sk->sk_state =3D=3D BT_CLOSED we are not closing =
the
>>> > connection anymore if we make it depend on a state before the frame
>>> > processing. And nothing guarantees that rfcomm_process_rx gets schedu=
led
>>> > again.
>>> >
>>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.=
c
>>> > index 94b3388..606143b 100644
>>> > --- a/net/bluetooth/rfcomm/core.c
>>> > +++ b/net/bluetooth/rfcomm/core.c
>>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct r=
fcomm_session *s)
>>> >
>>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, sk=
b_queue_len(&sk->sk_receive_queue));
>>> >
>>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>>> > +
>>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without=
copying it. */
>>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_recv_frame(s, skb);
>>> > =A0 =A0 =A0 =A0}
>>> >
>>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>>> > +
>>> > =A0 =A0 =A0 =A0if (sk->sk_state =3D=3D BT_CLOSED) {
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!s->initiator)
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_put(s);
>>> >
>>> > What does the above patch do for you? Since if I read it correctly, t=
hen
>>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closi=
ng
>>> > of the session. And then in this case it is delayed until after all
>>> > frames are processed.
>>> >
>>> >
>>>
>>> I've tried your patch but unfortunately kernel panic still happened.
>>>
>>> From the log I noticed that if rfcomm_l2state_change is called before
>>> rfcomm_process_rx, kernel panic will happen definitely.
>>>
>>> Below lines are in the correct log,
>>>
>>> [ =A0139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>>> [ =A0139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>>> ...
>>> [ =A0139.457519] rfcomm_l2state_change: ccf70000 state 9
>>> (disconnect ok)
>>>
>>> In the above case, when process_rx, the code in the condition "if
>>> (sk->sk_state =3D=3D BT_CLOSED)" will never run.
>>>
>>> Below lines are in the panic log,
>>>
>>> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>> ...
>>> ( then panic)
>>>
>>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, =A0the=
n
>>> process_rx, so the code in the condition "if (sk->sk_state =3D=3D
>>> BT_CLOSED) " will be run, it will call session_put twice. I think this
>>> is the root cause of panic.
>>
>> I know why it happens, that is not the problem. My point is not to break
>> current scheduling assumptions.
>>
>> So if you move the rfcomm_session_put() now at the end of the function,
>> then it should be fine, right?
>>
>> Regards
>>
>> Marcel
>>
>>
>>
>
> You are right. I moved the rfcomm_session_put() at the end of
> rfcomm_process_tx() then kernel panic doesn't happen any longer.
>
> The changed code is like below,
>
> @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_=
session
>
> =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, skb_qu=
eue_len(&sk->s
>
> + =A0 =A0 =A0 rfcomm_session_hold(s);
> +
> =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without cop=
ying it. */
> =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
> @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_=
session
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_close(s, sk->sk_err);
> =A0 =A0 =A0 =A0}
> +
> + =A0 =A0 =A0 rfcomm_session_put(s);
> =A0}
>
> =A0static inline void rfcomm_accept_connection(struct rfcomm_session *s)
>
> Please submit this change to bluez release.


Unfortunately, with this change I get a panic disconnecting from
Motorola H270 in the case that the headset initiated RFCOMM and we
disconnect RFCOMM.

Here is the hcidump:

2009-09-21 17:22:37.384811 < ACL data: handle 1 flags 0x02 dlen 22
L2CAP(d): cid 0x0041 len 18 [psm 3]
RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3.=
.
2009-09-21 17:22:37.502273 > HCI Event: Number of Completed Packets
(0x13) plen 5
handle 1 packets 1
2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0041 len 4 [psm 3]
RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
(0x13) plen 5
handle 1 packets 1
2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0040 len 4 [psm 3]
RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0041 len 4 [psm 3]
RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
(0x13) plen 5
handle 1 packets 1
2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
L2CAP(d): cid 0x0040 len 4 [psm 3]
RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
2009-09-21 17:22:39.029526 < ACL data: handle 1 flags 0x02 dlen 12
L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
2009-09-21 17:22:39.136581 > HCI Event: Number of Completed Packets
(0x13) plen 5
handle 1 packets 1
2009-09-21 17:22:41.337203 > HCI Event: Disconn Complete (0x05) plen 4
status 0x00 handle 1 reason 0x13
Reason: Remote User Terminated Connection

And the panic:

<7>[ 3161.665557] rfcomm:rfcomm_session_del: session c9c06ad0 state 9
<7>[ 3161.671905] l2cap:l2cap_sock_release: sock cea04360, sk c97f02f8
<7>[ 3161.678497] l2cap:l2cap_sock_shutdown: sock cea04360, sk c97f02f8
<7>[ 3161.685028] l2cap:l2cap_sock_kill: sk c97f02f8 state 9
<7>[ 3161.695587] l2cap:l2cap_sock_destruct: sk c97f02f8
<4>[ 3161.700805] npelly 1911 rfcomm_process_sessions session c9c06ad0
refcnt 1802201963
<7>[ 3161.709014] rfcomm:rfcomm_process_dlcs: session c9c06ad0 state 180220=
1963
<7>[ 3161.716308] rfcomm:rfcomm_process_dlcs: session c9c06ad0 dlc 6b6b6b6b
<1>[ 3161.726776] Unable to handle kernel paging request at virtual
address 6b6b6b6b
<1>[ 3161.734619] pgd =3D c0004000
<1>[ 3161.737609] [6b6b6b6b] *pgd=3D00000000
<4>[ 3161.741638] Internal error: Oops: 5 [#1] PREEMPT
<4>[ 3161.746734] Modules linked in:
<4>[ 3161.750213] CPU: 0 Not tainted
(2.6.29-omap1-07358-g9a3fd55-dirty #206)
<4>[ 3161.757629] PC is at rfcomm_process_dlcs+0x108/0x590
<4>[ 3161.762969] LR is at preempt_schedule+0x44/0x54
<4>[ 3161.767852] pc : [<c03911f4>] lr : [<c03a27c4>] psr: 60000113
<4>[ 3161.767883] sp : ccdf9e80 ip : ccdf9dd8 fp : ccdf9edc
<4>[ 3161.780273] r10: 00000000 r9 : c9c06af4 r8 : c9c06ad0
<4>[ 3161.786010] r7 : 00000000 r6 : c9c06ad0 r5 : c4c68680 r4 : 6b6b6b6=
b
<4>[ 3161.792968] r3 : c9c06ae0 r2 : ccdf8000 r1 : c61a8940 r0 : 0000004=
c
<4>[ 3161.800079] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment kernel
<4>[ 3161.807983] Control: 10c5387d Table: 86db8019 DAC: 00000017
<4>[ 3161.814147]
<4>[ 3161.814147] PC: 0xc0391174:
[...]
<4>[ 3162.973175] Backtrace:
<4>[ 3162.976013] [<c03910ec>] (rfcomm_process_dlcs+0x0/0x590) from
[<c03930b0>] (rfcomm_process_sessions+0x1a34/0x1a9c)
<4>[ 3162.987579] [<c039167c>] (rfcomm_process_sessions+0x0/0x1a9c)
from [<c03932ec>] (rfcomm_run+0x1d4/0x2ac)
<4>[ 3162.998199] [<c0393118>] (rfcomm_run+0x0/0x2ac) from
[<c008e7d8>] (kthread+0x5c/0x94)
<4>[ 3163.013763] [<c008e77c>] (kthread+0x0/0x94) from [<c007c998>]
(do_exit+0x0/0x714)


Seems like this fix avoids the panic due to calling
rfcomm_session_close() on a deleted session, but does not always
address the unbalanced rfcomm_session_put() which may be the root
cause.

Lan Zhu suspected this in the original post, and his original fix does
in fact fix this panic as well as the originally reported panic,
because it avoids the unbalanced rfcomm_session_put().

Marcel I know you are concerned about the original fix changing
scheduling assumptions, are you able to comment on this further?

Are there any other suggestions for patches for this issue? I have
spent the best part of the day trying to figure this one out, but the
recounting in the rfcomm core is quite subtle and I think it really
needs someone familiar with the code to have a quick look and come up
with the safest patch. I can run tests.

In the mean time, I am doing some testing of Lan Zhu's original fix
and if there are no better suggestions we will run with that one for
Android.

Nick


Some more analysis:

With the RFCOMM connection in idle there are 2 references on s->refcnt

However three references are removed during disconnect with the H270
- rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink()
- rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
with dlci =3D 0 and s->state =3D BT_DISCONN
- rfcomm_process_sessions() -> rfcomm_process_rx() with sk_state =3D
BT_CLOSED and s->initiator =3D 0

in that order.

On another headset, for example the Moto H350, we only see the first
two references removed during disconnect.

- rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink()
- rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
with dlci =3D 0 and s->state =3D BT_DISCONN

2009-09-18 15:24:13

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Andrei,

> >> >>> >> We met a issue that kernel panic happens when disconnecting some kinds
> >> >>> >> of Bluetooth headset, then we did some analysis and made some changes
> >> >>> >> on kernel code which have avoided the panic happening. Would you
> >> >>> >> please help to check if our analysis and fix is correct?
> >> >>> >>
> >> >>> >> =============
> >> >>> >> Issue description
> >> >>> >> =============
> >> >>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
> >> >>> >> may cause kernel panic on certain conditions.
> >> >>> >>
> >> >>> >> (Pre-condition is android paired with headset.)
> >> >>> >> Initiate the connection from android, disconnect it from android, result is OK.
> >> >>> >> Initiate the connection from android, disconnect it from headset, result is OK.
> >> >>> >> Initiate the connection from headset, disconnect it from headset, result is OK.
> >> >>> >> Initiate the connection from headset, disconnect it from android, for
> >> >>> >> Motorola H12 headset, result is OK.
> >> >>> >> Initiate the connection from headset, disconnect it from android, for
> >> >>> >> Motorola H620/560 headset, result is kernel panic.
> >> >>> >>
> >> >>> >> =============
> >> >>> >> Kernel panic point
> >> >>> >> =============
> >> >>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
> >> >>> >> panic reason is "Unable to handle kernel paging request at virtual
> >> >>> >> address 00200200"
> >> >>> >>
> >> >>> >> =============
> >> >>> >> Kernel log analysis
> >> >>> >> =============
> >> >>> >> rfcomm_session_del() is still called after the session entry is
> >> >>> >> removed from the list. Then __list_del() will cause kernel panic
> >> >>> >> because of the incorrect pointer. This situation occurs when calling
> >> >>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
> >> >>> >> find out why the socket state become BT_CLOSED before we calling
> >> >>> >> rfcomm_recv_ua().
> >> >>> >>
> >> >>> >> # [ 171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >> >>> >> [ 171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
> >> >>> >> [ 171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >> >>> >> [ 171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >> >>> >> [ 171.701110] rfcomm_process_dlcs: session cc751be0 state 1
> >> >>> >> [ 171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >> >>> >> rx_credits 33 tx_credits 31
> >> >>> >> [ 171.715515] rfcomm_send_frame: session cc751be0 len 18
> >> >>> >> [ 171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >> >>> >> [ 174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >> >>> >> [ 174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
> >> >>> >> [ 174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >> >>> >> [ 174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >> >>> >> [ 174.154968] rfcomm_process_dlcs: session cc751be0 state 1
> >> >>> >> [ 174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >> >>> >> rx_credits 33 tx_credits 30
> >> >>> >> [ 174.171173] rfcomm_send_frame: session cc751be0 len 10
> >> >>> >> [ 174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >> >>> >> [ 174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
> >> >>> >> [ 174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
> >> >>> >> [ 174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
> >> >>> >> [ 174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
> >> >>> >> session cc751be0
> >> >>> >> [ 174.236877] rfcomm_send_disc: cc751be0 dlci 20
> >> >>> >> [ 174.242706] rfcomm_send_frame: session cc751be0 len 4
> >> >>> >> [ 174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
> >> >>> >> [ 174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
> >> >>> >> [ 174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
> >> >>> >> [ 174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
> >> >>> >> [ 174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
> >> >>> >> [ 174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >> >>> >> [ 174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
> >> >>> >> [ 174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
> >> >>> >> session cc751be0
> >> >>> >> [ 174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
> >> >>> >> [ 174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
> >> >>> >> [ 174.443572] rfcomm_dlc_free: cd3fe920
> >> >>> >> [ 174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >> >>> >> [ 174.454528] rfcomm_send_disc: cc751be0 dlci 0
> >> >>> >> [ 174.459259] rfcomm_send_frame: session cc751be0 len 4
> >> >>> >> [ 174.464904] rfcomm_process_dlcs: session cc751be0 state 8
> >> >>> >> [ 174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >> >>> >> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >> >>> >> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> >> >>> >> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >> >>> >> [ 174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
> >> >>> >> [ 174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
> >> >>> >> [ 174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >> >>> >> [ 174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
> >> >>> >> [ 174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >> >>> >> [ 174.948211] rfcomm_session_del: session cc751be0 state 8
> >> >>> >> [ 174.953918] @@@@ in rfcomm_session_del()
> >> >>> >> [ 174.958312] @@@@ s->list = cc751be0
> >> >>> >> [ 174.962097] @@@@ s->list.next = ccbfe9a0
> >> >>> >> [ 174.966369] @@@@ s->list.prev = c047d524
> >> >>> >> [ 174.970733] @@@@ list is valid, call list_del()
> >> >>> >> [ 174.975646] @@@@ after list_del()
> >> >>> >> [ 174.979278] @@@@ s->list = cc751be0
> >> >>> >> [ 174.983184] @@@@ s->list.next = 00100100
> >> >>> >> [ 174.987457] @@@@ s->list.prev = 00200200
> >> >>> >> [ 174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
> >> >>> >> [ 174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >> >>> >> [ 175.005310] rfcomm_session_del: session cc751be0 state 9
> >> >>> >> [ 175.011169] @@@@ in rfcomm_session_del()
> >> >>> >> [ 175.015441] @@@@ s->list = cc751be0
> >> >>> >> [ 175.019409] @@@@ s->list.next = 00100100
> >> >>> >> [ 175.023651] @@@@ s->list.prev = 00200200
> >> >>> >> [ 175.027923] @@@@ list is valid, call list_del()
> >> >>> >> [ 175.032958] Unable to handle kernel paging request at virtual
> >> >>> >> address 00200200
> >> >>> >> [ 175.040679] pgd = c0004000
> >> >>> >> [ 175.043792] [00200200] *pgd=00000000
> >> >>> >> [ 175.047821] Internal error: Oops: 817 [#1]
> >> >>> >> [ 175.052246] Modules linked in:
> >> >>> >> [ 175.055725] CPU: 0 Not tainted (2.6.29-omap1-dirty #34)
> >> >>> >> [ 175.061859] PC is at rfcomm_session_del+0x6c/0x108
> >> >>> >> [ 175.067047] LR is at release_console_sem+0x190/0x1a0
> >> >>> >> [ 175.072509] pc : [<c033ded8>] lr : [<c0066308>] psr: 60000013
> >> >>> >> [ 175.072509] sp : cc1abf38 ip : cc1abe68 fp : cc1abf4c
> >> >>> >> [ 175.084960] r10: cc751c04 r9 : c036d2fc r8 : cc751be0
> >> >>> >> [ 175.090545] r7 : 00000068 r6 : cc751c04 r5 : 00000009 r4 : cc751be0
> >> >>> >> [ 175.097656] r3 : 00100100 r2 : 00100100 r1 : 00200200 r0 : c0422876
> >> >>> >>
> >> >>> >> =============
> >> >>> >> HCI log analysis
> >> >>> >> =============
> >> >>> >> Compare the hcidump log of the correct case with the one of the panic
> >> >>> >> case, we found there is only one difference in the message sequence.
> >> >>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
> >> >>> >> sending rfcomm UA frame to android. We think this is the reason that
> >> >>> >> cause the socket state become BT_CLOSED.
> >> >>> >>
> >> >>> >> Please compare these two log, pay attention to the message direction
> >> >>> >> of the last Disconn_Req.
> >> >>> >>
> >> >>> >>
> >> >>> >> Log of correct case:
> >> >>> >> ----------------------------
> >> >>> >>
> >> >>> >>
> >> >>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
> >> >>> >> L2CAP(d): cid 0x0047 len 18 [psm 3]
> >> >>> >> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >> >>> >> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> >> >>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >>
> >> >>> >> handle 1 packets 1
> >> >>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0047 len 4 [psm 3]
> >> >>> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> >>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >>
> >> >>> >> handle 1 packets 1
> >> >>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> >>> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> >>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0047 len 4 [psm 3]
> >> >>> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> >>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >>
> >> >>> >> handle 1 packets 1
> >> >>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> >>> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> >>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
> >> >>> >> L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
> >> >>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >>
> >> >>> >> handle 1 packets 1
> >> >>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
> >> >>> >> L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> log of panic case:
> >> >>> >> ------------------------
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
> >> >>> >> L2CAP(d): cid 0x0041 len 18 [psm 3]
> >> >>> >> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >> >>> >> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> >> >>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >> 5
> >> >>> >> handle 1 packets 1
> >> >>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> >>> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> >>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >> 5
> >> >>> >> handle 1 packets 1
> >> >>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> >>> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> >>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> >>> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> >>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
> >> >>> >> 5
> >> >>> >> handle 1 packets 1
> >> >>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
> >> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> >>> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> >>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
> >> >>> >> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >> >>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
> >> >>> >> L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
> >> >>> >>
> >> >>> >> =============
> >> >>> >> Analysis Result
> >> >>> >> =============
> >> >>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
> >> >>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command right
> >> >>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
> >> >>> >> rfcomm socket state become BT_CLOSED before completely handling UA
> >> >>> >> frame, thus it will cause kernel panic. I think we can ignore the
> >> >>> >> received rfcomm frames if socket state is BT_CLOSED, because it
> >> >>> >> doesn't make sense in the BT_CLOSED state.
> >> >>> >>
> >> >>> >>
> >> >>> >> =============
> >> >>> >> Changed Code
> >> >>> >> =============
> >> >>> >> We changed the code in the function rfcomm_process_rx() in
> >> >>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
> >> >>> >> handling the received framew. If the socket state is BT_CLOSED, we
> >> >>> >> don't handle any rfcomm frames but just close the session.
> >> >>> >>
> >> >>> >> The change is like below
> >> >>> >>
> >> >>> >> + if (sk->sk_state != BT_CLOSED) {
> >> >>> >> /* Get data directly from socket receive queue without copying it. */
> >> >>> >> while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >> >>> >> skb_orphan(skb);
> >> >>> >> rfcomm_recv_frame(s, skb);
> >> >>> >> }
> >> >>> >> -
> >> >>> >> - if (sk->sk_state == BT_CLOSED) {
> >> >>> >> + } else {
> >> >>> >> if (!s->initiator)
> >> >>> >> rfcomm_session_put(s);
> >> >>> >>
> >> >>> >> rfcomm_session_close(s, sk->sk_err);
> >> >>> >> }
> >> >>> >
> >> >>> > so I do see the issue here, but I don't agree with the fix since it
> >> >>> > changes behavior that might cause other issues. So in case the frame
> >> >>> > processing leads to sk->sk_state == BT_CLOSED we are not closing the
> >> >>> > connection anymore if we make it depend on a state before the frame
> >> >>> > processing. And nothing guarantees that rfcomm_process_rx gets scheduled
> >> >>> > again.
> >> >>> >
> >> >>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> >> >>> > index 94b3388..606143b 100644
> >> >>> > --- a/net/bluetooth/rfcomm/core.c
> >> >>> > +++ b/net/bluetooth/rfcomm/core.c
> >> >>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)
> >> >>> >
> >> >>> > BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));
> >> >>> >
> >> >>> > + rfcomm_session_hold(s);
> >> >>> > +
> >> >>> > /* Get data directly from socket receive queue without copying it. */
> >> >>> > while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >> >>> > skb_orphan(skb);
> >> >>> > rfcomm_recv_frame(s, skb);
> >> >>> > }
> >> >>> >
> >> >>> > + rfcomm_session_put(s);
> >> >>> > +
> >> >>> > if (sk->sk_state == BT_CLOSED) {
> >> >>> > if (!s->initiator)
> >> >>> > rfcomm_session_put(s);
> >> >>> >
> >> >>> > What does the above patch do for you? Since if I read it correctly, then
> >> >>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
> >> >>> > of the session. And then in this case it is delayed until after all
> >> >>> > frames are processed.
> >> >>> >
> >> >>> >
> >> >>>
> >> >>> I've tried your patch but unfortunately kernel panic still happened.
> >> >>>
> >> >>> From the log I noticed that if rfcomm_l2state_change is called before
> >> >>> rfcomm_process_rx, kernel panic will happen definitely.
> >> >>>
> >> >>> Below lines are in the correct log,
> >> >>>
> >> >>> [ 139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
> >> >>> [ 139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
> >> >>> ...
> >> >>> [ 139.457519] rfcomm_l2state_change: ccf70000 state 9
> >> >>> (disconnect ok)
> >> >>>
> >> >>> In the above case, when process_rx, the code in the condition "if
> >> >>> (sk->sk_state == BT_CLOSED)" will never run.
> >> >>>
> >> >>> Below lines are in the panic log,
> >> >>>
> >> >>> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >> >>> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> >> >>> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >> >>> ...
> >> >>> ( then panic)
> >> >>>
> >> >>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, then
> >> >>> process_rx, so the code in the condition "if (sk->sk_state ==
> >> >>> BT_CLOSED) " will be run, it will call session_put twice. I think this
> >> >>> is the root cause of panic.
> >> >>
> >> >> I know why it happens, that is not the problem. My point is not to break
> >> >> current scheduling assumptions.
> >> >>
> >> >> So if you move the rfcomm_session_put() now at the end of the function,
> >> >> then it should be fine, right?
> >> >>
> >> >> Regards
> >> >>
> >> >> Marcel
> >> >>
> >> >>
> >> >>
> >> >
> >> > You are right. I moved the rfcomm_session_put() at the end of
> >> > rfcomm_process_tx() then kernel panic doesn't happen any longer.
> >> >
> >> > The changed code is like below,
> >> >
> >> > @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
> >> >
> >> > BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->s
> >> >
> >> > + rfcomm_session_hold(s);
> >> > +
> >> > /* Get data directly from socket receive queue without copying it. */
> >> > while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >> > skb_orphan(skb);
> >> > @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
> >> >
> >> > rfcomm_session_close(s, sk->sk_err);
> >> > }
> >> > +
> >> > + rfcomm_session_put(s);
> >> > }
> >> >
> >> > static inline void rfcomm_accept_connection(struct rfcomm_session *s)
> >> >
> >> > Please submit this change to bluez release.
> >> >
> >> > Thank you,
> >> > Zhu Lan
> >>
> >> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=6f505dbe5337e49302574f8d2e65fd83e30f9117
> >>
> >> If Bluez wants to cherry-pick it.
> >
> > if it would have a proper commit message and author details, I would,
> > but that is not the case. The whole patch + commit message is missing
> > the background on why we have to do it.
>
> Guys how are we going to proceed with it? I think this patch is
> important as we also see couple of crashes recently. Is commit to
> android missing some information?

what do you mean by some? You guys must be kidding me here. You really
think that I am going to commit a patch like this without at least three
lengthy paragraphs of commit message and a comment inside the code. If
so then you must be joking. Writing the proper commit message is as
important as the patch. Since I already wrote the patch, you could have
at least spent the time for the commit message, but it seems that I have
to do that by myself, too.

Regards

Marcel



2009-09-18 08:06:52

by Andrei Emeltchenko

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi,

On Thu, Sep 17, 2009 at 5:17 AM, Marcel Holtmann <[email protected]> wrot=
e:
> Hi Nick,
>
>> >>> >> We met a issue that kernel panic happens when disconnecting some =
kinds
>> >>> >> of Bluetooth headset, then we did some analysis and made some cha=
nges
>> >>> >> on kernel code which have avoided the panic happening. Would you
>> >>> >> please help to check if our analysis and fix is correct?
>> >>> >>
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> Issue description
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth heads=
et
>> >>> >> may cause kernel panic on certain conditions.
>> >>> >>
>> >>> >> (Pre-condition is android paired with headset.)
>> >>> >> Initiate the connection from android, disconnect it from android,=
result is OK.
>> >>> >> Initiate the connection from android, disconnect it from headset,=
result is OK.
>> >>> >> Initiate the connection from headset, disconnect it from headset,=
result is OK.
>> >>> >> Initiate the connection from headset, disconnect it from android,=
for
>> >>> >> Motorola H12 headset, result is OK.
>> >>> >> Initiate the connection from headset, disconnect it from android,=
for
>> >>> >> Motorola H620/560 headset, result is kernel panic.
>> >>> >>
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> Kernel panic point
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> kernel panic at __list_del() in the function rfcomm_session_del()=
,
>> >>> >> panic reason is "Unable to handle kernel paging request at virtua=
l
>> >>> >> address 00200200"
>> >>> >>
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> Kernel log analysis
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> rfcomm_session_del() is still called after the session entry is
>> >>> >> removed from the list. Then __list_del() will cause kernel panic
>> >>> >> because of the incorrect pointer. This situation occurs when call=
ing
>> >>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need =
to
>> >>> >> find out why the socket state become BT_CLOSED before we calling
>> >>> >> rfcomm_recv_ua().
>> >>> >>
>> >>> >> # [ =A0171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c0=
0
>> >>> >> [ =A0171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>> >>> >> [ =A0171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen=
0
>> >>> >> [ =A0171.695709] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>> >>> >> [ =A0171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>> >>> >> [ =A0171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>> >>> >> rx_credits 33 tx_credits 31
>> >>> >> [ =A0171.715515] rfcomm_send_frame: session cc751be0 len 18
>> >>> >> [ =A0171.721130] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 3
>> >>> >> [ =A0174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>> >>> >> [ =A0174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>> >>> >> [ =A0174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen=
0
>> >>> >> [ =A0174.148498] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>> >>> >> [ =A0174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>> >>> >> [ =A0174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>> >>> >> rx_credits 33 tx_credits 30
>> >>> >> [ =A0174.171173] rfcomm_send_frame: session cc751be0 len 10
>> >>> >> [ =A0174.177642] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 3
>> >>> >> [ =A0174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>> >>> >> [ =A0174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>> >>> >> [ =A0174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket =
ce9a0960
>> >>> >> [ =A0174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20=
err 0
>> >>> >> session cc751be0
>> >>> >> [ =A0174.236877] rfcomm_send_disc: cc751be0 dlci 20
>> >>> >> [ =A0174.242706] rfcomm_send_frame: session cc751be0 len 4
>> >>> >> [ =A0174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeo=
ut 2560
>> >>> >> [ =A0174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>> >>> >> [ =A0174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>> >>> >> [ =A0174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>> >>> >> [ =A0174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen=
1
>> >>> >> [ =A0174.411010] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>> >>> >> [ =A0174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>> >>> >> [ =A0174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20=
err 0
>> >>> >> session cc751be0
>> >>> >> [ =A0174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>> >>> >> [ =A0174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session=
cc751be0
>> >>> >> [ =A0174.443572] rfcomm_dlc_free: cd3fe920
>> >>> >> [ =A0174.447570] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 3
>> >>> >> [ =A0174.454528] rfcomm_send_disc: cc751be0 dlci 0
>> >>> >> [ =A0174.459259] rfcomm_send_frame: session cc751be0 len 4
>> >>> >> [ =A0174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>> >>> >> [ =A0174.470703] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 2
>> >>> >> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>> >>> >> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>> >>> >> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen=
1
>> >>> >> [ =A0174.915130] rfcomm_process_rx: @@@ @@@ sk_state =3D 9
>> >>> >> [ =A0174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>> >>> >> [ =A0174.926574] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 2
>> >>> >> [ =A0174.933532] rfcomm_process_rx: @@@ @@@ sk_state =3D=3D BT_CL=
OSED , s->initiator=3D0
>> >>> >> [ =A0174.941253] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 1
>> >>> >> [ =A0174.948211] rfcomm_session_del: session cc751be0 state 8
>> >>> >> [ =A0174.953918] @@@@ in rfcomm_session_del()
>> >>> >> [ =A0174.958312] @@@@ s->list =3D cc751be0
>> >>> >> [ =A0174.962097] @@@@ s->list.next =3D ccbfe9a0
>> >>> >> [ =A0174.966369] @@@@ s->list.prev =3D c047d524
>> >>> >> [ =A0174.970733] @@@@ list is valid, call list_del()
>> >>> >> [ =A0174.975646] @@@@ after list_del()
>> >>> >> [ =A0174.979278] @@@@ s->list =3D cc751be0
>> >>> >> [ =A0174.983184] @@@@ s->list.next =3D 00100100
>> >>> >> [ =A0174.987457] @@@@ s->list.prev =3D 00200200
>> >>> >> [ =A0174.991729] rfcomm_session_close: session cc751be0 state 8 e=
rr 104
>> >>> >> [ =A0174.998504] rfcomm_session_put: in rfcomm_session_put, s->re=
fcnt =3D 1
>> >>> >> [ =A0175.005310] rfcomm_session_del: session cc751be0 state 9
>> >>> >> [ =A0175.011169] @@@@ in rfcomm_session_del()
>> >>> >> [ =A0175.015441] @@@@ s->list =3D cc751be0
>> >>> >> [ =A0175.019409] @@@@ s->list.next =3D 00100100
>> >>> >> [ =A0175.023651] @@@@ s->list.prev =3D 00200200
>> >>> >> [ =A0175.027923] @@@@ list is valid, call list_del()
>> >>> >> [ =A0175.032958] Unable to handle kernel paging request at virtua=
l
>> >>> >> address 00200200
>> >>> >> [ =A0175.040679] pgd =3D c0004000
>> >>> >> [ =A0175.043792] [00200200] *pgd=3D00000000
>> >>> >> [ =A0175.047821] Internal error: Oops: 817 [#1]
>> >>> >> [ =A0175.052246] Modules linked in:
>> >>> >> [ =A0175.055725] CPU: 0 =A0 =A0Not tainted =A0(2.6.29-omap1-dirty=
#34)
>> >>> >> [ =A0175.061859] PC is at rfcomm_session_del+0x6c/0x108
>> >>> >> [ =A0175.067047] LR is at release_console_sem+0x190/0x1a0
>> >>> >> [ =A0175.072509] pc : [<c033ded8>] =A0 =A0lr : [<c0066308>] =A0 =
=A0psr: 60000013
>> >>> >> [ =A0175.072509] sp : cc1abf38 =A0ip : cc1abe68 =A0fp : cc1abf4c
>> >>> >> [ =A0175.084960] r10: cc751c04 =A0r9 : c036d2fc =A0r8 : cc751be0
>> >>> >> [ =A0175.090545] r7 : 00000068 =A0r6 : cc751c04 =A0r5 : 00000009 =
=A0r4 : cc751be0
>> >>> >> [ =A0175.097656] r3 : 00100100 =A0r2 : 00100100 =A0r1 : 00200200 =
=A0r0 : c0422876
>> >>> >>
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> HCI log analysis
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> Compare the hcidump log of the correct case with the one of the p=
anic
>> >>> >> case, we found there is only one difference in the message sequen=
ce.
>> >>> >> In the panic case, headset send L2CAP Disconn_Req immediately aft=
er
>> >>> >> sending rfcomm UA frame to android. We think this is the reason t=
hat
>> >>> >> cause the socket state become BT_CLOSED.
>> >>> >>
>> >>> >> Please compare these two log, pay attention to the message direct=
ion
>> >>> >> of the last Disconn_Req.
>> >>> >>
>> >>> >>
>> >>> >> Log of correct case:
>> >>> >> ----------------------------
>> >>> >>
>> >>> >>
>> >>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>> >>> >> =A0 =A0L2CAP(d): cid 0x0047 len 18 [psm 3]
>> >>> >> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> >>> >> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0=
=A0 =A0 =A0..+CIEV: 7,3..
>> >>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packet=
s (0x13) plen
>> >>> >>
>> >>> >> =A0 =A0handle 1 packets 1
>> >>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> >>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packet=
s (0x13) plen
>> >>> >>
>> >>> >> =A0 =A0handle 1 packets 1
>> >>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> >>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> >>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packet=
s (0x13) plen
>> >>> >>
>> >>> >> =A0 =A0handle 1 packets 1
>> >>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> >>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>> >>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>> >>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packet=
s (0x13) plen
>> >>> >>
>> >>> >> =A0 =A0handle 1 packets 1
>> >>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>> >>> >> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> log of panic case:
>> >>> >> ------------------------
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 2=
2
>> >>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 18 [psm 3]
>> >>> >> =A0 =A0 =A0 RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> >>> >> =A0 =A0 =A0 0000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =
=A0 =A0 =A0 =A0..+CIEV: 7,3..
>> >>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packe=
ts (0x13) plen
>> >>> >> 5
>> >>> >> =A0 =A0 handle 1 packets 1
>> >>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> >>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packe=
ts (0x13) plen
>> >>> >> 5
>> >>> >> =A0 =A0 handle 1 packets 1
>> >>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> >>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> >>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packe=
ts (0x13) plen
>> >>> >> 5
>> >>> >> =A0 =A0 handle 1 packets 1
>> >>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>> >>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> >>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 1=
2
>> >>> >> =A0 =A0 L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>> >>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 1=
2
>> >>> >> =A0 =A0 L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>> >>> >>
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> Analysis Result
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 wh=
ich
>> >>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command r=
ight
>> >>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause =
the
>> >>> >> rfcomm socket state become BT_CLOSED before completely handling U=
A
>> >>> >> frame, thus it will cause kernel panic. I think we can ignore the
>> >>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>> >>> >> doesn't make sense in the BT_CLOSED state.
>> >>> >>
>> >>> >>
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> Changed Code
>> >>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>> >> We changed the code in the function rfcomm_process_rx() in
>> >>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>> >>> >> handling the received framew. If the socket state is BT_CLOSED, w=
e
>> >>> >> don't handle any rfcomm frames but just close the session.
>> >>> >>
>> >>> >> The change is like below
>> >>> >>
>> >>> >> + =A0 =A0 =A0 if (sk->sk_state !=3D BT_CLOSED) {
>> >>> >> =A0 =A0 =A0 =A0 /* Get data directly from socket receive queue wi=
thout copying it. */
>> >>> >> =A0 =A0 =A0 =A0 while ((skb =3D skb_dequeue(&sk->sk_receive_queue=
))) {
>> >>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb_orphan(skb);
>> >>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_recv_frame(s, skb);
>> >>> >> =A0 =A0 =A0 =A0 }
>> >>> >> -
>> >>> >> - =A0 =A0 =A0 if (sk->sk_state =3D=3D BT_CLOSED) {
>> >>> >> + =A0 =A0 =A0 } else {
>> >>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!s->initiator)
>> >>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_pu=
t(s);
>> >>> >>
>> >>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_close(s, sk->sk_er=
r);
>> >>> >> =A0 =A0 =A0 =A0 =A0}
>> >>> >
>> >>> > so I do see the issue here, but I don't agree with the fix since i=
t
>> >>> > changes behavior that might cause other issues. So in case the fra=
me
>> >>> > processing leads to sk->sk_state =3D=3D BT_CLOSED we are not closi=
ng the
>> >>> > connection anymore if we make it depend on a state before the fram=
e
>> >>> > processing. And nothing guarantees that rfcomm_process_rx gets sch=
eduled
>> >>> > again.
>> >>> >
>> >>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/co=
re.c
>> >>> > index 94b3388..606143b 100644
>> >>> > --- a/net/bluetooth/rfcomm/core.c
>> >>> > +++ b/net/bluetooth/rfcomm/core.c
>> >>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struc=
t rfcomm_session *s)
>> >>> >
>> >>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state,=
skb_queue_len(&sk->sk_receive_queue));
>> >>> >
>> >>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>> >>> > +
>> >>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue with=
out copying it. */
>> >>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))=
) {
>> >>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>> >>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_recv_frame(s, skb);
>> >>> > =A0 =A0 =A0 =A0}
>> >>> >
>> >>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>> >>> > +
>> >>> > =A0 =A0 =A0 =A0if (sk->sk_state =3D=3D BT_CLOSED) {
>> >>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!s->initiator)
>> >>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_put(=
s);
>> >>> >
>> >>> > What does the above patch do for you? Since if I read it correctly=
, then
>> >>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the cl=
osing
>> >>> > of the session. And then in this case it is delayed until after al=
l
>> >>> > frames are processed.
>> >>> >
>> >>> >
>> >>>
>> >>> I've tried your patch but unfortunately kernel panic still happened.
>> >>>
>> >>> From the log I noticed that if rfcomm_l2state_change is called befor=
e
>> >>> rfcomm_process_rx, kernel panic will happen definitely.
>> >>>
>> >>> Below lines are in the correct log,
>> >>>
>> >>> [ =A0139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>> >>> [ =A0139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>> >>> ...
>> >>> [ =A0139.457519] rfcomm_l2state_change: ccf70000 state 9
>> >>> (disconnect ok)
>> >>>
>> >>> In the above case, when process_rx, the code in the condition "if
>> >>> (sk->sk_state =3D=3D BT_CLOSED)" will never run.
>> >>>
>> >>> Below lines are in the panic log,
>> >>>
>> >>> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>> >>> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>> >>> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>> >>> ...
>> >>> ( then panic)
>> >>>
>> >>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, =A0=
then
>> >>> process_rx, so the code in the condition "if (sk->sk_state =3D=3D
>> >>> BT_CLOSED) " will be run, it will call session_put twice. I think th=
is
>> >>> is the root cause of panic.
>> >>
>> >> I know why it happens, that is not the problem. My point is not to br=
eak
>> >> current scheduling assumptions.
>> >>
>> >> So if you move the rfcomm_session_put() now at the end of the functio=
n,
>> >> then it should be fine, right?
>> >>
>> >> Regards
>> >>
>> >> Marcel
>> >>
>> >>
>> >>
>> >
>> > You are right. I moved the rfcomm_session_put() at the end of
>> > rfcomm_process_tx() then kernel panic doesn't happen any longer.
>> >
>> > The changed code is like below,
>> >
>> > @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfco=
mm_session
>> >
>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, skb=
_queue_len(&sk->s
>> >
>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>> > +
>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without =
copying it. */
>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>> > @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfco=
mm_session
>> >
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_close(s, sk->sk_err);
>> > =A0 =A0 =A0 =A0}
>> > +
>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>> > =A0}
>> >
>> > =A0static inline void rfcomm_accept_connection(struct rfcomm_session *=
s)
>> >
>> > Please submit this change to bluez release.
>> >
>> > Thank you,
>> > Zhu Lan
>>
>> http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=3D6f50=
5dbe5337e49302574f8d2e65fd83e30f9117
>>
>> If Bluez wants to cherry-pick it.
>
> if it would have a proper commit message and author details, I would,
> but that is not the case. The whole patch + commit message is missing
> the background on why we have to do it.

Guys how are we going to proceed with it? I think this patch is
important as we also see couple of crashes recently. Is commit to
android missing some information?

Regards,
Andrei

2009-09-17 02:17:48

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Nick,

> >>> >> We met a issue that kernel panic happens when disconnecting some kinds
> >>> >> of Bluetooth headset, then we did some analysis and made some changes
> >>> >> on kernel code which have avoided the panic happening. Would you
> >>> >> please help to check if our analysis and fix is correct?
> >>> >>
> >>> >> =============
> >>> >> Issue description
> >>> >> =============
> >>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
> >>> >> may cause kernel panic on certain conditions.
> >>> >>
> >>> >> (Pre-condition is android paired with headset.)
> >>> >> Initiate the connection from android, disconnect it from android, result is OK.
> >>> >> Initiate the connection from android, disconnect it from headset, result is OK.
> >>> >> Initiate the connection from headset, disconnect it from headset, result is OK.
> >>> >> Initiate the connection from headset, disconnect it from android, for
> >>> >> Motorola H12 headset, result is OK.
> >>> >> Initiate the connection from headset, disconnect it from android, for
> >>> >> Motorola H620/560 headset, result is kernel panic.
> >>> >>
> >>> >> =============
> >>> >> Kernel panic point
> >>> >> =============
> >>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
> >>> >> panic reason is "Unable to handle kernel paging request at virtual
> >>> >> address 00200200"
> >>> >>
> >>> >> =============
> >>> >> Kernel log analysis
> >>> >> =============
> >>> >> rfcomm_session_del() is still called after the session entry is
> >>> >> removed from the list. Then __list_del() will cause kernel panic
> >>> >> because of the incorrect pointer. This situation occurs when calling
> >>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
> >>> >> find out why the socket state become BT_CLOSED before we calling
> >>> >> rfcomm_recv_ua().
> >>> >>
> >>> >> # [ 171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >>> >> [ 171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
> >>> >> [ 171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >>> >> [ 171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >>> >> [ 171.701110] rfcomm_process_dlcs: session cc751be0 state 1
> >>> >> [ 171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >>> >> rx_credits 33 tx_credits 31
> >>> >> [ 171.715515] rfcomm_send_frame: session cc751be0 len 18
> >>> >> [ 171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >>> >> [ 174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >>> >> [ 174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
> >>> >> [ 174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >>> >> [ 174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >>> >> [ 174.154968] rfcomm_process_dlcs: session cc751be0 state 1
> >>> >> [ 174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >>> >> rx_credits 33 tx_credits 30
> >>> >> [ 174.171173] rfcomm_send_frame: session cc751be0 len 10
> >>> >> [ 174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >>> >> [ 174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
> >>> >> [ 174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
> >>> >> [ 174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
> >>> >> [ 174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
> >>> >> session cc751be0
> >>> >> [ 174.236877] rfcomm_send_disc: cc751be0 dlci 20
> >>> >> [ 174.242706] rfcomm_send_frame: session cc751be0 len 4
> >>> >> [ 174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
> >>> >> [ 174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
> >>> >> [ 174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
> >>> >> [ 174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
> >>> >> [ 174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
> >>> >> [ 174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >>> >> [ 174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
> >>> >> [ 174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
> >>> >> session cc751be0
> >>> >> [ 174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
> >>> >> [ 174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
> >>> >> [ 174.443572] rfcomm_dlc_free: cd3fe920
> >>> >> [ 174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >>> >> [ 174.454528] rfcomm_send_disc: cc751be0 dlci 0
> >>> >> [ 174.459259] rfcomm_send_frame: session cc751be0 len 4
> >>> >> [ 174.464904] rfcomm_process_dlcs: session cc751be0 state 8
> >>> >> [ 174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >>> >> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >>> >> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> >>> >> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >>> >> [ 174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
> >>> >> [ 174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
> >>> >> [ 174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >>> >> [ 174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
> >>> >> [ 174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >>> >> [ 174.948211] rfcomm_session_del: session cc751be0 state 8
> >>> >> [ 174.953918] @@@@ in rfcomm_session_del()
> >>> >> [ 174.958312] @@@@ s->list = cc751be0
> >>> >> [ 174.962097] @@@@ s->list.next = ccbfe9a0
> >>> >> [ 174.966369] @@@@ s->list.prev = c047d524
> >>> >> [ 174.970733] @@@@ list is valid, call list_del()
> >>> >> [ 174.975646] @@@@ after list_del()
> >>> >> [ 174.979278] @@@@ s->list = cc751be0
> >>> >> [ 174.983184] @@@@ s->list.next = 00100100
> >>> >> [ 174.987457] @@@@ s->list.prev = 00200200
> >>> >> [ 174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
> >>> >> [ 174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >>> >> [ 175.005310] rfcomm_session_del: session cc751be0 state 9
> >>> >> [ 175.011169] @@@@ in rfcomm_session_del()
> >>> >> [ 175.015441] @@@@ s->list = cc751be0
> >>> >> [ 175.019409] @@@@ s->list.next = 00100100
> >>> >> [ 175.023651] @@@@ s->list.prev = 00200200
> >>> >> [ 175.027923] @@@@ list is valid, call list_del()
> >>> >> [ 175.032958] Unable to handle kernel paging request at virtual
> >>> >> address 00200200
> >>> >> [ 175.040679] pgd = c0004000
> >>> >> [ 175.043792] [00200200] *pgd=00000000
> >>> >> [ 175.047821] Internal error: Oops: 817 [#1]
> >>> >> [ 175.052246] Modules linked in:
> >>> >> [ 175.055725] CPU: 0 Not tainted (2.6.29-omap1-dirty #34)
> >>> >> [ 175.061859] PC is at rfcomm_session_del+0x6c/0x108
> >>> >> [ 175.067047] LR is at release_console_sem+0x190/0x1a0
> >>> >> [ 175.072509] pc : [<c033ded8>] lr : [<c0066308>] psr: 60000013
> >>> >> [ 175.072509] sp : cc1abf38 ip : cc1abe68 fp : cc1abf4c
> >>> >> [ 175.084960] r10: cc751c04 r9 : c036d2fc r8 : cc751be0
> >>> >> [ 175.090545] r7 : 00000068 r6 : cc751c04 r5 : 00000009 r4 : cc751be0
> >>> >> [ 175.097656] r3 : 00100100 r2 : 00100100 r1 : 00200200 r0 : c0422876
> >>> >>
> >>> >> =============
> >>> >> HCI log analysis
> >>> >> =============
> >>> >> Compare the hcidump log of the correct case with the one of the panic
> >>> >> case, we found there is only one difference in the message sequence.
> >>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
> >>> >> sending rfcomm UA frame to android. We think this is the reason that
> >>> >> cause the socket state become BT_CLOSED.
> >>> >>
> >>> >> Please compare these two log, pay attention to the message direction
> >>> >> of the last Disconn_Req.
> >>> >>
> >>> >>
> >>> >> Log of correct case:
> >>> >> ----------------------------
> >>> >>
> >>> >>
> >>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
> >>> >> L2CAP(d): cid 0x0047 len 18 [psm 3]
> >>> >> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >>> >> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> >>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >>
> >>> >> handle 1 packets 1
> >>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0047 len 4 [psm 3]
> >>> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >>
> >>> >> handle 1 packets 1
> >>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0047 len 4 [psm 3]
> >>> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >>
> >>> >> handle 1 packets 1
> >>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
> >>> >> L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
> >>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >>
> >>> >> handle 1 packets 1
> >>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
> >>> >> L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> log of panic case:
> >>> >> ------------------------
> >>> >>
> >>> >>
> >>> >>
> >>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
> >>> >> L2CAP(d): cid 0x0041 len 18 [psm 3]
> >>> >> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >>> >> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> >>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >> 5
> >>> >> handle 1 packets 1
> >>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >>> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >> 5
> >>> >> handle 1 packets 1
> >>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >>> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
> >>> >> 5
> >>> >> handle 1 packets 1
> >>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
> >>> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >>> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
> >>> >> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
> >>> >> L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
> >>> >>
> >>> >> =============
> >>> >> Analysis Result
> >>> >> =============
> >>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
> >>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command right
> >>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
> >>> >> rfcomm socket state become BT_CLOSED before completely handling UA
> >>> >> frame, thus it will cause kernel panic. I think we can ignore the
> >>> >> received rfcomm frames if socket state is BT_CLOSED, because it
> >>> >> doesn't make sense in the BT_CLOSED state.
> >>> >>
> >>> >>
> >>> >> =============
> >>> >> Changed Code
> >>> >> =============
> >>> >> We changed the code in the function rfcomm_process_rx() in
> >>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
> >>> >> handling the received framew. If the socket state is BT_CLOSED, we
> >>> >> don't handle any rfcomm frames but just close the session.
> >>> >>
> >>> >> The change is like below
> >>> >>
> >>> >> + if (sk->sk_state != BT_CLOSED) {
> >>> >> /* Get data directly from socket receive queue without copying it. */
> >>> >> while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >>> >> skb_orphan(skb);
> >>> >> rfcomm_recv_frame(s, skb);
> >>> >> }
> >>> >> -
> >>> >> - if (sk->sk_state == BT_CLOSED) {
> >>> >> + } else {
> >>> >> if (!s->initiator)
> >>> >> rfcomm_session_put(s);
> >>> >>
> >>> >> rfcomm_session_close(s, sk->sk_err);
> >>> >> }
> >>> >
> >>> > so I do see the issue here, but I don't agree with the fix since it
> >>> > changes behavior that might cause other issues. So in case the frame
> >>> > processing leads to sk->sk_state == BT_CLOSED we are not closing the
> >>> > connection anymore if we make it depend on a state before the frame
> >>> > processing. And nothing guarantees that rfcomm_process_rx gets scheduled
> >>> > again.
> >>> >
> >>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> >>> > index 94b3388..606143b 100644
> >>> > --- a/net/bluetooth/rfcomm/core.c
> >>> > +++ b/net/bluetooth/rfcomm/core.c
> >>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)
> >>> >
> >>> > BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));
> >>> >
> >>> > + rfcomm_session_hold(s);
> >>> > +
> >>> > /* Get data directly from socket receive queue without copying it. */
> >>> > while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >>> > skb_orphan(skb);
> >>> > rfcomm_recv_frame(s, skb);
> >>> > }
> >>> >
> >>> > + rfcomm_session_put(s);
> >>> > +
> >>> > if (sk->sk_state == BT_CLOSED) {
> >>> > if (!s->initiator)
> >>> > rfcomm_session_put(s);
> >>> >
> >>> > What does the above patch do for you? Since if I read it correctly, then
> >>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
> >>> > of the session. And then in this case it is delayed until after all
> >>> > frames are processed.
> >>> >
> >>> >
> >>>
> >>> I've tried your patch but unfortunately kernel panic still happened.
> >>>
> >>> From the log I noticed that if rfcomm_l2state_change is called before
> >>> rfcomm_process_rx, kernel panic will happen definitely.
> >>>
> >>> Below lines are in the correct log,
> >>>
> >>> [ 139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
> >>> [ 139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
> >>> ...
> >>> [ 139.457519] rfcomm_l2state_change: ccf70000 state 9
> >>> (disconnect ok)
> >>>
> >>> In the above case, when process_rx, the code in the condition "if
> >>> (sk->sk_state == BT_CLOSED)" will never run.
> >>>
> >>> Below lines are in the panic log,
> >>>
> >>> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >>> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> >>> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >>> ...
> >>> ( then panic)
> >>>
> >>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, then
> >>> process_rx, so the code in the condition "if (sk->sk_state ==
> >>> BT_CLOSED) " will be run, it will call session_put twice. I think this
> >>> is the root cause of panic.
> >>
> >> I know why it happens, that is not the problem. My point is not to break
> >> current scheduling assumptions.
> >>
> >> So if you move the rfcomm_session_put() now at the end of the function,
> >> then it should be fine, right?
> >>
> >> Regards
> >>
> >> Marcel
> >>
> >>
> >>
> >
> > You are right. I moved the rfcomm_session_put() at the end of
> > rfcomm_process_tx() then kernel panic doesn't happen any longer.
> >
> > The changed code is like below,
> >
> > @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
> >
> > BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->s
> >
> > + rfcomm_session_hold(s);
> > +
> > /* Get data directly from socket receive queue without copying it. */
> > while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> > skb_orphan(skb);
> > @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
> >
> > rfcomm_session_close(s, sk->sk_err);
> > }
> > +
> > + rfcomm_session_put(s);
> > }
> >
> > static inline void rfcomm_accept_connection(struct rfcomm_session *s)
> >
> > Please submit this change to bluez release.
> >
> > Thank you,
> > Zhu Lan
>
> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=6f505dbe5337e49302574f8d2e65fd83e30f9117
>
> If Bluez wants to cherry-pick it.

if it would have a proper commit message and author details, I would,
but that is not the case. The whole patch + commit message is missing
the background on why we have to do it.

Regards

Marcel



2009-09-17 01:21:39

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Mon, Sep 14, 2009 at 2:10 AM, Lan Zhu <[email protected]> wrote:
> Hi Marcel,
>
> 2009/9/12 Marcel Holtmann <[email protected]>:
>> Hi Zhu,
>>
>>> >> We met a issue that kernel panic happens when disconnecting some kin=
ds
>>> >> of Bluetooth headset, then we did some analysis and made some change=
s
>>> >> on kernel code which have avoided the panic happening. Would you
>>> >> please help to check if our analysis and fix is correct?
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Issue description
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
>>> >> may cause kernel panic on certain conditions.
>>> >>
>>> >> (Pre-condition is android paired with headset.)
>>> >> Initiate the connection from android, disconnect it from android, re=
sult is OK.
>>> >> Initiate the connection from android, disconnect it from headset, re=
sult is OK.
>>> >> Initiate the connection from headset, disconnect it from headset, re=
sult is OK.
>>> >> Initiate the connection from headset, disconnect it from android, fo=
r
>>> >> Motorola H12 headset, result is OK.
>>> >> Initiate the connection from headset, disconnect it from android, fo=
r
>>> >> Motorola H620/560 headset, result is kernel panic.
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Kernel panic point
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
>>> >> panic reason is "Unable to handle kernel paging request at virtual
>>> >> address 00200200"
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Kernel log analysis
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> rfcomm_session_del() is still called after the session entry is
>>> >> removed from the list. Then __list_del() will cause kernel panic
>>> >> because of the incorrect pointer. This situation occurs when calling
>>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
>>> >> find out why the socket state become BT_CLOSED before we calling
>>> >> rfcomm_recv_ua().
>>> >>
>>> >> # [ =A0171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>>> >> [ =A0171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>> >> [ =A0171.695709] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>> >> [ =A0171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>>> >> [ =A0171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>> >> rx_credits 33 tx_credits 31
>>> >> [ =A0171.715515] rfcomm_send_frame: session cc751be0 len 18
>>> >> [ =A0171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 3
>>> >> [ =A0174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>>> >> [ =A0174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>> >> [ =A0174.148498] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>> >> [ =A0174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>>> >> [ =A0174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>> >> rx_credits 33 tx_credits 30
>>> >> [ =A0174.171173] rfcomm_send_frame: session cc751be0 len 10
>>> >> [ =A0174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 3
>>> >> [ =A0174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>>> >> [ =A0174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9=
a0960
>>> >> [ =A0174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 er=
r 0
>>> >> session cc751be0
>>> >> [ =A0174.236877] rfcomm_send_disc: cc751be0 dlci 20
>>> >> [ =A0174.242706] rfcomm_send_frame: session cc751be0 len 4
>>> >> [ =A0174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout =
2560
>>> >> [ =A0174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>>> >> [ =A0174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>>> >> [ =A0174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>>> >> [ =A0174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
>>> >> [ =A0174.411010] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>>> >> [ =A0174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>>> >> [ =A0174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 er=
r 0
>>> >> session cc751be0
>>> >> [ =A0174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>>> >> [ =A0174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc=
751be0
>>> >> [ =A0174.443572] rfcomm_dlc_free: cd3fe920
>>> >> [ =A0174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 3
>>> >> [ =A0174.454528] rfcomm_send_disc: cc751be0 dlci 0
>>> >> [ =A0174.459259] rfcomm_send_frame: session cc751be0 len 4
>>> >> [ =A0174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>>> >> [ =A0174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 2
>>> >> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>> >> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>> >> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>> >> [ =A0174.915130] rfcomm_process_rx: @@@ @@@ sk_state =3D 9
>>> >> [ =A0174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>>> >> [ =A0174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 2
>>> >> [ =A0174.933532] rfcomm_process_rx: @@@ @@@ sk_state =3D=3D BT_CLOSE=
D , s->initiator=3D0
>>> >> [ =A0174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 1
>>> >> [ =A0174.948211] rfcomm_session_del: session cc751be0 state 8
>>> >> [ =A0174.953918] @@@@ in rfcomm_session_del()
>>> >> [ =A0174.958312] @@@@ s->list =3D cc751be0
>>> >> [ =A0174.962097] @@@@ s->list.next =3D ccbfe9a0
>>> >> [ =A0174.966369] @@@@ s->list.prev =3D c047d524
>>> >> [ =A0174.970733] @@@@ list is valid, call list_del()
>>> >> [ =A0174.975646] @@@@ after list_del()
>>> >> [ =A0174.979278] @@@@ s->list =3D cc751be0
>>> >> [ =A0174.983184] @@@@ s->list.next =3D 00100100
>>> >> [ =A0174.987457] @@@@ s->list.prev =3D 00200200
>>> >> [ =A0174.991729] rfcomm_session_close: session cc751be0 state 8 err =
104
>>> >> [ =A0174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcn=
t =3D 1
>>> >> [ =A0175.005310] rfcomm_session_del: session cc751be0 state 9
>>> >> [ =A0175.011169] @@@@ in rfcomm_session_del()
>>> >> [ =A0175.015441] @@@@ s->list =3D cc751be0
>>> >> [ =A0175.019409] @@@@ s->list.next =3D 00100100
>>> >> [ =A0175.023651] @@@@ s->list.prev =3D 00200200
>>> >> [ =A0175.027923] @@@@ list is valid, call list_del()
>>> >> [ =A0175.032958] Unable to handle kernel paging request at virtual
>>> >> address 00200200
>>> >> [ =A0175.040679] pgd =3D c0004000
>>> >> [ =A0175.043792] [00200200] *pgd=3D00000000
>>> >> [ =A0175.047821] Internal error: Oops: 817 [#1]
>>> >> [ =A0175.052246] Modules linked in:
>>> >> [ =A0175.055725] CPU: 0 =A0 =A0Not tainted =A0(2.6.29-omap1-dirty #3=
4)
>>> >> [ =A0175.061859] PC is at rfcomm_session_del+0x6c/0x108
>>> >> [ =A0175.067047] LR is at release_console_sem+0x190/0x1a0
>>> >> [ =A0175.072509] pc : [<c033ded8>] =A0 =A0lr : [<c0066308>] =A0 =A0p=
sr: 60000013
>>> >> [ =A0175.072509] sp : cc1abf38 =A0ip : cc1abe68 =A0fp : cc1abf4c
>>> >> [ =A0175.084960] r10: cc751c04 =A0r9 : c036d2fc =A0r8 : cc751be0
>>> >> [ =A0175.090545] r7 : 00000068 =A0r6 : cc751c04 =A0r5 : 00000009 =A0=
r4 : cc751be0
>>> >> [ =A0175.097656] r3 : 00100100 =A0r2 : 00100100 =A0r1 : 00200200 =A0=
r0 : c0422876
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> HCI log analysis
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Compare the hcidump log of the correct case with the one of the pani=
c
>>> >> case, we found there is only one difference in the message sequence.
>>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
>>> >> sending rfcomm UA frame to android. We think this is the reason that
>>> >> cause the socket state become BT_CLOSED.
>>> >>
>>> >> Please compare these two log, pay attention to the message direction
>>> >> of the last Disconn_Req.
>>> >>
>>> >>
>>> >> Log of correct case:
>>> >> ----------------------------
>>> >>
>>> >>
>>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 18 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>> >> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (=
0x13) plen
>>> >>
>>> >> =A0 =A0handle 1 packets 1
>>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> log of panic case:
>>> >> ------------------------
>>> >>
>>> >>
>>> >>
>>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 18 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>> >> =A0 =A0 =A0 0000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets =
(0x13) plen
>>> >> 5
>>> >> =A0 =A0 handle 1 packets 1
>>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets =
(0x13) plen
>>> >> 5
>>> >> =A0 =A0 handle 1 packets 1
>>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets =
(0x13) plen
>>> >> 5
>>> >> =A0 =A0 handle 1 packets 1
>>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0 L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0 L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Analysis Result
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
>>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command righ=
t
>>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
>>> >> rfcomm socket state become BT_CLOSED before completely handling UA
>>> >> frame, thus it will cause kernel panic. I think we can ignore the
>>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>>> >> doesn't make sense in the BT_CLOSED state.
>>> >>
>>> >>
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> Changed Code
>>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> >> We changed the code in the function rfcomm_process_rx() in
>>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>>> >> handling the received framew. If the socket state is BT_CLOSED, we
>>> >> don't handle any rfcomm frames but just close the session.
>>> >>
>>> >> The change is like below
>>> >>
>>> >> + =A0 =A0 =A0 if (sk->sk_state !=3D BT_CLOSED) {
>>> >> =A0 =A0 =A0 =A0 /* Get data directly from socket receive queue witho=
ut copying it. */
>>> >> =A0 =A0 =A0 =A0 while ((skb =3D skb_dequeue(&sk->sk_receive_queue)))=
{
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb_orphan(skb);
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_recv_frame(s, skb);
>>> >> =A0 =A0 =A0 =A0 }
>>> >> -
>>> >> - =A0 =A0 =A0 if (sk->sk_state =3D=3D BT_CLOSED) {
>>> >> + =A0 =A0 =A0 } else {
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!s->initiator)
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s=
);
>>> >>
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_close(s, sk->sk_err);
>>> >> =A0 =A0 =A0 =A0 =A0}
>>> >
>>> > so I do see the issue here, but I don't agree with the fix since it
>>> > changes behavior that might cause other issues. So in case the frame
>>> > processing leads to sk->sk_state =3D=3D BT_CLOSED we are not closing =
the
>>> > connection anymore if we make it depend on a state before the frame
>>> > processing. And nothing guarantees that rfcomm_process_rx gets schedu=
led
>>> > again.
>>> >
>>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.=
c
>>> > index 94b3388..606143b 100644
>>> > --- a/net/bluetooth/rfcomm/core.c
>>> > +++ b/net/bluetooth/rfcomm/core.c
>>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct r=
fcomm_session *s)
>>> >
>>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, sk=
b_queue_len(&sk->sk_receive_queue));
>>> >
>>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>>> > +
>>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without=
copying it. */
>>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_recv_frame(s, skb);
>>> > =A0 =A0 =A0 =A0}
>>> >
>>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>>> > +
>>> > =A0 =A0 =A0 =A0if (sk->sk_state =3D=3D BT_CLOSED) {
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!s->initiator)
>>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_put(s);
>>> >
>>> > What does the above patch do for you? Since if I read it correctly, t=
hen
>>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closi=
ng
>>> > of the session. And then in this case it is delayed until after all
>>> > frames are processed.
>>> >
>>> >
>>>
>>> I've tried your patch but unfortunately kernel panic still happened.
>>>
>>> From the log I noticed that if rfcomm_l2state_change is called before
>>> rfcomm_process_rx, kernel panic will happen definitely.
>>>
>>> Below lines are in the correct log,
>>>
>>> [ =A0139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>>> [ =A0139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>>> ...
>>> [ =A0139.457519] rfcomm_l2state_change: ccf70000 state 9
>>> (disconnect ok)
>>>
>>> In the above case, when process_rx, the code in the condition "if
>>> (sk->sk_state =3D=3D BT_CLOSED)" will never run.
>>>
>>> Below lines are in the panic log,
>>>
>>> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>>> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>> ...
>>> ( then panic)
>>>
>>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, =A0the=
n
>>> process_rx, so the code in the condition "if (sk->sk_state =3D=3D
>>> BT_CLOSED) " will be run, it will call session_put twice. I think this
>>> is the root cause of panic.
>>
>> I know why it happens, that is not the problem. My point is not to break
>> current scheduling assumptions.
>>
>> So if you move the rfcomm_session_put() now at the end of the function,
>> then it should be fine, right?
>>
>> Regards
>>
>> Marcel
>>
>>
>>
>
> You are right. I moved the rfcomm_session_put() at the end of
> rfcomm_process_tx() then kernel panic doesn't happen any longer.
>
> The changed code is like below,
>
> @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_=
session
>
> =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, skb_qu=
eue_len(&sk->s
>
> + =A0 =A0 =A0 rfcomm_session_hold(s);
> +
> =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without cop=
ying it. */
> =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
> @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_=
session
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_close(s, sk->sk_err);
> =A0 =A0 =A0 =A0}
> +
> + =A0 =A0 =A0 rfcomm_session_put(s);
> =A0}
>
> =A0static inline void rfcomm_accept_connection(struct rfcomm_session *s)
>
> Please submit this change to bluez release.
>
> Thank you,
> Zhu Lan

http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=3D6f505db=
e5337e49302574f8d2e65fd83e30f9117

If Bluez wants to cherry-pick it.

Nick

2009-09-14 09:10:41

by Lan Zhu

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Marcel,

2009/9/12 Marcel Holtmann <[email protected]>:
> Hi Zhu,
>
>> >> We met a issue that kernel panic happens when disconnecting some kind=
s
>> >> of Bluetooth headset, then we did some analysis and made some changes
>> >> on kernel code which have avoided the panic happening. Would you
>> >> please help to check if our analysis and fix is correct?
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Issue description
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
>> >> may cause kernel panic on certain conditions.
>> >>
>> >> (Pre-condition is android paired with headset.)
>> >> Initiate the connection from android, disconnect it from android, res=
ult is OK.
>> >> Initiate the connection from android, disconnect it from headset, res=
ult is OK.
>> >> Initiate the connection from headset, disconnect it from headset, res=
ult is OK.
>> >> Initiate the connection from headset, disconnect it from android, for
>> >> Motorola H12 headset, result is OK.
>> >> Initiate the connection from headset, disconnect it from android, for
>> >> Motorola H620/560 headset, result is kernel panic.
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Kernel panic point
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
>> >> panic reason is "Unable to handle kernel paging request at virtual
>> >> address 00200200"
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Kernel log analysis
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> rfcomm_session_del() is still called after the session entry is
>> >> removed from the list. Then __list_del() will cause kernel panic
>> >> because of the incorrect pointer. This situation occurs when calling
>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
>> >> find out why the socket state become BT_CLOSED before we calling
>> >> rfcomm_recv_ua().
>> >>
>> >> # [ =A0171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>> >> [ =A0171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>> >> [ =A0171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>> >> [ =A0171.695709] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>> >> [ =A0171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>> >> [ =A0171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>> >> rx_credits 33 tx_credits 31
>> >> [ =A0171.715515] rfcomm_send_frame: session cc751be0 len 18
>> >> [ =A0171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 3
>> >> [ =A0174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>> >> [ =A0174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>> >> [ =A0174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>> >> [ =A0174.148498] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>> >> [ =A0174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>> >> [ =A0174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>> >> rx_credits 33 tx_credits 30
>> >> [ =A0174.171173] rfcomm_send_frame: session cc751be0 len 10
>> >> [ =A0174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 3
>> >> [ =A0174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>> >> [ =A0174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>> >> [ =A0174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a=
0960
>> >> [ =A0174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err=
0
>> >> session cc751be0
>> >> [ =A0174.236877] rfcomm_send_disc: cc751be0 dlci 20
>> >> [ =A0174.242706] rfcomm_send_frame: session cc751be0 len 4
>> >> [ =A0174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2=
560
>> >> [ =A0174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>> >> [ =A0174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>> >> [ =A0174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>> >> [ =A0174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
>> >> [ =A0174.411010] rfcomm_process_rx: @@@ @@@ sk_state =3D 1
>> >> [ =A0174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>> >> [ =A0174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err=
0
>> >> session cc751be0
>> >> [ =A0174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>> >> [ =A0174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc7=
51be0
>> >> [ =A0174.443572] rfcomm_dlc_free: cd3fe920
>> >> [ =A0174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 3
>> >> [ =A0174.454528] rfcomm_send_disc: cc751be0 dlci 0
>> >> [ =A0174.459259] rfcomm_send_frame: session cc751be0 len 4
>> >> [ =A0174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>> >> [ =A0174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 2
>> >> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>> >> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>> >> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>> >> [ =A0174.915130] rfcomm_process_rx: @@@ @@@ sk_state =3D 9
>> >> [ =A0174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>> >> [ =A0174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 2
>> >> [ =A0174.933532] rfcomm_process_rx: @@@ @@@ sk_state =3D=3D BT_CLOSED=
, s->initiator=3D0
>> >> [ =A0174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 1
>> >> [ =A0174.948211] rfcomm_session_del: session cc751be0 state 8
>> >> [ =A0174.953918] @@@@ in rfcomm_session_del()
>> >> [ =A0174.958312] @@@@ s->list =3D cc751be0
>> >> [ =A0174.962097] @@@@ s->list.next =3D ccbfe9a0
>> >> [ =A0174.966369] @@@@ s->list.prev =3D c047d524
>> >> [ =A0174.970733] @@@@ list is valid, call list_del()
>> >> [ =A0174.975646] @@@@ after list_del()
>> >> [ =A0174.979278] @@@@ s->list =3D cc751be0
>> >> [ =A0174.983184] @@@@ s->list.next =3D 00100100
>> >> [ =A0174.987457] @@@@ s->list.prev =3D 00200200
>> >> [ =A0174.991729] rfcomm_session_close: session cc751be0 state 8 err 1=
04
>> >> [ =A0174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt=
=3D 1
>> >> [ =A0175.005310] rfcomm_session_del: session cc751be0 state 9
>> >> [ =A0175.011169] @@@@ in rfcomm_session_del()
>> >> [ =A0175.015441] @@@@ s->list =3D cc751be0
>> >> [ =A0175.019409] @@@@ s->list.next =3D 00100100
>> >> [ =A0175.023651] @@@@ s->list.prev =3D 00200200
>> >> [ =A0175.027923] @@@@ list is valid, call list_del()
>> >> [ =A0175.032958] Unable to handle kernel paging request at virtual
>> >> address 00200200
>> >> [ =A0175.040679] pgd =3D c0004000
>> >> [ =A0175.043792] [00200200] *pgd=3D00000000
>> >> [ =A0175.047821] Internal error: Oops: 817 [#1]
>> >> [ =A0175.052246] Modules linked in:
>> >> [ =A0175.055725] CPU: 0 =A0 =A0Not tainted =A0(2.6.29-omap1-dirty #34=
)
>> >> [ =A0175.061859] PC is at rfcomm_session_del+0x6c/0x108
>> >> [ =A0175.067047] LR is at release_console_sem+0x190/0x1a0
>> >> [ =A0175.072509] pc : [<c033ded8>] =A0 =A0lr : [<c0066308>] =A0 =A0ps=
r: 60000013
>> >> [ =A0175.072509] sp : cc1abf38 =A0ip : cc1abe68 =A0fp : cc1abf4c
>> >> [ =A0175.084960] r10: cc751c04 =A0r9 : c036d2fc =A0r8 : cc751be0
>> >> [ =A0175.090545] r7 : 00000068 =A0r6 : cc751c04 =A0r5 : 00000009 =A0r=
4 : cc751be0
>> >> [ =A0175.097656] r3 : 00100100 =A0r2 : 00100100 =A0r1 : 00200200 =A0r=
0 : c0422876
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> HCI log analysis
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Compare the hcidump log of the correct case with the one of the panic
>> >> case, we found there is only one difference in the message sequence.
>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
>> >> sending rfcomm UA frame to android. We think this is the reason that
>> >> cause the socket state become BT_CLOSED.
>> >>
>> >> Please compare these two log, pay attention to the message direction
>> >> of the last Disconn_Req.
>> >>
>> >>
>> >> Log of correct case:
>> >> ----------------------------
>> >>
>> >>
>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>> >> =A0 =A0L2CAP(d): cid 0x0047 len 18 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> >> =A0 =A0 =A00000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =A0=
=A0 =A0..+CIEV: 7,3..
>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0=
x13) plen
>> >>
>> >> =A0 =A0handle 1 packets 1
>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0=
x13) plen
>> >>
>> >> =A0 =A0handle 1 packets 1
>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0047 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0=
x13) plen
>> >>
>> >> =A0 =A0handle 1 packets 1
>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0=
x13) plen
>> >>
>> >> =A0 =A0handle 1 packets 1
>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>> >> =A0 =A0L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>> >>
>> >>
>> >>
>> >>
>> >> log of panic case:
>> >> ------------------------
>> >>
>> >>
>> >>
>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 18 [psm 3]
>> >> =A0 =A0 =A0 RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> >> =A0 =A0 =A0 0000: 0d 0a 2b 43 49 45 56 3a =A020 37 2c 33 0d 0a =A0 =
=A0 =A0 =A0..+CIEV: 7,3..
>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (=
0x13) plen
>> >> 5
>> >> =A0 =A0 handle 1 packets 1
>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (=
0x13) plen
>> >> 5
>> >> =A0 =A0 handle 1 packets 1
>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0 L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >> =A0 =A0 =A0 RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (=
0x13) plen
>> >> 5
>> >> =A0 =A0 handle 1 packets 1
>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0 L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> =A0 =A0 =A0 RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>> >> =A0 =A0 L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>> >> =A0 =A0 L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Analysis Result
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command right
>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
>> >> rfcomm socket state become BT_CLOSED before completely handling UA
>> >> frame, thus it will cause kernel panic. I think we can ignore the
>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>> >> doesn't make sense in the BT_CLOSED state.
>> >>
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Changed Code
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> We changed the code in the function rfcomm_process_rx() in
>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>> >> handling the received framew. If the socket state is BT_CLOSED, we
>> >> don't handle any rfcomm frames but just close the session.
>> >>
>> >> The change is like below
>> >>
>> >> + =A0 =A0 =A0 if (sk->sk_state !=3D BT_CLOSED) {
>> >> =A0 =A0 =A0 =A0 /* Get data directly from socket receive queue withou=
t copying it. */
>> >> =A0 =A0 =A0 =A0 while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) =
{
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 skb_orphan(skb);
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_recv_frame(s, skb);
>> >> =A0 =A0 =A0 =A0 }
>> >> -
>> >> - =A0 =A0 =A0 if (sk->sk_state =3D=3D BT_CLOSED) {
>> >> + =A0 =A0 =A0 } else {
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!s->initiator)
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s)=
;
>> >>
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_close(s, sk->sk_err);
>> >> =A0 =A0 =A0 =A0 =A0}
>> >
>> > so I do see the issue here, but I don't agree with the fix since it
>> > changes behavior that might cause other issues. So in case the frame
>> > processing leads to sk->sk_state =3D=3D BT_CLOSED we are not closing t=
he
>> > connection anymore if we make it depend on a state before the frame
>> > processing. And nothing guarantees that rfcomm_process_rx gets schedul=
ed
>> > again.
>> >
>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
>> > index 94b3388..606143b 100644
>> > --- a/net/bluetooth/rfcomm/core.c
>> > +++ b/net/bluetooth/rfcomm/core.c
>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rf=
comm_session *s)
>> >
>> > =A0 =A0 =A0 =A0BT_DBG("session %p state %ld qlen %d", s, s->state, skb=
_queue_len(&sk->sk_receive_queue));
>> >
>> > + =A0 =A0 =A0 rfcomm_session_hold(s);
>> > +
>> > =A0 =A0 =A0 =A0/* Get data directly from socket receive queue without =
copying it. */
>> > =A0 =A0 =A0 =A0while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0skb_orphan(skb);
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_recv_frame(s, skb);
>> > =A0 =A0 =A0 =A0}
>> >
>> > + =A0 =A0 =A0 rfcomm_session_put(s);
>> > +
>> > =A0 =A0 =A0 =A0if (sk->sk_state =3D=3D BT_CLOSED) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!s->initiator)
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rfcomm_session_put(s);
>> >
>> > What does the above patch do for you? Since if I read it correctly, th=
en
>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closin=
g
>> > of the session. And then in this case it is delayed until after all
>> > frames are processed.
>> >
>> >
>>
>> I've tried your patch but unfortunately kernel panic still happened.
>>
>> From the log I noticed that if rfcomm_l2state_change is called before
>> rfcomm_process_rx, kernel panic will happen definitely.
>>
>> Below lines are in the correct log,
>>
>> [ =A0139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>> [ =A0139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>> ...
>> [ =A0139.457519] rfcomm_l2state_change: ccf70000 state 9
>> (disconnect ok)
>>
>> In the above case, when process_rx, the code in the condition "if
>> (sk->sk_state =3D=3D BT_CLOSED)" will never run.
>>
>> Below lines are in the panic log,
>>
>> [ =A0174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>> [ =A0174.903442] rfcomm_l2state_change: ccf70400 state 9
>> [ =A0174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>> ...
>> ( then panic)
>>
>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, =A0then
>> process_rx, so the code in the condition "if (sk->sk_state =3D=3D
>> BT_CLOSED) " will be run, it will call session_put twice. I think this
>> is the root cause of panic.
>
> I know why it happens, that is not the problem. My point is not to break
> current scheduling assumptions.
>
> So if you move the rfcomm_session_put() now at the end of the function,
> then it should be fine, right?
>
> Regards
>
> Marcel
>
>
>

You are right. I moved the rfcomm_session_put() at the end of
rfcomm_process_tx() then kernel panic doesn't happen any longer.

The changed code is like below,

@@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_se=
ssion

BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&=
sk->s

+ rfcomm_session_hold(s);
+
/* Get data directly from socket receive queue without copying it. =
*/
while ((skb =3D skb_dequeue(&sk->sk_receive_queue))) {
skb_orphan(skb);
@@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_se=
ssion

rfcomm_session_close(s, sk->sk_err);
}
+
+ rfcomm_session_put(s);
}

static inline void rfcomm_accept_connection(struct rfcomm_session *s)

Please submit this change to bluez release.

Thank you,
Zhu Lan

2009-09-11 16:45:14

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Zhu,

> >> We met a issue that kernel panic happens when disconnecting some kinds
> >> of Bluetooth headset, then we did some analysis and made some changes
> >> on kernel code which have avoided the panic happening. Would you
> >> please help to check if our analysis and fix is correct?
> >>
> >> =============
> >> Issue description
> >> =============
> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
> >> may cause kernel panic on certain conditions.
> >>
> >> (Pre-condition is android paired with headset.)
> >> Initiate the connection from android, disconnect it from android, result is OK.
> >> Initiate the connection from android, disconnect it from headset, result is OK.
> >> Initiate the connection from headset, disconnect it from headset, result is OK.
> >> Initiate the connection from headset, disconnect it from android, for
> >> Motorola H12 headset, result is OK.
> >> Initiate the connection from headset, disconnect it from android, for
> >> Motorola H620/560 headset, result is kernel panic.
> >>
> >> =============
> >> Kernel panic point
> >> =============
> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
> >> panic reason is "Unable to handle kernel paging request at virtual
> >> address 00200200"
> >>
> >> =============
> >> Kernel log analysis
> >> =============
> >> rfcomm_session_del() is still called after the session entry is
> >> removed from the list. Then __list_del() will cause kernel panic
> >> because of the incorrect pointer. This situation occurs when calling
> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
> >> find out why the socket state become BT_CLOSED before we calling
> >> rfcomm_recv_ua().
> >>
> >> # [ 171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >> [ 171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
> >> [ 171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >> [ 171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >> [ 171.701110] rfcomm_process_dlcs: session cc751be0 state 1
> >> [ 171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >> rx_credits 33 tx_credits 31
> >> [ 171.715515] rfcomm_send_frame: session cc751be0 len 18
> >> [ 171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >> [ 174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> >> [ 174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
> >> [ 174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> >> [ 174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >> [ 174.154968] rfcomm_process_dlcs: session cc751be0 state 1
> >> [ 174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> >> rx_credits 33 tx_credits 30
> >> [ 174.171173] rfcomm_send_frame: session cc751be0 len 10
> >> [ 174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >> [ 174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
> >> [ 174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
> >> [ 174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
> >> [ 174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
> >> session cc751be0
> >> [ 174.236877] rfcomm_send_disc: cc751be0 dlci 20
> >> [ 174.242706] rfcomm_send_frame: session cc751be0 len 4
> >> [ 174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
> >> [ 174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
> >> [ 174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
> >> [ 174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
> >> [ 174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
> >> [ 174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
> >> [ 174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
> >> [ 174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
> >> session cc751be0
> >> [ 174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
> >> [ 174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
> >> [ 174.443572] rfcomm_dlc_free: cd3fe920
> >> [ 174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> >> [ 174.454528] rfcomm_send_disc: cc751be0 dlci 0
> >> [ 174.459259] rfcomm_send_frame: session cc751be0 len 4
> >> [ 174.464904] rfcomm_process_dlcs: session cc751be0 state 8
> >> [ 174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> >> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> >> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> >> [ 174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
> >> [ 174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
> >> [ 174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> >> [ 174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
> >> [ 174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >> [ 174.948211] rfcomm_session_del: session cc751be0 state 8
> >> [ 174.953918] @@@@ in rfcomm_session_del()
> >> [ 174.958312] @@@@ s->list = cc751be0
> >> [ 174.962097] @@@@ s->list.next = ccbfe9a0
> >> [ 174.966369] @@@@ s->list.prev = c047d524
> >> [ 174.970733] @@@@ list is valid, call list_del()
> >> [ 174.975646] @@@@ after list_del()
> >> [ 174.979278] @@@@ s->list = cc751be0
> >> [ 174.983184] @@@@ s->list.next = 00100100
> >> [ 174.987457] @@@@ s->list.prev = 00200200
> >> [ 174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
> >> [ 174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> >> [ 175.005310] rfcomm_session_del: session cc751be0 state 9
> >> [ 175.011169] @@@@ in rfcomm_session_del()
> >> [ 175.015441] @@@@ s->list = cc751be0
> >> [ 175.019409] @@@@ s->list.next = 00100100
> >> [ 175.023651] @@@@ s->list.prev = 00200200
> >> [ 175.027923] @@@@ list is valid, call list_del()
> >> [ 175.032958] Unable to handle kernel paging request at virtual
> >> address 00200200
> >> [ 175.040679] pgd = c0004000
> >> [ 175.043792] [00200200] *pgd=00000000
> >> [ 175.047821] Internal error: Oops: 817 [#1]
> >> [ 175.052246] Modules linked in:
> >> [ 175.055725] CPU: 0 Not tainted (2.6.29-omap1-dirty #34)
> >> [ 175.061859] PC is at rfcomm_session_del+0x6c/0x108
> >> [ 175.067047] LR is at release_console_sem+0x190/0x1a0
> >> [ 175.072509] pc : [<c033ded8>] lr : [<c0066308>] psr: 60000013
> >> [ 175.072509] sp : cc1abf38 ip : cc1abe68 fp : cc1abf4c
> >> [ 175.084960] r10: cc751c04 r9 : c036d2fc r8 : cc751be0
> >> [ 175.090545] r7 : 00000068 r6 : cc751c04 r5 : 00000009 r4 : cc751be0
> >> [ 175.097656] r3 : 00100100 r2 : 00100100 r1 : 00200200 r0 : c0422876
> >>
> >> =============
> >> HCI log analysis
> >> =============
> >> Compare the hcidump log of the correct case with the one of the panic
> >> case, we found there is only one difference in the message sequence.
> >> In the panic case, headset send L2CAP Disconn_Req immediately after
> >> sending rfcomm UA frame to android. We think this is the reason that
> >> cause the socket state become BT_CLOSED.
> >>
> >> Please compare these two log, pay attention to the message direction
> >> of the last Disconn_Req.
> >>
> >>
> >> Log of correct case:
> >> ----------------------------
> >>
> >>
> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
> >> L2CAP(d): cid 0x0047 len 18 [psm 3]
> >> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
> >>
> >> handle 1 packets 1
> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0047 len 4 [psm 3]
> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
> >>
> >> handle 1 packets 1
> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0047 len 4 [psm 3]
> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
> >>
> >> handle 1 packets 1
> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
> >> L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
> >>
> >> handle 1 packets 1
> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
> >> L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
> >>
> >>
> >>
> >>
> >> log of panic case:
> >> ------------------------
> >>
> >>
> >>
> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
> >> L2CAP(d): cid 0x0041 len 18 [psm 3]
> >> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> >> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
> >> 5
> >> handle 1 packets 1
> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
> >> 5
> >> handle 1 packets 1
> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
> >> 5
> >> handle 1 packets 1
> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
> >> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
> >> L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
> >>
> >> =============
> >> Analysis Result
> >> =============
> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
> >> are based on BCM2044S, they will send L2CAP Disconn_Req command right
> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
> >> rfcomm socket state become BT_CLOSED before completely handling UA
> >> frame, thus it will cause kernel panic. I think we can ignore the
> >> received rfcomm frames if socket state is BT_CLOSED, because it
> >> doesn't make sense in the BT_CLOSED state.
> >>
> >>
> >> =============
> >> Changed Code
> >> =============
> >> We changed the code in the function rfcomm_process_rx() in
> >> net/bluetooth/rfcomm/core.c, check the socket state first before
> >> handling the received framew. If the socket state is BT_CLOSED, we
> >> don't handle any rfcomm frames but just close the session.
> >>
> >> The change is like below
> >>
> >> + if (sk->sk_state != BT_CLOSED) {
> >> /* Get data directly from socket receive queue without copying it. */
> >> while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> >> skb_orphan(skb);
> >> rfcomm_recv_frame(s, skb);
> >> }
> >> -
> >> - if (sk->sk_state == BT_CLOSED) {
> >> + } else {
> >> if (!s->initiator)
> >> rfcomm_session_put(s);
> >>
> >> rfcomm_session_close(s, sk->sk_err);
> >> }
> >
> > so I do see the issue here, but I don't agree with the fix since it
> > changes behavior that might cause other issues. So in case the frame
> > processing leads to sk->sk_state == BT_CLOSED we are not closing the
> > connection anymore if we make it depend on a state before the frame
> > processing. And nothing guarantees that rfcomm_process_rx gets scheduled
> > again.
> >
> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> > index 94b3388..606143b 100644
> > --- a/net/bluetooth/rfcomm/core.c
> > +++ b/net/bluetooth/rfcomm/core.c
> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)
> >
> > BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));
> >
> > + rfcomm_session_hold(s);
> > +
> > /* Get data directly from socket receive queue without copying it. */
> > while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> > skb_orphan(skb);
> > rfcomm_recv_frame(s, skb);
> > }
> >
> > + rfcomm_session_put(s);
> > +
> > if (sk->sk_state == BT_CLOSED) {
> > if (!s->initiator)
> > rfcomm_session_put(s);
> >
> > What does the above patch do for you? Since if I read it correctly, then
> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
> > of the session. And then in this case it is delayed until after all
> > frames are processed.
> >
> >
>
> I've tried your patch but unfortunately kernel panic still happened.
>
> From the log I noticed that if rfcomm_l2state_change is called before
> rfcomm_process_rx, kernel panic will happen definitely.
>
> Below lines are in the correct log,
>
> [ 139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
> [ 139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
> ...
> [ 139.457519] rfcomm_l2state_change: ccf70000 state 9
> (disconnect ok)
>
> In the above case, when process_rx, the code in the condition "if
> (sk->sk_state == BT_CLOSED)" will never run.
>
> Below lines are in the panic log,
>
> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> ...
> ( then panic)
>
> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, then
> process_rx, so the code in the condition "if (sk->sk_state ==
> BT_CLOSED) " will be run, it will call session_put twice. I think this
> is the root cause of panic.

I know why it happens, that is not the problem. My point is not to break
current scheduling assumptions.

So if you move the rfcomm_session_put() now at the end of the function,
then it should be fine, right?

Regards

Marcel



2009-09-11 15:28:20

by Lan Zhu

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Marcel,

2009/9/11 Marcel Holtmann <[email protected]>:
> Hi Zhu,
>
>> We met a issue that kernel panic happens when disconnecting some kinds
>> of Bluetooth headset, then we did some analysis and made some changes
>> on kernel code which have avoided the panic happening. Would you
>> please help to check if our analysis and fix is correct?
>>
>> =============
>> Issue description
>> =============
>> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
>> may cause kernel panic on certain conditions.
>>
>> (Pre-condition is android paired with headset.)
>> Initiate the connection from android, disconnect it from android, result is OK.
>> Initiate the connection from android, disconnect it from headset, result is OK.
>> Initiate the connection from headset, disconnect it from headset, result is OK.
>> Initiate the connection from headset, disconnect it from android, for
>> Motorola H12 headset, result is OK.
>> Initiate the connection from headset, disconnect it from android, for
>> Motorola H620/560 headset, result is kernel panic.
>>
>> =============
>> Kernel panic point
>> =============
>> kernel panic at __list_del() in the function rfcomm_session_del() ,
>> panic reason is "Unable to handle kernel paging request at virtual
>> address 00200200"
>>
>> =============
>> Kernel log analysis
>> =============
>> rfcomm_session_del() is still called after the session entry is
>> removed from the list. Then __list_del() will cause kernel panic
>> because of the incorrect pointer. This situation occurs when calling
>> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
>> find out why the socket state become BT_CLOSED before we calling
>> rfcomm_recv_ua().
>>
>> # [ ?171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>> [ ?171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>> [ ?171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>> [ ?171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
>> [ ?171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>> [ ?171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>> rx_credits 33 tx_credits 31
>> [ ?171.715515] rfcomm_send_frame: session cc751be0 len 18
>> [ ?171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
>> [ ?174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>> [ ?174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>> [ ?174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>> [ ?174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
>> [ ?174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>> [ ?174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>> rx_credits 33 tx_credits 30
>> [ ?174.171173] rfcomm_send_frame: session cc751be0 len 10
>> [ ?174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
>> [ ?174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>> [ ?174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>> [ ?174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
>> [ ?174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
>> session cc751be0
>> [ ?174.236877] rfcomm_send_disc: cc751be0 dlci 20
>> [ ?174.242706] rfcomm_send_frame: session cc751be0 len 4
>> [ ?174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
>> [ ?174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>> [ ?174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>> [ ?174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>> [ ?174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
>> [ ?174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
>> [ ?174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>> [ ?174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
>> session cc751be0
>> [ ?174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>> [ ?174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
>> [ ?174.443572] rfcomm_dlc_free: cd3fe920
>> [ ?174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
>> [ ?174.454528] rfcomm_send_disc: cc751be0 dlci 0
>> [ ?174.459259] rfcomm_send_frame: session cc751be0 len 4
>> [ ?174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>> [ ?174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
>> [ ?174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>> [ ?174.903442] rfcomm_l2state_change: ccf70400 state 9
>> [ ?174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>> [ ?174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
>> [ ?174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>> [ ?174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
>> [ ?174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
>> [ ?174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
>> [ ?174.948211] rfcomm_session_del: session cc751be0 state 8
>> [ ?174.953918] @@@@ in rfcomm_session_del()
>> [ ?174.958312] @@@@ s->list = cc751be0
>> [ ?174.962097] @@@@ s->list.next = ccbfe9a0
>> [ ?174.966369] @@@@ s->list.prev = c047d524
>> [ ?174.970733] @@@@ list is valid, call list_del()
>> [ ?174.975646] @@@@ after list_del()
>> [ ?174.979278] @@@@ s->list = cc751be0
>> [ ?174.983184] @@@@ s->list.next = 00100100
>> [ ?174.987457] @@@@ s->list.prev = 00200200
>> [ ?174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
>> [ ?174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
>> [ ?175.005310] rfcomm_session_del: session cc751be0 state 9
>> [ ?175.011169] @@@@ in rfcomm_session_del()
>> [ ?175.015441] @@@@ s->list = cc751be0
>> [ ?175.019409] @@@@ s->list.next = 00100100
>> [ ?175.023651] @@@@ s->list.prev = 00200200
>> [ ?175.027923] @@@@ list is valid, call list_del()
>> [ ?175.032958] Unable to handle kernel paging request at virtual
>> address 00200200
>> [ ?175.040679] pgd = c0004000
>> [ ?175.043792] [00200200] *pgd=00000000
>> [ ?175.047821] Internal error: Oops: 817 [#1]
>> [ ?175.052246] Modules linked in:
>> [ ?175.055725] CPU: 0 ? ?Not tainted ?(2.6.29-omap1-dirty #34)
>> [ ?175.061859] PC is at rfcomm_session_del+0x6c/0x108
>> [ ?175.067047] LR is at release_console_sem+0x190/0x1a0
>> [ ?175.072509] pc : [<c033ded8>] ? ?lr : [<c0066308>] ? ?psr: 60000013
>> [ ?175.072509] sp : cc1abf38 ?ip : cc1abe68 ?fp : cc1abf4c
>> [ ?175.084960] r10: cc751c04 ?r9 : c036d2fc ?r8 : cc751be0
>> [ ?175.090545] r7 : 00000068 ?r6 : cc751c04 ?r5 : 00000009 ?r4 : cc751be0
>> [ ?175.097656] r3 : 00100100 ?r2 : 00100100 ?r1 : 00200200 ?r0 : c0422876
>>
>> =============
>> HCI log analysis
>> =============
>> Compare the hcidump log of the correct case with the one of the panic
>> case, we found there is only one difference in the message sequence.
>> In the panic case, headset send L2CAP Disconn_Req immediately after
>> sending rfcomm UA frame to android. We think this is the reason that
>> cause the socket state become BT_CLOSED.
>>
>> Please compare these two log, pay attention to the message direction
>> of the last Disconn_Req.
>>
>>
>> Log of correct case:
>> ----------------------------
>>
>>
>> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>> ? ?L2CAP(d): cid 0x0047 len 18 [psm 3]
>> ? ? ?RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> ? ? ?0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
>> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
>>
>> ? ?handle 1 packets 1
>> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>> ? ?L2CAP(d): cid 0x0047 len 4 [psm 3]
>> ? ? ?RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
>>
>> ? ?handle 1 packets 1
>> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
>> ? ? ?RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>> ? ?L2CAP(d): cid 0x0047 len 4 [psm 3]
>> ? ? ?RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
>>
>> ? ?handle 1 packets 1
>> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
>> ? ? ?RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>> ? ?L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
>>
>> ? ?handle 1 packets 1
>> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>> ? ?L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>>
>>
>>
>>
>> log of panic case:
>> ------------------------
>>
>>
>>
>> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>> ? ? L2CAP(d): cid 0x0041 len 18 [psm 3]
>> ? ? ? RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>> ? ? ? 0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
>> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
>> 5
>> ? ? handle 1 packets 1
>> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>> ? ? L2CAP(d): cid 0x0041 len 4 [psm 3]
>> ? ? ? RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
>> 5
>> ? ? handle 1 packets 1
>> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>> ? ? L2CAP(d): cid 0x0040 len 4 [psm 3]
>> ? ? ? RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>> ? ? L2CAP(d): cid 0x0041 len 4 [psm 3]
>> ? ? ? RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
>> 5
>> ? ? handle 1 packets 1
>> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>> ? ? L2CAP(d): cid 0x0040 len 4 [psm 3]
>> ? ? ? RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>> ? ? L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>> ? ? L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>
>> =============
>> Analysis Result
>> =============
>> For some kinds of Bluetooth headset such as Motorola H560/H620 which
>> are based on BCM2044S, they will send L2CAP Disconn_Req command right
>> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
>> rfcomm socket state become BT_CLOSED before completely handling UA
>> frame, thus it will cause kernel panic. I think we can ignore the
>> received rfcomm frames if socket state is BT_CLOSED, because it
>> doesn't make sense in the BT_CLOSED state.
>>
>>
>> =============
>> Changed Code
>> =============
>> We changed the code in the function rfcomm_process_rx() in
>> net/bluetooth/rfcomm/core.c, check the socket state first before
>> handling the received framew. If the socket state is BT_CLOSED, we
>> don't handle any rfcomm frames but just close the session.
>>
>> The change is like below
>>
>> + ? ? ? if (sk->sk_state != BT_CLOSED) {
>> ? ? ? ? /* Get data directly from socket receive queue without copying it. */
>> ? ? ? ? while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
>> ? ? ? ? ? ? ? ? skb_orphan(skb);
>> ? ? ? ? ? ? ? ? rfcomm_recv_frame(s, skb);
>> ? ? ? ? }
>> -
>> - ? ? ? if (sk->sk_state == BT_CLOSED) {
>> + ? ? ? } else {
>> ? ? ? ? ? ? ? ? if (!s->initiator)
>> ? ? ? ? ? ? ? ? ? ? ? ? rfcomm_session_put(s);
>>
>> ? ? ? ? ? ? ? ? rfcomm_session_close(s, sk->sk_err);
>> ? ? ? ? ?}
>
> so I do see the issue here, but I don't agree with the fix since it
> changes behavior that might cause other issues. So in case the frame
> processing leads to sk->sk_state == BT_CLOSED we are not closing the
> connection anymore if we make it depend on a state before the frame
> processing. And nothing guarantees that rfcomm_process_rx gets scheduled
> again.
>
> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> index 94b3388..606143b 100644
> --- a/net/bluetooth/rfcomm/core.c
> +++ b/net/bluetooth/rfcomm/core.c
> @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)
>
> ? ? ? ?BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));
>
> + ? ? ? rfcomm_session_hold(s);
> +
> ? ? ? ?/* Get data directly from socket receive queue without copying it. */
> ? ? ? ?while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> ? ? ? ? ? ? ? ?skb_orphan(skb);
> ? ? ? ? ? ? ? ?rfcomm_recv_frame(s, skb);
> ? ? ? ?}
>
> + ? ? ? rfcomm_session_put(s);
> +
> ? ? ? ?if (sk->sk_state == BT_CLOSED) {
> ? ? ? ? ? ? ? ?if (!s->initiator)
> ? ? ? ? ? ? ? ? ? ? ? ?rfcomm_session_put(s);
>
> What does the above patch do for you? Since if I read it correctly, then
> the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
> of the session. And then in this case it is delayed until after all
> frames are processed.
>
>

I've tried your patch but unfortunately kernel panic still happened.

>From the log I noticed that if rfcomm_l2state_change is called before
rfcomm_process_rx, kernel panic will happen definitely.

Below lines are in the correct log,

[ 139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
[ 139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
...
[ 139.457519] rfcomm_l2state_change: ccf70000 state 9
(disconnect ok)

In the above case, when process_rx, the code in the condition "if
(sk->sk_state == BT_CLOSED)" will never run.

Below lines are in the panic log,

[ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
[ 174.903442] rfcomm_l2state_change: ccf70400 state 9
[ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
...
( then panic)

In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, then
process_rx, so the code in the condition "if (sk->sk_state ==
BT_CLOSED) " will be run, it will call session_put twice. I think this
is the root cause of panic.

Thanks,
Zhu Lan

2009-09-11 08:23:56

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Zhu,

> We met a issue that kernel panic happens when disconnecting some kinds
> of Bluetooth headset, then we did some analysis and made some changes
> on kernel code which have avoided the panic happening. Would you
> please help to check if our analysis and fix is correct?
>
> =============
> Issue description
> =============
> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
> may cause kernel panic on certain conditions.
>
> (Pre-condition is android paired with headset.)
> Initiate the connection from android, disconnect it from android, result is OK.
> Initiate the connection from android, disconnect it from headset, result is OK.
> Initiate the connection from headset, disconnect it from headset, result is OK.
> Initiate the connection from headset, disconnect it from android, for
> Motorola H12 headset, result is OK.
> Initiate the connection from headset, disconnect it from android, for
> Motorola H620/560 headset, result is kernel panic.
>
> =============
> Kernel panic point
> =============
> kernel panic at __list_del() in the function rfcomm_session_del() ,
> panic reason is "Unable to handle kernel paging request at virtual
> address 00200200"
>
> =============
> Kernel log analysis
> =============
> rfcomm_session_del() is still called after the session entry is
> removed from the list. Then __list_del() will cause kernel panic
> because of the incorrect pointer. This situation occurs when calling
> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
> find out why the socket state become BT_CLOSED before we calling
> rfcomm_recv_ua().
>
> # [ 171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> [ 171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
> [ 171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> [ 171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
> [ 171.701110] rfcomm_process_dlcs: session cc751be0 state 1
> [ 171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> rx_credits 33 tx_credits 31
> [ 171.715515] rfcomm_send_frame: session cc751be0 len 18
> [ 171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> [ 174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
> [ 174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
> [ 174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
> [ 174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
> [ 174.154968] rfcomm_process_dlcs: session cc751be0 state 1
> [ 174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
> rx_credits 33 tx_credits 30
> [ 174.171173] rfcomm_send_frame: session cc751be0 len 10
> [ 174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> [ 174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
> [ 174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
> [ 174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
> [ 174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
> session cc751be0
> [ 174.236877] rfcomm_send_disc: cc751be0 dlci 20
> [ 174.242706] rfcomm_send_frame: session cc751be0 len 4
> [ 174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
> [ 174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
> [ 174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
> [ 174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
> [ 174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
> [ 174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
> [ 174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
> [ 174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
> session cc751be0
> [ 174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
> [ 174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
> [ 174.443572] rfcomm_dlc_free: cd3fe920
> [ 174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
> [ 174.454528] rfcomm_send_disc: cc751be0 dlci 0
> [ 174.459259] rfcomm_send_frame: session cc751be0 len 4
> [ 174.464904] rfcomm_process_dlcs: session cc751be0 state 8
> [ 174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> [ 174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
> [ 174.903442] rfcomm_l2state_change: ccf70400 state 9
> [ 174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
> [ 174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
> [ 174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
> [ 174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
> [ 174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
> [ 174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> [ 174.948211] rfcomm_session_del: session cc751be0 state 8
> [ 174.953918] @@@@ in rfcomm_session_del()
> [ 174.958312] @@@@ s->list = cc751be0
> [ 174.962097] @@@@ s->list.next = ccbfe9a0
> [ 174.966369] @@@@ s->list.prev = c047d524
> [ 174.970733] @@@@ list is valid, call list_del()
> [ 174.975646] @@@@ after list_del()
> [ 174.979278] @@@@ s->list = cc751be0
> [ 174.983184] @@@@ s->list.next = 00100100
> [ 174.987457] @@@@ s->list.prev = 00200200
> [ 174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
> [ 174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
> [ 175.005310] rfcomm_session_del: session cc751be0 state 9
> [ 175.011169] @@@@ in rfcomm_session_del()
> [ 175.015441] @@@@ s->list = cc751be0
> [ 175.019409] @@@@ s->list.next = 00100100
> [ 175.023651] @@@@ s->list.prev = 00200200
> [ 175.027923] @@@@ list is valid, call list_del()
> [ 175.032958] Unable to handle kernel paging request at virtual
> address 00200200
> [ 175.040679] pgd = c0004000
> [ 175.043792] [00200200] *pgd=00000000
> [ 175.047821] Internal error: Oops: 817 [#1]
> [ 175.052246] Modules linked in:
> [ 175.055725] CPU: 0 Not tainted (2.6.29-omap1-dirty #34)
> [ 175.061859] PC is at rfcomm_session_del+0x6c/0x108
> [ 175.067047] LR is at release_console_sem+0x190/0x1a0
> [ 175.072509] pc : [<c033ded8>] lr : [<c0066308>] psr: 60000013
> [ 175.072509] sp : cc1abf38 ip : cc1abe68 fp : cc1abf4c
> [ 175.084960] r10: cc751c04 r9 : c036d2fc r8 : cc751be0
> [ 175.090545] r7 : 00000068 r6 : cc751c04 r5 : 00000009 r4 : cc751be0
> [ 175.097656] r3 : 00100100 r2 : 00100100 r1 : 00200200 r0 : c0422876
>
> =============
> HCI log analysis
> =============
> Compare the hcidump log of the correct case with the one of the panic
> case, we found there is only one difference in the message sequence.
> In the panic case, headset send L2CAP Disconn_Req immediately after
> sending rfcomm UA frame to android. We think this is the reason that
> cause the socket state become BT_CLOSED.
>
> Please compare these two log, pay attention to the message direction
> of the last Disconn_Req.
>
>
> Log of correct case:
> ----------------------------
>
>
> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
> L2CAP(d): cid 0x0047 len 18 [psm 3]
> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
>
> handle 1 packets 1
> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0047 len 4 [psm 3]
> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
>
> handle 1 packets 1
> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0040 len 4 [psm 3]
> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0047 len 4 [psm 3]
> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
>
> handle 1 packets 1
> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0040 len 4 [psm 3]
> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
> L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
>
> handle 1 packets 1
> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
> L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>
>
>
>
> log of panic case:
> ------------------------
>
>
>
> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
> L2CAP(d): cid 0x0041 len 18 [psm 3]
> RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
> 0000: 0d 0a 2b 43 49 45 56 3a 20 37 2c 33 0d 0a ..+CIEV: 7,3..
> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
> 5
> handle 1 packets 1
> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0041 len 4 [psm 3]
> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
> 5
> handle 1 packets 1
> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0040 len 4 [psm 3]
> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0041 len 4 [psm 3]
> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
> 5
> handle 1 packets 1
> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0040 len 4 [psm 3]
> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
> L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>
> =============
> Analysis Result
> =============
> For some kinds of Bluetooth headset such as Motorola H560/H620 which
> are based on BCM2044S, they will send L2CAP Disconn_Req command right
> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
> rfcomm socket state become BT_CLOSED before completely handling UA
> frame, thus it will cause kernel panic. I think we can ignore the
> received rfcomm frames if socket state is BT_CLOSED, because it
> doesn't make sense in the BT_CLOSED state.
>
>
> =============
> Changed Code
> =============
> We changed the code in the function rfcomm_process_rx() in
> net/bluetooth/rfcomm/core.c, check the socket state first before
> handling the received framew. If the socket state is BT_CLOSED, we
> don't handle any rfcomm frames but just close the session.
>
> The change is like below
>
> + if (sk->sk_state != BT_CLOSED) {
> /* Get data directly from socket receive queue without copying it. */
> while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
> skb_orphan(skb);
> rfcomm_recv_frame(s, skb);
> }
> -
> - if (sk->sk_state == BT_CLOSED) {
> + } else {
> if (!s->initiator)
> rfcomm_session_put(s);
>
> rfcomm_session_close(s, sk->sk_err);
> }

so I do see the issue here, but I don't agree with the fix since it
changes behavior that might cause other issues. So in case the frame
processing leads to sk->sk_state == BT_CLOSED we are not closing the
connection anymore if we make it depend on a state before the frame
processing. And nothing guarantees that rfcomm_process_rx gets scheduled
again.

diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
index 94b3388..606143b 100644
--- a/net/bluetooth/rfcomm/core.c
+++ b/net/bluetooth/rfcomm/core.c
@@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)

BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));

+ rfcomm_session_hold(s);
+
/* Get data directly from socket receive queue without copying it. */
while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
skb_orphan(skb);
rfcomm_recv_frame(s, skb);
}

+ rfcomm_session_put(s);
+
if (sk->sk_state == BT_CLOSED) {
if (!s->initiator)
rfcomm_session_put(s);

What does the above patch do for you? Since if I read it correctly, then
the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
of the session. And then in this case it is delayed until after all
frames are processed.

Regards

Marcel



2009-12-30 14:22:35

by Luiz Pena

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi all.
I was wondering if you solved the problem eventually and how you did it?
In addition, you mentioned that some L2CAP specific API should be
added to the kernel so direct use of L2CAP sockets won=92t be necessary
anymore =96 anybody working on that?
Thanks ahead and happy new years eve.

2009-12-22 16:20:50

by Andrei Emeltchenko

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Marcel,

On Sat, Dec 19, 2009 at 1:02 AM, Marcel Holtmann <[email protected]> wrot=
e:
> Hi Nick,
>
>> >> Processing a RFCOMM UA frame when the socket is closed and we were no=
t
>> >> the
>> >> RFCOMM initiator would cause rfcomm_session_put() to be called twice
>> >> during
>> >> rfcomm_process_rx(). This would cause a kernel panic in
>> >> rfcomm_session_close.
>> >>
>> >> This could be easily reproduced during disconnect with devices such a=
s
>> >> Motorola H270 that send RFCOMM UA followed quickly by L2CAP disconnec=
t
>> >> request.
>> >> This hcidump for this looks like:
>> >>
>> >> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> >> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
>> >> (0x13)
>> >> plen 5
>> >> =A0 =A0handle 1 packets 1
>> >> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> >> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> >> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
>> >> (0x13)
>> >> plen 5
>> >> =A0 =A0handle 1 packets 1
>> >> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> >> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>> >>
>> >> Avoid calling rfcomm_session_put() twice by skipping this call
>> >> in rfcomm_recv_ua() if the socket is closed.
>> >>
>> >> Picked from:
>> >> http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=3D1=
048e007842da2d6440679e1ca80f45438a6369d
>> >>
>> >> Signed-off-by: Nick Pelly <[email protected]>
>> >> Signed-off-by: Andrei Emeltchenko <[email protected]>
>> >> ---
>> >> =A0net/bluetooth/rfcomm/core.c | =A0 =A03 ++-
>> >> =A01 files changed, 2 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.=
c
>> >> index 0313e88..56ffcb8 100644
>> >> --- a/net/bluetooth/rfcomm/core.c
>> >> +++ b/net/bluetooth/rfcomm/core.c
>> >> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
>> >> *s, u8 dlci)
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>> >>
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BT_DISCONN:
>> >> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s);
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (s->sock->sk->sk_sta=
te !=3D BT_CLOSED)
>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_=
session_put(s);
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> >> =A0 =A0 =A0 =A0 }
>> >
>> > I am not a big fan of conditionally decreasing reference counts. I do
>> > think it would be better to fix this by holding an extra pair of
>> > reference counts or actually fixing the imbalance. What about the othe=
r
>> > patches I proposed?
>>
>> Your proposed patch was to add an extra hold() / put() reference count
>> around the offending put(). I did test this patch, and found it does
>> not fix the underlying imbalance, it just moves the kernel panic
>> somewhere else.
>>
>> As best I can tell, my patch does address the underlying imbalance. It
>> is in production on Android phones and seems to work well. As best I
>> can tell, there is not a cleaner solution that does not involve
>> significant refactoring of rfcomm refcounting.

We have this patch also in Nokia N900 phone. And this was the best solution
for the problem mentioned.

> the RFCOMM reference counting is something nasty and it does need to be
> re-written. One thing that needs to happen that we stop using the L2CAP
> sockets directly. We have to put a proper L2CAP in-kernel specific API
> in between that ensures we are not mixing things. That is the one issues
> that we always had in this area.
>
> Before applying this patch, I like to have additionally a comment in
> front of this conditional put call that explains a little bit the
> problem area here. The long explanation with logs etc. should be in the
> commit message. I have to make sure that we fully understand what is
> going on here and why we did it.

What do you think about following comment:

--- a/net/bluetooth/rfcomm/core.c
+++ b/net/bluetooth/rfcomm/core.c
@@ -1151,7 +1151,11 @@ static int rfcomm_recv_ua(struct rfcomm_session
*s, u8 dlci)
break;

case BT_DISCONN:
- rfcomm_session_put(s);
+ /* When socket is closed and we are not RFCOMM
+ * initiator rfcomm_process_rx already calls
+ * rfcomm_session_put */
+ if (s->sock->sk->sk_state !=3D BT_CLOSED)
+ rfcomm_session_put(s);
break;
}
}
--=20


Regards
Andrei

2009-12-18 23:02:27

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Nick,

> >> Processing a RFCOMM UA frame when the socket is closed and we were not
> >> the
> >> RFCOMM initiator would cause rfcomm_session_put() to be called twice
> >> during
> >> rfcomm_process_rx(). This would cause a kernel panic in
> >> rfcomm_session_close.
> >>
> >> This could be easily reproduced during disconnect with devices such as
> >> Motorola H270 that send RFCOMM UA followed quickly by L2CAP disconnect
> >> request.
> >> This hcidump for this looks like:
> >>
> >> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
> >> (0x13)
> >> plen 5
> >> handle 1 packets 1
> >> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
> >> (0x13)
> >> plen 5
> >> handle 1 packets 1
> >> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
> >> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >>
> >> Avoid calling rfcomm_session_put() twice by skipping this call
> >> in rfcomm_recv_ua() if the socket is closed.
> >>
> >> Picked from:
> >> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=1048e007842da2d6440679e1ca80f45438a6369d
> >>
> >> Signed-off-by: Nick Pelly <[email protected]>
> >> Signed-off-by: Andrei Emeltchenko <[email protected]>
> >> ---
> >> net/bluetooth/rfcomm/core.c | 3 ++-
> >> 1 files changed, 2 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> >> index 0313e88..56ffcb8 100644
> >> --- a/net/bluetooth/rfcomm/core.c
> >> +++ b/net/bluetooth/rfcomm/core.c
> >> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> >> *s, u8 dlci)
> >> break;
> >>
> >> case BT_DISCONN:
> >> - rfcomm_session_put(s);
> >> + if (s->sock->sk->sk_state != BT_CLOSED)
> >> + rfcomm_session_put(s);
> >> break;
> >> }
> >> }
> >
> > I am not a big fan of conditionally decreasing reference counts. I do
> > think it would be better to fix this by holding an extra pair of
> > reference counts or actually fixing the imbalance. What about the other
> > patches I proposed?
>
> Your proposed patch was to add an extra hold() / put() reference count
> around the offending put(). I did test this patch, and found it does
> not fix the underlying imbalance, it just moves the kernel panic
> somewhere else.
>
> As best I can tell, my patch does address the underlying imbalance. It
> is in production on Android phones and seems to work well. As best I
> can tell, there is not a cleaner solution that does not involve
> significant refactoring of rfcomm refcounting.

the RFCOMM reference counting is something nasty and it does need to be
re-written. One thing that needs to happen that we stop using the L2CAP
sockets directly. We have to put a proper L2CAP in-kernel specific API
in between that ensures we are not mixing things. That is the one issues
that we always had in this area.

Before applying this patch, I like to have additionally a comment in
front of this conditional put call that explains a little bit the
problem area here. The long explanation with logs etc. should be in the
commit message. I have to make sure that we fully understand what is
going on here and why we did it.

Regards

Marcel



2009-12-18 22:30:56

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Fri, Dec 18, 2009 at 1:59 PM, Marcel Holtmann <[email protected]> wrot=
e:
> Hi Andrei,
>
>> Processing a RFCOMM UA frame when the socket is closed and we were not
>> the
>> RFCOMM initiator would cause rfcomm_session_put() to be called twice
>> during
>> rfcomm_process_rx(). This would cause a kernel panic in
>> rfcomm_session_close.
>>
>> This could be easily reproduced during disconnect with devices such as
>> Motorola H270 that send RFCOMM UA followed quickly by L2CAP disconnect
>> request.
>> This hcidump for this looks like:
>>
>> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
>> (0x13)
>> plen 5
>> =A0 =A0handle 1 packets 1
>> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
>> (0x13)
>> plen 5
>> =A0 =A0handle 1 packets 1
>> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
>> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
>> =A0 =A0L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>
>> Avoid calling rfcomm_session_put() twice by skipping this call
>> in rfcomm_recv_ua() if the socket is closed.
>>
>> Picked from:
>> http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=3D1048=
e007842da2d6440679e1ca80f45438a6369d
>>
>> Signed-off-by: Nick Pelly <[email protected]>
>> Signed-off-by: Andrei Emeltchenko <[email protected]>
>> ---
>> =A0net/bluetooth/rfcomm/core.c | =A0 =A03 ++-
>> =A01 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
>> index 0313e88..56ffcb8 100644
>> --- a/net/bluetooth/rfcomm/core.c
>> +++ b/net/bluetooth/rfcomm/core.c
>> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
>> *s, u8 dlci)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>>
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BT_DISCONN:
>> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (s->sock->sk->sk_state =
!=3D BT_CLOSED)
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_ses=
sion_put(s);
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> =A0 =A0 =A0 =A0 }
>
> I am not a big fan of conditionally decreasing reference counts. I do
> think it would be better to fix this by holding an extra pair of
> reference counts or actually fixing the imbalance. What about the other
> patches I proposed?

Your proposed patch was to add an extra hold() / put() reference count
around the offending put(). I did test this patch, and found it does
not fix the underlying imbalance, it just moves the kernel panic
somewhere else.

As best I can tell, my patch does address the underlying imbalance. It
is in production on Android phones and seems to work well. As best I
can tell, there is not a cleaner solution that does not involve
significant refactoring of rfcomm refcounting.

Nick

2009-12-18 21:59:15

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Andrei,

> Processing a RFCOMM UA frame when the socket is closed and we were not
> the
> RFCOMM initiator would cause rfcomm_session_put() to be called twice
> during
> rfcomm_process_rx(). This would cause a kernel panic in
> rfcomm_session_close.
>
> This could be easily reproduced during disconnect with devices such as
> Motorola H270 that send RFCOMM UA followed quickly by L2CAP disconnect
> request.
> This hcidump for this looks like:
>
> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0041 len 4 [psm 3]
> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
> (0x13)
> plen 5
> handle 1 packets 1
> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0040 len 4 [psm 3]
> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0041 len 4 [psm 3]
> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
> (0x13)
> plen 5
> handle 1 packets 1
> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
> L2CAP(d): cid 0x0040 len 4 [psm 3]
> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>
> Avoid calling rfcomm_session_put() twice by skipping this call
> in rfcomm_recv_ua() if the socket is closed.
>
> Picked from:
> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=1048e007842da2d6440679e1ca80f45438a6369d
>
> Signed-off-by: Nick Pelly <[email protected]>
> Signed-off-by: Andrei Emeltchenko <[email protected]>
> ---
> net/bluetooth/rfcomm/core.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> index 0313e88..56ffcb8 100644
> --- a/net/bluetooth/rfcomm/core.c
> +++ b/net/bluetooth/rfcomm/core.c
> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> *s, u8 dlci)
> break;
>
> case BT_DISCONN:
> - rfcomm_session_put(s);
> + if (s->sock->sk->sk_state != BT_CLOSED)
> + rfcomm_session_put(s);
> break;
> }
> }

I am not a big fan of conditionally decreasing reference counts. I do
think it would be better to fix this by holding an extra pair of
reference counts or actually fixing the imbalance. What about the other
patches I proposed?

Regards

Marcel



2009-12-18 14:20:17

by Andrei Emeltchenko

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Marcel,

On Tue, Sep 22, 2009 at 10:18 PM, Nick Pelly <[email protected]> wrote:
> On Mon, Sep 21, 2009 at 6:29 PM, Nick Pelly <[email protected]> wrote:
>> On Mon, Sep 21, 2009 at 5:52 PM, Nick Pelly <[email protected]> wrote:
>>> On Mon, Sep 14, 2009 at 2:10 AM, Lan Zhu <[email protected]> wrote:
>>>> Hi Marcel,
>>>>
>>>> 2009/9/12 Marcel Holtmann <[email protected]>:
>>>>> Hi Zhu,
>>>>>
>>>>>> >> We met a issue that kernel panic happens when disconnecting some kinds
>>>>>> >> of Bluetooth headset, then we did some analysis and made some changes
>>>>>> >> on kernel code which have avoided the panic happening. Would you
>>>>>> >> please help to check if our analysis and fix is correct?
>>>>>> >>
>>>>>> >> =============
>>>>>> >> Issue description
>>>>>> >> =============
>>>>>> >> On Android platform(kernel 2.6.29), disconnecting Bluetooth headset
>>>>>> >> may cause kernel panic on certain conditions.
>>>>>> >>
>>>>>> >> (Pre-condition is android paired with headset.)
>>>>>> >> Initiate the connection from android, disconnect it from android, result is OK.
>>>>>> >> Initiate the connection from android, disconnect it from headset, result is OK.
>>>>>> >> Initiate the connection from headset, disconnect it from headset, result is OK.
>>>>>> >> Initiate the connection from headset, disconnect it from android, for
>>>>>> >> Motorola H12 headset, result is OK.
>>>>>> >> Initiate the connection from headset, disconnect it from android, for
>>>>>> >> Motorola H620/560 headset, result is kernel panic.
>>>>>> >>
>>>>>> >> =============
>>>>>> >> Kernel panic point
>>>>>> >> =============
>>>>>> >> kernel panic at __list_del() in the function rfcomm_session_del() ,
>>>>>> >> panic reason is "Unable to handle kernel paging request at virtual
>>>>>> >> address 00200200"
>>>>>> >>
>>>>>> >> =============
>>>>>> >> Kernel log analysis
>>>>>> >> =============
>>>>>> >> rfcomm_session_del() is still called after the session entry is
>>>>>> >> removed from the list. Then __list_del() will cause kernel panic
>>>>>> >> because of the incorrect pointer. This situation occurs when calling
>>>>>> >> rfcomm_recv_ua() when the socket state is BT_CLOSED . So we need to
>>>>>> >> find out why the socket state become BT_CLOSED before we calling
>>>>>> >> rfcomm_recv_ua().
>>>>>> >>
>>>>>> >> # [ ?171.677429] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>>>>> >> [ ?171.683532] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 14
>>>>>> >> [ ?171.689422] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>>>>> >> [ ?171.695709] rfcomm_process_rx: @@@ @@@ sk_state = 1
>>>>>> >> [ ?171.701110] rfcomm_process_dlcs: session cc751be0 state 1
>>>>>> >> [ ?171.706939] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>>>>> >> rx_credits 33 tx_credits 31
>>>>>> >> [ ?171.715515] rfcomm_send_frame: session cc751be0 len 18
>>>>>> >> [ ?171.721130] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
>>>>>> >> [ ?174.127807] rfcomm_sock_sendmsg: sock ce9a0960, sk cc5c4c00
>>>>>> >> [ ?174.134490] rfcomm_dlc_send: dlc cd3fe920 mtu 255 len 6
>>>>>> >> [ ?174.141540] rfcomm_process_rx: session cc751be0 state 1 qlen 0
>>>>>> >> [ ?174.148498] rfcomm_process_rx: @@@ @@@ sk_state = 1
>>>>>> >> [ ?174.154968] rfcomm_process_dlcs: session cc751be0 state 1
>>>>>> >> [ ?174.161437] rfcomm_process_tx: dlc cd3fe920 state 1 cfc 40
>>>>>> >> rx_credits 33 tx_credits 30
>>>>>> >> [ ?174.171173] rfcomm_send_frame: session cc751be0 len 10
>>>>>> >> [ ?174.177642] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
>>>>>> >> [ ?174.205932] rfcomm_sock_release: sock ce9a0960, sk cc5c4c00
>>>>>> >> [ ?174.212707] rfcomm_sock_shutdown: sock ce9a0960, sk cc5c4c00
>>>>>> >> [ ?174.220031] __rfcomm_sock_close: sk cc5c4c00 state 1 socket ce9a0960
>>>>>> >> [ ?174.227508] __rfcomm_dlc_close: dlc cd3fe920 state 1 dlci 20 err 0
>>>>>> >> session cc751be0
>>>>>> >> [ ?174.236877] rfcomm_send_disc: cc751be0 dlci 20
>>>>>> >> [ ?174.242706] rfcomm_send_frame: session cc751be0 len 4
>>>>>> >> [ ?174.248962] rfcomm_dlc_set_timer: dlc cd3fe920 state 8 timeout 2560
>>>>>> >> [ ?174.256835] rfcomm_sock_kill: sk cc5c4c00 state 1 refcnt 2
>>>>>> >> [ ?174.263336] rfcomm_sock_destruct: sk cc5c4c00 dlc cd3fe920
>>>>>> >> [ ?174.399444] rfcomm_l2data_ready: ccf70400 bytes 4
>>>>>> >> [ ?174.404724] rfcomm_process_rx: session cc751be0 state 1 qlen 1
>>>>>> >> [ ?174.411010] rfcomm_process_rx: @@@ @@@ sk_state = 1
>>>>>> >> [ ?174.416412] rfcomm_recv_ua: session cc751be0 state 1 dlci 20
>>>>>> >> [ ?174.422515] __rfcomm_dlc_close: dlc cd3fe920 state 9 dlci 20 err 0
>>>>>> >> session cc751be0
>>>>>> >> [ ?174.430816] rfcomm_dlc_clear_timer: dlc cd3fe920 state 9
>>>>>> >> [ ?174.436553] rfcomm_dlc_unlink: dlc cd3fe920 refcnt 1 session cc751be0
>>>>>> >> [ ?174.443572] rfcomm_dlc_free: cd3fe920
>>>>>> >> [ ?174.447570] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 3
>>>>>> >> [ ?174.454528] rfcomm_send_disc: cc751be0 dlci 0
>>>>>> >> [ ?174.459259] rfcomm_send_frame: session cc751be0 len 4
>>>>>> >> [ ?174.464904] rfcomm_process_dlcs: session cc751be0 state 8
>>>>>> >> [ ?174.470703] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
>>>>>> >> [ ?174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>>>>> >> [ ?174.903442] rfcomm_l2state_change: ccf70400 state 9
>>>>>> >> [ ?174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>>>>> >> [ ?174.915130] rfcomm_process_rx: @@@ @@@ sk_state = 9
>>>>>> >> [ ?174.920562] rfcomm_recv_ua: session cc751be0 state 8 dlci 0
>>>>>> >> [ ?174.926574] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 2
>>>>>> >> [ ?174.933532] rfcomm_process_rx: @@@ @@@ sk_state == BT_CLOSED , s->initiator=0
>>>>>> >> [ ?174.941253] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
>>>>>> >> [ ?174.948211] rfcomm_session_del: session cc751be0 state 8
>>>>>> >> [ ?174.953918] @@@@ in rfcomm_session_del()
>>>>>> >> [ ?174.958312] @@@@ s->list = cc751be0
>>>>>> >> [ ?174.962097] @@@@ s->list.next = ccbfe9a0
>>>>>> >> [ ?174.966369] @@@@ s->list.prev = c047d524
>>>>>> >> [ ?174.970733] @@@@ list is valid, call list_del()
>>>>>> >> [ ?174.975646] @@@@ after list_del()
>>>>>> >> [ ?174.979278] @@@@ s->list = cc751be0
>>>>>> >> [ ?174.983184] @@@@ s->list.next = 00100100
>>>>>> >> [ ?174.987457] @@@@ s->list.prev = 00200200
>>>>>> >> [ ?174.991729] rfcomm_session_close: session cc751be0 state 8 err 104
>>>>>> >> [ ?174.998504] rfcomm_session_put: in rfcomm_session_put, s->refcnt = 1
>>>>>> >> [ ?175.005310] rfcomm_session_del: session cc751be0 state 9
>>>>>> >> [ ?175.011169] @@@@ in rfcomm_session_del()
>>>>>> >> [ ?175.015441] @@@@ s->list = cc751be0
>>>>>> >> [ ?175.019409] @@@@ s->list.next = 00100100
>>>>>> >> [ ?175.023651] @@@@ s->list.prev = 00200200
>>>>>> >> [ ?175.027923] @@@@ list is valid, call list_del()
>>>>>> >> [ ?175.032958] Unable to handle kernel paging request at virtual
>>>>>> >> address 00200200
>>>>>> >> [ ?175.040679] pgd = c0004000
>>>>>> >> [ ?175.043792] [00200200] *pgd=00000000
>>>>>> >> [ ?175.047821] Internal error: Oops: 817 [#1]
>>>>>> >> [ ?175.052246] Modules linked in:
>>>>>> >> [ ?175.055725] CPU: 0 ? ?Not tainted ?(2.6.29-omap1-dirty #34)
>>>>>> >> [ ?175.061859] PC is at rfcomm_session_del+0x6c/0x108
>>>>>> >> [ ?175.067047] LR is at release_console_sem+0x190/0x1a0
>>>>>> >> [ ?175.072509] pc : [<c033ded8>] ? ?lr : [<c0066308>] ? ?psr: 60000013
>>>>>> >> [ ?175.072509] sp : cc1abf38 ?ip : cc1abe68 ?fp : cc1abf4c
>>>>>> >> [ ?175.084960] r10: cc751c04 ?r9 : c036d2fc ?r8 : cc751be0
>>>>>> >> [ ?175.090545] r7 : 00000068 ?r6 : cc751c04 ?r5 : 00000009 ?r4 : cc751be0
>>>>>> >> [ ?175.097656] r3 : 00100100 ?r2 : 00100100 ?r1 : 00200200 ?r0 : c0422876
>>>>>> >>
>>>>>> >> =============
>>>>>> >> HCI log analysis
>>>>>> >> =============
>>>>>> >> Compare the hcidump log of the correct case with the one of the panic
>>>>>> >> case, we found there is only one difference in the message sequence.
>>>>>> >> In the panic case, headset send L2CAP Disconn_Req immediately after
>>>>>> >> sending rfcomm UA frame to android. We think this is the reason that
>>>>>> >> cause the socket state become BT_CLOSED.
>>>>>> >>
>>>>>> >> Please compare these two log, pay attention to the message direction
>>>>>> >> of the last Disconn_Req.
>>>>>> >>
>>>>>> >>
>>>>>> >> Log of correct case:
>>>>>> >> ----------------------------
>>>>>> >>
>>>>>> >>
>>>>>> >> 009-09-10 15:27:28.963519 < ACL data: handle 1 flags 0x02 dlen 22
>>>>>> >> ? ?L2CAP(d): cid 0x0047 len 18 [psm 3]
>>>>>> >> ? ? ?RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>>>>> >> ? ? ?0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
>>>>>> >> 009-09-10 15:27:28.967272 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >>
>>>>>> >> ? ?handle 1 packets 1
>>>>>> >> 009-09-10 15:27:29.243945 < ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ?L2CAP(d): cid 0x0047 len 4 [psm 3]
>>>>>> >> ? ? ?RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>>>>> >> 009-09-10 15:27:29.247363 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >>
>>>>>> >> ? ?handle 1 packets 1
>>>>>> >> 009-09-10 15:27:29.274890 > ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>>> >> ? ? ?RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>>>>> >> 009-09-10 15:27:29.296343 < ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ?L2CAP(d): cid 0x0047 len 4 [psm 3]
>>>>>> >> ? ? ?RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>>>>> >> 009-09-10 15:27:29.298480 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >>
>>>>>> >> ? ?handle 1 packets 1
>>>>>> >> 009-09-10 15:27:29.319873 > ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>>> >> ? ? ?RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>>>>> >> 009-09-10 15:27:29.320727 < ACL data: handle 1 flags 0x02 dlen 12
>>>>>> >> ? ?L2CAP(s): Disconn req: dcid 0x0047 scid 0x0040
>>>>>> >> 009-09-10 15:27:29.323474 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >>
>>>>>> >> ? ?handle 1 packets 1
>>>>>> >> 009-09-10 15:27:29.337237 > ACL data: handle 1 flags 0x02 dlen 12
>>>>>> >> ? ?L2CAP(s): Disconn rsp: dcid 0x0047 scid 0x0040
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> log of panic case:
>>>>>> >> ------------------------
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> 2009-09-10 13:34:24.020208 < ACL data: handle 1 flags 0x02 dlen 22
>>>>>> >> ? ? L2CAP(d): cid 0x0041 len 18 [psm 3]
>>>>>> >> ? ? ? RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>>>>> >> ? ? ? 0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
>>>>>> >> 2009-09-10 13:34:24.281256 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >> 5
>>>>>> >> ? ? handle 1 packets 1
>>>>>> >> 2009-09-10 13:34:24.083580 < ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ? L2CAP(d): cid 0x0041 len 4 [psm 3]
>>>>>> >> ? ? ? RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>>>>> >> 2009-09-10 13:34:24.529442 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >> 5
>>>>>> >> ? ? handle 1 packets 1
>>>>>> >> 2009-09-10 13:34:24.531914 > ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ? L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>>> >> ? ? ? RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>>>>> >> 2009-09-10 13:34:24.533135 < ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ? L2CAP(d): cid 0x0041 len 4 [psm 3]
>>>>>> >> ? ? ? RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>>>>> >> 2009-09-10 13:34:25.028649 > HCI Event: Number of Completed Packets (0x13) plen
>>>>>> >> 5
>>>>>> >> ? ? handle 1 packets 1
>>>>>> >> 2009-09-10 13:34:25.032128 > ACL data: handle 1 flags 0x02 dlen 8
>>>>>> >> ? ? L2CAP(d): cid 0x0040 len 4 [psm 3]
>>>>>> >> ? ? ? RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>>>>> >> 2009-09-10 13:34:25.032341 > ACL data: handle 1 flags 0x02 dlen 12
>>>>>> >> ? ? L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>>>>> >> 2009-09-10 13:34:25.032646 < ACL data: handle 1 flags 0x02 dlen 12
>>>>>> >> ? ? L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>>>>> >>
>>>>>> >> =============
>>>>>> >> Analysis Result
>>>>>> >> =============
>>>>>> >> For some kinds of Bluetooth headset such as Motorola H560/H620 which
>>>>>> >> are based on BCM2044S, they will send L2CAP Disconn_Req command right
>>>>>> >> after sending rfcomm UA frame. This L2CAP Disconn_Req will cause the
>>>>>> >> rfcomm socket state become BT_CLOSED before completely handling UA
>>>>>> >> frame, thus it will cause kernel panic. I think we can ignore the
>>>>>> >> received rfcomm frames if socket state is BT_CLOSED, because it
>>>>>> >> doesn't make sense in the BT_CLOSED state.
>>>>>> >>
>>>>>> >>
>>>>>> >> =============
>>>>>> >> Changed Code
>>>>>> >> =============
>>>>>> >> We changed the code in the function rfcomm_process_rx() in
>>>>>> >> net/bluetooth/rfcomm/core.c, check the socket state first before
>>>>>> >> handling the received framew. If the socket state is BT_CLOSED, we
>>>>>> >> don't handle any rfcomm frames but just close the session.
>>>>>> >>
>>>>>> >> The change is like below
>>>>>> >>
>>>>>> >> + ? ? ? if (sk->sk_state != BT_CLOSED) {
>>>>>> >> ? ? ? ? /* Get data directly from socket receive queue without copying it. */
>>>>>> >> ? ? ? ? while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
>>>>>> >> ? ? ? ? ? ? ? ? skb_orphan(skb);
>>>>>> >> ? ? ? ? ? ? ? ? rfcomm_recv_frame(s, skb);
>>>>>> >> ? ? ? ? }
>>>>>> >> -
>>>>>> >> - ? ? ? if (sk->sk_state == BT_CLOSED) {
>>>>>> >> + ? ? ? } else {
>>>>>> >> ? ? ? ? ? ? ? ? if (!s->initiator)
>>>>>> >> ? ? ? ? ? ? ? ? ? ? ? ? rfcomm_session_put(s);
>>>>>> >>
>>>>>> >> ? ? ? ? ? ? ? ? rfcomm_session_close(s, sk->sk_err);
>>>>>> >> ? ? ? ? ?}
>>>>>> >
>>>>>> > so I do see the issue here, but I don't agree with the fix since it
>>>>>> > changes behavior that might cause other issues. So in case the frame
>>>>>> > processing leads to sk->sk_state == BT_CLOSED we are not closing the
>>>>>> > connection anymore if we make it depend on a state before the frame
>>>>>> > processing. And nothing guarantees that rfcomm_process_rx gets scheduled
>>>>>> > again.
>>>>>> >
>>>>>> > diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
>>>>>> > index 94b3388..606143b 100644
>>>>>> > --- a/net/bluetooth/rfcomm/core.c
>>>>>> > +++ b/net/bluetooth/rfcomm/core.c
>>>>>> > @@ -1798,12 +1798,16 @@ static inline void rfcomm_process_rx(struct rfcomm_session *s)
>>>>>> >
>>>>>> > ? ? ? ?BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->sk_receive_queue));
>>>>>> >
>>>>>> > + ? ? ? rfcomm_session_hold(s);
>>>>>> > +
>>>>>> > ? ? ? ?/* Get data directly from socket receive queue without copying it. */
>>>>>> > ? ? ? ?while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
>>>>>> > ? ? ? ? ? ? ? ?skb_orphan(skb);
>>>>>> > ? ? ? ? ? ? ? ?rfcomm_recv_frame(s, skb);
>>>>>> > ? ? ? ?}
>>>>>> >
>>>>>> > + ? ? ? rfcomm_session_put(s);
>>>>>> > +
>>>>>> > ? ? ? ?if (sk->sk_state == BT_CLOSED) {
>>>>>> > ? ? ? ? ? ? ? ?if (!s->initiator)
>>>>>> > ? ? ? ? ? ? ? ? ? ? ? ?rfcomm_session_put(s);
>>>>>> >
>>>>>> > What does the above patch do for you? Since if I read it correctly, then
>>>>>> > the rfcomm_recv_ua causes the rfcomm_session_put to trigger the closing
>>>>>> > of the session. And then in this case it is delayed until after all
>>>>>> > frames are processed.
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> I've tried your patch but unfortunately kernel panic still happened.
>>>>>>
>>>>>> From the log I noticed that if rfcomm_l2state_change is called before
>>>>>> rfcomm_process_rx, kernel panic will happen definitely.
>>>>>>
>>>>>> Below lines are in the correct log,
>>>>>>
>>>>>> [ ?139.323852] rfcomm_l2data_ready: ccf70000 bytes 4
>>>>>> [ ?139.346252] rfcomm_process_rx: session ccb94ce0 state 8 qlen 1
>>>>>> ...
>>>>>> [ ?139.457519] rfcomm_l2state_change: ccf70000 state 9
>>>>>> (disconnect ok)
>>>>>>
>>>>>> In the above case, when process_rx, the code in the condition "if
>>>>>> (sk->sk_state == BT_CLOSED)" will never run.
>>>>>>
>>>>>> Below lines are in the panic log,
>>>>>>
>>>>>> [ ?174.898284] rfcomm_l2data_ready: ccf70400 bytes 4
>>>>>> [ ?174.903442] rfcomm_l2state_change: ccf70400 state 9
>>>>>> [ ?174.908874] rfcomm_process_rx: session cc751be0 state 8 qlen 1
>>>>>> ...
>>>>>> ( then panic)
>>>>>>
>>>>>> In the above case, sk_state is changed to 9 (BT_CLOSED) firstly, ?then
>>>>>> process_rx, so the code in the condition "if (sk->sk_state ==
>>>>>> BT_CLOSED) " will be run, it will call session_put twice. I think this
>>>>>> is the root cause of panic.
>>>>>
>>>>> I know why it happens, that is not the problem. My point is not to break
>>>>> current scheduling assumptions.
>>>>>
>>>>> So if you move the rfcomm_session_put() now at the end of the function,
>>>>> then it should be fine, right?
>>>>>
>>>>> Regards
>>>>>
>>>>> Marcel
>>>>>
>>>>>
>>>>>
>>>>
>>>> You are right. I moved the rfcomm_session_put() at the end of
>>>> rfcomm_process_tx() then kernel panic doesn't happen any longer.
>>>>
>>>> The changed code is like below,
>>>>
>>>> @@ -1796,6 +1796,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
>>>>
>>>> ? ? ? ?BT_DBG("session %p state %ld qlen %d", s, s->state, skb_queue_len(&sk->s
>>>>
>>>> + ? ? ? rfcomm_session_hold(s);
>>>> +
>>>> ? ? ? ?/* Get data directly from socket receive queue without copying it. */
>>>> ? ? ? ?while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
>>>> ? ? ? ? ? ? ? ?skb_orphan(skb);
>>>> @@ -1808,6 +1810,8 @@ static inline void rfcomm_process_rx(struct rfcomm_session
>>>>
>>>> ? ? ? ? ? ? ? ?rfcomm_session_close(s, sk->sk_err);
>>>> ? ? ? ?}
>>>> +
>>>> + ? ? ? rfcomm_session_put(s);
>>>> ?}
>>>>
>>>> ?static inline void rfcomm_accept_connection(struct rfcomm_session *s)
>>>>
>>>> Please submit this change to bluez release.
>>>
>>>
>>> Unfortunately, with this change I get a panic disconnecting from
>>> Motorola H270 in the case that the headset initiated RFCOMM and we
>>> disconnect RFCOMM.
>>>
>>> Here is the hcidump:
>>>
>>> 2009-09-21 17:22:37.384811 < ACL data: handle 1 flags 0x02 dlen 22
>>> ? ?L2CAP(d): cid 0x0041 len 18 [psm 3]
>>> ? ? ?RFCOMM(d): UIH: cr 0 dlci 20 pf 0 ilen 14 fcs 0xeb
>>> ? ? ?0000: 0d 0a 2b 43 49 45 56 3a ?20 37 2c 33 0d 0a ? ? ? ?..+CIEV: 7,3..
>>> 2009-09-21 17:22:37.502273 > HCI Event: Number of Completed Packets
>>> (0x13) plen 5
>>> ? ?handle 1 packets 1
>>> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
>>> ? ?L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> ? ? ?RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
>>> (0x13) plen 5
>>> ? ?handle 1 packets 1
>>> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
>>> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> ? ? ?RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
>>> ? ?L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> ? ? ?RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
>>> (0x13) plen 5
>>> ? ?handle 1 packets 1
>>> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
>>> ? ?L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> ? ? ?RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
>>> ? ?L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>> 2009-09-21 17:22:39.029526 < ACL data: handle 1 flags 0x02 dlen 12
>>> ? ?L2CAP(s): Disconn rsp: dcid 0x0040 scid 0x0041
>>> 2009-09-21 17:22:39.136581 > HCI Event: Number of Completed Packets
>>> (0x13) plen 5
>>> ? ?handle 1 packets 1
>>> 2009-09-21 17:22:41.337203 > HCI Event: Disconn Complete (0x05) plen 4
>>> ? ?status 0x00 handle 1 reason 0x13
>>> ? ?Reason: Remote User Terminated Connection
>>>
>>> And the panic:
>>>
>>> <7>[ 3161.665557] rfcomm:rfcomm_session_del: session c9c06ad0 state 9
>>> <7>[ 3161.671905] l2cap:l2cap_sock_release: sock cea04360, sk c97f02f8
>>> <7>[ 3161.678497] l2cap:l2cap_sock_shutdown: sock cea04360, sk c97f02f8
>>> <7>[ 3161.685028] l2cap:l2cap_sock_kill: sk c97f02f8 state 9
>>> <7>[ 3161.695587] l2cap:l2cap_sock_destruct: sk c97f02f8
>>> <4>[ 3161.700805] npelly 1911 rfcomm_process_sessions session c9c06ad0
>>> refcnt 1802201963
>>> <7>[ 3161.709014] rfcomm:rfcomm_process_dlcs: session c9c06ad0 state 1802201963
>>> <7>[ 3161.716308] rfcomm:rfcomm_process_dlcs: session c9c06ad0 dlc 6b6b6b6b
>>> <1>[ 3161.726776] Unable to handle kernel paging request at virtual
>>> address 6b6b6b6b
>>> <1>[ 3161.734619] pgd = c0004000
>>> <1>[ 3161.737609] [6b6b6b6b] *pgd=00000000
>>> <4>[ 3161.741638] Internal error: Oops: 5 [#1] PREEMPT
>>> <4>[ 3161.746734] Modules linked in:
>>> <4>[ 3161.750213] CPU: 0 ? ?Not tainted
>>> (2.6.29-omap1-07358-g9a3fd55-dirty #206)
>>> <4>[ 3161.757629] PC is at rfcomm_process_dlcs+0x108/0x590
>>> <4>[ 3161.762969] LR is at preempt_schedule+0x44/0x54
>>> <4>[ 3161.767852] pc : [<c03911f4>] ? ?lr : [<c03a27c4>] ? ?psr: 60000113
>>> <4>[ 3161.767883] sp : ccdf9e80 ?ip : ccdf9dd8 ?fp : ccdf9edc
>>> <4>[ 3161.780273] r10: 00000000 ?r9 : c9c06af4 ?r8 : c9c06ad0
>>> <4>[ 3161.786010] r7 : 00000000 ?r6 : c9c06ad0 ?r5 : c4c68680 ?r4 : 6b6b6b6b
>>> <4>[ 3161.792968] r3 : c9c06ae0 ?r2 : ccdf8000 ?r1 : c61a8940 ?r0 : 0000004c
>>> <4>[ 3161.800079] Flags: nZCv ?IRQs on ?FIQs on ?Mode SVC_32 ?ISA ARM
>>> Segment kernel
>>> <4>[ 3161.807983] Control: 10c5387d ?Table: 86db8019 ?DAC: 00000017
>>> <4>[ 3161.814147]
>>> <4>[ 3161.814147] PC: 0xc0391174:
>>> [...]
>>> <4>[ 3162.973175] Backtrace:
>>> <4>[ 3162.976013] [<c03910ec>] (rfcomm_process_dlcs+0x0/0x590) from
>>> [<c03930b0>] (rfcomm_process_sessions+0x1a34/0x1a9c)
>>> <4>[ 3162.987579] [<c039167c>] (rfcomm_process_sessions+0x0/0x1a9c)
>>> from [<c03932ec>] (rfcomm_run+0x1d4/0x2ac)
>>> <4>[ 3162.998199] [<c0393118>] (rfcomm_run+0x0/0x2ac) from
>>> [<c008e7d8>] (kthread+0x5c/0x94)
>>> <4>[ 3163.013763] [<c008e77c>] (kthread+0x0/0x94) from [<c007c998>]
>>> (do_exit+0x0/0x714)
>>>
>>>
>>> Seems like this fix avoids the panic due to calling
>>> rfcomm_session_close() on a deleted session, but does not always
>>> address the unbalanced rfcomm_session_put() which may be the root
>>> cause.
>>>
>>> Lan Zhu suspected this in the original post, and his original fix does
>>> in fact fix this panic as well as the originally reported panic,
>>> because it avoids the unbalanced rfcomm_session_put().
>>>
>>> Marcel I know you are concerned about the original fix changing
>>> scheduling assumptions, are you able to comment on this further?
>>>
>>> Are there any other suggestions for patches for this issue? I have
>>> spent the best part of the day trying to figure this one out, but the
>>> recounting in the rfcomm core is quite subtle and I think it really
>>> needs someone familiar with the code to have a quick look and come up
>>> with the safest patch. I can run tests.
>>>
>>> In the mean time, I am doing some testing of Lan Zhu's original fix
>>> and if there are no better suggestions we will run with that one for
>>> Android.
>>>
>>> Nick
>>>
>>>
>>> Some more analysis:
>>>
>>> With the RFCOMM connection in idle there are 2 references on s->refcnt
>>>
>>> However three references are removed during disconnect with the H270
>>> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink()
>>> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
>>> with dlci = 0 and s->state = BT_DISCONN
>>> - rfcomm_process_sessions() -> rfcomm_process_rx() with sk_state =
>>> BT_CLOSED and s->initiator = 0
>>>
>>> in that order.
>>>
>>> On another headset, for example the Moto H350, we only see the first
>>> two references removed during disconnect.
>>>
>>> - rfcomm_process_sessions() -> __rfcomm_dlc_close() -> rfcomm_dlc_unlink()
>>> - rfcomm_process_sessions() -> rfcomm_process_rx() -> rfcomm_recv_ua()
>>> with dlci = 0 and s->state = BT_DISCONN
>>>
>>
>> How about this. We still call rfcomm_process_rx(), but avoid the
>> rfcomm_session_put() due to RFCOMM UA when the socket state is
>> BT_CLOSED.
>>
>> It is less invasive, so might address Marcel's concerns with regard to
>> scheduling changes.
>>
>>
>> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
>
> I made a minor style improvement and added commit message. Patch available from
>
> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=1048e007842da2d6440679e1ca80f45438a6369d
>

We have tested patch and found that it fixes the problem mentioned in
this thread.

I see that this patch is not applied yet. I am sending properly
formatted android commit.


Regards,
Andrei Emeltchenko


Attachments:
0001-Bluetooth-Do-not-call-rfcomm_session_put-due-to-R.patch (2.41 kB)

2010-02-04 00:19:01

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Wed, Feb 3, 2010 at 12:21 PM, Marcel Holtmann <[email protected]> wrot=
e:
> Hi Andrei,
>
>> >> >> Processing a RFCOMM UA frame when the socket is closed and we were=
not
>> >> >> the
>> >> >> RFCOMM initiator would cause rfcomm_session_put() to be called twi=
ce
>> >> >> during
>> >> >> rfcomm_process_rx(). This would cause a kernel panic in
>> >> >> rfcomm_session_close.
>> >> >>
>> >> >> This could be easily reproduced during disconnect with devices suc=
h as
>> >> >> Motorola H270 that send RFCOMM UA followed quickly by L2CAP discon=
nect
>> >> >> request.
>> >> >> This hcidump for this looks like:
>> >> >>
>> >> >> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
>> >> >> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>> >> >> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packet=
s
>> >> >> (0x13)
>> >> >> plen 5
>> >> >> =A0 =A0handle 1 packets 1
>> >> >> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
>> >> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>> >> >> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
>> >> >> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>> >> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>> >> >> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packet=
s
>> >> >> (0x13)
>> >> >> plen 5
>> >> >> =A0 =A0handle 1 packets 1
>> >> >> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
>> >> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>> >> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>> >> >> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
>> >> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>> >> >>
>> >> >> Avoid calling rfcomm_session_put() twice by skipping this call
>> >> >> in rfcomm_recv_ua() if the socket is closed.
>> >> >>
>> >> >> Picked from:
>> >> >> http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=
=3D1048e007842da2d6440679e1ca80f45438a6369d
>> >> >>
>> >> >> Signed-off-by: Nick Pelly <[email protected]>
>> >> >> Signed-off-by: Andrei Emeltchenko <[email protected]>
>> >> >> ---
>> >> >> =A0net/bluetooth/rfcomm/core.c | =A0 =A03 ++-
>> >> >> =A01 files changed, 2 insertions(+), 1 deletions(-)
>> >> >>
>> >> >> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/co=
re.c
>> >> >> index 0313e88..56ffcb8 100644
>> >> >> --- a/net/bluetooth/rfcomm/core.c
>> >> >> +++ b/net/bluetooth/rfcomm/core.c
>> >> >> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_sess=
ion
>> >> >> *s, u8 dlci)
>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>> >> >>
>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BT_DISCONN:
>> >> >> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s=
);
>> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (s->sock->sk->sk_=
state !=3D BT_CLOSED)
>> >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfco=
mm_session_put(s);
>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>> >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> >> >> =A0 =A0 =A0 =A0 }
>> >> >
>> >> > I am not a big fan of conditionally decreasing reference counts. I =
do
>> >> > think it would be better to fix this by holding an extra pair of
>> >> > reference counts or actually fixing the imbalance. What about the o=
ther
>> >> > patches I proposed?
>> >>
>> >> Your proposed patch was to add an extra hold() / put() reference coun=
t
>> >> around the offending put(). I did test this patch, and found it does
>> >> not fix the underlying imbalance, it just moves the kernel panic
>> >> somewhere else.
>> >>
>> >> As best I can tell, my patch does address the underlying imbalance. I=
t
>> >> is in production on Android phones and seems to work well. As best I
>> >> can tell, there is not a cleaner solution that does not involve
>> >> significant refactoring of rfcomm refcounting.
>>
>> We have this patch also in Nokia N900 phone. And this was the best solut=
ion
>> for the problem mentioned.
>>
>> > the RFCOMM reference counting is something nasty and it does need to b=
e
>> > re-written. One thing that needs to happen that we stop using the L2CA=
P
>> > sockets directly. We have to put a proper L2CAP in-kernel specific API
>> > in between that ensures we are not mixing things. That is the one issu=
es
>> > that we always had in this area.
>> >
>> > Before applying this patch, I like to have additionally a comment in
>> > front of this conditional put call that explains a little bit the
>> > problem area here. The long explanation with logs etc. should be in th=
e
>> > commit message. I have to make sure that we fully understand what is
>> > going on here and why we did it.
>>
>> What do you think about following comment:
>>
>> --- a/net/bluetooth/rfcomm/core.c
>> +++ b/net/bluetooth/rfcomm/core.c
>> @@ -1151,7 +1151,11 @@ static int rfcomm_recv_ua(struct rfcomm_session
>> *s, u8 dlci)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>>
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BT_DISCONN:
>> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* When socket is closed and w=
e are not RFCOMM
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* initiator rfcomm_process_=
rx already calls
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* rfcomm_session_put */
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (s->sock->sk->sk_state !=3D=
BT_CLOSED)
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session=
_put(s);
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>> =A0 =A0 =A0 }
>
> looks good. Just turn this into a proper patch and send it to the
> mailing list so I can apply it.

Sent.

Nick

2010-02-03 20:21:27

by Marcel Holtmann

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

Hi Andrei,

> >> >> Processing a RFCOMM UA frame when the socket is closed and we were not
> >> >> the
> >> >> RFCOMM initiator would cause rfcomm_session_put() to be called twice
> >> >> during
> >> >> rfcomm_process_rx(). This would cause a kernel panic in
> >> >> rfcomm_session_close.
> >> >>
> >> >> This could be easily reproduced during disconnect with devices such as
> >> >> Motorola H270 that send RFCOMM UA followed quickly by L2CAP disconnect
> >> >> request.
> >> >> This hcidump for this looks like:
> >> >>
> >> >> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
> >> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> >> RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
> >> >> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
> >> >> (0x13)
> >> >> plen 5
> >> >> handle 1 packets 1
> >> >> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
> >> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> >> RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
> >> >> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
> >> >> L2CAP(d): cid 0x0041 len 4 [psm 3]
> >> >> RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
> >> >> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
> >> >> (0x13)
> >> >> plen 5
> >> >> handle 1 packets 1
> >> >> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
> >> >> L2CAP(d): cid 0x0040 len 4 [psm 3]
> >> >> RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
> >> >> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
> >> >> L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
> >> >>
> >> >> Avoid calling rfcomm_session_put() twice by skipping this call
> >> >> in rfcomm_recv_ua() if the socket is closed.
> >> >>
> >> >> Picked from:
> >> >> http://android.git.kernel.org/?p=kernel/common.git;a=commit;h=1048e007842da2d6440679e1ca80f45438a6369d
> >> >>
> >> >> Signed-off-by: Nick Pelly <[email protected]>
> >> >> Signed-off-by: Andrei Emeltchenko <[email protected]>
> >> >> ---
> >> >> net/bluetooth/rfcomm/core.c | 3 ++-
> >> >> 1 files changed, 2 insertions(+), 1 deletions(-)
> >> >>
> >> >> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
> >> >> index 0313e88..56ffcb8 100644
> >> >> --- a/net/bluetooth/rfcomm/core.c
> >> >> +++ b/net/bluetooth/rfcomm/core.c
> >> >> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_session
> >> >> *s, u8 dlci)
> >> >> break;
> >> >>
> >> >> case BT_DISCONN:
> >> >> - rfcomm_session_put(s);
> >> >> + if (s->sock->sk->sk_state != BT_CLOSED)
> >> >> + rfcomm_session_put(s);
> >> >> break;
> >> >> }
> >> >> }
> >> >
> >> > I am not a big fan of conditionally decreasing reference counts. I do
> >> > think it would be better to fix this by holding an extra pair of
> >> > reference counts or actually fixing the imbalance. What about the other
> >> > patches I proposed?
> >>
> >> Your proposed patch was to add an extra hold() / put() reference count
> >> around the offending put(). I did test this patch, and found it does
> >> not fix the underlying imbalance, it just moves the kernel panic
> >> somewhere else.
> >>
> >> As best I can tell, my patch does address the underlying imbalance. It
> >> is in production on Android phones and seems to work well. As best I
> >> can tell, there is not a cleaner solution that does not involve
> >> significant refactoring of rfcomm refcounting.
>
> We have this patch also in Nokia N900 phone. And this was the best solution
> for the problem mentioned.
>
> > the RFCOMM reference counting is something nasty and it does need to be
> > re-written. One thing that needs to happen that we stop using the L2CAP
> > sockets directly. We have to put a proper L2CAP in-kernel specific API
> > in between that ensures we are not mixing things. That is the one issues
> > that we always had in this area.
> >
> > Before applying this patch, I like to have additionally a comment in
> > front of this conditional put call that explains a little bit the
> > problem area here. The long explanation with logs etc. should be in the
> > commit message. I have to make sure that we fully understand what is
> > going on here and why we did it.
>
> What do you think about following comment:
>
> --- a/net/bluetooth/rfcomm/core.c
> +++ b/net/bluetooth/rfcomm/core.c
> @@ -1151,7 +1151,11 @@ static int rfcomm_recv_ua(struct rfcomm_session
> *s, u8 dlci)
> break;
>
> case BT_DISCONN:
> - rfcomm_session_put(s);
> + /* When socket is closed and we are not RFCOMM
> + * initiator rfcomm_process_rx already calls
> + * rfcomm_session_put */
> + if (s->sock->sk->sk_state != BT_CLOSED)
> + rfcomm_session_put(s);
> break;
> }
> }

looks good. Just turn this into a proper patch and send it to the
mailing list so I can apply it.

Regards

Marcel



2010-02-03 02:11:21

by Nick Pelly

[permalink] [raw]
Subject: Re: kernel panic happens when disconnecting Bluetooth headset

On Tue, Dec 22, 2009 at 8:20 AM, Andrei Emeltchenko
<[email protected]> wrote:
> Hi Marcel,
>
> On Sat, Dec 19, 2009 at 1:02 AM, Marcel Holtmann <[email protected]> wr=
ote:
>> Hi Nick,
>>
>>> >> Processing a RFCOMM UA frame when the socket is closed and we were n=
ot
>>> >> the
>>> >> RFCOMM initiator would cause rfcomm_session_put() to be called twice
>>> >> during
>>> >> rfcomm_process_rx(). This would cause a kernel panic in
>>> >> rfcomm_session_close.
>>> >>
>>> >> This could be easily reproduced during disconnect with devices such =
as
>>> >> Motorola H270 that send RFCOMM UA followed quickly by L2CAP disconne=
ct
>>> >> request.
>>> >> This hcidump for this looks like:
>>> >>
>>> >> 2009-09-21 17:22:37.788895 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 20 pf 1 ilen 0 fcs 0x7d
>>> >> 2009-09-21 17:22:37.906204 > HCI Event: Number of Completed Packets
>>> >> (0x13)
>>> >> plen 5
>>> >> =A0 =A0handle 1 packets 1
>>> >> 2009-09-21 17:22:37.933090 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 20 pf 1 ilen 0 fcs 0x57
>>> >> 2009-09-21 17:22:38.636764 < ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0041 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): DISC: cr 0 dlci 0 pf 1 ilen 0 fcs 0x9c
>>> >> 2009-09-21 17:22:38.744125 > HCI Event: Number of Completed Packets
>>> >> (0x13)
>>> >> plen 5
>>> >> =A0 =A0handle 1 packets 1
>>> >> 2009-09-21 17:22:38.763687 > ACL data: handle 1 flags 0x02 dlen 8
>>> >> =A0 =A0L2CAP(d): cid 0x0040 len 4 [psm 3]
>>> >> =A0 =A0 =A0RFCOMM(s): UA: cr 0 dlci 0 pf 1 ilen 0 fcs 0xb6
>>> >> 2009-09-21 17:22:38.783554 > ACL data: handle 1 flags 0x02 dlen 12
>>> >> =A0 =A0L2CAP(s): Disconn req: dcid 0x0040 scid 0x0041
>>> >>
>>> >> Avoid calling rfcomm_session_put() twice by skipping this call
>>> >> in rfcomm_recv_ua() if the socket is closed.
>>> >>
>>> >> Picked from:
>>> >> http://android.git.kernel.org/?p=3Dkernel/common.git;a=3Dcommit;h=3D=
1048e007842da2d6440679e1ca80f45438a6369d
>>> >>
>>> >> Signed-off-by: Nick Pelly <[email protected]>
>>> >> Signed-off-by: Andrei Emeltchenko <[email protected]>
>>> >> ---
>>> >> =A0net/bluetooth/rfcomm/core.c | =A0 =A03 ++-
>>> >> =A01 files changed, 2 insertions(+), 1 deletions(-)
>>> >>
>>> >> diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core=
.c
>>> >> index 0313e88..56ffcb8 100644
>>> >> --- a/net/bluetooth/rfcomm/core.c
>>> >> +++ b/net/bluetooth/rfcomm/core.c
>>> >> @@ -1148,7 +1148,8 @@ static int rfcomm_recv_ua(struct rfcomm_sessio=
n
>>> >> *s, u8 dlci)
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>>> >>
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 case BT_DISCONN:
>>> >> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s);
>>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (s->sock->sk->sk_st=
ate !=3D BT_CLOSED)
>>> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm=
_session_put(s);
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
>>> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
>>> >> =A0 =A0 =A0 =A0 }
>>> >
>>> > I am not a big fan of conditionally decreasing reference counts. I do
>>> > think it would be better to fix this by holding an extra pair of
>>> > reference counts or actually fixing the imbalance. What about the oth=
er
>>> > patches I proposed?
>>>
>>> Your proposed patch was to add an extra hold() / put() reference count
>>> around the offending put(). I did test this patch, and found it does
>>> not fix the underlying imbalance, it just moves the kernel panic
>>> somewhere else.
>>>
>>> As best I can tell, my patch does address the underlying imbalance. It
>>> is in production on Android phones and seems to work well. As best I
>>> can tell, there is not a cleaner solution that does not involve
>>> significant refactoring of rfcomm refcounting.
>
> We have this patch also in Nokia N900 phone. And this was the best soluti=
on
> for the problem mentioned.
>
>> the RFCOMM reference counting is something nasty and it does need to be
>> re-written. One thing that needs to happen that we stop using the L2CAP
>> sockets directly. We have to put a proper L2CAP in-kernel specific API
>> in between that ensures we are not mixing things. That is the one issues
>> that we always had in this area.
>>
>> Before applying this patch, I like to have additionally a comment in
>> front of this conditional put call that explains a little bit the
>> problem area here. The long explanation with logs etc. should be in the
>> commit message. I have to make sure that we fully understand what is
>> going on here and why we did it.
>
> What do you think about following comment:
>
> --- a/net/bluetooth/rfcomm/core.c
> +++ b/net/bluetooth/rfcomm/core.c
> @@ -1151,7 +1151,11 @@ static int rfcomm_recv_ua(struct rfcomm_session
> *s, u8 dlci)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0case BT_DISCONN:
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_session_put(s);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* When socket is closed an=
d we are not RFCOMM
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* initiator rfcomm_proce=
ss_rx already calls
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* rfcomm_session_put */
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (s->sock->sk->sk_state !=
=3D BT_CLOSED)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 rfcomm_sess=
ion_put(s);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> =A0 =A0 =A0 =A0}
> --
>

Ping.

Nick