2013-06-20 00:56:32

by Alex Deymo

[permalink] [raw]
Subject: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hello,

I'm experiencing a problem that wedges the bluetooth adapter from time
to time. The repro case can be hit with this scenario:

run bluetoothd
while hciconfig hci0 up && hciconfig hci0 down; do echo -n .; done

In few iterations you will have a timeout while trying to bring up the
interface. This is just a repro case, the real scenario just happens
from time to time if you for instance restart bluetoothd at the right
(or wrong!) moment.

The problem is the following:
Setting up the interface, the host will send several HCI commands to
set things like the device class or the local name. If it happens that
the hciconfig makes the host send a HCI_Reset command after another
command was sent but before the corresponding event was received, then
the adapter is blocked for a while. Recovery it from that situation is
kind of random. A few attempts of hcidump hci0 up may or not work....
but the problem was before.

In the hci trace you see below, the last HCI Event received has a
value of Num_HCI_Command_Packets of 1, which according to the Core
spec (Vol2, part E, sec 7.7.14) means that we can only send 1 HCI
command until we receive the next event; but the trace shows that we
sent 2: Write Local Name and a Reset. So the host here doesn't follow
the spec... am I right?

Even worse, after the HCI_Reset the host shall not send additional HCI
commands before the Command Complete event is received (sec 7.3.2).
According to the spec, if you want to reset the adapter, you have to
wait until the completion of any command if you already reached the
num_HCI_Command_Packets limit??? There's no alternative/exception to
reset the adapter? :-(

How can I workaround this problem?

Thanks,
Alex.

< HCI Command: Write Class of Device (0x03|0x0024) plen 3 94.190995
Class: 0x080104
Major class: Computer (desktop, notebook, PDA, organizers)
Minor class: Desktop workstation
Capturing (Scanner, Microphone)
> HCI Event: Command Complete (0x0e) plen 4 94.191963
Write Class of Device (0x03|0x0024) ncmd 1
Status: Success (0x00)
< HCI Command: Write Local Name (0x03|0x0013) plen 248 94.191997
Name: Chromebook
< HCI Command: Reset (0x03|0x0003) plen 0 95.200150
< HCI Command: Reset (0x03|0x0003) plen 0 106.228014
< HCI Command: Reset (0x03|0x0003) plen 0 117.261787
< HCI Command: Reset (0x03|0x0003) plen 0 128.293874


2013-06-26 05:38:30

by Johan Hedberg

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

On Tue, Jun 25, 2013, Alex Deymo wrote:
> I'm in 3.8.11 and I don't have that patch... and as I can see from the
> log in mgmt.c it is not going to apply... there are already several
> changes from this kernel to that point. But anyway... that won't fix
> the problem. The scenario is "send a command, btmgmt power off, btmgmt
> power on". The fact that the command was sent because a previous power
> on is not relevant, and bluetoothd normally sends more commands when
> it sees an interface going up that aren't covered by this lock (like
> Write EIR, Write Class of Device, Write Local Name)... right?

Not quite. What we do is program these values to the kernel as soon as
we discover a new adapter and before powering it on. After this
mgmt_set_powered takes care of actually "applying" the values as part of
the power on process. The patch I mentioned should ensure that the
kernel doesn't reply to the mgmt command before all of these HCI
commands have completed.

Of course this still doesn't fix the situation of an HCI command that's
pending that's not part of the power on process (e.g. changing the name
while powered on), so something will probably need to be done about that
anyway.

Johan

2013-06-25 20:34:33

by Alex Deymo

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Marcel,

On Tue, Jun 25, 2013 at 6:33 AM, Marcel Holtmann <[email protected]> wrot=
e:
>> I enabled dynamic debug on this kernel. Again, I'm running 3.8.11. I
>> added a debug line to check HCI_QUIRK_RESET_ON_CLOSE and
>> "test_bit(HCI_QUIRK_RESET_ON_CLOSE, &hdev->quirks)" evaluates to 0; so
>> the reset is sent during device up. Nevertheless, the bug is because a
>> combination of both:
>>
>> The timelime for the bug is the following.

>> 5. hci_dev_do_close clears the cmd, rx and tx queues in the kernel,
>> and sets the cmd_cnt to 1, even if the current command
>> (HCI_Write_Class_of_Device in this case) did not finish. From this
>> point we have an inconsistent state, since cmd_cnt is 1, but we are
>> not allowed to send a command to the adapter.
>
> I think this forcefully setting cmd_cnt to 1 is actually a problem when w=
e have pending commands.

Yes, but the problem is that if the driver powers down the controller
on close or the command complete event is in the rx_work queue (that
we just flushed) then we need to set this cmd_cnt to 1.

> You can trigger this via ioctl and also rfkill since both go via hci_dev_=
do_close to shutdown the handling of the device.
>
> We need to wait for the command timeout of the pending command. While we =
could pretend that the ioctl is deprecated and should not be used, but rfki=
ll clearly isn't. And honestly hciconfig hci0 up/down is way to useful for =
low-level testing that it might not go away any time soon.
>
>> I agree with Marcel that the right fix should be to wait until we
>> receive the completion event from the adapter. I'm certainly not very
>> familiar with this code, but I'll try to hack it =3D) Suggestions are
>> welcomed!
>
> Check if you have the patch that Johan mentioned in your tree. That shoul=
d at least fix the problem when using mgmt. It is a bit funky, but at least=
mgmt power off will be blocked until you mgmt power on finished and thus s=
hould make this race a lot harder trigger.

I'm in 3.8.11 and I don't have that patch... and as I can see from the
log in mgmt.c it is not going to apply... there are already several
changes from this kernel to that point. But anyway... that won't fix
the problem. The scenario is "send a command, btmgmt power off, btmgmt
power on". The fact that the command was sent because a previous power
on is not relevant, and bluetoothd normally sends more commands when
it sees an interface going up that aren't covered by this lock (like
Write EIR, Write Class of Device, Write Local Name)... right?

I'll try to test that on a linux desktop with a recent compat-drivers
and an usb adapter.

Alex.

2013-06-25 13:33:00

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

>>>> I can't say this for sure, but if this is our fault and not a hardware issue then this seems to be a pretty nasty race condition.
>>>>
>>>> To debug this you might need to work with dynamic_debug for Bluetooth core and add a few more DBG statements so we get timing information that we can compare with the btmon trace.
>>>>
>>>> So in theory all modern chips should send HCI_Reset on devup. Only a few old broken ones will send HCI_Reset on devdown. Can you check that the ath3k does not send a quirk here and really does HCI_Reset on devup. And that the ath3k firmware loading part not accidentally gets in the way.
>>>>
>>>> It would be also good to verify if devup or devdown is the root cause here.
>>>
>>> I don't understand how this could be a hardware problem. I must be
>>> missing something. The host is sending to the controller two
>>> consecutive commands, the second one is a HCI_Reset caused by the
>>> power off/on, but the spec says that we should not send more than one
>>> command at that point (ncmd of the last event is 1). So, since we are
>>> out of spec the firmware is happy to block itself in a bad way :)
>>> right?
>>
>> It is most likely our bug. Sometimes the hardware does however behave funny.
>>
>> My guess right now is that you are hitting a really nasty race conditions and in ath3k chips is shows up easier since the command processing is slower.
>>
>> So in hci_do_dev_close() we forcefully set the cmd_cnt to 1. And in case of HCI_QUIRK_RESET_ON_CLOSE we are sending a HCI Reset from close.
>>
>> Before we go ahead with this, I like you to confirm what HCI Reset behavior you actually have for your hardware. HCI Reset on devup or on devdown. Check your full btmon where we send the HCI Reset.
>>
>> What we might need to do different here is to check if we have a pending command, flush all other commands and then wait for this command to complete before sending HCI Reset.
>>
>> So in theory that HCI Request Lock should protect the whole init and shutdown procedures and all mgmt and ioctl interfaces should block each other. Why this is not the case is something else to investigate. Maybe we have some issue there as well.
>
> I enabled dynamic debug on this kernel. Again, I'm running 3.8.11. I
> added a debug line to check HCI_QUIRK_RESET_ON_CLOSE and
> "test_bit(HCI_QUIRK_RESET_ON_CLOSE, &hdev->quirks)" evaluates to 0; so
> the reset is sent during device up. Nevertheless, the bug is because a
> combination of both:
>
> The timelime for the bug is the following.
> 1. Due a previous "btmgmt power on" we are interacting with the
> bluetooth controller. There are 3 commands queued. The last command
> sent was a HCI_Write_Scan_Enable.
> 2. The Command complete event for HCI_Write_Scan_Enable is received,
> with status = 0 and ncmd = 1
> 3. The following command on the queue goes out (a
> HCI_Write_Class_of_Device, but is irrelevant). Two other commands
> remain in the queue.
> 4. We have bad luck and "btmgmt power off" is executed. This opens a
> mgmt socket and sends the power off. hci_dev_do_close gets called as a
> consequence of this.
> 5. hci_dev_do_close clears the cmd, rx and tx queues in the kernel,
> and sets the cmd_cnt to 1, even if the current command
> (HCI_Write_Class_of_Device in this case) did not finish. From this
> point we have an inconsistent state, since cmd_cnt is 1, but we are
> not allowed to send a command to the adapter.

I think this forcefully setting cmd_cnt to 1 is actually a problem when we have pending commands.

You can trigger this via ioctl and also rfkill since both go via hci_dev_do_close to shutdown the handling of the device.

We need to wait for the command timeout of the pending command. While we could pretend that the ioctl is deprecated and should not be used, but rfkill clearly isn't. And honestly hciconfig hci0 up/down is way to useful for low-level testing that it might not go away any time soon.

> 6. We have even more bad luck, and "btmgmt power on" is executed. This
> causes, eventually, hci_dev_open to be called which schedules a
> HCI_Reset command (and a few more that are queued).
> 7. The HCI_Reset command is sent to the Bluetooth Controller since cmd_cnt is 1.
>
> The dmesg log (with opcode anotations) is included below. I don't see
> any hci_rx_work after step 2 in the log.
> Running this with usbmon shows that the two commands are being sent
> without the event received in the middle.
>
> I agree with Marcel that the right fix should be to wait until we
> receive the completion event from the adapter. I'm certainly not very
> familiar with this code, but I'll try to hack it =) Suggestions are
> welcomed!

Check if you have the patch that Johan mentioned in your tree. That should at least fix the problem when using mgmt. It is a bit funky, but at least mgmt power off will be blocked until you mgmt power on finished and thus should make this race a lot harder trigger.

Regards

Marcel


2013-06-25 04:55:12

by Johan Hedberg

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

On Mon, Jun 24, 2013, Alex Deymo wrote:
> The timelime for the bug is the following.
> 1. Due a previous "btmgmt power on" we are interacting with the
> bluetooth controller. There are 3 commands queued. The last command
> sent was a HCI_Write_Scan_Enable.

The kernel should only respond to the mgmt_set_powered once *all* HCI
commands are completed. Maybe you're missing the following patch from
your kernel:

commit 229ab39caf8c1321527e408725c1350f7c9aaa84
Author: Johan Hedberg <[email protected]>
Date: Fri Mar 15 17:06:53 2013 -0500

Bluetooth: Wait for HCI command completion with mgmt_set_powered

Johan

2013-06-25 01:02:26

by Alex Deymo

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Marcel,

On Fri, Jun 21, 2013 at 6:22 AM, Marcel Holtmann <[email protected]> wrot=
e:
>>> I can't say this for sure, but if this is our fault and not a hardware =
issue then this seems to be a pretty nasty race condition.
>>>
>>> To debug this you might need to work with dynamic_debug for Bluetooth c=
ore and add a few more DBG statements so we get timing information that we =
can compare with the btmon trace.
>>>
>>> So in theory all modern chips should send HCI_Reset on devup. Only a fe=
w old broken ones will send HCI_Reset on devdown. Can you check that the at=
h3k does not send a quirk here and really does HCI_Reset on devup. And that=
the ath3k firmware loading part not accidentally gets in the way.
>>>
>>> It would be also good to verify if devup or devdown is the root cause h=
ere.
>>
>> I don't understand how this could be a hardware problem. I must be
>> missing something. The host is sending to the controller two
>> consecutive commands, the second one is a HCI_Reset caused by the
>> power off/on, but the spec says that we should not send more than one
>> command at that point (ncmd of the last event is 1). So, since we are
>> out of spec the firmware is happy to block itself in a bad way :)
>> right?
>
> It is most likely our bug. Sometimes the hardware does however behave fun=
ny.
>
> My guess right now is that you are hitting a really nasty race conditions=
and in ath3k chips is shows up easier since the command processing is slow=
er.
>
> So in hci_do_dev_close() we forcefully set the cmd_cnt to 1. And in case =
of HCI_QUIRK_RESET_ON_CLOSE we are sending a HCI Reset from close.
>
> Before we go ahead with this, I like you to confirm what HCI Reset behavi=
or you actually have for your hardware. HCI Reset on devup or on devdown. C=
heck your full btmon where we send the HCI Reset.
>
> What we might need to do different here is to check if we have a pending =
command, flush all other commands and then wait for this command to complet=
e before sending HCI Reset.
>
> So in theory that HCI Request Lock should protect the whole init and shut=
down procedures and all mgmt and ioctl interfaces should block each other. =
Why this is not the case is something else to investigate. Maybe we have so=
me issue there as well.

I enabled dynamic debug on this kernel. Again, I'm running 3.8.11. I
added a debug line to check HCI_QUIRK_RESET_ON_CLOSE and
"test_bit(HCI_QUIRK_RESET_ON_CLOSE, &hdev->quirks)" evaluates to 0; so
the reset is sent during device up. Nevertheless, the bug is because a
combination of both:

The timelime for the bug is the following.
1. Due a previous "btmgmt power on" we are interacting with the
bluetooth controller. There are 3 commands queued. The last command
sent was a HCI_Write_Scan_Enable.
2. The Command complete event for HCI_Write_Scan_Enable is received,
with status =3D 0 and ncmd =3D 1
3. The following command on the queue goes out (a
HCI_Write_Class_of_Device, but is irrelevant). Two other commands
remain in the queue.
4. We have bad luck and "btmgmt power off" is executed. This opens a
mgmt socket and sends the power off. hci_dev_do_close gets called as a
consequence of this.
5. hci_dev_do_close clears the cmd, rx and tx queues in the kernel,
and sets the cmd_cnt to 1, even if the current command
(HCI_Write_Class_of_Device in this case) did not finish. From this
point we have an inconsistent state, since cmd_cnt is 1, but we are
not allowed to send a command to the adapter.
6. We have even more bad luck, and "btmgmt power on" is executed. This
causes, eventually, hci_dev_open to be called which schedules a
HCI_Reset command (and a few more that are queued).
7. The HCI_Reset command is sent to the Bluetooth Controller since cmd_cnt =
is 1.

The dmesg log (with opcode anotations) is included below. I don't see
any hci_rx_work after step 2 in the log.
Running this with usbmon shows that the two commands are being sent
without the event received in the middle.

I agree with Marcel that the right fix should be to wait until we
receive the completion event from the adapter. I'm certainly not very
familiar with this code, but I'll try to hack it =3D) Suggestions are
welcomed!

Thanks,
Alex.


[ 333.345868] hci_rx_work:2860: hci0
[ 333.345876] hci_rx_work:2889: hci0 Event packet
[ 333.345883] hci_cc_write_scan_enable:306: hci0 status 0x00
[ 333.345890] hci_sent_cmd_data:2204: hci0 opcode 0x0c1a #######
HCI_Write_Scan_Enable
[ 333.345898] hci_req_complete:62: hci0 command 0x0c1a result 0x00
[ 333.345914] hci_cmd_work:2916: hci0 cmd_cnt 1 cmd queued 3
[ 333.345923] hci_send_frame:2136: hci0 type 1 len 6
[ 333.346889] hci_rx_work:2860: hci0
[ 333.346897] hci_rx_work:2889: hci0 Event packet
[ 333.346904] hci_cc_write_class_of_dev:369: hci0 status 0x00
[ 333.346910] hci_sent_cmd_data:2204: hci0 opcode 0x0c24 #######
HCI_Write_Class_of_Device
[ 333.346919] hci_send_to_control:147: len 9
[ 333.346938] hci_cmd_work:2916: hci0 cmd_cnt 1 cmd queued 2
[ 333.346947] hci_send_frame:2136: hci0 type 1 len 251
[ 333.347267] hci_sock_create:1066: sock ffff8801183e2c00
[ 333.347286] hci_sock_bind:598: sock ffff8801183e2c00 sk ffff88011a158800
[ 333.347477] hci_sock_sendmsg:794: sock ffff8801183e2c00 sk ffff88011a158=
800
[ 333.347483] mgmt_control:2766: got 7 bytes
[ 333.347489] hci_dev_get:287: 0
[ 333.347494] hci_dev_hold:655: hci0 orig refcnt 5
[ 333.347499] set_powered:778: request for hci0
[ 333.347512] hci_dev_put:647: hci0 orig refcnt 6
[ 333.347523] hci_power_off:1160: hci0
[ 333.347528] hci_dev_do_close:785: hci0 ffff88014a967000
[ 333.347534] hci_req_cancel:100: hci0 err 0x13
[ 333.347540] hci_conn_hash_flush:812: hdev hci0
[ 333.347545] hci_sock_dev_event:376: hdev hci0 event 4
[ 333.347553] hci_send_to_sock:76: hdev (null) len 8
[ 333.350268] cmd_complete:248: sock ffff88011a158800
[ 333.350289] hci_send_to_control:147: len 10
[ 333.350301] hci_dev_put:647: hci0 orig refcnt 5
[ 333.350309] hci_sock_recvmsg:748: sock ffff8801183e2c00, sk ffff88011a15=
8800
[ 333.350595] hci_sock_release:421: sock ffff8801183e2c00 sk ffff88011a158=
800
[ 333.352418] hci_sock_create:1066: sock ffff8801183e1b80
[ 333.352438] hci_sock_bind:598: sock ffff8801183e1b80 sk ffff88011a15a800
[ 333.352594] hci_sock_sendmsg:794: sock ffff8801183e1b80 sk ffff88011a15a=
800
[ 333.352600] mgmt_control:2766: got 7 bytes
[ 333.352604] hci_dev_get:287: 0
[ 333.352609] hci_dev_hold:655: hci0 orig refcnt 4
[ 333.352613] set_powered:778: request for hci0
[ 333.352622] hci_dev_put:647: hci0 orig refcnt 5
[ 333.352632] hci_power_on:1143: hci0
[ 333.352635] hci_dev_get:287: 0
[ 333.352639] hci_dev_hold:655: hci0 orig refcnt 4
[ 333.352643] hci_dev_open:703: hci0 ffff88014a967000
[ 333.352865] __hci_request:117: hci0 start
[ 333.352880] hci_init_req:208: hci0 0
[ 333.352893] hci_reset_req:172: hci0 0
[ 333.352906] hci_send_cmd:2162: hci0 opcode 0x0c03 plen 0 ######## HCI Re=
set
[ 333.352916] hci_send_cmd:2177: skb len 3
[ 333.352929] hci_send_cmd:2162: hci0 opcode 0x1003 plen 0 ########
HCI Read local supported features
[ 333.352935] hci_cmd_work:2916: hci0 cmd_cnt 1 cmd queued 1
[ 333.352937] hci_send_cmd:2177: skb len 3
[ 333.352944] hci_send_cmd:2162: hci0 opcode 0x1001 plen 0 ########
HCI Read local version information
[ 333.352949] hci_send_frame:2136: hci0 type 1 len 3
[ 333.352955] hci_send_cmd:2177: skb len 3
[ 333.352965] hci_cmd_work:2916: hci0 cmd_cnt 0 cmd queued 2
[ 333.362660] bt_sock_poll:417: sock ffff8801402b42c0, sk ffff88011bcdfc00
[ 333.362669] bt_sock_poll:417: sock ffff88012679d080, sk ffff88011e063400
[ 333.362674] bt_sock_poll:417: sock ffff88012679c580, sk ffff88011e062800
[ 333.362678] bt_sock_poll:417: sock ffff88012679c2c0, sk ffff88011e063000
[ 333.362682] bt_sock_poll:417: sock ffff88012679c000, sk ffff880146445c00
[ 333.362686] bt_sock_poll:417: sock ffff88012679cb00, sk ffff880146446800
[ 333.362735] hci_sock_recvmsg:748: sock ffff8801402cf440, sk ffff88011b41=
f800
[ 333.362751] bt_sock_poll:417: sock ffff8801402b42c0, sk ffff88011bcdfc00
[ 333.362754] bt_sock_poll:417: sock ffff88012679d080, sk ffff88011e063400
[ 333.362756] bt_sock_poll:417: sock ffff88012679c580, sk ffff88011e062800
[ 333.362758] bt_sock_poll:417: sock ffff88012679c2c0, sk ffff88011e063000
[ 333.362760] bt_sock_poll:417: sock ffff88012679c000, sk ffff880146445c00
[ 333.362762] bt_sock_poll:417: sock ffff88012679cb00, sk ffff880146446800
[ 333.362768] hci_sock_recvmsg:748: sock ffff8801402cf440, sk ffff88011b41=
f800
[ 333.370467] bt_sock_poll:417: sock ffff8801402b42c0, sk ffff88011bcdfc00
[ 333.370476] bt_sock_poll:417: sock ffff88012679d080, sk ffff88011e063400
[ 333.370480] bt_sock_poll:417: sock ffff88012679c580, sk ffff88011e062800
[ 333.370485] bt_sock_poll:417: sock ffff88012679c2c0, sk ffff88011e063000
[ 333.370489] bt_sock_poll:417: sock ffff88012679c000, sk ffff880146445c00
[ 333.370493] bt_sock_poll:417: sock ffff88012679cb00, sk ffff880146446800
[ 333.370623] bt_sock_poll:417: sock ffff8801402b42c0, sk ffff88011bcdfc00
[ 333.370628] bt_sock_poll:417: sock ffff88012679d080, sk ffff88011e063400
[ 333.370633] bt_sock_poll:417: sock ffff88012679c580, sk ffff88011e062800
[ 333.370637] bt_sock_poll:417: sock ffff88012679c2c0, sk ffff88011e063000
[ 333.370641] bt_sock_poll:417: sock ffff88012679c000, sk ffff880146445c00
[ 333.370645] bt_sock_poll:417: sock ffff88012679cb00, sk ffff880146446800
[ 343.349711] __hci_request:148: hci0 end: err -110
[ 343.352606] hci_dev_put:647: hci0 orig refcnt 5

2013-06-21 13:22:17

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

>>> I'm running kernel 3.8.11 on a x86_64 and BlueZ 5.4. The hardware is a
>>> chromebook (I saw it with different hardware and I can also repro it
>>> on my ubuntu) that uses the ath9k and ath3k drivers for the wifi/bt
>>> chip (MD222)
>>
>> I do not have a Chromebook or actually looked into what Bluetooth chip it uses. Is that one also an ath3k one? Hope the hardware does not have a bug here. It is on USB, right?
>
> Is an ath3k, and bluetooth is conected to the usb.
>
>> Can you reproduce this with a off the shelf CSR or Broadcom chip. Maybe I should send you some of the Intel chips so you can test on our silicon as well ;)
>
> I just tried it again on my desktop with a Broadcom usb 0a5c:21e8 (I
> think that one is a BCM20702) and I was able to reproduce it with
> btmgmt as well (bluez compiled from tip of tree). It took more
> iterations (about 50) but eventually the repro case works (in a
> different point, see the btmon traces below). I have a bunch of
> different Broadcom, Atheros and CSR usb bluetooth adapters here, but
> none from Intel ;)

let me see if we have some final production chips for you. I looked at my stack and it only has pre-production units in there. I rather not send these around.

>> I can't say this for sure, but if this is our fault and not a hardware issue then this seems to be a pretty nasty race condition.
>>
>> To debug this you might need to work with dynamic_debug for Bluetooth core and add a few more DBG statements so we get timing information that we can compare with the btmon trace.
>>
>> So in theory all modern chips should send HCI_Reset on devup. Only a few old broken ones will send HCI_Reset on devdown. Can you check that the ath3k does not send a quirk here and really does HCI_Reset on devup. And that the ath3k firmware loading part not accidentally gets in the way.
>>
>> It would be also good to verify if devup or devdown is the root cause here.
>
> I don't understand how this could be a hardware problem. I must be
> missing something. The host is sending to the controller two
> consecutive commands, the second one is a HCI_Reset caused by the
> power off/on, but the spec says that we should not send more than one
> command at that point (ncmd of the last event is 1). So, since we are
> out of spec the firmware is happy to block itself in a bad way :)
> right?

It is most likely our bug. Sometimes the hardware does however behave funny.

My guess right now is that you are hitting a really nasty race conditions and in ath3k chips is shows up easier since the command processing is slower.

So in hci_do_dev_close() we forcefully set the cmd_cnt to 1. And in case of HCI_QUIRK_RESET_ON_CLOSE we are sending a HCI Reset from close.

Before we go ahead with this, I like you to confirm what HCI Reset behavior you actually have for your hardware. HCI Reset on devup or on devdown. Check your full btmon where we send the HCI Reset.

What we might need to do different here is to check if we have a pending command, flush all other commands and then wait for this command to complete before sending HCI Reset.

So in theory that HCI Request Lock should protect the whole init and shutdown procedures and all mgmt and ioctl interfaces should block each other. Why this is not the case is something else to investigate. Maybe we have some issue there as well.

Regards

Marcel


2013-06-20 18:42:40

by Alex Deymo

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Marcel,

>> I'm running kernel 3.8.11 on a x86_64 and BlueZ 5.4. The hardware is a
>> chromebook (I saw it with different hardware and I can also repro it
>> on my ubuntu) that uses the ath9k and ath3k drivers for the wifi/bt
>> chip (MD222)
>
> I do not have a Chromebook or actually looked into what Bluetooth chip it=
uses. Is that one also an ath3k one? Hope the hardware does not have a bug=
here. It is on USB, right?

Is an ath3k, and bluetooth is conected to the usb.

> Can you reproduce this with a off the shelf CSR or Broadcom chip. Maybe I=
should send you some of the Intel chips so you can test on our silicon as =
well ;)

I just tried it again on my desktop with a Broadcom usb 0a5c:21e8 (I
think that one is a BCM20702) and I was able to reproduce it with
btmgmt as well (bluez compiled from tip of tree). It took more
iterations (about 50) but eventually the repro case works (in a
different point, see the btmon traces below). I have a bunch of
different Broadcom, Atheros and CSR usb bluetooth adapters here, but
none from Intel ;)

> I can't say this for sure, but if this is our fault and not a hardware is=
sue then this seems to be a pretty nasty race condition.
>
> To debug this you might need to work with dynamic_debug for Bluetooth cor=
e and add a few more DBG statements so we get timing information that we ca=
n compare with the btmon trace.
>
> So in theory all modern chips should send HCI_Reset on devup. Only a few =
old broken ones will send HCI_Reset on devdown. Can you check that the ath3=
k does not send a quirk here and really does HCI_Reset on devup. And that t=
he ath3k firmware loading part not accidentally gets in the way.
>
> It would be also good to verify if devup or devdown is the root cause her=
e.

I don't understand how this could be a hardware problem. I must be
missing something. The host is sending to the controller two
consecutive commands, the second one is a HCI_Reset caused by the
power off/on, but the spec says that we should not send more than one
command at that point (ncmd of the last event is 1). So, since we are
out of spec the firmware is happy to block itself in a bad way :)
right?

Regards,
Alex

< HCI Command: Write LE Host Supported (0x03|0x006d) plen 2
[hci0] 58988.081654
Supported: 0x01
Simultaneous: 0x01
> HCI Event: Command Complete (0x0e) plen 4 =
[hci0] 58988.084618
Write LE Host Supported (0x03|0x006d) ncmd 1
Status: Success (0x00)
< HCI Command: LE Set Advertising Data (0x08|0x0008) plen 32
[hci0] 58988.084713
Length: 18
Flags: 0x18
Simultaneous LE and BR/EDR (Controller)
Simultaneous LE and BR/EDR (Host)
TX power: 4 dBm
Name (complete): Drink Mate
@ New Settings: 0x02d3
powered connectable pairable ssp br/edr le
@ New Settings: 0x02d2
connectable pairable ssp br/edr le
< HCI Command: Reset (0x03|0x0003) plen 0
[hci0] 58988.087811

---------------------------------------------------------------------------=
--------------------------------------------------------
other case:

< HCI Command: Write LE Host Supported (0x03|0x006d) plen 2
[hci0] 61133.028188
Supported: 0x01
Simultaneous: 0x01
> HCI Event: Command Complete (0x0e) plen 4 =
[hci0] 61133.031169
Write LE Host Supported (0x03|0x006d) ncmd 1
Status: Success (0x00)
@ New Settings: 0x02d3
powered connectable pairable ssp br/edr le
< HCI Command: LE Set Advertising Data (0x08|0x0008) plen 32
[hci0] 61133.031261
Length: 18
Flags: 0x18
Simultaneous LE and BR/EDR (Controller)
Simultaneous LE and BR/EDR (Host)
TX power: 4 dBm
Name (complete): Drink Mate
@ New Settings: 0x02d2
connectable pairable ssp br/edr le
< HCI Command: Reset (0x03|0x0003) plen 0
[hci0] 61133.035268

2013-06-20 12:31:24

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

>>> I'm experiencing a problem that wedges the bluetooth adapter from time
>>> to time. The repro case can be hit with this scenario:
>>>
>>> run bluetoothd
>>> while hciconfig hci0 up && hciconfig hci0 down; do echo -n .; done
>>>
>>> In few iterations you will have a timeout while trying to bring up the
>>> interface. This is just a repro case, the real scenario just happens
>>> from time to time if you for instance restart bluetoothd at the right
>>> (or wrong!) moment.
>>
>> can you give us at least a few details on the kernel you are running and what hardware this is?
>
> I'm running kernel 3.8.11 on a x86_64 and BlueZ 5.4. The hardware is a
> chromebook (I saw it with different hardware and I can also repro it
> on my ubuntu) that uses the ath9k and ath3k drivers for the wifi/bt
> chip (MD222)

I do not have a Chromebook or actually looked into what Bluetooth chip it uses. Is that one also an ath3k one? Hope the hardware does not have a bug here. It is on USB, right?

Can you reproduce this with a off the shelf CSR or Broadcom chip. Maybe I should send you some of the Intel chips so you can test on our silicon as well ;)

>> If you do not use hciconfig and instead use btmgmt or bluetoothctl, do you see the same issue?
>
> Using btmgmt I see the same issue with a very similar hcidump (just
> the timing is different).
> while btmgmt power on && btmgmt power off; do echo -n; done
> btmgmt power on after about 4 iterations is blocked and never returns.
>
> Using the dbus interface to the running bluetoothd apparently doesn't
> hit the problem (ran fine for a few hundred times)
> while dbus-send --system --print-reply --dest=org.bluez
> /org/bluez/hci0 org.freedesktop.DBus.Properties.Set
> string:org.bluez.Adapter1 string:Powered variant:boolean:true &&
> dbus-send --system --print-reply --dest=org.bluez /org/bluez/hci0
> org.freedesktop.DBus.Properties.Set string:org.bluez.Adapter1
> string:Powered variant:boolean:false; do echo -n; done

I can't say this for sure, but if this is our fault and not a hardware issue then this seems to be a pretty nasty race condition.

To debug this you might need to work with dynamic_debug for Bluetooth core and add a few more DBG statements so we get timing information that we can compare with the btmon trace.

So in theory all modern chips should send HCI_Reset on devup. Only a few old broken ones will send HCI_Reset on devdown. Can you check that the ath3k does not send a quirk here and really does HCI_Reset on devup. And that the ath3k firmware loading part not accidentally gets in the way.

It would be also good to verify if devup or devdown is the root cause here.

Regards

Marcel


2013-06-20 09:13:38

by Andrei Emeltchenko

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

On Wed, Jun 19, 2013 at 08:18:09PM -0700, Alex Deymo wrote:
> Hi Marcel,
>
> On Wed, Jun 19, 2013 at 6:04 PM, Marcel Holtmann <[email protected]> wrote:
> > Hi Alex,
> >
> >> I'm experiencing a problem that wedges the bluetooth adapter from time
> >> to time. The repro case can be hit with this scenario:
> >>
> >> run bluetoothd
> >> while hciconfig hci0 up && hciconfig hci0 down; do echo -n .; done
> >>
> >> In few iterations you will have a timeout while trying to bring up the
> >> interface. This is just a repro case, the real scenario just happens
> >> from time to time if you for instance restart bluetoothd at the right
> >> (or wrong!) moment.
> >
> > can you give us at least a few details on the kernel you are running and what hardware this is?
>
> I'm running kernel 3.8.11 on a x86_64 and BlueZ 5.4. The hardware is a
> chromebook (I saw it with different hardware and I can also repro it
> on my ubuntu) that uses the ath9k and ath3k drivers for the wifi/bt
> chip (MD222)
>
> > If you do not use hciconfig and instead use btmgmt or bluetoothctl, do you see the same issue?
>
> Using btmgmt I see the same issue with a very similar hcidump (just
> the timing is different).
> while btmgmt power on && btmgmt power off; do echo -n; done
> btmgmt power on after about 4 iterations is blocked and never returns.
>
> Using the dbus interface to the running bluetoothd apparently doesn't
> hit the problem (ran fine for a few hundred times)
> while dbus-send --system --print-reply --dest=org.bluez
> /org/bluez/hci0 org.freedesktop.DBus.Properties.Set
> string:org.bluez.Adapter1 string:Powered variant:boolean:true &&
> dbus-send --system --print-reply --dest=org.bluez /org/bluez/hci0
> org.freedesktop.DBus.Properties.Set string:org.bluez.Adapter1
> string:Powered variant:boolean:false; do echo -n; done

The D-Bus interface is very slow so you never face race condition.

Best regards
Andrei Emeltchenko

2013-06-20 03:18:09

by Alex Deymo

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Marcel,

On Wed, Jun 19, 2013 at 6:04 PM, Marcel Holtmann <[email protected]> wrote:
> Hi Alex,
>
>> I'm experiencing a problem that wedges the bluetooth adapter from time
>> to time. The repro case can be hit with this scenario:
>>
>> run bluetoothd
>> while hciconfig hci0 up && hciconfig hci0 down; do echo -n .; done
>>
>> In few iterations you will have a timeout while trying to bring up the
>> interface. This is just a repro case, the real scenario just happens
>> from time to time if you for instance restart bluetoothd at the right
>> (or wrong!) moment.
>
> can you give us at least a few details on the kernel you are running and what hardware this is?

I'm running kernel 3.8.11 on a x86_64 and BlueZ 5.4. The hardware is a
chromebook (I saw it with different hardware and I can also repro it
on my ubuntu) that uses the ath9k and ath3k drivers for the wifi/bt
chip (MD222)

> If you do not use hciconfig and instead use btmgmt or bluetoothctl, do you see the same issue?

Using btmgmt I see the same issue with a very similar hcidump (just
the timing is different).
while btmgmt power on && btmgmt power off; do echo -n; done
btmgmt power on after about 4 iterations is blocked and never returns.

Using the dbus interface to the running bluetoothd apparently doesn't
hit the problem (ran fine for a few hundred times)
while dbus-send --system --print-reply --dest=org.bluez
/org/bluez/hci0 org.freedesktop.DBus.Properties.Set
string:org.bluez.Adapter1 string:Powered variant:boolean:true &&
dbus-send --system --print-reply --dest=org.bluez /org/bluez/hci0
org.freedesktop.DBus.Properties.Set string:org.bluez.Adapter1
string:Powered variant:boolean:false; do echo -n; done

Regards,
Alex.

2013-06-20 01:04:16

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [BUG] HCI_RESET and Num_HCI_Command_Packets limit

Hi Alex,

> I'm experiencing a problem that wedges the bluetooth adapter from time
> to time. The repro case can be hit with this scenario:
>
> run bluetoothd
> while hciconfig hci0 up && hciconfig hci0 down; do echo -n .; done
>
> In few iterations you will have a timeout while trying to bring up the
> interface. This is just a repro case, the real scenario just happens
> from time to time if you for instance restart bluetoothd at the right
> (or wrong!) moment.

can you give us at least a few details on the kernel you are running and what hardware this is?

If you do not use hciconfig and instead use btmgmt or bluetoothctl, do you see the same issue?

Regards

Marcel