2015-10-26 04:41:28

by Jakub Pawlowski

[permalink] [raw]
Subject: [RFC] Fix scan enable/disable kernel issue for opening socket

Issue (on all kernels, including bluetooth-next and latest 4.3 kernel):
1. in bluetoothd pair to two BLE devices using random address
(AA:BB:CC:DD:EE:FF and FF:EE:DD:CC:BB:AA), that don't use autoconnect.
2. Make sure all those devices don't advertise.
3. call Device1.Connect() on both devices at same time, with scan
initially disabled.
4. Second request will fail right away, instead of timeout in 45
seconds (if it doesn't fail repeat, it's a race condition and happens
once every time, trying with more, i.e. 5 devices at same time gives
bigger change).

When calling Device1.Connect() twice, bluetoothd will try to open two
sockets, each to different random address device.
1 Inside kernel, for first socket, scan state is checked,and
HCI_OP_SET_SCAN_ENABLE is scheduled, but not yet executed.
2. Same happens for second socket - current scan state is checked, and
HCI_OP_SET_SCAN_ENABLE is scheduled.
3. Now both HCI_OP_SET_SCAN_ENABLE are executed, second will fail,
because first one already enabled scan.


To solve this kind of issues, I want to propose adding inside kernel
virtual HCI commands, that might be scheduled in hci_request. Virtual
HCI command will be able to do some very basic check (i.e. check scan
state) and might cause real HCI command to be issued.

Commands I'd like to add:
VIRT_HCI_SCAN_DISABLE_START - check scan state, remember it, and send
HCI_OP_SET_SCAN_ENABLE(enable=false) if it was enabled
VIRT_HCI_SCAN_DISABLE_STOP - bring scan state back to state remembered
in VIRT_HCI_SCAN_DISABLE_START
Those commands must be both scheduled in hci_request, having only one
of them in hci_request would be error.

VIRT_HCI_SCAN_ENABLE - this command will check current scan state, and
send HCI_OP_SET_SCAN_ENABLE(enable=true) if scan was disabled
VIRT_HCI_SCAN_DISABLE - this command will check current scan state,
and send HCI_OP_SET_SCAN_ENABLE(enable=false) if scan was enabled


2015-11-01 06:51:41

by Jakub Pawlowski

[permalink] [raw]
Subject: Re: [RFC] Fix scan enable/disable kernel issue for opening socket

Hi Marcel,

On Fri, Oct 30, 2015 at 11:51 PM, Jakub Pawlowski <[email protected]> w=
rote:
> Hi Marcel,
>
> On Fri, Oct 30, 2015 at 8:56 PM, Marcel Holtmann <[email protected]> wr=
ote:
>> Hi Jakub,
>>
>>> Issue (on all kernels, including bluetooth-next and latest 4.3 kernel):
>>> 1. in bluetoothd pair to two BLE devices using random address
>>> (AA:BB:CC:DD:EE:FF and FF:EE:DD:CC:BB:AA), that don't use autoconnect.
>>> 2. Make sure all those devices don't advertise.
>>> 3. call Device1.Connect() on both devices at same time, with scan
>>> initially disabled.
>>> 4. Second request will fail right away, instead of timeout in 45
>>> seconds (if it doesn't fail repeat, it's a race condition and happens
>>> once every time, trying with more, i.e. 5 devices at same time gives
>>> bigger change).
>>>
>>> When calling Device1.Connect() twice, bluetoothd will try to open two
>>> sockets, each to different random address device.
>>> 1 Inside kernel, for first socket, scan state is checked,and
>>> HCI_OP_SET_SCAN_ENABLE is scheduled, but not yet executed.
>>> 2. Same happens for second socket - current scan state is checked, and
>>> HCI_OP_SET_SCAN_ENABLE is scheduled.
>>> 3. Now both HCI_OP_SET_SCAN_ENABLE are executed, second will fail,
>>> because first one already enabled scan.
>>
>> I think that is the fundamental problem here. The state machine is bogus=
. I said this many time before, in general the scanning needs to be disable=
d first, whitelist re-programmed and then scanning resumed. The fact that a=
n L2CAP connect() for ATT might be triggered at the same time is really not=
any reason to not have a proper state machine behind this.
>>
>
> The state machine is good, it properly checks all conditions and
> schedule commands that are correct at THAT time, but they're being
> send to controller later, at different moment in time, at which they
> might no longer be correct, because of other commands that were
> executed in mean time.
>
>>> To solve this kind of issues, I want to propose adding inside kernel
>>> virtual HCI commands, that might be scheduled in hci_request. Virtual
>>> HCI command will be able to do some very basic check (i.e. check scan
>>> state) and might cause real HCI command to be issued.
>>>
>>> Commands I'd like to add:
>>> VIRT_HCI_SCAN_DISABLE_START - check scan state, remember it, and send
>>> HCI_OP_SET_SCAN_ENABLE(enable=3Dfalse) if it was enabled
>>> VIRT_HCI_SCAN_DISABLE_STOP - bring scan state back to state remembered
>>> in VIRT_HCI_SCAN_DISABLE_START
>>> Those commands must be both scheduled in hci_request, having only one
>>> of them in hci_request would be error.
>>>
>>> VIRT_HCI_SCAN_ENABLE - this command will check current scan state, and
>>> send HCI_OP_SET_SCAN_ENABLE(enable=3Dtrue) if scan was disabled
>>> VIRT_HCI_SCAN_DISABLE - this command will check current scan state,
>>> and send HCI_OP_SET_SCAN_ENABLE(enable=3Dfalse) if scan was enabled
>>
>> This approach does not sound right at all. And honestly I do not see a n=
eed for adding such a complex overhead. It is like fixing symptoms. What we=
need to fix is the root cause.
>
> So after all playing I've done with that right now, only one command
> was needed, HCI_OP_VIRT_LE_SCAN_DISABLED - it solves all problems so
> far.
>
>>
>> And what I like to see is that we actually have mgmt-tester and l2cap-te=
ster test cases first that expose this issue. Then we can ensure that we ne=
ver break this behavior anymore.
>
> I'll add mgmt-tester test case for that. Going through it will explain
> the problem much better.
> Hope that when going through it you'll either get convinced that it's
> really needed, or proof me wrong.

So way to reproduce the problem is like that:

tools/l2test -V le_public -m 11:22:33:44:55:66 22:33:44:55:66:77

It'll try to add to white list during scanning, which will result in
"Command Disallowed"

> HCI Event: Command Complete (0x0e) plen 4 [hci0]=
57.987306
LE Add Device To White List (0x08|0x0011) ncmd 1
Status: Command Disallowed (0x0c)

here are full logs from executing at my machine: ix.io/lKo

I tried to write a test case for that, but when running against vhci I
was unable to reproduce, will work more on that.

Here's explanation why this fail:

connection to 11:22:33:44:55:66 - lines starting with ***
connection to 11:22:33:44:55:66 - lines starting with &&&

Initially scan, and background scan is off.
*** l2cap_chan_connect: locks hdev, and call hci_connect_le_scan,
which call hci_update_background_scan
*** hci_update_background_scan check scan state (scan is off) so scan
disabling is not scheduled
*** schedule ADD TO WHITELIST in req1
*** schedule SET SCAN PARAMS in req1
*** schedule SET_SCAN_ENABLE in req1
*** schedule req1 for execution
*** unlocks hdev in l2cap_chan_connect

&&& l2cap_chan_connect: locks hdev, and call hci_connect_le_scan,
which call hci_update_background_scan
&&& hci_update_background_scan check scan state (scan is off) so scan
disabling is not scheduled
&&& schedule ADD TO WHITELIST in req2
&&& schedule SET SCAN PARAMS in req2
&&& schedule SET_SCAN_ENABLE in req2
&&& schedule req2 for execution
&&& unlocks hdev in l2cap_chan_connect

*** req1 is succesfully executed, leaves controller scanning
&&& req2 is starting execution (it assumed scan is off during
scheduling), first command - ADD TO WHITELIST fails to be properly
executed


does this make more sense now ? If my solution with "virtual" command
is wrong, how should I fix that ?
It's not only two "connect" commands interfering with each other, it
can be "connect" and some mgmt command that is trying to update
background scan, like "add device".

Jakub

>
> Jakub
>
>>
>> Regards
>>
>> Marcel
>>

2015-10-31 06:51:20

by Jakub Pawlowski

[permalink] [raw]
Subject: Re: [RFC] Fix scan enable/disable kernel issue for opening socket

Hi Marcel,

On Fri, Oct 30, 2015 at 8:56 PM, Marcel Holtmann <[email protected]> wrot=
e:
> Hi Jakub,
>
>> Issue (on all kernels, including bluetooth-next and latest 4.3 kernel):
>> 1. in bluetoothd pair to two BLE devices using random address
>> (AA:BB:CC:DD:EE:FF and FF:EE:DD:CC:BB:AA), that don't use autoconnect.
>> 2. Make sure all those devices don't advertise.
>> 3. call Device1.Connect() on both devices at same time, with scan
>> initially disabled.
>> 4. Second request will fail right away, instead of timeout in 45
>> seconds (if it doesn't fail repeat, it's a race condition and happens
>> once every time, trying with more, i.e. 5 devices at same time gives
>> bigger change).
>>
>> When calling Device1.Connect() twice, bluetoothd will try to open two
>> sockets, each to different random address device.
>> 1 Inside kernel, for first socket, scan state is checked,and
>> HCI_OP_SET_SCAN_ENABLE is scheduled, but not yet executed.
>> 2. Same happens for second socket - current scan state is checked, and
>> HCI_OP_SET_SCAN_ENABLE is scheduled.
>> 3. Now both HCI_OP_SET_SCAN_ENABLE are executed, second will fail,
>> because first one already enabled scan.
>
> I think that is the fundamental problem here. The state machine is bogus.=
I said this many time before, in general the scanning needs to be disabled=
first, whitelist re-programmed and then scanning resumed. The fact that an=
L2CAP connect() for ATT might be triggered at the same time is really not =
any reason to not have a proper state machine behind this.
>

The state machine is good, it properly checks all conditions and
schedule commands that are correct at THAT time, but they're being
send to controller later, at different moment in time, at which they
might no longer be correct, because of other commands that were
executed in mean time.

>> To solve this kind of issues, I want to propose adding inside kernel
>> virtual HCI commands, that might be scheduled in hci_request. Virtual
>> HCI command will be able to do some very basic check (i.e. check scan
>> state) and might cause real HCI command to be issued.
>>
>> Commands I'd like to add:
>> VIRT_HCI_SCAN_DISABLE_START - check scan state, remember it, and send
>> HCI_OP_SET_SCAN_ENABLE(enable=3Dfalse) if it was enabled
>> VIRT_HCI_SCAN_DISABLE_STOP - bring scan state back to state remembered
>> in VIRT_HCI_SCAN_DISABLE_START
>> Those commands must be both scheduled in hci_request, having only one
>> of them in hci_request would be error.
>>
>> VIRT_HCI_SCAN_ENABLE - this command will check current scan state, and
>> send HCI_OP_SET_SCAN_ENABLE(enable=3Dtrue) if scan was disabled
>> VIRT_HCI_SCAN_DISABLE - this command will check current scan state,
>> and send HCI_OP_SET_SCAN_ENABLE(enable=3Dfalse) if scan was enabled
>
> This approach does not sound right at all. And honestly I do not see a ne=
ed for adding such a complex overhead. It is like fixing symptoms. What we =
need to fix is the root cause.

So after all playing I've done with that right now, only one command
was needed, HCI_OP_VIRT_LE_SCAN_DISABLED - it solves all problems so
far.

>
> And what I like to see is that we actually have mgmt-tester and l2cap-tes=
ter test cases first that expose this issue. Then we can ensure that we nev=
er break this behavior anymore.

I'll add mgmt-tester test case for that. Going through it will explain
the problem much better.
Hope that when going through it you'll either get convinced that it's
really needed, or proof me wrong.

Jakub

>
> Regards
>
> Marcel
>

2015-10-31 03:56:49

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [RFC] Fix scan enable/disable kernel issue for opening socket

Hi Jakub,

> Issue (on all kernels, including bluetooth-next and latest 4.3 kernel):
> 1. in bluetoothd pair to two BLE devices using random address
> (AA:BB:CC:DD:EE:FF and FF:EE:DD:CC:BB:AA), that don't use autoconnect.
> 2. Make sure all those devices don't advertise.
> 3. call Device1.Connect() on both devices at same time, with scan
> initially disabled.
> 4. Second request will fail right away, instead of timeout in 45
> seconds (if it doesn't fail repeat, it's a race condition and happens
> once every time, trying with more, i.e. 5 devices at same time gives
> bigger change).
>
> When calling Device1.Connect() twice, bluetoothd will try to open two
> sockets, each to different random address device.
> 1 Inside kernel, for first socket, scan state is checked,and
> HCI_OP_SET_SCAN_ENABLE is scheduled, but not yet executed.
> 2. Same happens for second socket - current scan state is checked, and
> HCI_OP_SET_SCAN_ENABLE is scheduled.
> 3. Now both HCI_OP_SET_SCAN_ENABLE are executed, second will fail,
> because first one already enabled scan.

I think that is the fundamental problem here. The state machine is bogus. I said this many time before, in general the scanning needs to be disabled first, whitelist re-programmed and then scanning resumed. The fact that an L2CAP connect() for ATT might be triggered at the same time is really not any reason to not have a proper state machine behind this.

> To solve this kind of issues, I want to propose adding inside kernel
> virtual HCI commands, that might be scheduled in hci_request. Virtual
> HCI command will be able to do some very basic check (i.e. check scan
> state) and might cause real HCI command to be issued.
>
> Commands I'd like to add:
> VIRT_HCI_SCAN_DISABLE_START - check scan state, remember it, and send
> HCI_OP_SET_SCAN_ENABLE(enable=false) if it was enabled
> VIRT_HCI_SCAN_DISABLE_STOP - bring scan state back to state remembered
> in VIRT_HCI_SCAN_DISABLE_START
> Those commands must be both scheduled in hci_request, having only one
> of them in hci_request would be error.
>
> VIRT_HCI_SCAN_ENABLE - this command will check current scan state, and
> send HCI_OP_SET_SCAN_ENABLE(enable=true) if scan was disabled
> VIRT_HCI_SCAN_DISABLE - this command will check current scan state,
> and send HCI_OP_SET_SCAN_ENABLE(enable=false) if scan was enabled

This approach does not sound right at all. And honestly I do not see a need for adding such a complex overhead. It is like fixing symptoms. What we need to fix is the root cause.

And what I like to see is that we actually have mgmt-tester and l2cap-tester test cases first that expose this issue. Then we can ensure that we never break this behavior anymore.

Regards

Marcel