2016-08-30 12:13:42

by Northfield Stuart

[permalink] [raw]
Subject: Two second pending connection timeout prevents connection to devices with long advertising interval

We are currently working with a BLE device which (for power consumption reasons) uses an unusually large advertisement period of ten seconds (unusual, but allowed within the BLE specification). We don’t have the option of reducing the advertising interval for this product.

This works with older kernels (e.g. 4.2.6 in RH F23), but on later kernels it appears that the kernel times out the connection attempt after only two seconds.

I believe I have tracked down the change responsible to a patch from Johan Hedberg <[email protected]> on 2014-07-06, which appears to split the BLE connection timeout in to two variants, HCI_LE_CONN_TIMEOUT which remains at 20 seconds, and the newly added one, HCI_LE_AUTOCONN_TIMEOUT, which has been reduced down to two seconds.

I understand, from other threads touching on this subject (see links below - I am at least not the only person to have hit this problem) that this 2s timeout is chosen to avoid blocking other connections, and agree that the average user probably doesn’t need to be able to handle such slow devices. However, is there any reason why this timeout is hardcoded in the source rather than a tuneable parameter, which would at least allow those of us who do need to interact with such devices to be able to configure the linux bluetooth stack suitably.

Personally, I would regard a change which prevents interoperability with a conformant device as a regression, but I can see why it was done, and why it isn’t an issue for the vast majority of users and devices.

This is my third attempt to raise this issue on the list - I would appreciate if someone could please explain what more I need to do to actually get a response (or even some progress) on this issue?

Thread discussing identical issue with slow advertising device (no apparent solution though):

<http://marc.info/?l=linux-bluetooth&m=146361213623520&w=2>

Thread referenced from above which states why the 2s timeout is expected behaviour:

<http://marc.info/?l=linux-bluetooth&m=144830298701744&w=2>

Regards

Stu


Attachments:
smime.p7s (1.60 kB)

2016-08-31 19:19:18

by Marcel Holtmann

[permalink] [raw]
Subject: Re: Two second pending connection timeout prevents connection to devices with long advertising interval

Hi Stu,

>> the problem is that in order to send a CONNNECT_REQ, the HCI_LE_Create_Connection command needs to see connectable advertising packet one more time. So the longer time you give to HCI_LE_Create_Connection to find it, the longer everything else in the system is blocked. Since only one HCI_LE_Create_Connection can be running at the same time.
>
> We understand that, but for reasons which I’ll explain below, I suspect our only realistic option is a method of controlling this from our applications, not by modifying system behaviour automatically based on the device.
>
>> Now if the peripheral in question would actually include ​«Advertising Interval» AD type in its advertising, then we could automatically adjust the scan window/interval and timeout when connection to such a device. Can you run a btmon trace and show the advertising data you are getting.
>
> It’s a nice idea, but I can tell you now that the advertisement data from the device consists of purely the flags and manufacturer specific data fields filling the entire advertisement frame. There is no remaining space in the advertisement frame to add the advertising interval field! I can’t omit the flags field because the device is BLE only, thus some flag bits must be set.
>
> This product is pushing at the boundaries of what is achievable. Without disclosing the exact nature of the product, it is a portable device equipped with screen, environmental sensors (temperature, accelerometers, etc), GPS, cellular modem, UHF transceiver and BLE (the Nordic BLE device acts as the system main processor as well). The requirement is for a (non-rechargeable, non-replaceable) battery life of 15 years (on average), potentially out in remote field locations for long periods of time. Naturally, physical battery space is limited too. It is an extremely challenging project, and the onerous battery life requirements have forced us to squeeze every last bit of data in to the adverts to minimise the requirements for connections. It was rather an unpleasant surprise to discover that moving development of the infrastructure tools forward to a later distribution/kernel stopped them working completely!
>
> I’m sure you can appreciate that while I agree your suggestion is almost certainly the ‘proper’ solution to this issue, almost any solution which requires modifying the device behaviour has a severe impact on the power budget. For example, I could put the advertising interval field in a scan response, but enabling scan responses guarantees more time spent transmitting and a corresponding reduction in battery life.

Scan responses are not going to help here either. Since background scanning is passive scanning. We do not want to add to the mess that is active background scanning. That some phone OSes are doing this is already bad enough.

> Unfortunately, we will not be in a position to build bespoke patched kernel images for every linux platform expected to interact with these devices (some, yes, but I believe the COTS linux tablets will be out of our control), so we were really looking for a solution which would allow use of a modern distribution/kernel but still allow interoperability with a device such as ours by configuring or tuning the central behaviour from our bespoke applications.
>
> At the moment the linux platforms in use on product trials are still running an early enough kernel that the issue is not affecting the trials, but it would be unrealistic to expect this to remain the case going forwards.
>
> Any other suggestions?
>
> It’s a while since I worked on Linux kernel level stuff (I’m mostly embedded/OS X/iOS at the moment), but if it would be considered for inclusion then I’m prepared to put the effort in to a patch to provide some form of tunability around the timeouts (suggestions and guidance on preferred mechanisms welcome). We are in an environment where all the devices are ‘slow’ and we both understand and accept the implications of such a stack reconfiguration.
>
> If it’s likely to be rejected out of hand, then that makes life considerably more tedious and we will have to have a serious re-think on how we move forward :(

The problem here is that we have to make this fly without harming any other user of the system. One peripheral should not block all the rest. And the problem here is really the re-connection time of for example a HID device where low-latency is what counts.

One solution would be to keep the long timeout with HCI_LE_Create_Connection if we have controllers that allow us to keep scanning. Meaning a combination of Passive Scanning State and Initiating State. This is something we need to find out with trial and error and see if it can be done.

As a background here. Currently we stop scanning when we see a device we need to connect to, then connect to it. And if there are other devices on the "to be connected" list, we enable scanning after successful or failure of the connection attempt.

Essentially you want to change this into this:

a) Found device we want to connect to
b) If more devices are on the auto-connect list, keep scanning, other disable scanning
c) Send connect request and wait for its completion
d) For the first 2 seconds that connect attempt is exclusive
e) After that cancel it if we see another auto-connect device and try that device
f) Start over

Similar things then apply to when to re-enable scanning after connection termination, but I doubt that will actually have to change.

What this means in simple term, only disable background scanning when the auto-connect list empty. Otherwise keep it active and let the controller deal with the two instances of state machines by itself.

Now we need to check if that would work or not. We have quirks like HCI_QUIRK_SIMULTANEOUS_DISCOVERY and this might need another one. Not sure if we want to go with blacklist or whitelist here. I would do blacklist and actually check the supported states. Since this is LE only, the controller should not lie to us.

If you want to work on this, then try this simple approach:

a) Read the supported states and extract support for passive scanning + initiating
a) Use a long timeout
b) Only disable scanning when no other device is left on auto-connect list

If this basically works, then the only other thing we have to do is be smart about concurrent connection. Meaning that a long running one can be cancelled and replaced with something we see in the 2-x second window. As I said above, the 0-2 second window should be exclusive to the first attempt. We can tune these values, it is just the 20 second one is killing low-latency reconnect by HID device.

However there is one case to be made that we might only consider direct advertising to be able to interrupt it. Which would satisfy the HID requirement with low-latency. The advantage here is that they are high duty cycle and would show up right away. So you are not really losing out on your slow-connection attempt.

But this idea really stands and falls with the passive scanning + initiating state support in the controller.

Regards

Marcel


2016-08-31 15:08:16

by Northfield Stuart

[permalink] [raw]
Subject: Re: Two second pending connection timeout prevents connection to devices with long advertising interval

Hi Marcel,

Thank you for your reply.

> the problem is that in order to send a CONNNECT_REQ, the HCI_LE_Create_Connection command needs to see connectable advertising packet one more time. So the longer time you give to HCI_LE_Create_Connection to find it, the longer everything else in the system is blocked. Since only one HCI_LE_Create_Connection can be running at the same time.

We understand that, but for reasons which I’ll explain below, I suspect our only realistic option is a method of controlling this from our applications, not by modifying system behaviour automatically based on the device.

> Now if the peripheral in question would actually include ​«Advertising Interval» AD type in its advertising, then we could automatically adjust the scan window/interval and timeout when connection to such a device. Can you run a btmon trace and show the advertising data you are getting.

It’s a nice idea, but I can tell you now that the advertisement data from the device consists of purely the flags and manufacturer specific data fields filling the entire advertisement frame. There is no remaining space in the advertisement frame to add the advertising interval field! I can’t omit the flags field because the device is BLE only, thus some flag bits must be set.

This product is pushing at the boundaries of what is achievable. Without disclosing the exact nature of the product, it is a portable device equipped with screen, environmental sensors (temperature, accelerometers, etc), GPS, cellular modem, UHF transceiver and BLE (the Nordic BLE device acts as the system main processor as well). The requirement is for a (non-rechargeable, non-replaceable) battery life of 15 years (on average), potentially out in remote field locations for long periods of time. Naturally, physical battery space is limited too. It is an extremely challenging project, and the onerous battery life requirements have forced us to squeeze every last bit of data in to the adverts to minimise the requirements for connections. It was rather an unpleasant surprise to discover that moving development of the infrastructure tools forward to a later distribution/kernel stopped them working completely!

I’m sure you can appreciate that while I agree your suggestion is almost certainly the ‘proper’ solution to this issue, almost any solution which requires modifying the device behaviour has a severe impact on the power budget. For example, I could put the advertising interval field in a scan response, but enabling scan responses guarantees more time spent transmitting and a corresponding reduction in battery life.

Unfortunately, we will not be in a position to build bespoke patched kernel images for every linux platform expected to interact with these devices (some, yes, but I believe the COTS linux tablets will be out of our control), so we were really looking for a solution which would allow use of a modern distribution/kernel but still allow interoperability with a device such as ours by configuring or tuning the central behaviour from our bespoke applications.

At the moment the linux platforms in use on product trials are still running an early enough kernel that the issue is not affecting the trials, but it would be unrealistic to expect this to remain the case going forwards.

Any other suggestions?

It’s a while since I worked on Linux kernel level stuff (I’m mostly embedded/OS X/iOS at the moment), but if it would be considered for inclusion then I’m prepared to put the effort in to a patch to provide some form of tunability around the timeouts (suggestions and guidance on preferred mechanisms welcome). We are in an environment where all the devices are ‘slow’ and we both understand and accept the implications of such a stack reconfiguration.

If it’s likely to be rejected out of hand, then that makes life considerably more tedious and we will have to have a serious re-think on how we move forward :(

Regards

Stu

--
Stuart Northfield
+44 (0) 1223 566728 (Direct), +44 (0) 1223 566727 (Fax)
Metanate Limited. Registered in England No 4046086 at:
Lincoln House, Station Court, Great Shelford, Cambridge CB22 5NE, UK
http://www.metanate.com (Consultancy) http://www.schemus.com (Data synchronisation)

This e-mail and all attachments it may contain is confidential and
intended solely for the use of the individual to whom it is addressed.
Any views or opinions presented are those of the author and do not
necessarily represent those of Metanate Ltd. If you are not the
intended recipient, be advised that you have received this e-mail in
error and that any use, dissemination, printing, forwarding or copying
of this e-mail is strictly prohibited. Please contact the sender if
you have received this e-mail in error.







Attachments:
smime.p7s (1.60 kB)

2016-08-31 12:58:21

by Marcel Holtmann

[permalink] [raw]
Subject: Re: Two second pending connection timeout prevents connection to devices with long advertising interval

Hi Stu,

> We are currently working with a BLE device which (for power consumption reasons) uses an unusually large advertisement period of ten seconds (unusual, but allowed within the BLE specification). We don’t have the option of reducing the advertising interval for this product.
>
> This works with older kernels (e.g. 4.2.6 in RH F23), but on later kernels it appears that the kernel times out the connection attempt after only two seconds.
>
> I believe I have tracked down the change responsible to a patch from Johan Hedberg <[email protected]> on 2014-07-06, which appears to split the BLE connection timeout in to two variants, HCI_LE_CONN_TIMEOUT which remains at 20 seconds, and the newly added one, HCI_LE_AUTOCONN_TIMEOUT, which has been reduced down to two seconds.
>
> I understand, from other threads touching on this subject (see links below - I am at least not the only person to have hit this problem) that this 2s timeout is chosen to avoid blocking other connections, and agree that the average user probably doesn’t need to be able to handle such slow devices. However, is there any reason why this timeout is hardcoded in the source rather than a tuneable parameter, which would at least allow those of us who do need to interact with such devices to be able to configure the linux bluetooth stack suitably.
>
> Personally, I would regard a change which prevents interoperability with a conformant device as a regression, but I can see why it was done, and why it isn’t an issue for the vast majority of users and devices.

the problem is that in order to send a CONNNECT_REQ, the HCI_LE_Create_Connection command needs to see connectable advertising packet one more time. So the longer time you give to HCI_LE_Create_Connection to find it, the longer everything else in the system is blocked. Since only one HCI_LE_Create_Connection can be running at the same time.

Now if the peripheral in question would actually include ​«Advertising Interval» AD type in its advertising, then we could automatically adjust the scan window/interval and timeout when connection to such a device. Can you run a btmon trace and show the advertising data you are getting.

Regards

Marcel


2016-09-01 14:04:24

by Marcel Holtmann

[permalink] [raw]
Subject: Re: Two second pending connection timeout prevents connection to devices with long advertising interval

Hi Stu,

>> The problem here is that we have to make this fly without harming any other user of the system. One peripheral should not block all the rest. And the problem here is really the re-connection time of for example a HID device where low-latency is what counts.
>
> Understood.
>
>> One solution would be to keep the long timeout with HCI_LE_Create_Connection if we have controllers that allow us to keep scanning. Meaning a combination of Passive Scanning State and Initiating State. This is something we need to find out with trial and error and see if it can be done.
>>
>> As a background here. Currently we stop scanning when we see a device we need to connect to, then connect to it. And if there are other devices on the "to be connected" list, we enable scanning after successful or failure of the connection attempt.
>
> Presumably this is per controller rather than global across controllers? One thing I forgot to mention is that some of our infrastructure code is capable of using multiple BLE controllers and (I guess due to the above) we usually keep one controller scanning while using the other(s) for connections. (NB I personally didn’t write any of this linux application code - I was merely tasked with investigating the 2s connection attempt termination issue.)

in Linux every controller / radio is treated independently. For all I care you can attach 64k to a system. That is really the only limit :)

I mean in theory we can always go with a super long timeout. However if the controller supports passive scanning and initiating state at the same time, we should use it. Limited controllers are limited. That's just it. If they are supporting it, then we should use it to make it low-latency if possible.

>> Essentially you want to change this into this:
>>
>> a) Found device we want to connect to
>> b) If more devices are on the auto-connect list, keep scanning, other disable scanning
>> c) Send connect request and wait for its completion
>> d) For the first 2 seconds that connect attempt is exclusive
>> e) After that cancel it if we see another auto-connect device and try that device
>> f) Start over
>>
>> Similar things then apply to when to re-enable scanning after connection termination, but I doubt that will actually have to change.
>>
>> What this means in simple term, only disable background scanning when the auto-connect list empty. Otherwise keep it active and let the controller deal with the two instances of state machines by itself.
>>
>> Now we need to check if that would work or not. We have quirks like HCI_QUIRK_SIMULTANEOUS_DISCOVERY and this might need another one. Not sure if we want to go with blacklist or whitelist here. I would do blacklist and actually check the supported states. Since this is LE only, the controller should not lie to us.
>>
>> If you want to work on this, then try this simple approach:
>>
>> a) Read the supported states and extract support for passive scanning + initiating
>> a) Use a long timeout
>> b) Only disable scanning when no other device is left on auto-connect list
>>
>> If this basically works, then the only other thing we have to do is be smart about concurrent connection. Meaning that a long running one can be cancelled and replaced with something we see in the 2-x second window. As I said above, the 0-2 second window should be exclusive to the first attempt. We can tune these values, it is just the 20 second one is killing low-latency reconnect by HID device.
>>
>> However there is one case to be made that we might only consider direct advertising to be able to interrupt it. Which would satisfy the HID requirement with low-latency. The advantage here is that they are high duty cycle and would show up right away. So you are not really losing out on your slow-connection attempt.
>>
>> But this idea really stands and falls with the passive scanning + initiating state support in the controller.
>
> OK - I’ll investigate the controllers we have (we are using at least two different chipsets I believe, possibly three) and then have a go at this simple test - probably won’t be before next week at the moment (trying to get a new release candidate out at the moment) and it might take me a day or two to re-familiarise myself and get to grips with the code :)

I took a random Broadcom dongle of the shelf and it seems to actually do fine with imitating state + passive scanning.

Btw. this is your main entry point to look at:

/* If controller is scanning, we stop it since some controllers are
* not able to scan and connect at the same time. Also set the
* HCI_LE_SCAN_INTERRUPTED flag so that the command complete
* handler for scan disabling knows to set the correct discovery
* state.
*/
if (hci_dev_test_flag(hdev, HCI_LE_SCAN)) {
hci_req_add_le_scan_disable(&req);
hci_dev_set_flag(hdev, HCI_LE_SCAN_INTERRUPTED);
}

hci_req_add_le_create_conn(&req, conn);

It is not as easy as just adding extra checks around it. There is a bit more logic that needs handling. Mainly since just disabling scanning is not going to do it. You actually need to update the white list. So now the fun part is that if you can update the white list while scanning or do you need to stop scanning, update the white list and then restart scanning.

I wonder if we should start using hci-tester to some new tool (via HCI User Channel) to allow us do a quick check on what the controller supports.

Regards

Marcel


2016-09-01 09:41:50

by Northfield Stuart

[permalink] [raw]
Subject: Re: Two second pending connection timeout prevents connection to devices with long advertising interval

Hi Marcel,

> The problem here is that we have to make this fly without harming any other user of the system. One peripheral should not block all the rest. And the problem here is really the re-connection time of for example a HID device where low-latency is what counts.

Understood.

> One solution would be to keep the long timeout with HCI_LE_Create_Connection if we have controllers that allow us to keep scanning. Meaning a combination of Passive Scanning State and Initiating State. This is something we need to find out with trial and error and see if it can be done.
>
> As a background here. Currently we stop scanning when we see a device we need to connect to, then connect to it. And if there are other devices on the "to be connected" list, we enable scanning after successful or failure of the connection attempt.

Presumably this is per controller rather than global across controllers? One thing I forgot to mention is that some of our infrastructure code is capable of using multiple BLE controllers and (I guess due to the above) we usually keep one controller scanning while using the other(s) for connections. (NB I personally didn’t write any of this linux application code - I was merely tasked with investigating the 2s connection attempt termination issue.)

> Essentially you want to change this into this:
>
> a) Found device we want to connect to
> b) If more devices are on the auto-connect list, keep scanning, other disable scanning
> c) Send connect request and wait for its completion
> d) For the first 2 seconds that connect attempt is exclusive
> e) After that cancel it if we see another auto-connect device and try that device
> f) Start over
>
> Similar things then apply to when to re-enable scanning after connection termination, but I doubt that will actually have to change.
>
> What this means in simple term, only disable background scanning when the auto-connect list empty. Otherwise keep it active and let the controller deal with the two instances of state machines by itself.
>
> Now we need to check if that would work or not. We have quirks like HCI_QUIRK_SIMULTANEOUS_DISCOVERY and this might need another one. Not sure if we want to go with blacklist or whitelist here. I would do blacklist and actually check the supported states. Since this is LE only, the controller should not lie to us.
>
> If you want to work on this, then try this simple approach:
>
> a) Read the supported states and extract support for passive scanning + initiating
> a) Use a long timeout
> b) Only disable scanning when no other device is left on auto-connect list
>
> If this basically works, then the only other thing we have to do is be smart about concurrent connection. Meaning that a long running one can be cancelled and replaced with something we see in the 2-x second window. As I said above, the 0-2 second window should be exclusive to the first attempt. We can tune these values, it is just the 20 second one is killing low-latency reconnect by HID device.
>
> However there is one case to be made that we might only consider direct advertising to be able to interrupt it. Which would satisfy the HID requirement with low-latency. The advantage here is that they are high duty cycle and would show up right away. So you are not really losing out on your slow-connection attempt.
>
> But this idea really stands and falls with the passive scanning + initiating state support in the controller.

OK - I’ll investigate the controllers we have (we are using at least two different chipsets I believe, possibly three) and then have a go at this simple test - probably won’t be before next week at the moment (trying to get a new release candidate out at the moment) and it might take me a day or two to re-familiarise myself and get to grips with the code :)

Many thanks for your assistance and input so far.

Regards

Stu

--
Stuart Northfield
+44 (0) 1223 566728 (Direct), +44 (0) 1223 566727 (Fax)
Metanate Limited. Registered in England No 4046086 at:
Lincoln House, Station Court, Great Shelford, Cambridge CB22 5NE, UK
http://www.metanate.com (Consultancy) http://www.schemus.com (Data synchronisation)

This e-mail and all attachments it may contain is confidential and
intended solely for the use of the individual to whom it is addressed.
Any views or opinions presented are those of the author and do not
necessarily represent those of Metanate Ltd. If you are not the
intended recipient, be advised that you have received this e-mail in
error and that any use, dissemination, printing, forwarding or copying
of this e-mail is strictly prohibited. Please contact the sender if
you have received this e-mail in error.







Attachments:
smime.p7s (1.60 kB)