2017-01-10 17:09:46

by Edward Rosten

[permalink] [raw]
Subject: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()?

Hi All,

I've been doing some bluetooth LE work on an RPi 3. I've been getting
occasional ENOSYS replies from connect()---actually from retreiving
the error code via getsockopt since it's an async call. I'm apparently
not the only person, though it seems to be very rare (link below).
Drilling down, the relevant part of the HCI log (parsed by btmon) is
this:


> HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 461.313621
LE Read Remote Used Features (0x04)
Status: Connection Failed to be Established (0x3e)
Handle: 64
Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00
LE Encryption
Connection Parameter Request Procedure
Extended Reject Indication
Slave-initiated Features Exchange
LE Ping

You can see the error code is 0x3e (coincidently the same as the LE
Meta Event code!). According to the Core v4.0 spec (Volume 2, Part B,
Section 1.3, Table 1.1) the code 0x3e which according to the spec is:

2.59 CONNECTION FAILED TO BE ESTABLISHED (0X3E)
The Connection Failed to be Established error code indicates that the LL initi-
ated a connection but the connection has failed to be established.

In other words, it appears to be an error along the lines of
"something wrong happened". On my machine it's very intermittent and
I've found some other people reporting the same
(https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=145144&p=962612).
I modified my code to retry 3 times (with a 500ms pause) on one of
these errors and so far it's never failed on the second attempt.

In net/bluetooth/lib.c, there's a function, bt_to_errno(), which maps
HCI codes to errno numbers, there's no entry for 0x3e.

I was going to submit a 2 line patch to return a sensible error code,
but I've come here to ask what the best choice would be:

I currently think EAGAIN (Try Again---since trying again is usually
the appropriate choice), but this is the same number as EWOULDBLOCK. I
think it ought to be possible to distinguish all cases
EAGAIN/EWOULDBLOCK on a blocking socket is this kind of error. Getting
it on a non blocking socket would only ever mean "operation in
progress". Getting EAGAIN/EWOULDBLOCK from getsockopt(fd, SOL_SOCKET,
SO_ERROR, ...) would also only mean "try again".

Does any one have any input on this? There's not a huge choice when it
comes to error codes, and so I think this is the best one, but I'm not
really sure.

Regards,

-Ed


2017-01-12 12:44:08

by Luiz Augusto von Dentz

[permalink] [raw]
Subject: Re: 5.43 Experimental Status?

Hi Travis,

On Thu, Jan 12, 2017 at 12:38 AM, Travis Griggs <[email protected]> wr=
ote:
> I=E2=80=99m trying to bring my linux SBC acting as a peripheral up to deb=
ian stretch. I had been running on jessie, with my own build of bluez 5.41.=
I had been modifying the bluetooth.service to execute:
>
> /usr/lib/bluetooth/bluetoothd -d -E --noplugin=3D*
>
> The latest stretch/testing bluez seems to be 5.43 though:
>
> dpkg -l | grep bluez
> ii bluez 5.43-1 armhf =
Bluetooth tools and daemons
>
> I thought the -E flag was supposed to no longer be necessary after 5.42. =
So I removed it. I updated my own python code to match the small changes I =
found in example-advertisement. But I=E2=80=99m getting the following error=
:
>
> ERROR:dbus.connection:Unable to set arguments (dbus.ObjectPath('/nels=
on_irrigation/twig_pilot'), {}) according to signature None: <class 'ValueE=
rror'>: Unable to guess signature from an empty dict

Well RegisterAdvertisement is still using
GDBUS_EXPERIMENTAL_ASYNC_METHOD, what has been marked as stable is the
GATT APIs so for advertisement you still need -E.

> Which after googling, was the same error I found a while back when I had =
forgotten to turn -E on with 5.40.
>
> So I added the -E back in and it started working as hoped.
>
> Can someone explain why the -E is still needed with 5.43, for dbus code l=
ike that found in the test/examples? Did it not really get promoted after a=
ll? Or did I misunderstand the semantics of what was changing?--
> To unsubscribe from this list: send the line "unsubscribe linux-bluetooth=
" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--=20
Luiz Augusto von Dentz

2017-01-11 22:38:19

by Travis Griggs

[permalink] [raw]
Subject: 5.43 Experimental Status?

I=E2=80=99m trying to bring my linux SBC acting as a peripheral up to =
debian stretch. I had been running on jessie, with my own build of bluez =
5.41. I had been modifying the bluetooth.service to execute:

/usr/lib/bluetooth/bluetoothd -d -E --noplugin=3D*

The latest stretch/testing bluez seems to be 5.43 though:

dpkg -l | grep bluez
ii bluez 5.43-1 armhf =
Bluetooth tools and daemons

I thought the -E flag was supposed to no longer be necessary after 5.42. =
So I removed it. I updated my own python code to match the small changes =
I found in example-advertisement. But I=E2=80=99m getting the following =
error:

ERROR:dbus.connection:Unable to set arguments =
(dbus.ObjectPath('/nelson_irrigation/twig_pilot'), {}) according to =
signature None: <class 'ValueError'>: Unable to guess signature from an =
empty dict

Which after googling, was the same error I found a while back when I had =
forgotten to turn -E on with 5.40.=20

So I added the -E back in and it started working as hoped.

Can someone explain why the -E is still needed with 5.43, for dbus code =
like that found in the test/examples? Did it not really get promoted =
after all? Or did I misunderstand the semantics of what was changing?=

2017-01-11 01:50:06

by Marcel Holtmann

[permalink] [raw]
Subject: Re: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()?

Hi Emil,

>>>> I bet that if you take a sniffer and look at the raw air packets, then this means that CONNECT_REQ has been sent. The initiator then moves into connection state, but reality is that only after receiving the first data packet, the connection is fully established. Between the CONNECT_REQ and the first packet, things can actually go wrong.
>
> At the same moment the controller sends the CONNECT_REQ over the air,
> it sends the LE Connection Complete Event to the host. This means that
> the connection is considered created even though the slave has not yet
> acknowledged.
> After this the master link layer starts to send out packets at every
> connection interval. If it doesn't get any response after 6 connection
> intervals, it gives up and the link gets reported as disconnected.
>
>>>
>>> I don't have a hardware sniffer: all I've got are tools to parse the HCI dump.
>>>
>>>> Can you send the whole trace for it? I wonder if we get a disconnect event as well, or just an indication that the request for the remote used features did not complete. And at a later LL connection event, the connection would successfully establish.
>>>
>>> Here's the complete HCI log, from Create Connection onwards:
>>>
>>> < HCI Command: LE Create Connection (0x08|0x000d) plen 25
>>>
>>> [hci0] 460.552896
>>> Scan interval: 60.000 msec (0x0060)
>>> Scan window: 60.000 msec (0x0060)
>>> Filter policy: White list is not used (0x00)
>>> Peer address type: Public (0x00)
>>> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
>>> Own address type: Public (0x00)
>>> Min connection interval: 50.00 msec (0x0028)
>>> Max connection interval: 70.00 msec (0x0038)
>>> Connection latency: 0x0000
>>> Supervision timeout: 420 msec (0x002a)
>>> Min connection length: 0.000 msec (0x0000)
>>> Max connection length: 0.000 msec (0x0000)
>>>> HCI Event: Command Status (0x0f) plen 4 [hci0] 460.553725
>>> LE Create Connection (0x08|0x000d) ncmd 1
>>> Status: Success (0x00)
>>>> HCI Event: LE Meta Event (0x3e) plen 19 [hci0] 460.882366
>>> LE Connection Complete (0x01)
>>> Status: Success (0x00)
>>> Handle: 64
>>> Role: Master (0x00)
>>> Peer address type: Public (0x00)
>>> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
>>> Connection interval: 67.50 msec (0x0036)
>>> Connection latency: 0.00 msec (0x0000)
>>> Supervision timeout: 420 msec (0x002a)
>>> Master clock accuracy: 0x00
>>> < HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2
>>>
>>> [hci0] 460.882691
>>> Handle: 64
>>> @ Device Connected: 00:07:80:CF:3E:94 (1) flags 0x0000
>>> 02 01 06 11 06 64 97 81 d1 ed ba 6b ac 11 4c 9d .....d.....k..L.
>>> 34 3e 20 09 73 4> .s
>>>> HCI Event: Command Status (0x0f) plen 4 [hci0] 460.883763
>>> LE Read Remote Used Features (0x08|0x0016) ncmd 1
>>> Status: Success (0x00)
>>>> HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 461.313621
>>> LE Read Remote Used Features (0x04)
>>> Status: Connection Failed to be Established (0x3e)
>>> Handle: 64
>>> Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00
>>> LE Encryption
>>> Connection Parameter Request Procedure
>>> Extended Reject Indication
>>> Slave-initiated Features Exchange
>>> LE Ping
>>>> HCI Event: Disconnect Complete (0x05) plen 4 [hci0] 461.314141
>>> Status: Success (0x00)
>>> Handle: 64
>>> Reason: Connection Failed to be Established (0x3e)
>>> @ Device Disconnected: 00:07:80:CF:3E:94 (1) reason 0
>>
>> never mind then. So we actually get disconnected here. Which means the connection never fully established. That can actually happen even after a successful LE Connection Complete. Mainly through interference etc.
>
> Yes, always when the controller sends a disconnected event due to
> Connection Failed To Be Established, the other pending commands (e.g.
> read remote features / read remote version) are also completed with
> the same status code (prior to the disconnected event).
>
>>
>> I am fine treating the error 0x3e special, but we need to figure out what is the best socket error code to map it to. I am reluctant to use EAGAIN since that is kinda dedicated meaning for non-blocking operation. Keep in mind that EAGAIN and EWOULDBLOCK are the same number.
>>
>> This needs some research on how normal network sockets would handle this. If they do it at all. However things like ECANCELED or ECONNABORTED might be a good choice.
>>
>>>> If that is the case, then I have to say that it is a bit sad if this leaks through via HCI. I would have expected that the controllers hides this from the host. In case you have an Intel or Broadcom dongle, can you enable LL traces via /sys/kernel/debug/bluetooth/hci0/vendor_diag (just echo 1 > into it). That way we also see the LL traces going over the air and can analyse this.
>>>
>>> I'm afraid not: it's the BCM43438 chip build into the RPi 3 and
>>> doesn't provide that option.
>>
>> It does support it, but if you use Rasparian or similar, then they use the old hciattach hacked up code for getting the chip recognised. The newly btattach tool and kernel drivers for Broadcom chips will enable that. It is pretty useful.
>>
>> Do you want to work on using hci_bcm.c on the RPi3? Would be useful to have it natively integrated. I tried Fedora 25, but the SD card controller the card sits on wasn’t upstream yet.
>>
>>>> However my bet is that on this specific error, we should just send the LE Read Remote Used Features command at least one more time before we give up on the connection. If that would work, I do not want to send an EAGIN to the userspace socket. We want to at least hide that part in the kernel if can handle it.
>
> There is never a reason to retry a Read Remote Features command and
> similar since if the controller fails for any reason, it sends a
> disconnect event immediately after.
>
>>>
>>> What should be the error message on failure? It seems to be a very
>>> sporadic error. Also, I'm 99% sure it's not the remote hardware, since
>>> I've tested it a lot on Linux with a Broadcom dongle and never
>>> received an error.
>>
>> As I said, I am pretty sure that is some interference or scheduling issue locally and the controller missing the LL connection event that confirm the connection is really alive and active. We know that occasionally that can happen. A newer firmware from Broadcom might also make this more robust. Nevertheless, reporting this better to userspace is a good fix. Just need to figure out the best error code.
>
> There are basically 4 reasons the connection fails to establish:
> 1. Radio interference. If the CONNECT_REQ is not received correctly by
> the advertiser, the advertiser never enters the connection state.
> Since there is no retransmission of the CONNECT_REQ in case of
> failure, this is a more common cause of the problem than when a data
> packet is transmitted (there are 6 attempts to exchange data packets).
> Also, if a scanner sends a SCAN_REQ at the same time someone sends the
> CONNECT_REQ, the packets interfere and the advertiser just gets
> garbage and fails to create the connection.
> 2. The clock in either the master or the slave is outside the allowed
> range, causing missed packets.
> 3. The advertiser uses the white list feature and the initiator is
> currently not in the white list, causing the advertiser to drop the
> CONNECT_REQ.
> 4. The advertiser already has a connection to the initiator and
> therefore drops the CONNECT_REQ. The standard says "If an advertiser
> receives a connection request from an initiator it is already
> connected to, it shall ignore that request.". (Not all controllers
> obey this rule since it was introduced later than the 4.0 spec was
> released and some other companies have missed this requirement.) If
> the initiator host makes sure it doesn't create a new connection to a
> device it is already connected to, this shouldn't happen however.
>
> The Bluetooth spec treats a connection to be created even though the
> slave has not yet acknowledged, in contrast to for example TCP.
> Therefore the Connection Failed To Be Established error should be
> treated similar to supervision timeout. The Bluetooth spec says "The
> only difference between a connection that is created and a connection
> that is established is the Link Layer connection supervision timeout
> value that is used". But due to cause 3 above, I would still treat
> them different.
> In comparison, in Android Lollipop and up, the user does not get
> informed about that the connection has been successfully created until
> both the Read Remote Features and Read Remote Version commands have
> completed.
>
> As an application writer, you normally would like to distinguish
> Connection Failed To Be Established to for example Connection Limit
> Exceeded since you would normally like to try again a few times when
> you get the first error, unless you are aware of that the advertising
> device might use the white list.

the ENOSYS error returning to the socket is currently bad. We should choose a better error code at least.

However you are right, I think in case of the 0x3e error returned from the controller, the kernel should try to establish the connection again. And only after x failed attempts report an error back to the socket.

Regards

Marcel


2017-01-10 21:27:14

by Emil Lenngren

[permalink] [raw]
Subject: Re: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()?

Hi,

2017-01-10 20:22 GMT+01:00 Marcel Holtmann <[email protected]>:
> Hi Edward,
>
>>> I bet that if you take a sniffer and look at the raw air packets, then =
this means that CONNECT_REQ has been sent. The initiator then moves into co=
nnection state, but reality is that only after receiving the first data pac=
ket, the connection is fully established. Between the CONNECT_REQ and the f=
irst packet, things can actually go wrong.

At the same moment the controller sends the CONNECT_REQ over the air,
it sends the LE Connection Complete Event to the host. This means that
the connection is considered created even though the slave has not yet
acknowledged.
After this the master link layer starts to send out packets at every
connection interval. If it doesn't get any response after 6 connection
intervals, it gives up and the link gets reported as disconnected.

>>
>> I don't have a hardware sniffer: all I've got are tools to parse the HCI=
dump.
>>
>>> Can you send the whole trace for it? I wonder if we get a disconnect ev=
ent as well, or just an indication that the request for the remote used fea=
tures did not complete. And at a later LL connection event, the connection =
would successfully establish.
>>
>> Here's the complete HCI log, from Create Connection onwards:
>>
>> < HCI Command: LE Create Connection (0x08|0x000d) plen 25
>>
>> [hci0] 460.552896
>> Scan interval: 60.000 msec (0x0060)
>> Scan window: 60.000 msec (0x0060)
>> Filter policy: White list is not used (0x00)
>> Peer address type: Public (0x00)
>> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
>> Own address type: Public (0x00)
>> Min connection interval: 50.00 msec (0x0028)
>> Max connection interval: 70.00 msec (0x0038)
>> Connection latency: 0x0000
>> Supervision timeout: 420 msec (0x002a)
>> Min connection length: 0.000 msec (0x0000)
>> Max connection length: 0.000 msec (0x0000)
>>> HCI Event: Command Status (0x0f) plen 4 =
=
[hci0] 460.553725
>> LE Create Connection (0x08|0x000d) ncmd 1
>> Status: Success (0x00)
>>> HCI Event: LE Meta Event (0x3e) plen 19 =
=
[hci0] 460.882366
>> LE Connection Complete (0x01)
>> Status: Success (0x00)
>> Handle: 64
>> Role: Master (0x00)
>> Peer address type: Public (0x00)
>> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
>> Connection interval: 67.50 msec (0x0036)
>> Connection latency: 0.00 msec (0x0000)
>> Supervision timeout: 420 msec (0x002a)
>> Master clock accuracy: 0x00
>> < HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2
>>
>> [hci0] 460.882691
>> Handle: 64
>> @ Device Connected: 00:07:80:CF:3E:94 (1) flags 0x0000
>> 02 01 06 11 06 64 97 81 d1 ed ba 6b ac 11 4c 9d .....d.....k..L.
>> 34 3e 20 09 73 4> .s
>>> HCI Event: Command Status (0x0f) plen 4 =
=
[hci0] 460.883763
>> LE Read Remote Used Features (0x08|0x0016) ncmd 1
>> Status: Success (0x00)
>>> HCI Event: LE Meta Event (0x3e) plen 12 =
=
[hci0] 461.313621
>> LE Read Remote Used Features (0x04)
>> Status: Connection Failed to be Established (0x3e)
>> Handle: 64
>> Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00
>> LE Encryption
>> Connection Parameter Request Procedure
>> Extended Reject Indication
>> Slave-initiated Features Exchange
>> LE Ping
>>> HCI Event: Disconnect Complete (0x05) plen 4 =
=
[hci0] 461.314141
>> Status: Success (0x00)
>> Handle: 64
>> Reason: Connection Failed to be Established (0x3e)
>> @ Device Disconnected: 00:07:80:CF:3E:94 (1) reason 0
>
> never mind then. So we actually get disconnected here. Which means the co=
nnection never fully established. That can actually happen even after a suc=
cessful LE Connection Complete. Mainly through interference etc.

Yes, always when the controller sends a disconnected event due to
Connection Failed To Be Established, the other pending commands (e.g.
read remote features / read remote version) are also completed with
the same status code (prior to the disconnected event).

>
> I am fine treating the error 0x3e special, but we need to figure out what=
is the best socket error code to map it to. I am reluctant to use EAGAIN s=
ince that is kinda dedicated meaning for non-blocking operation. Keep in mi=
nd that EAGAIN and EWOULDBLOCK are the same number.
>
> This needs some research on how normal network sockets would handle this.=
If they do it at all. However things like ECANCELED or ECONNABORTED might =
be a good choice.
>
>>> If that is the case, then I have to say that it is a bit sad if this le=
aks through via HCI. I would have expected that the controllers hides this =
from the host. In case you have an Intel or Broadcom dongle, can you enable=
LL traces via /sys/kernel/debug/bluetooth/hci0/vendor_diag (just echo 1 > =
into it). That way we also see the LL traces going over the air and can ana=
lyse this.
>>
>> I'm afraid not: it's the BCM43438 chip build into the RPi 3 and
>> doesn't provide that option.
>
> It does support it, but if you use Rasparian or similar, then they use th=
e old hciattach hacked up code for getting the chip recognised. The newly b=
tattach tool and kernel drivers for Broadcom chips will enable that. It is =
pretty useful.
>
> Do you want to work on using hci_bcm.c on the RPi3? Would be useful to ha=
ve it natively integrated. I tried Fedora 25, but the SD card controller th=
e card sits on wasn=E2=80=99t upstream yet.
>
>>> However my bet is that on this specific error, we should just send the =
LE Read Remote Used Features command at least one more time before we give =
up on the connection. If that would work, I do not want to send an EAGIN to=
the userspace socket. We want to at least hide that part in the kernel if =
can handle it.

There is never a reason to retry a Read Remote Features command and
similar since if the controller fails for any reason, it sends a
disconnect event immediately after.

>>
>> What should be the error message on failure? It seems to be a very
>> sporadic error. Also, I'm 99% sure it's not the remote hardware, since
>> I've tested it a lot on Linux with a Broadcom dongle and never
>> received an error.
>
> As I said, I am pretty sure that is some interference or scheduling issue=
locally and the controller missing the LL connection event that confirm th=
e connection is really alive and active. We know that occasionally that can=
happen. A newer firmware from Broadcom might also make this more robust. N=
evertheless, reporting this better to userspace is a good fix. Just need to=
figure out the best error code.

There are basically 4 reasons the connection fails to establish:
1. Radio interference. If the CONNECT_REQ is not received correctly by
the advertiser, the advertiser never enters the connection state.
Since there is no retransmission of the CONNECT_REQ in case of
failure, this is a more common cause of the problem than when a data
packet is transmitted (there are 6 attempts to exchange data packets).
Also, if a scanner sends a SCAN_REQ at the same time someone sends the
CONNECT_REQ, the packets interfere and the advertiser just gets
garbage and fails to create the connection.
2. The clock in either the master or the slave is outside the allowed
range, causing missed packets.
3. The advertiser uses the white list feature and the initiator is
currently not in the white list, causing the advertiser to drop the
CONNECT_REQ.
4. The advertiser already has a connection to the initiator and
therefore drops the CONNECT_REQ. The standard says "If an advertiser
receives a connection request from an initiator it is already
connected to, it shall ignore that request.". (Not all controllers
obey this rule since it was introduced later than the 4.0 spec was
released and some other companies have missed this requirement.) If
the initiator host makes sure it doesn't create a new connection to a
device it is already connected to, this shouldn't happen however.

The Bluetooth spec treats a connection to be created even though the
slave has not yet acknowledged, in contrast to for example TCP.
Therefore the Connection Failed To Be Established error should be
treated similar to supervision timeout. The Bluetooth spec says "The
only difference between a connection that is created and a connection
that is established is the Link Layer connection supervision timeout
value that is used". But due to cause 3 above, I would still treat
them different.
In comparison, in Android Lollipop and up, the user does not get
informed about that the connection has been successfully created until
both the Read Remote Features and Read Remote Version commands have
completed.

As an application writer, you normally would like to distinguish
Connection Failed To Be Established to for example Connection Limit
Exceeded since you would normally like to try again a few times when
you get the first error, unless you are aware of that the advertising
device might use the white list.

>
> Regards
>
> Marcel

2017-01-10 19:22:43

by Marcel Holtmann

[permalink] [raw]
Subject: Re: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()?

Hi Edward,

>> I bet that if you take a sniffer and look at the raw air packets, then this means that CONNECT_REQ has been sent. The initiator then moves into connection state, but reality is that only after receiving the first data packet, the connection is fully established. Between the CONNECT_REQ and the first packet, things can actually go wrong.
>
> I don't have a hardware sniffer: all I've got are tools to parse the HCI dump.
>
>> Can you send the whole trace for it? I wonder if we get a disconnect event as well, or just an indication that the request for the remote used features did not complete. And at a later LL connection event, the connection would successfully establish.
>
> Here's the complete HCI log, from Create Connection onwards:
>
> < HCI Command: LE Create Connection (0x08|0x000d) plen 25
>
> [hci0] 460.552896
> Scan interval: 60.000 msec (0x0060)
> Scan window: 60.000 msec (0x0060)
> Filter policy: White list is not used (0x00)
> Peer address type: Public (0x00)
> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
> Own address type: Public (0x00)
> Min connection interval: 50.00 msec (0x0028)
> Max connection interval: 70.00 msec (0x0038)
> Connection latency: 0x0000
> Supervision timeout: 420 msec (0x002a)
> Min connection length: 0.000 msec (0x0000)
> Max connection length: 0.000 msec (0x0000)
>> HCI Event: Command Status (0x0f) plen 4 [hci0] 460.553725
> LE Create Connection (0x08|0x000d) ncmd 1
> Status: Success (0x00)
>> HCI Event: LE Meta Event (0x3e) plen 19 [hci0] 460.882366
> LE Connection Complete (0x01)
> Status: Success (0x00)
> Handle: 64
> Role: Master (0x00)
> Peer address type: Public (0x00)
> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
> Connection interval: 67.50 msec (0x0036)
> Connection latency: 0.00 msec (0x0000)
> Supervision timeout: 420 msec (0x002a)
> Master clock accuracy: 0x00
> < HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2
>
> [hci0] 460.882691
> Handle: 64
> @ Device Connected: 00:07:80:CF:3E:94 (1) flags 0x0000
> 02 01 06 11 06 64 97 81 d1 ed ba 6b ac 11 4c 9d .....d.....k..L.
> 34 3e 20 09 73 4> .s
>> HCI Event: Command Status (0x0f) plen 4 [hci0] 460.883763
> LE Read Remote Used Features (0x08|0x0016) ncmd 1
> Status: Success (0x00)
>> HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 461.313621
> LE Read Remote Used Features (0x04)
> Status: Connection Failed to be Established (0x3e)
> Handle: 64
> Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> LE Encryption
> Connection Parameter Request Procedure
> Extended Reject Indication
> Slave-initiated Features Exchange
> LE Ping
>> HCI Event: Disconnect Complete (0x05) plen 4 [hci0] 461.314141
> Status: Success (0x00)
> Handle: 64
> Reason: Connection Failed to be Established (0x3e)
> @ Device Disconnected: 00:07:80:CF:3E:94 (1) reason 0

never mind then. So we actually get disconnected here. Which means the connection never fully established. That can actually happen even after a successful LE Connection Complete. Mainly through interference etc.

I am fine treating the error 0x3e special, but we need to figure out what is the best socket error code to map it to. I am reluctant to use EAGAIN since that is kinda dedicated meaning for non-blocking operation. Keep in mind that EAGAIN and EWOULDBLOCK are the same number.

This needs some research on how normal network sockets would handle this. If they do it at all. However things like ECANCELED or ECONNABORTED might be a good choice.

>> If that is the case, then I have to say that it is a bit sad if this leaks through via HCI. I would have expected that the controllers hides this from the host. In case you have an Intel or Broadcom dongle, can you enable LL traces via /sys/kernel/debug/bluetooth/hci0/vendor_diag (just echo 1 > into it). That way we also see the LL traces going over the air and can analyse this.
>
> I'm afraid not: it's the BCM43438 chip build into the RPi 3 and
> doesn't provide that option.

It does support it, but if you use Rasparian or similar, then they use the old hciattach hacked up code for getting the chip recognised. The newly btattach tool and kernel drivers for Broadcom chips will enable that. It is pretty useful.

Do you want to work on using hci_bcm.c on the RPi3? Would be useful to have it natively integrated. I tried Fedora 25, but the SD card controller the card sits on wasn’t upstream yet.

>> However my bet is that on this specific error, we should just send the LE Read Remote Used Features command at least one more time before we give up on the connection. If that would work, I do not want to send an EAGIN to the userspace socket. We want to at least hide that part in the kernel if can handle it.
>
> What should be the error message on failure? It seems to be a very
> sporadic error. Also, I'm 99% sure it's not the remote hardware, since
> I've tested it a lot on Linux with a Broadcom dongle and never
> received an error.

As I said, I am pretty sure that is some interference or scheduling issue locally and the controller missing the LL connection event that confirm the connection is really alive and active. We know that occasionally that can happen. A newer firmware from Broadcom might also make this more robust. Nevertheless, reporting this better to userspace is a good fix. Just need to figure out the best error code.

Regards

Marcel


2017-01-10 19:05:06

by Edward Rosten

[permalink] [raw]
Subject: Re: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()?

On 10 January 2017 at 18:49, Marcel Holtmann <[email protected]> wrote:
> Hi Edward,
>
> I bet that if you take a sniffer and look at the raw air packets, then th=
is means that CONNECT_REQ has been sent. The initiator then moves into conn=
ection state, but reality is that only after receiving the first data packe=
t, the connection is fully established. Between the CONNECT_REQ and the fir=
st packet, things can actually go wrong.

I don't have a hardware sniffer: all I've got are tools to parse the HCI du=
mp.

> Can you send the whole trace for it? I wonder if we get a disconnect even=
t as well, or just an indication that the request for the remote used featu=
res did not complete. And at a later LL connection event, the connection wo=
uld successfully establish.

Here's the complete HCI log, from Create Connection onwards:

< HCI Command: LE Create Connection (0x08|0x000d) plen 25

[hci0] 460.552896
Scan interval: 60.000 msec (0x0060)
Scan window: 60.000 msec (0x0060)
Filter policy: White list is not used (0x00)
Peer address type: Public (0x00)
Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
Own address type: Public (0x00)
Min connection interval: 50.00 msec (0x0028)
Max connection interval: 70.00 msec (0x0038)
Connection latency: 0x0000
Supervision timeout: 420 msec (0x002a)
Min connection length: 0.000 msec (0x0000)
Max connection length: 0.000 msec (0x0000)
> HCI Event: Command Status (0x0f) plen 4 =
=
[hci0] 460.553725
LE Create Connection (0x08|0x000d) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 19 =
=
[hci0] 460.882366
LE Connection Complete (0x01)
Status: Success (0x00)
Handle: 64
Role: Master (0x00)
Peer address type: Public (0x00)
Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY)
Connection interval: 67.50 msec (0x0036)
Connection latency: 0.00 msec (0x0000)
Supervision timeout: 420 msec (0x002a)
Master clock accuracy: 0x00
< HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2

[hci0] 460.882691
Handle: 64
@ Device Connected: 00:07:80:CF:3E:94 (1) flags 0x0000
02 01 06 11 06 64 97 81 d1 ed ba 6b ac 11 4c 9d .....d.....k..L.
34 3e 20 09 73 4> .s
> HCI Event: Command Status (0x0f) plen 4 =
=
[hci0] 460.883763
LE Read Remote Used Features (0x08|0x0016) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 12 =
=
[hci0] 461.313621
LE Read Remote Used Features (0x04)
Status: Connection Failed to be Established (0x3e)
Handle: 64
Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00
LE Encryption
Connection Parameter Request Procedure
Extended Reject Indication
Slave-initiated Features Exchange
LE Ping
> HCI Event: Disconnect Complete (0x05) plen 4 =
=
[hci0] 461.314141
Status: Success (0x00)
Handle: 64
Reason: Connection Failed to be Established (0x3e)
@ Device Disconnected: 00:07:80:CF:3E:94 (1) reason 0


>
> If that is the case, then I have to say that it is a bit sad if this leak=
s through via HCI. I would have expected that the controllers hides this fr=
om the host. In case you have an Intel or Broadcom dongle, can you enable L=
L traces via /sys/kernel/debug/bluetooth/hci0/vendor_diag (just echo 1 > in=
to it). That way we also see the LL traces going over the air and can analy=
se this.

I'm afraid not: it's the BCM43438 chip build into the RPi 3 and
doesn't provide that option.

> However my bet is that on this specific error, we should just send the LE=
Read Remote Used Features command at least one more time before we give up=
on the connection. If that would work, I do not want to send an EAGIN to t=
he userspace socket. We want to at least hide that part in the kernel if ca=
n handle it.

What should be the error message on failure? It seems to be a very
sporadic error. Also, I'm 99% sure it's not the remote hardware, since
I've tested it a lot on Linux with a Broadcom dongle and never
received an error.

-Ed


>
>> In net/bluetooth/lib.c, there's a function, bt_to_errno(), which maps
>> HCI codes to errno numbers, there's no entry for 0x3e.
>>
>> I was going to submit a 2 line patch to return a sensible error code,
>> but I've come here to ask what the best choice would be:
>>
>> I currently think EAGAIN (Try Again---since trying again is usually
>> the appropriate choice), but this is the same number as EWOULDBLOCK. I
>> think it ought to be possible to distinguish all cases
>> EAGAIN/EWOULDBLOCK on a blocking socket is this kind of error. Getting
>> it on a non blocking socket would only ever mean "operation in
>> progress". Getting EAGAIN/EWOULDBLOCK from getsockopt(fd, SOL_SOCKET,
>> SO_ERROR, ...) would also only mean "try again".
>>
>> Does any one have any input on this? There's not a huge choice when it
>> comes to error codes, and so I think this is the best one, but I'm not
>> really sure.
>
> I need to see the full trace and if possible with LL tracing enabled, the=
n we can see what to do. As said above, if we can just handle it inside the=
kernel, then lets not bother the socket with it.
>
> Regards
>
> Marcel
>

2017-01-10 18:49:40

by Marcel Holtmann

[permalink] [raw]
Subject: Re: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()?

Hi Edward,

> I've been doing some bluetooth LE work on an RPi 3. I've been getting
> occasional ENOSYS replies from connect()---actually from retreiving
> the error code via getsockopt since it's an async call. I'm apparently
> not the only person, though it seems to be very rare (link below).
> Drilling down, the relevant part of the HCI log (parsed by btmon) is
> this:
>
>
>> HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 461.313621
> LE Read Remote Used Features (0x04)
> Status: Connection Failed to be Established (0x3e)
> Handle: 64
> Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> LE Encryption
> Connection Parameter Request Procedure
> Extended Reject Indication
> Slave-initiated Features Exchange
> LE Ping
>
> You can see the error code is 0x3e (coincidently the same as the LE
> Meta Event code!). According to the Core v4.0 spec (Volume 2, Part B,
> Section 1.3, Table 1.1) the code 0x3e which according to the spec is:
>
> 2.59 CONNECTION FAILED TO BE ESTABLISHED (0X3E)
> The Connection Failed to be Established error code indicates that the LL initi-
> ated a connection but the connection has failed to be established.
>
> In other words, it appears to be an error along the lines of
> "something wrong happened". On my machine it's very intermittent and
> I've found some other people reporting the same
> (https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=145144&p=962612).
> I modified my code to retry 3 times (with a 500ms pause) on one of
> these errors and so far it's never failed on the second attempt.

I bet that if you take a sniffer and look at the raw air packets, then this means that CONNECT_REQ has been sent. The initiator then moves into connection state, but reality is that only after receiving the first data packet, the connection is fully established. Between the CONNECT_REQ and the first packet, things can actually go wrong.

Can you send the whole trace for it? I wonder if we get a disconnect event as well, or just an indication that the request for the remote used features did not complete. And at a later LL connection event, the connection would successfully establish.

If that is the case, then I have to say that it is a bit sad if this leaks through via HCI. I would have expected that the controllers hides this from the host. In case you have an Intel or Broadcom dongle, can you enable LL traces via /sys/kernel/debug/bluetooth/hci0/vendor_diag (just echo 1 > into it). That way we also see the LL traces going over the air and can analyse this.

However my bet is that on this specific error, we should just send the LE Read Remote Used Features command at least one more time before we give up on the connection. If that would work, I do not want to send an EAGIN to the userspace socket. We want to at least hide that part in the kernel if can handle it.

> In net/bluetooth/lib.c, there's a function, bt_to_errno(), which maps
> HCI codes to errno numbers, there's no entry for 0x3e.
>
> I was going to submit a 2 line patch to return a sensible error code,
> but I've come here to ask what the best choice would be:
>
> I currently think EAGAIN (Try Again---since trying again is usually
> the appropriate choice), but this is the same number as EWOULDBLOCK. I
> think it ought to be possible to distinguish all cases
> EAGAIN/EWOULDBLOCK on a blocking socket is this kind of error. Getting
> it on a non blocking socket would only ever mean "operation in
> progress". Getting EAGAIN/EWOULDBLOCK from getsockopt(fd, SOL_SOCKET,
> SO_ERROR, ...) would also only mean "try again".
>
> Does any one have any input on this? There's not a huge choice when it
> comes to error codes, and so I think this is the best one, but I'm not
> really sure.

I need to see the full trace and if possible with LL tracing enabled, then we can see what to do. As said above, if we can just handle it inside the kernel, then lets not bother the socket with it.

Regards

Marcel