Return-Path: Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: Adding EAGAIN on 0x3e (Connection faile to be established) in net/bluetooth/lib.c/bt_to_errno()? From: Marcel Holtmann In-Reply-To: Date: Wed, 11 Jan 2017 02:50:06 +0100 Cc: Bluez mailing list Message-Id: <9AE508C4-7C2E-4126-AF76-DC551C6142D4@holtmann.org> References: <1BE3F300-B719-44E3-971F-5015C5F65A78@holtmann.org> <49A36183-9C83-43B4-B462-3838E28E21C9@holtmann.org> To: Emil Lenngren Sender: linux-bluetooth-owner@vger.kernel.org List-ID: Hi Emil, >>>> I bet that if you take a sniffer and look at the raw air packets, then this means that CONNECT_REQ has been sent. The initiator then moves into connection state, but reality is that only after receiving the first data packet, the connection is fully established. Between the CONNECT_REQ and the first packet, things can actually go wrong. > > At the same moment the controller sends the CONNECT_REQ over the air, > it sends the LE Connection Complete Event to the host. This means that > the connection is considered created even though the slave has not yet > acknowledged. > After this the master link layer starts to send out packets at every > connection interval. If it doesn't get any response after 6 connection > intervals, it gives up and the link gets reported as disconnected. > >>> >>> I don't have a hardware sniffer: all I've got are tools to parse the HCI dump. >>> >>>> Can you send the whole trace for it? I wonder if we get a disconnect event as well, or just an indication that the request for the remote used features did not complete. And at a later LL connection event, the connection would successfully establish. >>> >>> Here's the complete HCI log, from Create Connection onwards: >>> >>> < HCI Command: LE Create Connection (0x08|0x000d) plen 25 >>> >>> [hci0] 460.552896 >>> Scan interval: 60.000 msec (0x0060) >>> Scan window: 60.000 msec (0x0060) >>> Filter policy: White list is not used (0x00) >>> Peer address type: Public (0x00) >>> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY) >>> Own address type: Public (0x00) >>> Min connection interval: 50.00 msec (0x0028) >>> Max connection interval: 70.00 msec (0x0038) >>> Connection latency: 0x0000 >>> Supervision timeout: 420 msec (0x002a) >>> Min connection length: 0.000 msec (0x0000) >>> Max connection length: 0.000 msec (0x0000) >>>> HCI Event: Command Status (0x0f) plen 4 [hci0] 460.553725 >>> LE Create Connection (0x08|0x000d) ncmd 1 >>> Status: Success (0x00) >>>> HCI Event: LE Meta Event (0x3e) plen 19 [hci0] 460.882366 >>> LE Connection Complete (0x01) >>> Status: Success (0x00) >>> Handle: 64 >>> Role: Master (0x00) >>> Peer address type: Public (0x00) >>> Peer address: 00:07:80:CF:3E:94 (Bluegiga Technologies OY) >>> Connection interval: 67.50 msec (0x0036) >>> Connection latency: 0.00 msec (0x0000) >>> Supervision timeout: 420 msec (0x002a) >>> Master clock accuracy: 0x00 >>> < HCI Command: LE Read Remote Used Features (0x08|0x0016) plen 2 >>> >>> [hci0] 460.882691 >>> Handle: 64 >>> @ Device Connected: 00:07:80:CF:3E:94 (1) flags 0x0000 >>> 02 01 06 11 06 64 97 81 d1 ed ba 6b ac 11 4c 9d .....d.....k..L. >>> 34 3e 20 09 73 4> .s >>>> HCI Event: Command Status (0x0f) plen 4 [hci0] 460.883763 >>> LE Read Remote Used Features (0x08|0x0016) ncmd 1 >>> Status: Success (0x00) >>>> HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 461.313621 >>> LE Read Remote Used Features (0x04) >>> Status: Connection Failed to be Established (0x3e) >>> Handle: 64 >>> Features: 0x1f 0x00 0x00 0x00 0x00 0x00 0x00 0x00 >>> LE Encryption >>> Connection Parameter Request Procedure >>> Extended Reject Indication >>> Slave-initiated Features Exchange >>> LE Ping >>>> HCI Event: Disconnect Complete (0x05) plen 4 [hci0] 461.314141 >>> Status: Success (0x00) >>> Handle: 64 >>> Reason: Connection Failed to be Established (0x3e) >>> @ Device Disconnected: 00:07:80:CF:3E:94 (1) reason 0 >> >> never mind then. So we actually get disconnected here. Which means the connection never fully established. That can actually happen even after a successful LE Connection Complete. Mainly through interference etc. > > Yes, always when the controller sends a disconnected event due to > Connection Failed To Be Established, the other pending commands (e.g. > read remote features / read remote version) are also completed with > the same status code (prior to the disconnected event). > >> >> I am fine treating the error 0x3e special, but we need to figure out what is the best socket error code to map it to. I am reluctant to use EAGAIN since that is kinda dedicated meaning for non-blocking operation. Keep in mind that EAGAIN and EWOULDBLOCK are the same number. >> >> This needs some research on how normal network sockets would handle this. If they do it at all. However things like ECANCELED or ECONNABORTED might be a good choice. >> >>>> If that is the case, then I have to say that it is a bit sad if this leaks through via HCI. I would have expected that the controllers hides this from the host. In case you have an Intel or Broadcom dongle, can you enable LL traces via /sys/kernel/debug/bluetooth/hci0/vendor_diag (just echo 1 > into it). That way we also see the LL traces going over the air and can analyse this. >>> >>> I'm afraid not: it's the BCM43438 chip build into the RPi 3 and >>> doesn't provide that option. >> >> It does support it, but if you use Rasparian or similar, then they use the old hciattach hacked up code for getting the chip recognised. The newly btattach tool and kernel drivers for Broadcom chips will enable that. It is pretty useful. >> >> Do you want to work on using hci_bcm.c on the RPi3? Would be useful to have it natively integrated. I tried Fedora 25, but the SD card controller the card sits on wasn’t upstream yet. >> >>>> However my bet is that on this specific error, we should just send the LE Read Remote Used Features command at least one more time before we give up on the connection. If that would work, I do not want to send an EAGIN to the userspace socket. We want to at least hide that part in the kernel if can handle it. > > There is never a reason to retry a Read Remote Features command and > similar since if the controller fails for any reason, it sends a > disconnect event immediately after. > >>> >>> What should be the error message on failure? It seems to be a very >>> sporadic error. Also, I'm 99% sure it's not the remote hardware, since >>> I've tested it a lot on Linux with a Broadcom dongle and never >>> received an error. >> >> As I said, I am pretty sure that is some interference or scheduling issue locally and the controller missing the LL connection event that confirm the connection is really alive and active. We know that occasionally that can happen. A newer firmware from Broadcom might also make this more robust. Nevertheless, reporting this better to userspace is a good fix. Just need to figure out the best error code. > > There are basically 4 reasons the connection fails to establish: > 1. Radio interference. If the CONNECT_REQ is not received correctly by > the advertiser, the advertiser never enters the connection state. > Since there is no retransmission of the CONNECT_REQ in case of > failure, this is a more common cause of the problem than when a data > packet is transmitted (there are 6 attempts to exchange data packets). > Also, if a scanner sends a SCAN_REQ at the same time someone sends the > CONNECT_REQ, the packets interfere and the advertiser just gets > garbage and fails to create the connection. > 2. The clock in either the master or the slave is outside the allowed > range, causing missed packets. > 3. The advertiser uses the white list feature and the initiator is > currently not in the white list, causing the advertiser to drop the > CONNECT_REQ. > 4. The advertiser already has a connection to the initiator and > therefore drops the CONNECT_REQ. The standard says "If an advertiser > receives a connection request from an initiator it is already > connected to, it shall ignore that request.". (Not all controllers > obey this rule since it was introduced later than the 4.0 spec was > released and some other companies have missed this requirement.) If > the initiator host makes sure it doesn't create a new connection to a > device it is already connected to, this shouldn't happen however. > > The Bluetooth spec treats a connection to be created even though the > slave has not yet acknowledged, in contrast to for example TCP. > Therefore the Connection Failed To Be Established error should be > treated similar to supervision timeout. The Bluetooth spec says "The > only difference between a connection that is created and a connection > that is established is the Link Layer connection supervision timeout > value that is used". But due to cause 3 above, I would still treat > them different. > In comparison, in Android Lollipop and up, the user does not get > informed about that the connection has been successfully created until > both the Read Remote Features and Read Remote Version commands have > completed. > > As an application writer, you normally would like to distinguish > Connection Failed To Be Established to for example Connection Limit > Exceeded since you would normally like to try again a few times when > you get the first error, unless you are aware of that the advertising > device might use the white list. the ENOSYS error returning to the socket is currently bad. We should choose a better error code at least. However you are right, I think in case of the 0x3e error returned from the controller, the kernel should try to establish the connection again. And only after x failed attempts report an error back to the socket. Regards Marcel