LinuxLists.cc - detecting dead link

2010-04-14 08:10:48

Subject: detecting dead link

Hi,

It seems that most of the chips take a lot of time to detect that a
link in dead, due to out of range or power off (no hci disconnect),
the most common problem regarding this is when we try to connect SCO:

2010-03-31 18:40:19.975128 < HCI Command: Setup Synchronous Connection
(0x01|0x0028) plen 17
handle 11 voice setting 0x0060
2010-03-31 18:40:19.981780 > HCI Event: Command Status (0x0f) plen 4
Setup Synchronous Connection (0x01|0x0028) status 0x00 ncmd 1
2010-03-31 18:40:33.318023 < ACL data: handle 11 flags 0x02 dlen 22
L2CAP(d): cid 0x0073 len 18 [psm 0]
0000: 69 ef 1d 0d 0a 2b 43 49 45 56 3a 20 32 2c 33 0d i....+CIEV: 2,3.
0010: 0a 3e .>
2010-03-31 18:40:33.318145 < HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2
handle 11
2010-03-31 18:40:33.324615 > HCI Event: Command Status (0x0f) plen 4
Exit Sniff Mode (0x02|0x0004) status 0x00 ncmd 1
2010-03-31 18:40:51.207885 < ACL data: handle 11 flags 0x02 dlen 22
L2CAP(d): cid 0x0073 len 18 [psm 0]
0000: 69 ef 1d 0d 0a 2b 43 49 45 56 3a 20 32 2c 32 0d i....+CIEV: 2,2.
0010: 0a 3e .>
2010-03-31 18:40:54.965362 > HCI Event: Max Slots Change (0x1b) plen 3
handle 11 slots 5
2010-03-31 18:40:54.966094 > HCI Event: Synchronous Connect Complete (0x2c)
plen 17
status 0x08 handle 0 bdaddr 00:1E:DC:B3:40:9D type eSCO
Error: Connection Timeout
2010-03-31 18:40:54.968353 > HCI Event: Number of Completed Packets (0x13) plen
5
handle 11 packets 2
2010-03-31 18:40:54.968658 > HCI Event: Disconn Complete (0x05) plen 4
status 0x00 handle 11 reason 0x08
Reason: Connection Timeout

It took 35 sec to timeout where connections attempt normally only take
5 sec due to page timeout (configurable in main.conf), it is probably
due to being in sniff mode but even after exiting it takes another 20
sec to timeout (probably lmp timeout).

So I was wondering why this timeout is not derived from page timeout,
to me it seems quite obvious that if the connection doesn't complete
within page timeout there is something wrong since that is supposed to
be much faster on an active link, also if that assumption is correct
what would be the impact to have such a logic that takes page timeout
as timeout for any connection attempt in active/dead links dropping
them if the timeout is reached, of course this has to take into
account the user authorization framework/defer setup so it only
timeout if there is no activity in the link.

--
Luiz Augusto von Dentz
Computer Engineer

2010-04-16 13:19:13

by Luiz Augusto von Dentz

[permalink] [raw]

Subject: Re: detecting dead link

Hi,

On Thu, Apr 15, 2010 at 12:09 AM, Mike Tsai <[email protected]> wrote:
>>>Ok, so I do miss understanding the situation. In that case, host can program link supervision timeout to a shorter time (like 3 or 4 seconds) as suggested. So the link supervision timeout shall kick in far ahead of LMP response timeout. The user will get disconnect notification sooner,

It seems that somebody already give us a favor and patented it:

http://www.freepatentsonline.com/y2009/0258596.html

That is exactly what I was looking for, depending on the link usage
(a2dp, sco) reduce the link supervision timeout to detect the link
loss faster. Well I guess the only option now is to do this
statically, which I don't think will address the problem completely
since we can only change the timeout once due to this patent.

Grrr, it is so obvious why people patent such generic usage, I really
hate software patents. Very frustrating.

--
Luiz Augusto von Dentz
Computer Engineer

2010-04-15 07:54:43

by Andrei Emeltchenko

[permalink] [raw]

Subject: Re: detecting dead link

Hi,

On Wed, Apr 14, 2010 at 7:31 PM, Mike Tsai <[email protected]> wrote:
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Luiz Augusto von Dentz
> Sent: Wednesday, April 14, 2010 1:11 AM
> To: [email protected]
> Subject: detecting dead link
>
> Hi,
>
> It seems that most of the chips take a lot of time to detect that a
> link in dead, due to out of range or power off (no hci disconnect),
> the most common problem regarding this is when we try to connect SCO:
>
> 2010-03-31 18:40:19.975128 < HCI Command: Setup Synchronous Connection
> (0x01|0x0028) plen 17
> ? ?handle 11 voice setting 0x0060
> 2010-03-31 18:40:19.981780 > HCI Event: Command Status (0x0f) plen 4
> ? ?Setup Synchronous Connection (0x01|0x0028) status 0x00 ncmd 1
> 2010-03-31 18:40:33.318023 < ACL data: handle 11 flags 0x02 dlen 22
> ? ?L2CAP(d): cid 0x0073 len 18 [psm 0]
> ? ? ?0000: 69 ef 1d 0d 0a 2b 43 49 ?45 56 3a 20 32 2c 33 0d ?i....+CIEV: 2,3.
> ? ? ?0010: 0a 3e ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .>
> 2010-03-31 18:40:33.318145 < HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2
> ? ?handle 11
> 2010-03-31 18:40:33.324615 > HCI Event: Command Status (0x0f) plen 4
> ? ?Exit Sniff Mode (0x02|0x0004) status 0x00 ncmd 1
> 2010-03-31 18:40:51.207885 < ACL data: handle 11 flags 0x02 dlen 22
> ? ?L2CAP(d): cid 0x0073 len 18 [psm 0]
> ? ? ?0000: 69 ef 1d 0d 0a 2b 43 49 ?45 56 3a 20 32 2c 32 0d ?i....+CIEV: 2,2.
> ? ? ?0010: 0a 3e ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .>
> 2010-03-31 18:40:54.965362 > HCI Event: Max Slots Change (0x1b) plen 3
> ? ?handle 11 slots 5
> 2010-03-31 18:40:54.966094 > HCI Event: Synchronous Connect Complete (0x2c)
> plen 17
> ? ?status 0x08 handle 0 bdaddr 00:1E:DC:B3:40:9D type eSCO
> ? ?Error: Connection Timeout
> 2010-03-31 18:40:54.968353 > HCI Event: Number of Completed Packets (0x13) plen
> 5
> ? ?handle 11 packets 2
> 2010-03-31 18:40:54.968658 > HCI Event: Disconn Complete (0x05) plen 4
> ? ?status 0x00 handle 11 reason 0x08
> ? ?Reason: Connection Timeout
>
> It took 35 sec to timeout where connections attempt normally only take
> 5 sec due to page timeout (configurable in main.conf), it is probably
> due to being in sniff mode but even after exiting it takes another 20
> sec to timeout (probably lmp timeout).
>
> So I was wondering why this timeout is not derived from page timeout,
> to me it seems quite obvious that if the connection doesn't complete
> within page timeout there is something wrong since that is supposed to
> be much faster on an active link, also if that assumption is correct
> what would be the impact to have such a logic that takes page timeout
> as timeout for any connection attempt in active/dead links dropping
> them if the timeout is reached, of course this has to take into
> account the user authorization framework/defer setup so it only
> timeout if there is no activity in the link.
>
> Hi Luiz,
> [MT] Page timeout is for creating ACL data link. In this case, ACL link is already created so page timeout is not involved. The 20 second timeout is the default link supervision timeout and the 35 second timeout is the LMP timeout (actually 30 seconds). I think if you have the sniff log for the LMP, it would appear that one of the LMP might be lost during SCO negotiation and it is quite possibly due to the existing sniff mode. A proper implementation of host stack shall probably exit sniff mode first before start the SCO link creation. So the log would look like this,
>
> ? ? ? ?HCI command: Exit Sniff Mode
> ? ? ? ?HCI event: ? Command status, Exit Sniff Mode
> ? ? ? ?HCI event: ? Mode changed
> ? ? ? ?HCI command: Setup Synchronous Connection
>
> I think the host was sending the commands out of order. It asks controller to set up SCO link before it asks controller to exit sniff mode. Note that both commands are async commands (meaning it will request controller to go through several state transition for those commands), so the controller might have hard time executing the host commands.
>

I think we have recently applied the patch which makes this sequence correct:

http://git.kernel.org/?p=linux/kernel/git/holtmann/bluetooth-testing.git;a=commit;h=c390216b3e868b16d8154939f4b6f8c16dbd9a9f

Regards,
Andrei

2010-04-14 21:09:42

by Mike Tsai

[permalink] [raw]

Subject: RE: detecting dead link

-----Original Message-----
From: Luiz Augusto von Dentz [mailto:[email protected]]
Sent: Wednesday, April 14, 2010 1:34 PM
To: Mike Tsai
Cc: [email protected]
Subject: Re: detecting dead link

Hi,

On Wed, Apr 14, 2010 at 10:49 PM, Mike Tsai <[email protected]> wrote:
>>> I agree that 30 second LMP timeout is really too long, however it was defined in the spec. and is not changeable unless we go through errata process and get everyone in the Bluetooth CSWG to agree with it.
>
>>> There is another way to shorten the timeout which is to change the link supervision timeout to less than that, for instance, 2 seconds. But based on the log, it would seem to me that the link supervision timeout was not detected prior to LMP timeout (because disconnect event didn't come at 20 seconds).

Im not saying we have to change the LMP timeout, that is bellow hci
layer which I guess we have no control in BlueZ, obviously you are
misunderstanding me, there is no need to change the LMP timeout
whatsoever it is just that we don't need to wait for it.

> I guess it doesn't matter the order, as long as it detects the link
> loss faster, things like PulseAudio cannot afford 30 sec of the stream
> lost because the controller has lost the link, if the headset has been
> moved away or shutdown in the middle of the playback we want to route
> it back to the speaker as soon as possible.
>
>>> Here I am just stating the fact that multiple async commands from host can cause difficulty on the controller side because there is only 1 physical link available and the media needs to be shared. I guess it matters if we want less frustration from end user and better connection quality. This is a common practice from many popular Bluetooth host stacks. I think that we won't loss the link if host does the command order as suggested. Because shorten the link disconnection timeout will just let user realize that the link establishment failed sooner, does not change the fact that it is still a failure and user can't get what they want,

Sorry, again I guess you are misunderstanding me, the link is _lost_,
the device has being moved away or has been shutdown so I guess it
pretty impossible to give anything else but disconnect to the user, or
there is any way to maintain a link to an unreachable device?

>>Ok, so I do miss understanding the situation. In that case, host can program link supervision timeout to a shorter time (like 3 or 4 seconds) as suggested. So the link supervision timeout shall kick in far ahead of LMP response timeout. The user will get disconnect notification sooner,

Btw, the trace I gave here was from a carkit which when turned off
doesn't send any hci disconnect.

--
Luiz Augusto von Dentz
Computer Engineer

2010-04-14 20:34:26

by Luiz Augusto von Dentz

[permalink] [raw]

Subject: Re: detecting dead link

Hi,

On Wed, Apr 14, 2010 at 10:49 PM, Mike Tsai <[email protected]> wrote:
>>> I agree that 30 second LMP timeout is really too long, however it was defined in the spec. and is not changeable unless we go through errata process and get everyone in the Bluetooth CSWG to agree with it.
>
>>> There is another way to shorten the timeout which is to change the link supervision timeout to less than that, for instance, 2 seconds. But based on the log, it would seem to me that the link supervision timeout was not detected prior to LMP timeout (because disconnect event didn't come at 20 seconds).

Im not saying we have to change the LMP timeout, that is bellow hci
layer which I guess we have no control in BlueZ, obviously you are
misunderstanding me, there is no need to change the LMP timeout
whatsoever it is just that we don't need to wait for it.

> I guess it doesn't matter the order, as long as it detects the link
> loss faster, things like PulseAudio cannot afford 30 sec of the stream
> lost because the controller has lost the link, if the headset has been
> moved away or shutdown in the middle of the playback we want to route
> it back to the speaker as soon as possible.
>
>>> Here I am just stating the fact that multiple async commands from host can cause difficulty on the controller side because there is only 1 physical link available and the media needs to be shared. I guess it matters if we want less frustration from end user and better connection quality. This is a common practice from many popular Bluetooth host stacks. I think that we won't loss the link if host does the command order as suggested. Because shorten the link disconnection timeout will just let user realize that the link establishment failed sooner, does not change the fact that it is still a failure and user can't get what they want,

Sorry, again I guess you are misunderstanding me, the link is _lost_,
the device has being moved away or has been shutdown so I guess it
pretty impossible to give anything else but disconnect to the user, or
there is any way to maintain a link to an unreachable device?

Btw, the trace I gave here was from a carkit which when turned off
doesn't send any hci disconnect.

--
Luiz Augusto von Dentz
Computer Engineer

2010-04-14 19:49:11

by Mike Tsai

[permalink] [raw]

Subject: RE: detecting dead link

-----Original Message-----
From: Luiz Augusto von Dentz [mailto:[email protected]]
Sent: Wednesday, April 14, 2010 12:30 PM
To: Mike Tsai
Cc: [email protected]
Subject: Re: detecting dead link

Hi,

On Wed, Apr 14, 2010 at 7:31 PM, Mike Tsai <[email protected]> wrote:
> Hi Luiz,
> [MT] Page timeout is for creating ACL data link. In this case, ACL link is already created so page timeout is not involved. The 20 second timeout is the default link supervision timeout and the 35 second timeout is the LMP timeout (actually 30 seconds). I think if you have the sniff log for the LMP, it would appear that one of the LMP might be lost during SCO negotiation and it is quite possibly due to the existing sniff mode. A proper implementation of host stack shall probably exit sniff mode first before start the SCO link creation. So the log would look like this,
>
> ? ? ? ?HCI command: Exit Sniff Mode
> ? ? ? ?HCI event: ? Command status, Exit Sniff Mode
> ? ? ? ?HCI event: ? Mode changed
> ? ? ? ?HCI command: Setup Synchronous Connection

That exactly what Im arguing about, we have a 30 sec LMP timeout
compared to 5 sec of page timeout , if we are not willing to wait more
than 5 sec for a new ACL link why would we wait 30 sec? This is
completely disproportional and we should probably need to respond way
before that thus my suggestion to derive the LMP timeout for the page
timeout. This not only affect SCO connection but any attempt to
transfer data in the middle of the link loss.

>> I agree that 30 second LMP timeout is really too long, however it was defined in the spec. and is not changeable unless we go through errata process and get everyone in the Bluetooth CSWG to agree with it.

>> There is another way to shorten the timeout which is to change the link supervision timeout to less than that, for instance, 2 seconds. But based on the log, it would seem to me that the link supervision timeout was not detected prior to LMP timeout (because disconnect event didn't come at 20 seconds).

> I think the host was sending the commands out of order. It asks controller to set up SCO link before it asks controller to exit sniff mode. Note that both commands are async commands (meaning it will request controller to go through several state transition for those commands), so the controller might have hard time executing the host commands.

I guess it doesn't matter the order, as long as it detects the link
loss faster, things like PulseAudio cannot afford 30 sec of the stream
lost because the controller has lost the link, if the headset has been
moved away or shutdown in the middle of the playback we want to route
it back to the speaker as soon as possible.

>> Here I am just stating the fact that multiple async commands from host can cause difficulty on the controller side because there is only 1 physical link available and the media needs to be shared. I guess it matters if we want less frustration from end user and better connection quality. This is a common practice from many popular Bluetooth host stacks. I think that we won't loss the link if host does the command order as suggested. Because shorten the link disconnection timeout will just let user realize that the link establishment failed sooner, does not change the fact that it is still a failure and user can't get what they want,

--
Luiz Augusto von Dentz
Computer Engineer

2010-04-14 19:29:47

by Luiz Augusto von Dentz

[permalink] [raw]

Subject: Re: detecting dead link

Hi,

On Wed, Apr 14, 2010 at 7:31 PM, Mike Tsai <[email protected]> wrote:
> Hi Luiz,
> [MT] Page timeout is for creating ACL data link. In this case, ACL link is already created so page timeout is not involved. The 20 second timeout is the default link supervision timeout and the 35 second timeout is the LMP timeout (actually 30 seconds). I think if you have the sniff log for the LMP, it would appear that one of the LMP might be lost during SCO negotiation and it is quite possibly due to the existing sniff mode. A proper implementation of host stack shall probably exit sniff mode first before start the SCO link creation. So the log would look like this,
>
> ? ? ? ?HCI command: Exit Sniff Mode
> ? ? ? ?HCI event: ? Command status, Exit Sniff Mode
> ? ? ? ?HCI event: ? Mode changed
> ? ? ? ?HCI command: Setup Synchronous Connection

That exactly what Im arguing about, we have a 30 sec LMP timeout
compared to 5 sec of page timeout , if we are not willing to wait more
than 5 sec for a new ACL link why would we wait 30 sec? This is
completely disproportional and we should probably need to respond way
before that thus my suggestion to derive the LMP timeout for the page
timeout. This not only affect SCO connection but any attempt to
transfer data in the middle of the link loss.

> I think the host was sending the commands out of order. It asks controller to set up SCO link before it asks controller to exit sniff mode. Note that both commands are async commands (meaning it will request controller to go through several state transition for those commands), so the controller might have hard time executing the host commands.

I guess it doesn't matter the order, as long as it detects the link
loss faster, things like PulseAudio cannot afford 30 sec of the stream
lost because the controller has lost the link, if the headset has been
moved away or shutdown in the middle of the playback we want to route
it back to the speaker as soon as possible.

--
Luiz Augusto von Dentz
Computer Engineer

2010-04-14 16:31:40

by Mike Tsai

[permalink] [raw]

Subject: RE: detecting dead link

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Luiz Augusto von Dentz
Sent: Wednesday, April 14, 2010 1:11 AM
To: [email protected]
Subject: detecting dead link

Hi,

It seems that most of the chips take a lot of time to detect that a
link in dead, due to out of range or power off (no hci disconnect),
the most common problem regarding this is when we try to connect SCO:

2010-03-31 18:40:19.975128 < HCI Command: Setup Synchronous Connection
(0x01|0x0028) plen 17
handle 11 voice setting 0x0060
2010-03-31 18:40:19.981780 > HCI Event: Command Status (0x0f) plen 4
Setup Synchronous Connection (0x01|0x0028) status 0x00 ncmd 1
2010-03-31 18:40:33.318023 < ACL data: handle 11 flags 0x02 dlen 22
L2CAP(d): cid 0x0073 len 18 [psm 0]
0000: 69 ef 1d 0d 0a 2b 43 49 45 56 3a 20 32 2c 33 0d i....+CIEV: 2,3.
0010: 0a 3e .>
2010-03-31 18:40:33.318145 < HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2
handle 11
2010-03-31 18:40:33.324615 > HCI Event: Command Status (0x0f) plen 4
Exit Sniff Mode (0x02|0x0004) status 0x00 ncmd 1
2010-03-31 18:40:51.207885 < ACL data: handle 11 flags 0x02 dlen 22
L2CAP(d): cid 0x0073 len 18 [psm 0]
0000: 69 ef 1d 0d 0a 2b 43 49 45 56 3a 20 32 2c 32 0d i....+CIEV: 2,2.
0010: 0a 3e .>
2010-03-31 18:40:54.965362 > HCI Event: Max Slots Change (0x1b) plen 3
handle 11 slots 5
2010-03-31 18:40:54.966094 > HCI Event: Synchronous Connect Complete (0x2c)
plen 17
status 0x08 handle 0 bdaddr 00:1E:DC:B3:40:9D type eSCO
Error: Connection Timeout
2010-03-31 18:40:54.968353 > HCI Event: Number of Completed Packets (0x13) plen
5
handle 11 packets 2
2010-03-31 18:40:54.968658 > HCI Event: Disconn Complete (0x05) plen 4
status 0x00 handle 11 reason 0x08
Reason: Connection Timeout

It took 35 sec to timeout where connections attempt normally only take
5 sec due to page timeout (configurable in main.conf), it is probably
due to being in sniff mode but even after exiting it takes another 20
sec to timeout (probably lmp timeout).

So I was wondering why this timeout is not derived from page timeout,
to me it seems quite obvious that if the connection doesn't complete
within page timeout there is something wrong since that is supposed to
be much faster on an active link, also if that assumption is correct
what would be the impact to have such a logic that takes page timeout
as timeout for any connection attempt in active/dead links dropping
them if the timeout is reached, of course this has to take into
account the user authorization framework/defer setup so it only
timeout if there is no activity in the link.

Hi Luiz,
[MT] Page timeout is for creating ACL data link. In this case, ACL link is already created so page timeout is not involved. The 20 second timeout is the default link supervision timeout and the 35 second timeout is the LMP timeout (actually 30 seconds). I think if you have the sniff log for the LMP, it would appear that one of the LMP might be lost during SCO negotiation and it is quite possibly due to the existing sniff mode. A proper implementation of host stack shall probably exit sniff mode first before start the SCO link creation. So the log would look like this,

HCI command: Exit Sniff Mode
HCI event: Command status, Exit Sniff Mode
HCI event: Mode changed
HCI command: Setup Synchronous Connection

I think the host was sending the commands out of order. It asks controller to set up SCO link before it asks controller to exit sniff mode. Note that both commands are async commands (meaning it will request controller to go through several state transition for those commands), so the controller might have hard time executing the host commands.

--
Luiz Augusto von Dentz
Computer Engineer