2012-08-03 12:15:17

by Frederic Danis

[permalink] [raw]
Subject: RFCOMM disconnection problem

Hello,

When I try to disconnect Blue Stereo 200 headset (from mrhandsfree) from
my PC, if headset was the initiator of the connection, the low level
(ACL) is not disconnected.
The device is still visible whith "hcitool con".

This happens with BlueZ 4.98 or from git upstream, and kernel 3.4.6.
There is no disconnection problem with kernel 3.2.

I also try to revert change "Bluetooth: Fix RFCOMM session reference
counting issue" (commit cf33e7 in linux-stable.git) from Octavian
Purdila in kernel 3.4.6.
With this kernel I do not get disconnection problem.

You can find attached hcidump traces.

Fred

--
Frederic Danis Open Source Technology Center
[email protected] Intel Corporation



Attachments:
hcidump.txt (64.49 kB)

2012-08-10 09:02:53

by Dean Jenkins

[permalink] [raw]
Subject: Re: RFCOMM disconnection problem

On 9 August 2012 15:46, Andrei Emeltchenko
<[email protected]> wrote:
> Hi Dean,
>
> On Thu, Aug 09, 2012 at 03:29:18PM +0100, Dean Jenkins wrote:
>> One obvious failure of the rfcomm session refcnt is that the refcnt
>> counter either starts with a value of 0 or 1 depending on which peer
>> initiated the connection request, that is wrong. The initiator
>> direction is not relevant for the session as connect and disconnect
>> are independent events. The refcnt should start with a value of 1 in
>> all cases.
>>
>> I am using a 2-core ARM environment that is under high processor
>> loading. The rfcomm session refcnt caused kernel crashes. I used a
>> 2.6.34 kernel but the latest 3.5-RC1 still has the poor rfcomm code.
>> My solution was to remove the rfcomm session refcnt and to ensure that
>> the freeing of the rfcomm session pointer was propagated through-out
>> the rfcomm core code. Some kernel crashes were due to reuse of the
>> freed rfcomm session pointer.
>
> Maybe it does make sense to fix refcounting instead of removing?
>
Hi Andrei,

The existing rfcomm session state machine is capable of determining
when to delete the rfcomm session structure. IMHO the refcnt is an
unnecessary complication. I have fixed rfcomm for our project by
removing the refcnt and ensuring no code can reuse a freed rfcomm
session pointer. This code I have already sent to the mailing list.

In order to get rfcomm to fail, high processor load is necessary
during disconnections to ensure that some of the loops in rfcomm
process more than 1 thing. It is because there are multiple copies of
the rfcomm session pointer s that failure occurs eg. double rfcomm
session free.

Here is an example in net/bluetooth/rfcomm/core.c:

static inline void rfcomm_process_rx(struct rfcomm_session *s)
{
struct socket *sock = s->sock;
struct sock *sk = sock->sk;
struct sk_buff *skb;

BT_DBG("session %p state %ld qlen %d", s, s->state,
skb_queue_len(&sk->sk_receive_queue));

/* Get data directly from socket receive queue without copying it. */
while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
skb_orphan(skb);
if (!skb_linearize(skb))
rfcomm_recv_frame(s, skb);
else
kfree_skb(skb);
}

if (sk->sk_state == BT_CLOSED) {
if (!s->initiator)
rfcomm_session_put(s);

rfcomm_session_close(s, sk->sk_err);
}
}

In rfcomm_process_rx(), you can see there is a while loop that causes
frames to be analysed and actioned in rfcomm_recv_frame(). This action
may cause the rfcomm session pointer to be freed because the rfcomm
session refcnt has reached zero. Therefore, the s pointer is now
invalid in the scope of rfcomm_process_rx() because the underlying
session structure was freed. Unfortunately, if the socket state is
also detected as BT_CLOSED then !s->initiator may cause a crash or at
least accesses wrong data as the s pointer is invalid (pointing to
freed memory or reallocated memory). The refcnt in
rfcomm_session_put(s) is also invalid as s is invalid. In fact looking
at the initiator value is wrong for the rfcomm session. I suspect this
was a workaround that did not fix the root cause.

In practice, this failure scenario is difficult to reproduce (we hit
it in our embedded environment). It requires the socket to be in the
BT_CLOSED at the same time as the rfcomm session has been freed in a
single run of rfcomm_process_rx(). This is why high processor loading
triggers the malfunction because rfcomm_process_rx() maybe pre-empted
or time-sliced so rfcomm_process_rx() takes longer to complete than
normal. This gives the opportunity for the socket state to change to
BT_CLOSED during the run-time of rfcomm_process_rx().

In other words, the refcnt going to zero does not prevent copies of
the rfcomm session pointer s that now points to freed memory (or
reallocated memory) from being erroneously reused.

You can also see

if (!s->initiator)
rfcomm_session_put(s);

rfcomm_session_close(s, sk->sk_err);

If rfcomm_session_put(s) succeeds in causing the refcnt to go to zero,
the rfcomm session structure will be freed. s is now invalid.
Unfortunately, rfcomm_session_close(s, sk->sk_err) reuses the s
pointer with a potential for a crash.

The refcnt could be fixed but fixing it will show it to be unnecessary
code IMHO. One of the weaknesses in the code is not managing the s
pointer after the session has been freed.

I do intend to release a patchset of all our rfcomm changes.

I welcome any comments.

Regards,
Dean

> Best regards
> Andrei Emeltchenko
>
>>
>> I intend to release a patchset of rfcomm fixes.
>>
>> Therefore, IMHO the fix "Bluetooth: Fix RFCOMM session reference
>> counting issue" (commit cf33e7 in linux-stable.git) from Octavian
>> Purdila in kernel 3.4.6 is in fact not fixing the root cause and
>> introduces a misbehaviour of the refcnt. In our project, we rejected
>> this commit because disconnections failed.
>>
>


--
Dean Jenkins
Embedded Software Engineer
Professional Services UK/EMEA
MontaVista Software, LLC

2012-08-09 14:46:02

by Andrei Emeltchenko

[permalink] [raw]
Subject: Re: RFCOMM disconnection problem

Hi Dean,

On Thu, Aug 09, 2012 at 03:29:18PM +0100, Dean Jenkins wrote:
> One obvious failure of the rfcomm session refcnt is that the refcnt
> counter either starts with a value of 0 or 1 depending on which peer
> initiated the connection request, that is wrong. The initiator
> direction is not relevant for the session as connect and disconnect
> are independent events. The refcnt should start with a value of 1 in
> all cases.
>
> I am using a 2-core ARM environment that is under high processor
> loading. The rfcomm session refcnt caused kernel crashes. I used a
> 2.6.34 kernel but the latest 3.5-RC1 still has the poor rfcomm code.
> My solution was to remove the rfcomm session refcnt and to ensure that
> the freeing of the rfcomm session pointer was propagated through-out
> the rfcomm core code. Some kernel crashes were due to reuse of the
> freed rfcomm session pointer.

Maybe it does make sense to fix refcounting instead of removing?

Best regards
Andrei Emeltchenko

>
> I intend to release a patchset of rfcomm fixes.
>
> Therefore, IMHO the fix "Bluetooth: Fix RFCOMM session reference
> counting issue" (commit cf33e7 in linux-stable.git) from Octavian
> Purdila in kernel 3.4.6 is in fact not fixing the root cause and
> introduces a misbehaviour of the refcnt. In our project, we rejected
> this commit because disconnections failed.
>


2012-08-09 14:29:18

by Dean Jenkins

[permalink] [raw]
Subject: Re: RFCOMM disconnection problem

On 3 August 2012 13:22, Frederic Danis <[email protected]> wrote:
> On 03/08/2012 14:15, Frederic Danis wrote:
>>
>> Hello,
>>
>> When I try to disconnect Blue Stereo 200 headset (from mrhandsfree) from
>> my PC, if headset was the initiator of the connection, the low level
>> (ACL) is not disconnected.
>> The device is still visible whith "hcitool con".
>>
>> This happens with BlueZ 4.98 or from git upstream, and kernel 3.4.6.
>> There is no disconnection problem with kernel 3.2.
>>
>> I also try to revert change "Bluetooth: Fix RFCOMM session reference
>> counting issue" (commit cf33e7 in linux-stable.git) from Octavian
>> Purdila in kernel 3.4.6.
>> With this kernel I do not get disconnection problem.
>>
>> You can find attached hcidump traces.
>
>
> Make a mistake, this was kernel log for l2cap and rfcomm.
>
> Find attached hcidump traces.
>
> Fred
>
>
> --
> Frederic Danis Open Source Technology Center
> [email protected] Intel Corporation
>

Hi Fred,

IMHO, kernel.org's rfcomm session refcnt is fundamentally broken and
is in fact redundant. Search back in the mailing list for my previous
proposed fixes.

One obvious failure of the rfcomm session refcnt is that the refcnt
counter either starts with a value of 0 or 1 depending on which peer
initiated the connection request, that is wrong. The initiator
direction is not relevant for the session as connect and disconnect
are independent events. The refcnt should start with a value of 1 in
all cases.

I am using a 2-core ARM environment that is under high processor
loading. The rfcomm session refcnt caused kernel crashes. I used a
2.6.34 kernel but the latest 3.5-RC1 still has the poor rfcomm code.
My solution was to remove the rfcomm session refcnt and to ensure that
the freeing of the rfcomm session pointer was propagated through-out
the rfcomm core code. Some kernel crashes were due to reuse of the
freed rfcomm session pointer.

I intend to release a patchset of rfcomm fixes.

Therefore, IMHO the fix "Bluetooth: Fix RFCOMM session reference
counting issue" (commit cf33e7 in linux-stable.git) from Octavian
Purdila in kernel 3.4.6 is in fact not fixing the root cause and
introduces a misbehaviour of the refcnt. In our project, we rejected
this commit because disconnections failed.

Regards,
Dean

--
Dean Jenkins
Embedded Software Engineer
Professional Services UK/EMEA
MontaVista Software, LLC

2012-08-03 12:22:26

by Frederic Danis

[permalink] [raw]
Subject: Re: RFCOMM disconnection problem

On 03/08/2012 14:15, Frederic Danis wrote:
> Hello,
>
> When I try to disconnect Blue Stereo 200 headset (from mrhandsfree) from
> my PC, if headset was the initiator of the connection, the low level
> (ACL) is not disconnected.
> The device is still visible whith "hcitool con".
>
> This happens with BlueZ 4.98 or from git upstream, and kernel 3.4.6.
> There is no disconnection problem with kernel 3.2.
>
> I also try to revert change "Bluetooth: Fix RFCOMM session reference
> counting issue" (commit cf33e7 in linux-stable.git) from Octavian
> Purdila in kernel 3.4.6.
> With this kernel I do not get disconnection problem.
>
> You can find attached hcidump traces.

Make a mistake, this was kernel log for l2cap and rfcomm.
Find attached hcidump traces.

Fred


--
Frederic Danis Open Source Technology Center
[email protected] Intel Corporation


Attachments:
hcidump.txt (19.62 kB)