2015-06-05 23:11:09

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 0/7] Avoid L2CAP ERTM shutdown hung tasks

Details of the issue are described in the thread
[RFC] Bluetooth ERTM L2CAP deadlocks (hung tasks) due to l2cap_sock_shutdown()
and kernel.org Bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=99301

Summary
=======

a) __l2cap_wait_ack() gets stuck in an infinite loop whilst waiting for
I frame acknowledgements (ack is S or I frame from the remote device).
The infinite loop occurs because the waiting loop's condition never becomes
false. __l2cap_wait_ack() periodically sleeps for 200ms.

b) The L2CAP threads for processing received frames and error recovery
timers become blocked on a mutex held by l2cap_sock_shutdown() which
prevents updating of the loop condition in__l2cap_wait_ack().

Therefore a deadlock occurs.

This deadlock was observed on kernels 3.8 upto bluetooth-next on x86 and ARM
based platforms including the Raspberry Pi kernel 3.18.14+

The testcase is simple to reproduce the hung tasks, please wait upto 10 minutes
to observe the hung tasks:

On a laptop or other suitable device use the command:
l2test -i hci0 -X ertm -d
This command listens for connections with optional L2CAP ERTM supported and
dumps out the received data.

On the unit under test use the command:
l2test -s -b 1024 -B /dev/urandom -X ertm -P 4113 <lap_BT_ADDR>
This command sends 1024 bytes to the laptop via PSM 4113 used by l2test over
an ERTM channel. The command line shows:

l2test -s -b 1024 -B /dev/urandom -X ertm -P 4113 5C:F3:70:68:BA:DD
l2test[10366]: Connected to 5C:F3:70:68:BA:DD (bredr, psm 4113, scid 64)
l2test[10366]: Local device 5C:F3:70:68:BA:DA (bredr, psm 4113, scid 64)
l2test[10366]: Options [imtu 672, omtu 672, flush_to 65535, mode 3, handle 11, class 0x000000, priority 0, rcvbuf 163840]
l2test[10366]: Sending ...
l2test[10366]: Closing channel ...

Note that the channel fails to close because l2cap_sock_shutdown() is prevented
from returning due to __l2cap_wait_ack being stuck in an infinite loop.

Example hung tasks are:

[ 601.116119] INFO: task kworker/0:1:19 blocked for more than 120 seconds.
[ 601.116156] Not tainted 3.18.11+ #781
[ 601.116169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.116183] kworker/0:1 D c054cbcc 0 19 2 0x00000000
[ 601.116537] Workqueue: events l2cap_retrans_timeout [bluetooth]
[ 601.116685] [<c054cbcc>] (__schedule) from [<c054cfa4>] (schedule+0x40/0x8c)
[ 601.116727] [<c054cfa4>] (schedule) from [<c054d3c8>] (schedule_preempt_disabled+0x30/0x40)
[ 601.116812] [<c054d3c8>] (schedule_preempt_disabled) from [<c054ee14>] (__mutex_lock_slowpath+0xb8/0x18c)
[ 601.116856] [<c054ee14>] (__mutex_lock_slowpath) from [<c054ef20>] (mutex_lock+0x38/0x3c)
[ 601.117085] [<c054ef20>] (mutex_lock) from [<bf05173c>] (l2cap_retrans_timeout+0x24/0x5c [bluetooth])
[ 601.117371] [<bf05173c>] (l2cap_retrans_timeout [bluetooth]) from [<c003aa68>] (process_one_work+0x13c/0x444)
[ 601.117410] [<c003aa68>] (process_one_work) from [<c003b954>] (worker_thread+0x190/0x580)
[ 601.117445] [<c003b954>] (worker_thread) from [<c003fdb4>] (kthread+0xd8/0xf4)
[ 601.117488] [<c003fdb4>] (kthread) from [<c000e8c8>] (ret_from_fork+0x14/0x20)
[ 601.117569] INFO: task kworker/u3:2:327 blocked for more than 120 seconds.
[ 601.117588] Not tainted 3.18.11+ #781
[ 601.117601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.117614] kworker/u3:2 D c054cbcc 0 327 2 0x00000000
[ 601.117806] Workqueue: hci0 hci_rx_work [bluetooth]
[ 601.117941] [<c054cbcc>] (__schedule) from [<c054cfa4>] (schedule+0x40/0x8c)
[ 601.117981] [<c054cfa4>] (schedule) from [<c054d3c8>] (schedule_preempt_disabled+0x30/0x40)
[ 601.118056] [<c054d3c8>] (schedule_preempt_disabled) from [<c054ee14>] (__mutex_lock_slowpath+0xb8/0x18c)
[ 601.118092] [<c054ee14>] (__mutex_lock_slowpath) from [<c054ef20>] (mutex_lock+0x38/0x3c)
[ 601.118307] [<c054ef20>] (mutex_lock) from [<bf04d2c8>] (l2cap_get_chan_by_scid+0x28/0x98 [bluetooth])
[ 601.118696] [<bf04d2c8>] (l2cap_get_chan_by_scid [bluetooth]) from [<bf054590>] (l2cap_recv_frame+0xa4/0x2cec [bluetooth])
[ 601.119146] [<bf054590>] (l2cap_recv_frame [bluetooth]) from [<bf058194>] (l2cap_recv_acldata+0x214/0x22c [bluetooth])
[ 601.119559] [<bf058194>] (l2cap_recv_acldata [bluetooth]) from [<bf02bf88>] (hci_rx_work+0x180/0x288 [bluetooth])
[ 601.119801] [<bf02bf88>] (hci_rx_work [bluetooth]) from [<c003aa68>] (process_one_work+0x13c/0x444)
[ 601.119839] [<c003aa68>] (process_one_work) from [<c003b954>] (worker_thread+0x190/0x580)
[ 601.119883] [<c003b954>] (worker_thread) from [<c003fdb4>] (kthread+0xd8/0xf4)
[ 601.119919] [<c003fdb4>] (kthread) from [<c000e8c8>] (ret_from_fork+0x14/0x20)

This is not an exhaustive list, there are other hung task flavours.

Solution
========

The solution was to reorganise l2cap_sock_shutdown() to reveal the purpose of
the function. The function shuts down the L2CAP channel but in the case when the
channel type is ERTM, __l2cap_wait_ack() is called to wait for acknowledgements
before the channel is shutdown.

There is some concurrency protection by using a lock on sk to prevent
multiple instances of l2cap_sock_shutdown() from running.

SHUTDOWN_MASK is used to inhibit new I frames and other PDUs from being
queued. Setting this flag was moved to before calling __l2cap_wait_ack() so that
the loop would not be waiting indefinitely for ACKs. There is some risk here
of not sending all pending data, however, an arbitrary cut-off point is needed
otherwise the link could remain connected. This may need improving to inhibit
only new I frames.

Extra protection to the sk and chan structures was added to prevent those
structures being freed whilst the locks were not held. This is defensive and
perhaps unnecessary.

__l2cap_wait_ack() was modified to be only called when necessary and the
while loop was replaced by a do while loop because there must be at least 1
waiting period. References to chan->conn were replaced by chan->state so that
the mutex locks could be removed. Also waiting for ACKs is only valid whilst the
connection is in the BT_CONNECTED state.

The scope of chan locking was reduced so that processing of received frames
and error handling routines could operate whilst waiting for ACKs.

Importantly, for __l2cap_wait_ack() a 10 second maximum wait limit was
implemented to prevent __l2cap_wait_ack() getting stuck in an infinite loop.
If no expected ACKs arrive then the channel is disconnected and the error
code -ENOLINK is returned to indicate that not all data was ACK'ed.

Testing
=======

This is being tested in kernel 3.8 on an ARM based board. Results so far
show that the hung tasks no longer occur.

Note that the failure scenario was Mediaplayer media browsing. Waiting for
ACK's is unimportant in this scenario because it does not matter whether
a media browsing request such as "get item" was successfully transferred at the
point of shutting down the channel.

Feedback
========

Please review and provide comments on this solution. Thanks.

Patchset
========

Patches are based on bluetooth-next:

Dean Jenkins (7):
Bluetooth: L2CAP ERTM shutdown protect sk and chan
Bluetooth: Make __l2cap_wait_ack more efficient
Bluetooth: Unwind l2cap_sock_shutdown()
Bluetooth: l2cap_sock_shutdown() remove mutex_lock calls
Bluetooth: l2cap_sock_shutdown() reduce scope of chan locking
Bluetooth: Add BT_DBG to l2cap_sock_shutdown()
Bluetooth: __l2cap_wait_ack() limit max waiting time

net/bluetooth/l2cap_sock.c | 69 +++++++++++++++++++++++++++++-----------------
1 file changed, 44 insertions(+), 25 deletions(-)

--
1.8.5.6


2015-06-10 11:18:27

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] Bluetooth: __l2cap_wait_ack() limit max waiting time

Hi Dean,

>>> Add a limiter counter to prevent the do while loop
>>> running in an infinite loop. This ensures that the
>>> channel will be instructed to close within 10 seconds
>>> so prevents l2cap_sock_shutdown() getting stuck forever.
>>>
>>> Returns -ENOLINK when the limit is reached as the channel
>>> will be subequently closed and not all data was ACK'ed.
>>>
>>> Signed-off-by: Dean Jenkins <[email protected]>
>>> ---
>>> net/bluetooth/l2cap_sock.c | 11 ++++++++++-
>>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
>>> index 369ad0e..ee6531e 100644
>>> --- a/net/bluetooth/l2cap_sock.c
>>> +++ b/net/bluetooth/l2cap_sock.c
>>> @@ -1059,11 +1059,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
>>> DECLARE_WAITQUEUE(wait, current);
>>> int err = 0;
>>> int timeo = HZ/5;
>>> + int limiter = 10 * 5; /* 10 seconds limit */
>> while reading this, should timeo not be using msecs_to_jiffies() in the first place.
>>
>> And with that, can we have a little bit better logic on how you get to 10 seconds. I had to scratch my head a bit to realise that this is 50 * 200 msec. It seems a bit error prone in case anyone ever changes something.
> Thanks for your comments. I will redo this by using a #define and use 2 separate commits; one to add limiter, the other to use msecs_to_jiffies()
>>
>>> add_wait_queue(sk_sleep(sk), &wait);
>>> set_current_state(TASK_INTERRUPTIBLE);
>>> do {
>>> - BT_DBG("Waiting for %d ACKs", chan->unacked_frames);
>>> + BT_DBG("Waiting for %d ACKs, limiter %d",
>>> + chan->unacked_frames, limiter);
>>>
>>> if (!timeo)
>>> timeo = HZ/5;
>> And with that, I have no idea why we are doing this check here. Seems rather pointless unless I misses something.
> Testing shows that schedule_timeout() can return before the timeo time period has expired. I do not know why schedule_timeout() is returning early, no signal is caught by the signal_pending() statement within the loop. This means that the patch is broken because limiter can decrement too fast so the elapsed time is less than 10 seconds. The fix is to move limiter-- to this location so that timeo is zero before doing limiter--.
>
> The limiter counter will not give an accurate period of 10 seconds and will be longer than 10 seconds in most cases. However, the 10 seconds is arbitrary so the accuracy is unimportant.
>
> Would you prefer an overall 10 second jiffies counter instead of using the limiter loop counter ? Such as
> keep looping until jiffies > start_jiffies + 10*HZ
> although jiffies overflow needs to be taken into account, right ?

but we have time_after helpers for that right? If so, then try that one.

Regards

Marcel


2015-06-10 10:31:41

by Dean Jenkins

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] Bluetooth: __l2cap_wait_ack() limit max waiting time

Hi Marcel,

This patch is broken. Needs a redo. Please see inline comments.

On 06/06/15 05:21, Marcel Holtmann wrote:
> Hi Dean,
>
>> Add a limiter counter to prevent the do while loop
>> running in an infinite loop. This ensures that the
>> channel will be instructed to close within 10 seconds
>> so prevents l2cap_sock_shutdown() getting stuck forever.
>>
>> Returns -ENOLINK when the limit is reached as the channel
>> will be subequently closed and not all data was ACK'ed.
>>
>> Signed-off-by: Dean Jenkins <[email protected]>
>> ---
>> net/bluetooth/l2cap_sock.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
>> index 369ad0e..ee6531e 100644
>> --- a/net/bluetooth/l2cap_sock.c
>> +++ b/net/bluetooth/l2cap_sock.c
>> @@ -1059,11 +1059,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
>> DECLARE_WAITQUEUE(wait, current);
>> int err = 0;
>> int timeo = HZ/5;
>> + int limiter = 10 * 5; /* 10 seconds limit */
> while reading this, should timeo not be using msecs_to_jiffies() in the first place.
>
> And with that, can we have a little bit better logic on how you get to 10 seconds. I had to scratch my head a bit to realise that this is 50 * 200 msec. It seems a bit error prone in case anyone ever changes something.
Thanks for your comments. I will redo this by using a #define and use 2
separate commits; one to add limiter, the other to use msecs_to_jiffies()
>
>> add_wait_queue(sk_sleep(sk), &wait);
>> set_current_state(TASK_INTERRUPTIBLE);
>> do {
>> - BT_DBG("Waiting for %d ACKs", chan->unacked_frames);
>> + BT_DBG("Waiting for %d ACKs, limiter %d",
>> + chan->unacked_frames, limiter);
>>
>> if (!timeo)
>> timeo = HZ/5;
> And with that, I have no idea why we are doing this check here. Seems rather pointless unless I misses something.
Testing shows that schedule_timeout() can return before the timeo time
period has expired. I do not know why schedule_timeout() is returning
early, no signal is caught by the signal_pending() statement within the
loop. This means that the patch is broken because limiter can decrement
too fast so the elapsed time is less than 10 seconds. The fix is to move
limiter-- to this location so that timeo is zero before doing limiter--.

The limiter counter will not give an accurate period of 10 seconds and
will be longer than 10 seconds in most cases. However, the 10 seconds is
arbitrary so the accuracy is unimportant.

Would you prefer an overall 10 second jiffies counter instead of using
the limiter loop counter ? Such as
keep looping until jiffies > start_jiffies + 10*HZ
although jiffies overflow needs to be taken into account, right ?


>
> I know these are not your bugs, but while we are at it, it might be better to really clean this out.
>
>> @@ -1081,6 +1083,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
>> err = sock_error(sk);
>> if (err)
>> break;
>> +
>> + limiter--;
>> + if (!limiter) {
>> + err = -ENOLINK;
>> + break;
>> + }
>> +
>> } while (chan->unacked_frames > 0 &&
>> chan->state == BT_CONNECTED);
> Regards
>
> Marcel
>
Regards,
Dean

--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.

2015-06-09 10:13:32

by Dean Jenkins

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] Bluetooth: Unwind l2cap_sock_shutdown()

Hi Szymon,

On 09/06/15 10:55, Szymon Janc wrote:
> Hi Dean,
>
> On Saturday 06 of June 2015 00:11:12 Dean Jenkins wrote:
>> l2cap_sock_shutdown() is designed to only action shutdown
>> of the channel when shutdown is not already in progress.
>> Therefore, reorganise the code flow by adding a goto
>> to jump to the end of function handling when shutdown is
>> already being actioned. This removes one level of code
>> indentation and make the code more readable.
>>
>> Signed-off-by: Dean Jenkins <[email protected]>
>> ---
>> net/bluetooth/l2cap_sock.c | 44
>> ++++++++++++++++++++++++-------------------- 1 file changed, 24
>> insertions(+), 20 deletions(-)
>>
>> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
>> index c909f45..8550886 100644
>> --- a/net/bluetooth/l2cap_sock.c
>> +++ b/net/bluetooth/l2cap_sock.c
>> @@ -1099,6 +1099,13 @@ static int l2cap_sock_shutdown(struct socket *sock,
>> int how) if (!sk)
>> return 0;
>>
>> + lock_sock(sk);
>> +
>> + if (sk->sk_shutdown)
>> + goto shutdown_already;
>> +
>> + sk->sk_shutdown = SHUTDOWN_MASK;
>> +
>> /* prevent sk structure from being freed whilst unlocked */
>> sock_hold(sk);
>>
>> @@ -1113,30 +1120,21 @@ static int l2cap_sock_shutdown(struct socket *sock,
>> int how) mutex_lock(&conn->chan_lock);
>>
>> l2cap_chan_lock(chan);
>> - lock_sock(sk);
>>
>> - if (!sk->sk_shutdown) {
>> - if (chan->mode == L2CAP_MODE_ERTM &&
>> - chan->unacked_frames > 0 &&
>> - chan->state == BT_CONNECTED)
>> - err = __l2cap_wait_ack(sk, chan);
>> + if (chan->mode == L2CAP_MODE_ERTM &&
>> + chan->unacked_frames > 0 &&
>> + chan->state == BT_CONNECTED)
>> + err = __l2cap_wait_ack(sk, chan);
>>
>> - sk->sk_shutdown = SHUTDOWN_MASK;
>> -
>> - release_sock(sk);
>> - l2cap_chan_close(chan, 0);
>> - lock_sock(sk);
>> -
>> - if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
>> - !(current->flags & PF_EXITING))
>> - err = bt_sock_wait_state(sk, BT_CLOSED,
>> - sk->sk_lingertime);
>> - }
>> + release_sock(sk);
>> + l2cap_chan_close(chan, 0);
>> + lock_sock(sk);
>>
>> - if (!err && sk->sk_err)
>> - err = -sk->sk_err;
>> + if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
>> + !(current->flags & PF_EXITING))
>> + err = bt_sock_wait_state(sk, BT_CLOSED,
>> + sk->sk_lingertime);
>>
>> - release_sock(sk);
>> l2cap_chan_unlock(chan);
>>
>> if (conn)
>> @@ -1146,6 +1144,12 @@ static int l2cap_sock_shutdown(struct socket *sock,
>> int how) l2cap_chan_put(chan);
>> sock_put(sk);
> Isn't it possible that sk will be freed here? (at least this is what comment
> from patch 1 suggests).
At this point in the code, lock_sock(sk) is held so it should prevent
the freeing of sk.
I think my comment is misleading. I will rewrite my comment.

Thanks.
>
>> +shutdown_already:
>> + if (!err && sk->sk_err)
>> + err = -sk->sk_err;
>> +
>> + release_sock(sk);
>> +
>> return err;
>> }


--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.


2015-06-09 09:55:40

by Szymon Janc

[permalink] [raw]
Subject: Re: [PATCH v1 3/7] Bluetooth: Unwind l2cap_sock_shutdown()

Hi Dean,

On Saturday 06 of June 2015 00:11:12 Dean Jenkins wrote:
> l2cap_sock_shutdown() is designed to only action shutdown
> of the channel when shutdown is not already in progress.
> Therefore, reorganise the code flow by adding a goto
> to jump to the end of function handling when shutdown is
> already being actioned. This removes one level of code
> indentation and make the code more readable.
>
> Signed-off-by: Dean Jenkins <[email protected]>
> ---
> net/bluetooth/l2cap_sock.c | 44
> ++++++++++++++++++++++++-------------------- 1 file changed, 24
> insertions(+), 20 deletions(-)
>
> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
> index c909f45..8550886 100644
> --- a/net/bluetooth/l2cap_sock.c
> +++ b/net/bluetooth/l2cap_sock.c
> @@ -1099,6 +1099,13 @@ static int l2cap_sock_shutdown(struct socket *sock,
> int how) if (!sk)
> return 0;
>
> + lock_sock(sk);
> +
> + if (sk->sk_shutdown)
> + goto shutdown_already;
> +
> + sk->sk_shutdown = SHUTDOWN_MASK;
> +
> /* prevent sk structure from being freed whilst unlocked */
> sock_hold(sk);
>
> @@ -1113,30 +1120,21 @@ static int l2cap_sock_shutdown(struct socket *sock,
> int how) mutex_lock(&conn->chan_lock);
>
> l2cap_chan_lock(chan);
> - lock_sock(sk);
>
> - if (!sk->sk_shutdown) {
> - if (chan->mode == L2CAP_MODE_ERTM &&
> - chan->unacked_frames > 0 &&
> - chan->state == BT_CONNECTED)
> - err = __l2cap_wait_ack(sk, chan);
> + if (chan->mode == L2CAP_MODE_ERTM &&
> + chan->unacked_frames > 0 &&
> + chan->state == BT_CONNECTED)
> + err = __l2cap_wait_ack(sk, chan);
>
> - sk->sk_shutdown = SHUTDOWN_MASK;
> -
> - release_sock(sk);
> - l2cap_chan_close(chan, 0);
> - lock_sock(sk);
> -
> - if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
> - !(current->flags & PF_EXITING))
> - err = bt_sock_wait_state(sk, BT_CLOSED,
> - sk->sk_lingertime);
> - }
> + release_sock(sk);
> + l2cap_chan_close(chan, 0);
> + lock_sock(sk);
>
> - if (!err && sk->sk_err)
> - err = -sk->sk_err;
> + if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
> + !(current->flags & PF_EXITING))
> + err = bt_sock_wait_state(sk, BT_CLOSED,
> + sk->sk_lingertime);
>
> - release_sock(sk);
> l2cap_chan_unlock(chan);
>
> if (conn)
> @@ -1146,6 +1144,12 @@ static int l2cap_sock_shutdown(struct socket *sock,
> int how) l2cap_chan_put(chan);
> sock_put(sk);

Isn't it possible that sk will be freed here? (at least this is what comment
from patch 1 suggests).

>
> +shutdown_already:
> + if (!err && sk->sk_err)
> + err = -sk->sk_err;
> +
> + release_sock(sk);
> +
> return err;
> }

--
BR
Szymon Janc

2015-06-06 04:21:28

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH v1 7/7] Bluetooth: __l2cap_wait_ack() limit max waiting time

Hi Dean,

> Add a limiter counter to prevent the do while loop
> running in an infinite loop. This ensures that the
> channel will be instructed to close within 10 seconds
> so prevents l2cap_sock_shutdown() getting stuck forever.
>
> Returns -ENOLINK when the limit is reached as the channel
> will be subequently closed and not all data was ACK'ed.
>
> Signed-off-by: Dean Jenkins <[email protected]>
> ---
> net/bluetooth/l2cap_sock.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
> index 369ad0e..ee6531e 100644
> --- a/net/bluetooth/l2cap_sock.c
> +++ b/net/bluetooth/l2cap_sock.c
> @@ -1059,11 +1059,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
> DECLARE_WAITQUEUE(wait, current);
> int err = 0;
> int timeo = HZ/5;
> + int limiter = 10 * 5; /* 10 seconds limit */

while reading this, should timeo not be using msecs_to_jiffies() in the first place.

And with that, can we have a little bit better logic on how you get to 10 seconds. I had to scratch my head a bit to realise that this is 50 * 200 msec. It seems a bit error prone in case anyone ever changes something.

>
> add_wait_queue(sk_sleep(sk), &wait);
> set_current_state(TASK_INTERRUPTIBLE);
> do {
> - BT_DBG("Waiting for %d ACKs", chan->unacked_frames);
> + BT_DBG("Waiting for %d ACKs, limiter %d",
> + chan->unacked_frames, limiter);
>
> if (!timeo)
> timeo = HZ/5;

And with that, I have no idea why we are doing this check here. Seems rather pointless unless I misses something.

I know these are not your bugs, but while we are at it, it might be better to really clean this out.

> @@ -1081,6 +1083,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
> err = sock_error(sk);
> if (err)
> break;
> +
> + limiter--;
> + if (!limiter) {
> + err = -ENOLINK;
> + break;
> + }
> +
> } while (chan->unacked_frames > 0 &&
> chan->state == BT_CONNECTED);

Regards

Marcel


2015-06-06 04:13:35

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH v1 0/7] Avoid L2CAP ERTM shutdown hung tasks

Hi Dean,

> Details of the issue are described in the thread
> [RFC] Bluetooth ERTM L2CAP deadlocks (hung tasks) due to l2cap_sock_shutdown()
> and kernel.org Bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=99301
>
> Summary
> =======
>
> a) __l2cap_wait_ack() gets stuck in an infinite loop whilst waiting for
> I frame acknowledgements (ack is S or I frame from the remote device).
> The infinite loop occurs because the waiting loop's condition never becomes
> false. __l2cap_wait_ack() periodically sleeps for 200ms.
>
> b) The L2CAP threads for processing received frames and error recovery
> timers become blocked on a mutex held by l2cap_sock_shutdown() which
> prevents updating of the loop condition in__l2cap_wait_ack().
>
> Therefore a deadlock occurs.
>
> This deadlock was observed on kernels 3.8 upto bluetooth-next on x86 and ARM
> based platforms including the Raspberry Pi kernel 3.18.14+
>
> The testcase is simple to reproduce the hung tasks, please wait upto 10 minutes
> to observe the hung tasks:
>
> On a laptop or other suitable device use the command:
> l2test -i hci0 -X ertm -d
> This command listens for connections with optional L2CAP ERTM supported and
> dumps out the received data.
>
> On the unit under test use the command:
> l2test -s -b 1024 -B /dev/urandom -X ertm -P 4113 <lap_BT_ADDR>
> This command sends 1024 bytes to the laptop via PSM 4113 used by l2test over
> an ERTM channel. The command line shows:
>
> l2test -s -b 1024 -B /dev/urandom -X ertm -P 4113 5C:F3:70:68:BA:DD
> l2test[10366]: Connected to 5C:F3:70:68:BA:DD (bredr, psm 4113, scid 64)
> l2test[10366]: Local device 5C:F3:70:68:BA:DA (bredr, psm 4113, scid 64)
> l2test[10366]: Options [imtu 672, omtu 672, flush_to 65535, mode 3, handle 11, class 0x000000, priority 0, rcvbuf 163840]
> l2test[10366]: Sending ...
> l2test[10366]: Closing channel ...
>
> Note that the channel fails to close because l2cap_sock_shutdown() is prevented
> from returning due to __l2cap_wait_ack being stuck in an infinite loop.
>
> Example hung tasks are:
>
> [ 601.116119] INFO: task kworker/0:1:19 blocked for more than 120 seconds.
> [ 601.116156] Not tainted 3.18.11+ #781
> [ 601.116169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 601.116183] kworker/0:1 D c054cbcc 0 19 2 0x00000000
> [ 601.116537] Workqueue: events l2cap_retrans_timeout [bluetooth]
> [ 601.116685] [<c054cbcc>] (__schedule) from [<c054cfa4>] (schedule+0x40/0x8c)
> [ 601.116727] [<c054cfa4>] (schedule) from [<c054d3c8>] (schedule_preempt_disabled+0x30/0x40)
> [ 601.116812] [<c054d3c8>] (schedule_preempt_disabled) from [<c054ee14>] (__mutex_lock_slowpath+0xb8/0x18c)
> [ 601.116856] [<c054ee14>] (__mutex_lock_slowpath) from [<c054ef20>] (mutex_lock+0x38/0x3c)
> [ 601.117085] [<c054ef20>] (mutex_lock) from [<bf05173c>] (l2cap_retrans_timeout+0x24/0x5c [bluetooth])
> [ 601.117371] [<bf05173c>] (l2cap_retrans_timeout [bluetooth]) from [<c003aa68>] (process_one_work+0x13c/0x444)
> [ 601.117410] [<c003aa68>] (process_one_work) from [<c003b954>] (worker_thread+0x190/0x580)
> [ 601.117445] [<c003b954>] (worker_thread) from [<c003fdb4>] (kthread+0xd8/0xf4)
> [ 601.117488] [<c003fdb4>] (kthread) from [<c000e8c8>] (ret_from_fork+0x14/0x20)
> [ 601.117569] INFO: task kworker/u3:2:327 blocked for more than 120 seconds.
> [ 601.117588] Not tainted 3.18.11+ #781
> [ 601.117601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 601.117614] kworker/u3:2 D c054cbcc 0 327 2 0x00000000
> [ 601.117806] Workqueue: hci0 hci_rx_work [bluetooth]
> [ 601.117941] [<c054cbcc>] (__schedule) from [<c054cfa4>] (schedule+0x40/0x8c)
> [ 601.117981] [<c054cfa4>] (schedule) from [<c054d3c8>] (schedule_preempt_disabled+0x30/0x40)
> [ 601.118056] [<c054d3c8>] (schedule_preempt_disabled) from [<c054ee14>] (__mutex_lock_slowpath+0xb8/0x18c)
> [ 601.118092] [<c054ee14>] (__mutex_lock_slowpath) from [<c054ef20>] (mutex_lock+0x38/0x3c)
> [ 601.118307] [<c054ef20>] (mutex_lock) from [<bf04d2c8>] (l2cap_get_chan_by_scid+0x28/0x98 [bluetooth])
> [ 601.118696] [<bf04d2c8>] (l2cap_get_chan_by_scid [bluetooth]) from [<bf054590>] (l2cap_recv_frame+0xa4/0x2cec [bluetooth])
> [ 601.119146] [<bf054590>] (l2cap_recv_frame [bluetooth]) from [<bf058194>] (l2cap_recv_acldata+0x214/0x22c [bluetooth])
> [ 601.119559] [<bf058194>] (l2cap_recv_acldata [bluetooth]) from [<bf02bf88>] (hci_rx_work+0x180/0x288 [bluetooth])
> [ 601.119801] [<bf02bf88>] (hci_rx_work [bluetooth]) from [<c003aa68>] (process_one_work+0x13c/0x444)
> [ 601.119839] [<c003aa68>] (process_one_work) from [<c003b954>] (worker_thread+0x190/0x580)
> [ 601.119883] [<c003b954>] (worker_thread) from [<c003fdb4>] (kthread+0xd8/0xf4)
> [ 601.119919] [<c003fdb4>] (kthread) from [<c000e8c8>] (ret_from_fork+0x14/0x20)
>
> This is not an exhaustive list, there are other hung task flavours.
>
> Solution
> ========
>
> The solution was to reorganise l2cap_sock_shutdown() to reveal the purpose of
> the function. The function shuts down the L2CAP channel but in the case when the
> channel type is ERTM, __l2cap_wait_ack() is called to wait for acknowledgements
> before the channel is shutdown.
>
> There is some concurrency protection by using a lock on sk to prevent
> multiple instances of l2cap_sock_shutdown() from running.
>
> SHUTDOWN_MASK is used to inhibit new I frames and other PDUs from being
> queued. Setting this flag was moved to before calling __l2cap_wait_ack() so that
> the loop would not be waiting indefinitely for ACKs. There is some risk here
> of not sending all pending data, however, an arbitrary cut-off point is needed
> otherwise the link could remain connected. This may need improving to inhibit
> only new I frames.
>
> Extra protection to the sk and chan structures was added to prevent those
> structures being freed whilst the locks were not held. This is defensive and
> perhaps unnecessary.
>
> __l2cap_wait_ack() was modified to be only called when necessary and the
> while loop was replaced by a do while loop because there must be at least 1
> waiting period. References to chan->conn were replaced by chan->state so that
> the mutex locks could be removed. Also waiting for ACKs is only valid whilst the
> connection is in the BT_CONNECTED state.
>
> The scope of chan locking was reduced so that processing of received frames
> and error handling routines could operate whilst waiting for ACKs.
>
> Importantly, for __l2cap_wait_ack() a 10 second maximum wait limit was
> implemented to prevent __l2cap_wait_ack() getting stuck in an infinite loop.
> If no expected ACKs arrive then the channel is disconnected and the error
> code -ENOLINK is returned to indicate that not all data was ACK'ed.
>
> Testing
> =======
>
> This is being tested in kernel 3.8 on an ARM based board. Results so far
> show that the hung tasks no longer occur.
>
> Note that the failure scenario was Mediaplayer media browsing. Waiting for
> ACK's is unimportant in this scenario because it does not matter whether
> a media browsing request such as "get item" was successfully transferred at the
> point of shutting down the channel.

any chance you have run PTS against it to ensure that we are not breaking qualification?

You could use android/pts-l2cap.txt as documentation for the commands for each test case. I would feel a lot more comfortable if we know that qualification is still in place.

Otherwise, this looks pretty good. Thanks for tracking this down.

Regards

Marcel


2015-06-05 23:11:16

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 7/7] Bluetooth: __l2cap_wait_ack() limit max waiting time

Add a limiter counter to prevent the do while loop
running in an infinite loop. This ensures that the
channel will be instructed to close within 10 seconds
so prevents l2cap_sock_shutdown() getting stuck forever.

Returns -ENOLINK when the limit is reached as the channel
will be subequently closed and not all data was ACK'ed.

Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 369ad0e..ee6531e 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1059,11 +1059,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
DECLARE_WAITQUEUE(wait, current);
int err = 0;
int timeo = HZ/5;
+ int limiter = 10 * 5; /* 10 seconds limit */

add_wait_queue(sk_sleep(sk), &wait);
set_current_state(TASK_INTERRUPTIBLE);
do {
- BT_DBG("Waiting for %d ACKs", chan->unacked_frames);
+ BT_DBG("Waiting for %d ACKs, limiter %d",
+ chan->unacked_frames, limiter);

if (!timeo)
timeo = HZ/5;
@@ -1081,6 +1083,13 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
err = sock_error(sk);
if (err)
break;
+
+ limiter--;
+ if (!limiter) {
+ err = -ENOLINK;
+ break;
+ }
+
} while (chan->unacked_frames > 0 &&
chan->state == BT_CONNECTED);

--
1.8.5.6

2015-06-05 23:11:15

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 6/7] Bluetooth: Add BT_DBG to l2cap_sock_shutdown()

Add helpful BT_DBG debug to l2cap_sock_shutdown()
and __l2cap_wait_ack() so that the code flow can
be analysed.

Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 0d8ae73..369ad0e 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1063,6 +1063,8 @@ static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
add_wait_queue(sk_sleep(sk), &wait);
set_current_state(TASK_INTERRUPTIBLE);
do {
+ BT_DBG("Waiting for %d ACKs", chan->unacked_frames);
+
if (!timeo)
timeo = HZ/5;

@@ -1140,6 +1142,8 @@ shutdown_already:

release_sock(sk);

+ BT_DBG("err: %d", err);
+
return err;
}

--
1.8.5.6


2015-06-05 23:11:14

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 5/7] Bluetooth: l2cap_sock_shutdown() reduce scope of chan locking

The only place where chan locking is needed is the call to
l2cap_chan_close(chan, 0) which if necessary closes the channel.
Therefore, move the l2cap_chan_lock(chan) and
l2cap_chan_lock(chan) locking calls to around
l2cap_chan_close(chan, 0).

This allows __l2cap_wait_ack(sk, chan) to be called with no
chan locks being held so L2CAP messaging over the ACL link
can be done unimpaired.

Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 273bfa1..0d8ae73 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1114,24 +1114,22 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)

BT_DBG("chan %p state %s", chan, state_to_string(chan->state));

- l2cap_chan_lock(chan);
-
if (chan->mode == L2CAP_MODE_ERTM &&
chan->unacked_frames > 0 &&
chan->state == BT_CONNECTED)
err = __l2cap_wait_ack(sk, chan);

+ l2cap_chan_lock(chan);
release_sock(sk);
l2cap_chan_close(chan, 0);
lock_sock(sk);
+ l2cap_chan_unlock(chan);

if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
!(current->flags & PF_EXITING))
err = bt_sock_wait_state(sk, BT_CLOSED,
sk->sk_lingertime);

- l2cap_chan_unlock(chan);
-
/* allow chan and sk structures to be freed */
l2cap_chan_put(chan);
sock_put(sk);
--
1.8.5.6

2015-06-05 23:11:13

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 4/7] Bluetooth: l2cap_sock_shutdown() remove mutex_lock calls

Remove the calls
mutex_lock(&conn->chan_lock);
mutex_unlock(&conn->chan_lock);

because the conn structure is no longer being referenced in
l2cap_sock_shutdown() as instead the chan->state is being used
to determince the channel connection state.

Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 8550886..273bfa1 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1091,7 +1091,6 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
{
struct sock *sk = sock->sk;
struct l2cap_chan *chan;
- struct l2cap_conn *conn;
int err = 0;

BT_DBG("sock %p, sk %p", sock, sk);
@@ -1112,13 +1111,9 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
chan = l2cap_pi(sk)->chan;
/* prevent chan structure from being freed whilst unlocked */
l2cap_chan_hold(chan);
- conn = chan->conn;

BT_DBG("chan %p state %s", chan, state_to_string(chan->state));

- if (conn)
- mutex_lock(&conn->chan_lock);
-
l2cap_chan_lock(chan);

if (chan->mode == L2CAP_MODE_ERTM &&
@@ -1137,9 +1132,6 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)

l2cap_chan_unlock(chan);

- if (conn)
- mutex_unlock(&conn->chan_lock);
-
/* allow chan and sk structures to be freed */
l2cap_chan_put(chan);
sock_put(sk);
--
1.8.5.6

2015-06-05 23:11:12

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 3/7] Bluetooth: Unwind l2cap_sock_shutdown()

l2cap_sock_shutdown() is designed to only action shutdown
of the channel when shutdown is not already in progress.
Therefore, reorganise the code flow by adding a goto
to jump to the end of function handling when shutdown is
already being actioned. This removes one level of code
indentation and make the code more readable.

Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 44 ++++++++++++++++++++++++--------------------
1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index c909f45..8550886 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1099,6 +1099,13 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
if (!sk)
return 0;

+ lock_sock(sk);
+
+ if (sk->sk_shutdown)
+ goto shutdown_already;
+
+ sk->sk_shutdown = SHUTDOWN_MASK;
+
/* prevent sk structure from being freed whilst unlocked */
sock_hold(sk);

@@ -1113,30 +1120,21 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
mutex_lock(&conn->chan_lock);

l2cap_chan_lock(chan);
- lock_sock(sk);

- if (!sk->sk_shutdown) {
- if (chan->mode == L2CAP_MODE_ERTM &&
- chan->unacked_frames > 0 &&
- chan->state == BT_CONNECTED)
- err = __l2cap_wait_ack(sk, chan);
+ if (chan->mode == L2CAP_MODE_ERTM &&
+ chan->unacked_frames > 0 &&
+ chan->state == BT_CONNECTED)
+ err = __l2cap_wait_ack(sk, chan);

- sk->sk_shutdown = SHUTDOWN_MASK;
-
- release_sock(sk);
- l2cap_chan_close(chan, 0);
- lock_sock(sk);
-
- if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
- !(current->flags & PF_EXITING))
- err = bt_sock_wait_state(sk, BT_CLOSED,
- sk->sk_lingertime);
- }
+ release_sock(sk);
+ l2cap_chan_close(chan, 0);
+ lock_sock(sk);

- if (!err && sk->sk_err)
- err = -sk->sk_err;
+ if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
+ !(current->flags & PF_EXITING))
+ err = bt_sock_wait_state(sk, BT_CLOSED,
+ sk->sk_lingertime);

- release_sock(sk);
l2cap_chan_unlock(chan);

if (conn)
@@ -1146,6 +1144,12 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
l2cap_chan_put(chan);
sock_put(sk);

+shutdown_already:
+ if (!err && sk->sk_err)
+ err = -sk->sk_err;
+
+ release_sock(sk);
+
return err;
}

--
1.8.5.6

2015-06-05 23:11:11

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 2/7] Bluetooth: Make __l2cap_wait_ack more efficient

Use chan->state instead of chan->conn because waiting
for ACK's is only possible in the BT_CONNECTED state.
Also avoids reference to the conn structure so makes
locking easier.

Only call __l2cap_wait_ack() when the needed condition
of chan->unacked_frames > 0 && chan->state == BT_CONNECTED
is true and convert the while loop to a do while loop.

__l2cap_wait_ack() change the function prototype to
pass in the chan variable as chan is already available
in the calling function l2cap_sock_shutdown(). Avoids
locking issues.

Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 8321685..c909f45 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1054,16 +1054,15 @@ static void l2cap_sock_kill(struct sock *sk)
sock_put(sk);
}

-static int __l2cap_wait_ack(struct sock *sk)
+static int __l2cap_wait_ack(struct sock *sk, struct l2cap_chan *chan)
{
- struct l2cap_chan *chan = l2cap_pi(sk)->chan;
DECLARE_WAITQUEUE(wait, current);
int err = 0;
int timeo = HZ/5;

add_wait_queue(sk_sleep(sk), &wait);
set_current_state(TASK_INTERRUPTIBLE);
- while (chan->unacked_frames > 0 && chan->conn) {
+ do {
if (!timeo)
timeo = HZ/5;

@@ -1080,7 +1079,9 @@ static int __l2cap_wait_ack(struct sock *sk)
err = sock_error(sk);
if (err)
break;
- }
+ } while (chan->unacked_frames > 0 &&
+ chan->state == BT_CONNECTED);
+
set_current_state(TASK_RUNNING);
remove_wait_queue(sk_sleep(sk), &wait);
return err;
@@ -1115,8 +1116,10 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
lock_sock(sk);

if (!sk->sk_shutdown) {
- if (chan->mode == L2CAP_MODE_ERTM)
- err = __l2cap_wait_ack(sk);
+ if (chan->mode == L2CAP_MODE_ERTM &&
+ chan->unacked_frames > 0 &&
+ chan->state == BT_CONNECTED)
+ err = __l2cap_wait_ack(sk, chan);

sk->sk_shutdown = SHUTDOWN_MASK;

--
1.8.5.6


2015-06-05 23:11:10

by Dean Jenkins

[permalink] [raw]
Subject: [PATCH v1 1/7] Bluetooth: L2CAP ERTM shutdown protect sk and chan

During execution of l2cap_sock_shutdown() which might
sleep, the sk and chan structures can be in an unlocked
condition which potentially allows the structures to be
freed by other running threads. Therefore, there is a
possibility of a malfunction or memory reuse after being
freed.

Keep the sk and chan structures alive during the
exection of l2cap_sock_shutdown() by using their
respective hold and put functions. This allows the structures
to be freeable at the end of l2cap_sock_shutdown().

Signed-off-by: Kautuk Consul <[email protected]>
Signed-off-by: Dean Jenkins <[email protected]>
---
net/bluetooth/l2cap_sock.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 2442877..8321685 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1098,7 +1098,12 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
if (!sk)
return 0;

+ /* prevent sk structure from being freed whilst unlocked */
+ sock_hold(sk);
+
chan = l2cap_pi(sk)->chan;
+ /* prevent chan structure from being freed whilst unlocked */
+ l2cap_chan_hold(chan);
conn = chan->conn;

BT_DBG("chan %p state %s", chan, state_to_string(chan->state));
@@ -1134,6 +1139,10 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
if (conn)
mutex_unlock(&conn->chan_lock);

+ /* allow chan and sk structures to be freed */
+ l2cap_chan_put(chan);
+ sock_put(sk);
+
return err;
}

--
1.8.5.6

2015-10-08 08:59:55

by Luiz Augusto von Dentz

[permalink] [raw]
Subject: Re: [PATCH v1 0/7] Avoid L2CAP ERTM shutdown hung tasks

Hi Dean,

On Wed, Oct 7, 2015 at 5:43 PM, Dean Jenkins <[email protected]> wrot=
e:
> Hi Luiz,
>
> On 07/10/15 15:09, Luiz Augusto von Dentz wrote:
>>
>> Hi Dean,
>>
>> On Sat, Jun 6, 2015 at 7:13 AM, Marcel Holtmann <[email protected]>
>> wrote:
>>>
>>> Hi Dean,
>>>
>>>> Details of the issue are described in the thread
>>>> [RFC] Bluetooth ERTM L2CAP deadlocks (hung tasks) due to
>>>> l2cap_sock_shutdown()
>>>> and kernel.org Bugzilla
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=3D99301
>>
>> Are you still working on this? I can still lock bluetoothd on close,
>> btw it seems you are not aware of the following:
>>
>> SO_LINGER
>> When enabled, a close(2) or shutdown(2) will not return
>> until all queued messages for the socket have been successfully sent
>> or the linger timeout has been reached. Otherwise, the call
>> returns immediately and the closing is done in the background.
>>
>> So according to the text above we should never call __l2cap_wait_ack
>> on l2cap_sock_shutdown since it should return immediately, only when
>> SO_LINGER is used it is allowed to block. Also there doesn't seems to
>> be anything prevent us to send a L2CAP Disconnect request while there
>> are unacked packets, at least I could find anything on the core about
>> waiting for unacked nor the test spec has any test for it.
>>
> Thanks for the information about SO_LINGER time. So it sounds like furthe=
r
> changes to the logic of the code are needed.
>
> Yes, we are still working on this issue. We have devised a fix for our
> commercial ARM kernel 3.14. and so far we have no reported failures.
>
> In our fix, we had to ensure "L2CAP Disconnect request" took precedence o=
ver
> shutdown.

Alright but then __l2cap_wait_ack should probably not be needed
anymore since the remote should discard any data not processed yet, or
if did process send the pending acks before the disconnect response,
which makes me question why we have this code in the first place.

> Note that ACKS are only used for L2CAP ERTM and so only some BT services
> risk getting hung such as audio media browsing over BT.

Actually for AVRCP browsing it does not matter since it won't transfer
persistent data, GOEP 2.0 would perhaps be a problem since it means
data transfers, which could be files in case of FTP, may not be
complete but OBEX don't have segmentation so you always need to wait
the final of GET or PUT command even in Single Response Mode which
means if the we disconnect before the data completes we should discard
the data so there is no point in waiting for unacked packets in GOEP
2.0.

> We are currently preparing a patchset for bluetooth-next of our 2nd attem=
pt
> at resolving the locking issue.

I wonder if you tried removing __l2cap_wait_ack and related code?

Btw, Ive trace down the original commit introduce this logic to:

commit 6161c0382bbab883a634d284f7367a88bbe88534
Author: Gustavo F. Padovan <[email protected]>
Date: Sat May 1 16:15:44 2010 -0300

Bluetooth: Add wait_queue to wait ack of all sent packets

To guarantee that all packets we sent were received we need to wait for
theirs ack before shutdown the socket.

Signed-off-by: Gustavo F. Padovan <[email protected]>
Reviewed-by: Jo=C3=A3o Paulo Rechi Vita <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>

It looks to me it was trying to prevent that the scenario where you
send a bunch of data and disconnect, probably l2test would do that,
which means there is no protocol on top, which Im now suspecting
doesn't even work with BlueZ as a receiving part because we only ack
if the window is 3/4ths full.

--=20
Luiz Augusto von Dentz

2015-10-07 14:43:44

by Dean Jenkins

[permalink] [raw]
Subject: Re: [PATCH v1 0/7] Avoid L2CAP ERTM shutdown hung tasks

Hi Luiz,

On 07/10/15 15:09, Luiz Augusto von Dentz wrote:
> Hi Dean,
>
> On Sat, Jun 6, 2015 at 7:13 AM, Marcel Holtmann <[email protected]> wrote:
>> Hi Dean,
>>
>>> Details of the issue are described in the thread
>>> [RFC] Bluetooth ERTM L2CAP deadlocks (hung tasks) due to l2cap_sock_shutdown()
>>> and kernel.org Bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=99301
> Are you still working on this? I can still lock bluetoothd on close,
> btw it seems you are not aware of the following:
>
> SO_LINGER
> When enabled, a close(2) or shutdown(2) will not return
> until all queued messages for the socket have been successfully sent
> or the linger timeout has been reached. Otherwise, the call
> returns immediately and the closing is done in the background.
>
> So according to the text above we should never call __l2cap_wait_ack
> on l2cap_sock_shutdown since it should return immediately, only when
> SO_LINGER is used it is allowed to block. Also there doesn't seems to
> be anything prevent us to send a L2CAP Disconnect request while there
> are unacked packets, at least I could find anything on the core about
> waiting for unacked nor the test spec has any test for it.
>
Thanks for the information about SO_LINGER time. So it sounds like
further changes to the logic of the code are needed.

Yes, we are still working on this issue. We have devised a fix for our
commercial ARM kernel 3.14. and so far we have no reported failures.

In our fix, we had to ensure "L2CAP Disconnect request" took precedence
over shutdown.

Note that ACKS are only used for L2CAP ERTM and so only some BT services
risk getting hung such as audio media browsing over BT.

We are currently preparing a patchset for bluetooth-next of our 2nd
attempt at resolving the locking issue.

Thanks,

Regards,
Dean

--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.

2015-10-07 14:09:57

by Luiz Augusto von Dentz

[permalink] [raw]
Subject: Re: [PATCH v1 0/7] Avoid L2CAP ERTM shutdown hung tasks

Hi Dean,

On Sat, Jun 6, 2015 at 7:13 AM, Marcel Holtmann <[email protected]> wrote:
> Hi Dean,
>
>> Details of the issue are described in the thread
>> [RFC] Bluetooth ERTM L2CAP deadlocks (hung tasks) due to l2cap_sock_shutdown()
>> and kernel.org Bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=99301

Are you still working on this? I can still lock bluetoothd on close,
btw it seems you are not aware of the following:

SO_LINGER
When enabled, a close(2) or shutdown(2) will not return
until all queued messages for the socket have been successfully sent
or the linger timeout has been reached. Otherwise, the call
returns immediately and the closing is done in the background.

So according to the text above we should never call __l2cap_wait_ack
on l2cap_sock_shutdown since it should return immediately, only when
SO_LINGER is used it is allowed to block. Also there doesn't seems to
be anything prevent us to send a L2CAP Disconnect request while there
are unacked packets, at least I could find anything on the core about
waiting for unacked nor the test spec has any test for it.

--
Luiz Augusto von Dentz