LinuxLists.cc - [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command

[permalink] [raw]

Subject: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.

We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and del_timer()
succeeds, we decrement the number of pending commands. If del_timer() fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will check
the counter after decrementing it, if the counter (xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means current
timeout command has been handled by host and host has fetched new command as
xhci->current_cmd, then just return and wait for new current command.

Signed-off-by: Baolin Wang <[email protected]>
---
drivers/usb/host/xhci-ring.c | 29 ++++++++++++++++++++++++++++-
drivers/usb/host/xhci.h | 1 +
2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 9965a4c..edc9ac2 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1253,6 +1253,7 @@ static void xhci_handle_stopped_cmd_ring(struct xhci_hcd *xhci,
if ((xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) &&
!(xhci->xhc_state & XHCI_STATE_DYING)) {
xhci->current_cmd = cur_cmd;
+ xhci->current_cmd_pending++;
mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
xhci_ring_cmd_db(xhci);
}
@@ -1269,11 +1270,27 @@ void xhci_handle_command_timeout(unsigned long data)
xhci = (struct xhci_hcd *) data;

spin_lock_irqsave(&xhci->lock, flags);
+ xhci->current_cmd_pending--;
+
if (!xhci->current_cmd) {
spin_unlock_irqrestore(&xhci->lock, flags);
return;
}

+ /*
+ * If the current_cmd_pending is 0, which means current command is
+ * timeout.
+ *
+ * If the current_cmd_pending is greater than 0, which means current
+ * timeout command has been handled by host and host has fetched new
+ * command as xhci->current_cmd, then just return and wait for new
+ * current command.
+ */
+ if (xhci->current_cmd_pending > 0) {
+ spin_unlock_irqrestore(&xhci->lock, flags);
+ return;
+ }
+
if (xhci->current_cmd->status == COMP_CMD_ABORT)
second_timeout = true;
xhci->current_cmd->status = COMP_CMD_ABORT;
@@ -1282,6 +1299,8 @@ void xhci_handle_command_timeout(unsigned long data)
hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
(hw_ring_state & CMD_RING_RUNNING)) {
+ /* Will add command timer again to wait for abort event */
+ xhci->current_cmd_pending++;
spin_unlock_irqrestore(&xhci->lock, flags);
xhci_dbg(xhci, "Command timeout\n");
ret = xhci_abort_cmd_ring(xhci);
@@ -1336,7 +1355,13 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,

cmd = list_entry(xhci->cmd_list.next, struct xhci_command, cmd_list);

- del_timer(&xhci->cmd_timer);
+ /*
+ * If the command timer is running on another CPU, we don't decrement
+ * current_cmd_pending, since we didn't successfully stop the command
+ * timer.
+ */
+ if (del_timer(&xhci->cmd_timer))
+ xhci->current_cmd_pending--;

trace_xhci_cmd_completion(cmd_trb, (struct xhci_generic_trb *) event);

@@ -1427,6 +1452,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
if (cmd->cmd_list.next != &xhci->cmd_list) {
xhci->current_cmd = list_entry(cmd->cmd_list.next,
struct xhci_command, cmd_list);
+ xhci->current_cmd_pending++;
mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
} else if (xhci->current_cmd == cmd) {
xhci->current_cmd = NULL;
@@ -3927,6 +3953,7 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
if (xhci->cmd_list.next == &cmd->cmd_list &&
!timer_pending(&xhci->cmd_timer)) {
xhci->current_cmd = cmd;
+ xhci->current_cmd_pending++;
mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
}

diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 9dbaacf..5d81257 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1567,6 +1567,7 @@ struct xhci_hcd {
unsigned int cmd_ring_reserved_trbs;
struct timer_list cmd_timer;
struct xhci_command *current_cmd;
+ u32 current_cmd_pending;
struct xhci_ring *event_ring;
struct xhci_erst erst;
/* Scratchpad */
--
1.7.9.5

2016-12-05 14:07:35

[permalink] [raw]

Subject: Re: [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command

On 05.12.2016 09:51, Baolin Wang wrote:
> When current command was supposed to be aborted, host will free the command
> in handle_cmd_completion() function. But it might be still referenced by
> xhci->current_cmd, which need to set NULL.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> This patch is based on Lu Baolu's new fix patch:
> usb: xhci: fix possible wild pointer
> https://www.spinics.net/lists/linux-usb/msg150219.html
> ---
> drivers/usb/host/xhci-ring.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 62dd1c7..9965a4c 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1362,8 +1362,11 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
> */
> if (cmd_comp_code == COMP_CMD_ABORT) {
> xhci->cmd_ring_state = CMD_RING_STATE_STOPPED;
> - if (cmd->status == COMP_CMD_ABORT)
> + if (cmd->status == COMP_CMD_ABORT) {
> + if (xhci->current_cmd == cmd)
> + xhci->current_cmd = NULL;
> goto event_handled;
> + }
> }
>
> cmd_type = TRB_FIELD_TO_TYPE(le32_to_cpu(cmd_trb->generic.field[3]));
>

True, thanks

-Mathias

2016-12-12 15:51:33

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 05.12.2016 09:51, Baolin Wang wrote:
> If a command event is found on the event ring during an interrupt,
> we need to stop the command timer with del_timer(). Since del_timer()
> can fail if the timer is running and waiting on the xHCI lock, then
> it maybe get the wrong timeout command in xhci_handle_command_timeout()
> if host fetched a new command and updated the xhci->current_cmd in
> handle_cmd_completion(). For this situation, we need a way to signal
> to the command timer that everything is fine and it should exit.

Ah, right, this could actually happen.

>
> We should introduce a counter (xhci->current_cmd_pending) for the number
> of pending commands. If we need to cancel the command timer and del_timer()
> succeeds, we decrement the number of pending commands. If del_timer() fails,
> we leave the number of pending commands alone.
>
> For handling timeout command, in xhci_handle_command_timeout() we will check
> the counter after decrementing it, if the counter (xhci->current_cmd_pending)
> is 0, which means xhci->current_cmd is the right timeout command. If the
> counter (xhci->current_cmd_pending) is greater than 0, which means current
> timeout command has been handled by host and host has fetched new command as
> xhci->current_cmd, then just return and wait for new current command.

A counter like this could work.

Writing the abort bit can generate either ABORT+STOP, or just STOP
event, this seems to cover both.

quick check, case 1: timeout and cmd completion at the same time.

cpu1 cpu2

queue_command(first), p++ (=1)
queue_command(more),
--completion irq fires-- -- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock ) spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
cur_cmd = list_next(), p++ (=2)
unlock(xhci_lock)
lock(xhci_lock)
p-- (=1)
if (p > 0), exit
OK works

case 2: normal timeout case with ABORT+STOP, no race.

cpu1 cpu2

queue_command(first), p++ (=1)
queue_command(more),
handle_cmd_timeout()
p-- (P=0), don't exit
mod_timer(), p++ (P=1)
write_abort_bit()
handle_cmd_comletion(ABORT)
del_timer(), ok, p-- (p = 0)
handle_cmd_completion(STOP)
del_timer(), fail, (P=0)
handle_stopped_cmd_ring()
cur_cmd = list_next(), p++ (=1)
mod_timer()

OK, works, and same for just STOP case, with the only difference that
during handle_cmd_completion(STOP) p is decremented (p--)

So unless there is a way to find out if cur_cmd is valid in command timeout
in command timeout with the help of existing flags and lists this would be a working
solution.

-Mathias

2016-12-13 03:21:57

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi Mathias,

On 12 December 2016 at 23:52, Mathias Nyman
<[email protected]> wrote:
> On 05.12.2016 09:51, Baolin Wang wrote:
>>
>> If a command event is found on the event ring during an interrupt,
>> we need to stop the command timer with del_timer(). Since del_timer()
>> can fail if the timer is running and waiting on the xHCI lock, then
>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>> if host fetched a new command and updated the xhci->current_cmd in
>> handle_cmd_completion(). For this situation, we need a way to signal
>> to the command timer that everything is fine and it should exit.
>
>
> Ah, right, this could actually happen.
>
>>
>>
>> We should introduce a counter (xhci->current_cmd_pending) for the number
>> of pending commands. If we need to cancel the command timer and
>> del_timer()
>> succeeds, we decrement the number of pending commands. If del_timer()
>> fails,
>> we leave the number of pending commands alone.
>>
>> For handling timeout command, in xhci_handle_command_timeout() we will
>> check
>> the counter after decrementing it, if the counter
>> (xhci->current_cmd_pending)
>> is 0, which means xhci->current_cmd is the right timeout command. If the
>> counter (xhci->current_cmd_pending) is greater than 0, which means current
>> timeout command has been handled by host and host has fetched new command
>> as
>> xhci->current_cmd, then just return and wait for new current command.
>
>
> A counter like this could work.
>
> Writing the abort bit can generate either ABORT+STOP, or just STOP
> event, this seems to cover both.
>
> quick check, case 1: timeout and cmd completion at the same time.
>
> cpu1 cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> --completion irq fires-- -- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock ) spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> cur_cmd = list_next(), p++ (=2)
> unlock(xhci_lock)
> lock(xhci_lock)
> p-- (=1)
> if (p > 0), exit
> OK works
>
> case 2: normal timeout case with ABORT+STOP, no race.
>
> cpu1 cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> handle_cmd_timeout()
> p-- (P=0), don't exit
> mod_timer(), p++ (P=1)
> write_abort_bit()
> handle_cmd_comletion(ABORT)
> del_timer(), ok, p-- (p = 0)
> handle_cmd_completion(STOP)
> del_timer(), fail, (P=0)
> handle_stopped_cmd_ring()
> cur_cmd = list_next(), p++ (=1)
> mod_timer()
>
> OK, works, and same for just STOP case, with the only difference that
> during handle_cmd_completion(STOP) p is decremented (p--)

Yes, that's the cases what I want to handle, thanks for your explicit
explanation.

>
> So unless there is a way to find out if cur_cmd is valid in command timeout
> in command timeout with the help of existing flags and lists this would be a
> working
> solution.
>
> -Mathias
>

--
Baolin.wang
Best Regards

2016-12-19 10:33:11

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 13.12.2016 05:21, Baolin Wang wrote:
> Hi Mathias,
>
> On 12 December 2016 at 23:52, Mathias Nyman
> <[email protected]> wrote:
>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>
>>> If a command event is found on the event ring during an interrupt,
>>> we need to stop the command timer with del_timer(). Since del_timer()
>>> can fail if the timer is running and waiting on the xHCI lock, then
>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>> if host fetched a new command and updated the xhci->current_cmd in
>>> handle_cmd_completion(). For this situation, we need a way to signal
>>> to the command timer that everything is fine and it should exit.
>>
>>
>> Ah, right, this could actually happen.
>>
>>>
>>>
>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>> of pending commands. If we need to cancel the command timer and
>>> del_timer()
>>> succeeds, we decrement the number of pending commands. If del_timer()
>>> fails,
>>> we leave the number of pending commands alone.
>>>
>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>> check
>>> the counter after decrementing it, if the counter
>>> (xhci->current_cmd_pending)
>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>> counter (xhci->current_cmd_pending) is greater than 0, which means current
>>> timeout command has been handled by host and host has fetched new command
>>> as
>>> xhci->current_cmd, then just return and wait for new current command.
>>
>>
>> A counter like this could work.
>>
>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>> event, this seems to cover both.
>>
>> quick check, case 1: timeout and cmd completion at the same time.
>>
>> cpu1 cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>> --completion irq fires-- -- timer times out at same time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock ) spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> cur_cmd = list_next(), p++ (=2)
>> unlock(xhci_lock)
>> lock(xhci_lock)
>> p-- (=1)
>> if (p > 0), exit
>> OK works
>>
>> case 2: normal timeout case with ABORT+STOP, no race.
>>
>> cpu1 cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>> handle_cmd_timeout()
>> p-- (P=0), don't exit
>> mod_timer(), p++ (P=1)
>> write_abort_bit()
>> handle_cmd_comletion(ABORT)
>> del_timer(), ok, p-- (p = 0)
>> handle_cmd_completion(STOP)
>> del_timer(), fail, (P=0)
>> handle_stopped_cmd_ring()
>> cur_cmd = list_next(), p++ (=1)
>> mod_timer()
>>
>> OK, works, and same for just STOP case, with the only difference that
>> during handle_cmd_completion(STOP) p is decremented (p--)
>
> Yes, that's the cases what I want to handle, thanks for your explicit
> explanation.
>

Gave this some more thought over the weekend, and this implementation
doesn't solve the case when the last command times out and races with the
completion handler:

cpu1 cpu2

queue_command(first), p++ (=1)
--completion irq fires-- -- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock ) spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
no more commands, P (=1, nochange)
unlock(xhci_lock)
lock(xhci_lock)
p-- (=0)
p == 0, continue, even if we should not.

For this we still need to rely on checking cur_cmd == NULL in the timeout function.
(Baolus patch sets it to NULL if there are no more commands pending)

And then we could replace the whole counter with a simple check if the timeout timer
is pending in the timeout function:

xhci_handle_command_timeout()
lock()
if (!cur_cmd || timer_pending(timeout_timer)) {
unlock();
return;
}

-Mathias

2016-12-19 11:34:31

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi Mathias,

On 19 December 2016 at 18:33, Mathias Nyman
<[email protected]> wrote:
> On 13.12.2016 05:21, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 12 December 2016 at 23:52, Mathias Nyman
>> <[email protected]> wrote:
>>>
>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>
>>>>
>>>> If a command event is found on the event ring during an interrupt,
>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>> to the command timer that everything is fine and it should exit.
>>>
>>>
>>>
>>> Ah, right, this could actually happen.
>>>
>>>>
>>>>
>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>> of pending commands. If we need to cancel the command timer and
>>>> del_timer()
>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>> fails,
>>>> we leave the number of pending commands alone.
>>>>
>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>> check
>>>> the counter after decrementing it, if the counter
>>>> (xhci->current_cmd_pending)
>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>> current
>>>> timeout command has been handled by host and host has fetched new
>>>> command
>>>> as
>>>> xhci->current_cmd, then just return and wait for new current command.
>>>
>>>
>>>
>>> A counter like this could work.
>>>
>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>> event, this seems to cover both.
>>>
>>> quick check, case 1: timeout and cmd completion at the same time.
>>>
>>> cpu1 cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>> --completion irq fires-- -- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock ) spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> cur_cmd = list_next(), p++ (=2)
>>> unlock(xhci_lock)
>>> lock(xhci_lock)
>>> p-- (=1)
>>> if (p > 0), exit
>>> OK works
>>>
>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>
>>> cpu1 cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>> handle_cmd_timeout()
>>> p-- (P=0), don't exit
>>> mod_timer(), p++ (P=1)
>>> write_abort_bit()
>>> handle_cmd_comletion(ABORT)
>>> del_timer(), ok, p-- (p = 0)
>>> handle_cmd_completion(STOP)
>>> del_timer(), fail, (P=0)
>>> handle_stopped_cmd_ring()
>>> cur_cmd = list_next(), p++ (=1)
>>> mod_timer()
>>>
>>> OK, works, and same for just STOP case, with the only difference that
>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>
>>
>> Yes, that's the cases what I want to handle, thanks for your explicit
>> explanation.
>>
>
> Gave this some more thought over the weekend, and this implementation
> doesn't solve the case when the last command times out and races with the
> completion handler:
>
> cpu1 cpu2
>
> queue_command(first), p++ (=1)
> --completion irq fires-- -- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock ) spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> no more commands, P (=1, nochange)
> unlock(xhci_lock)
> lock(xhci_lock)
> p-- (=0)
> p == 0, continue, even if we should
> not.
> For this we still need to rely on
> checking cur_cmd == NULL in the timeout function.
> (Baolus patch sets it to NULL if there are no more commands pending)

As I pointed out in patch 1 of this patchset, this patchset is based
on Lu Baolu's new fix patch:
usb: xhci: fix possible wild pointer
https://www.spinics.net/lists/linux-usb/msg150219.html

After applying Baolu's patch, after decrement the counter, we will
check the xhci->cur_command if is NULL. So in this situation:
cpu1 cpu2

queue_command(first), p++ (=1)
--completion irq fires-- -- timer times out at same time--
handle_cmd_completion() handle_cmd_timeout(),)
lock(xhci_lock ) spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
no more commands, P (=1, nochange)
unlock(xhci_lock)
lock(xhci_lock)
p-- (=0)
no current command, return
if (!xhci->current_cmd) {
unlock(xhci_lock);
return;
}

It can work.

>
> And then we could replace the whole counter with a simple check if the
> timeout timer
> is pending in the timeout function:
>
> xhci_handle_command_timeout()
> lock()
> if (!cur_cmd || timer_pending(timeout_timer)) {
> unlock();
> return;
> }
>
> -Mathias
>

--
Baolin.wang
Best Regards

2016-12-19 12:12:30

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 19.12.2016 13:34, Baolin Wang wrote:
> Hi Mathias,
>
> On 19 December 2016 at 18:33, Mathias Nyman
> <[email protected]> wrote:
>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>
>>> Hi Mathias,
>>>
>>> On 12 December 2016 at 23:52, Mathias Nyman
>>> <[email protected]> wrote:
>>>>
>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> If a command event is found on the event ring during an interrupt,
>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>> to the command timer that everything is fine and it should exit.
>>>>
>>>>
>>>>
>>>> Ah, right, this could actually happen.
>>>>
>>>>>
>>>>>
>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>> of pending commands. If we need to cancel the command timer and
>>>>> del_timer()
>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>> fails,
>>>>> we leave the number of pending commands alone.
>>>>>
>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>> check
>>>>> the counter after decrementing it, if the counter
>>>>> (xhci->current_cmd_pending)
>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>> current
>>>>> timeout command has been handled by host and host has fetched new
>>>>> command
>>>>> as
>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>
>>>>
>>>>
>>>> A counter like this could work.
>>>>
>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>> event, this seems to cover both.
>>>>
>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>
>>>> cpu1 cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> queue_command(more),
>>>> --completion irq fires-- -- timer times out at same time--
>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>> del_timer() fail, p (=1, nochange)
>>>> cur_cmd = list_next(), p++ (=2)
>>>> unlock(xhci_lock)
>>>> lock(xhci_lock)
>>>> p-- (=1)
>>>> if (p > 0), exit
>>>> OK works
>>>>
>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>
>>>> cpu1 cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> queue_command(more),
>>>> handle_cmd_timeout()
>>>> p-- (P=0), don't exit
>>>> mod_timer(), p++ (P=1)
>>>> write_abort_bit()
>>>> handle_cmd_comletion(ABORT)
>>>> del_timer(), ok, p-- (p = 0)
>>>> handle_cmd_completion(STOP)
>>>> del_timer(), fail, (P=0)
>>>> handle_stopped_cmd_ring()
>>>> cur_cmd = list_next(), p++ (=1)
>>>> mod_timer()
>>>>
>>>> OK, works, and same for just STOP case, with the only difference that
>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>
>>>
>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>> explanation.
>>>
>>
>> Gave this some more thought over the weekend, and this implementation
>> doesn't solve the case when the last command times out and races with the
>> completion handler:
>>
>> cpu1 cpu2
>>
>> queue_command(first), p++ (=1)
>> --completion irq fires-- -- timer times out at same time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock ) spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> no more commands, P (=1, nochange)
>> unlock(xhci_lock)
>> lock(xhci_lock)
>> p-- (=0)
>> p == 0, continue, even if we should
>> not.
>> For this we still need to rely on
>> checking cur_cmd == NULL in the timeout function.
>> (Baolus patch sets it to NULL if there are no more commands pending)
>
> As I pointed out in patch 1 of this patchset, this patchset is based
> on Lu Baolu's new fix patch:
> usb: xhci: fix possible wild pointer
> https://www.spinics.net/lists/linux-usb/msg150219.html
>
> After applying Baolu's patch, after decrement the counter, we will
> check the xhci->cur_command if is NULL. So in this situation:
> cpu1 cpu2
>
> queue_command(first), p++ (=1)
> --completion irq fires-- -- timer times out at same time--
> handle_cmd_completion() handle_cmd_timeout(),)
> lock(xhci_lock ) spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> no more commands, P (=1, nochange)
> unlock(xhci_lock)
> lock(xhci_lock)
> p-- (=0)
> no current command, return
> if (!xhci->current_cmd) {
> unlock(xhci_lock);
> return;
> }
>
> It can work.

Yes,

What I wanted to say is that as it relies on Baolus patch for that one case
it seems that patch 2/2 can be replaced by a single line change:

if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))

Right?

-Mathias

2016-12-20 03:23:32

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 19 December 2016 at 20:13, Mathias Nyman <[email protected]> wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>> <[email protected]> wrote:
>>>
>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>
>>>>
>>>> Hi Mathias,
>>>>
>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>> <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>> it maybe get the wrong timeout command in
>>>>>> xhci_handle_command_timeout()
>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>> to the command timer that everything is fine and it should exit.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Ah, right, this could actually happen.
>>>>>
>>>>>>
>>>>>>
>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the
>>>>>> number
>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>> del_timer()
>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>> fails,
>>>>>> we leave the number of pending commands alone.
>>>>>>
>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>> check
>>>>>> the counter after decrementing it, if the counter
>>>>>> (xhci->current_cmd_pending)
>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If
>>>>>> the
>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>> current
>>>>>> timeout command has been handled by host and host has fetched new
>>>>>> command
>>>>>> as
>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> A counter like this could work.
>>>>>
>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>> event, this seems to cover both.
>>>>>
>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>
>>>>> cpu1 cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>> --completion irq fires-- -- timer times out at same
>>>>> time--
>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> cur_cmd = list_next(), p++ (=2)
>>>>> unlock(xhci_lock)
>>>>> lock(xhci_lock)
>>>>> p-- (=1)
>>>>> if (p > 0), exit
>>>>> OK works
>>>>>
>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>
>>>>> cpu1 cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>> handle_cmd_timeout()
>>>>> p-- (P=0), don't exit
>>>>> mod_timer(), p++ (P=1)
>>>>> write_abort_bit()
>>>>> handle_cmd_comletion(ABORT)
>>>>> del_timer(), ok, p-- (p = 0)
>>>>> handle_cmd_completion(STOP)
>>>>> del_timer(), fail, (P=0)
>>>>> handle_stopped_cmd_ring()
>>>>> cur_cmd = list_next(), p++ (=1)
>>>>> mod_timer()
>>>>>
>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>
>>>>
>>>>
>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>> explanation.
>>>>
>>>
>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with the
>>> completion handler:
>>>
>>> cpu1 cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires-- -- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock ) spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>> lock(xhci_lock)
>>> p-- (=0)
>>> p == 0, continue, even if we
>>> should
>>> not.
>>> For this we still need to rely
>>> on
>>> checking cur_cmd == NULL in the timeout function.
>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>
>>
>> As I pointed out in patch 1 of this patchset, this patchset is based
>> on Lu Baolu's new fix patch:
>> usb: xhci: fix possible wild pointer
>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>
>> After applying Baolu's patch, after decrement the counter, we will
>> check the xhci->cur_command if is NULL. So in this situation:
>> cpu1 cpu2
>>
>> queue_command(first), p++ (=1)
>> --completion irq fires-- -- timer times out at same
>> time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock ) spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> no more commands, P (=1, nochange)
>> unlock(xhci_lock)
>> lock(xhci_lock)
>> p-- (=0)
>> no current command, return
>> if (!xhci->current_cmd) {
>> unlock(xhci_lock);
>> return;
>> }
>>
>> It can work.
>
>
> Yes,
>
> What I wanted to say is that as it relies on Baolus patch for that one case
> it seems that patch 2/2 can be replaced by a single line change:
>
> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>
> Right?

After checking the code again, I am agree with you. I will resend the
patch with fixing this issue. Thanks.

--
Baolin.wang
Best Regards

2016-12-20 04:29:41

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi Mathias,

On 12/19/2016 08:13 PM, Mathias Nyman wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>> <[email protected]> wrote:
>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>
>>>> Hi Mathias,
>>>>
>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>> <[email protected]> wrote:
>>>>>
>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>> to the command timer that everything is fine and it should exit.
>>>>>
>>>>>
>>>>>
>>>>> Ah, right, this could actually happen.
>>>>>
>>>>>>
>>>>>>
>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>> del_timer()
>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>> fails,
>>>>>> we leave the number of pending commands alone.
>>>>>>
>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>> check
>>>>>> the counter after decrementing it, if the counter
>>>>>> (xhci->current_cmd_pending)
>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>> current
>>>>>> timeout command has been handled by host and host has fetched new
>>>>>> command
>>>>>> as
>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>
>>>>>
>>>>>
>>>>> A counter like this could work.
>>>>>
>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>> event, this seems to cover both.
>>>>>
>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>
>>>>> cpu1 cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>> --completion irq fires-- -- timer times out at same time--
>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> cur_cmd = list_next(), p++ (=2)
>>>>> unlock(xhci_lock)
>>>>> lock(xhci_lock)
>>>>> p-- (=1)
>>>>> if (p > 0), exit
>>>>> OK works
>>>>>
>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>
>>>>> cpu1 cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>> handle_cmd_timeout()
>>>>> p-- (P=0), don't exit
>>>>> mod_timer(), p++ (P=1)
>>>>> write_abort_bit()
>>>>> handle_cmd_comletion(ABORT)
>>>>> del_timer(), ok, p-- (p = 0)
>>>>> handle_cmd_completion(STOP)
>>>>> del_timer(), fail, (P=0)
>>>>> handle_stopped_cmd_ring()
>>>>> cur_cmd = list_next(), p++ (=1)
>>>>> mod_timer()
>>>>>
>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>
>>>>
>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>> explanation.
>>>>
>>>
>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with the
>>> completion handler:
>>>
>>> cpu1 cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires-- -- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock ) spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>> lock(xhci_lock)
>>> p-- (=0)
>>> p == 0, continue, even if we should
>>> not.
>>> For this we still need to rely on
>>> checking cur_cmd == NULL in the timeout function.
>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>
>> As I pointed out in patch 1 of this patchset, this patchset is based
>> on Lu Baolu's new fix patch:
>> usb: xhci: fix possible wild pointer
>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>
>> After applying Baolu's patch, after decrement the counter, we will
>> check the xhci->cur_command if is NULL. So in this situation:
>> cpu1 cpu2
>>
>> queue_command(first), p++ (=1)
>> --completion irq fires-- -- timer times out at same time--
>> handle_cmd_completion() handle_cmd_timeout(),)
>> lock(xhci_lock ) spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> no more commands, P (=1, nochange)
>> unlock(xhci_lock)
>> lock(xhci_lock)
>> p-- (=0)
>> no current command, return
>> if (!xhci->current_cmd) {
>> unlock(xhci_lock);
>> return;
>> }
>>
>> It can work.
>
> Yes,
>
> What I wanted to say is that as it relies on Baolus patch for that one case
> it seems that patch 2/2 can be replaced by a single line change:
>
> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>
> Right?
>
> -Mathias
>

It seems that the watch dog algorithm for command queue becomes
more and more complicated and hard for maintain. I am also seeing
another case where a command may lose the chance to be tracked by
the watch dog timer.

Say,

queue_command(the only command in queue)
- completion irq fires-- - timer times out at same time-- - another command enqueue--
- lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock)
- del_timer() fail
- free the command and
set current_cmd to NULL
- unlock(xhci_lock)
- lock(xhci_lock)
- queue_command()(timer will
not rescheduled since the timer
is pending)
- lock(xhci_lock)
- no current command
- return

As the result, the later command isn't under track of the watch dog.
If hardware fails to response to this command, kernel will hang in
the thread which is waiting for the completion of the command.

I can write a patch to fix this and cc stable kernel as well. For long
term, in order to make it simple and easy to maintain, how about
allocating a watch dog timer for each command? It could be part
of the command structure and be managed just like the life cycle
of a command structure.

I can write a patch for review and discussion, if you think this
change is possible.

Best regards,
Lu Baolu

2016-12-20 06:06:47

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 20 December 2016 at 12:29, Lu Baolu <[email protected]> wrote:
> Hi Mathias,
>
> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>> On 19.12.2016 13:34, Baolin Wang wrote:
>>> Hi Mathias,
>>>
>>> On 19 December 2016 at 18:33, Mathias Nyman
>>> <[email protected]> wrote:
>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>
>>>>> Hi Mathias,
>>>>>
>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ah, right, this could actually happen.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>> del_timer()
>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>> fails,
>>>>>>> we leave the number of pending commands alone.
>>>>>>>
>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>> check
>>>>>>> the counter after decrementing it, if the counter
>>>>>>> (xhci->current_cmd_pending)
>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>> current
>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>> command
>>>>>>> as
>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>
>>>>>>
>>>>>>
>>>>>> A counter like this could work.
>>>>>>
>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>> event, this seems to cover both.
>>>>>>
>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>
>>>>>> cpu1 cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> queue_command(more),
>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>> del_timer() fail, p (=1, nochange)
>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>> unlock(xhci_lock)
>>>>>> lock(xhci_lock)
>>>>>> p-- (=1)
>>>>>> if (p > 0), exit
>>>>>> OK works
>>>>>>
>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>
>>>>>> cpu1 cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> queue_command(more),
>>>>>> handle_cmd_timeout()
>>>>>> p-- (P=0), don't exit
>>>>>> mod_timer(), p++ (P=1)
>>>>>> write_abort_bit()
>>>>>> handle_cmd_comletion(ABORT)
>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>> handle_cmd_completion(STOP)
>>>>>> del_timer(), fail, (P=0)
>>>>>> handle_stopped_cmd_ring()
>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>> mod_timer()
>>>>>>
>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>
>>>>>
>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>> explanation.
>>>>>
>>>>
>>>> Gave this some more thought over the weekend, and this implementation
>>>> doesn't solve the case when the last command times out and races with the
>>>> completion handler:
>>>>
>>>> cpu1 cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> --completion irq fires-- -- timer times out at same time--
>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>> del_timer() fail, p (=1, nochange)
>>>> no more commands, P (=1, nochange)
>>>> unlock(xhci_lock)
>>>> lock(xhci_lock)
>>>> p-- (=0)
>>>> p == 0, continue, even if we should
>>>> not.
>>>> For this we still need to rely on
>>>> checking cur_cmd == NULL in the timeout function.
>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>
>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>> on Lu Baolu's new fix patch:
>>> usb: xhci: fix possible wild pointer
>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>
>>> After applying Baolu's patch, after decrement the counter, we will
>>> check the xhci->cur_command if is NULL. So in this situation:
>>> cpu1 cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires-- -- timer times out at same time--
>>> handle_cmd_completion() handle_cmd_timeout(),)
>>> lock(xhci_lock ) spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>> lock(xhci_lock)
>>> p-- (=0)
>>> no current command, return
>>> if (!xhci->current_cmd) {
>>> unlock(xhci_lock);
>>> return;
>>> }
>>>
>>> It can work.
>>
>> Yes,
>>
>> What I wanted to say is that as it relies on Baolus patch for that one case
>> it seems that patch 2/2 can be replaced by a single line change:
>>
>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>
>> Right?
>>
>> -Mathias
>>
>
> It seems that the watch dog algorithm for command queue becomes
> more and more complicated and hard for maintain. I am also seeing
> another case where a command may lose the chance to be tracked by
> the watch dog timer.
>
> Say,
>
> queue_command(the only command in queue)
> - completion irq fires-- - timer times out at same time-- - another command enqueue--
> - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock)
> - del_timer() fail
> - free the command and
> set current_cmd to NULL
> - unlock(xhci_lock)
> - lock(xhci_lock)
> - queue_command()(timer will
> not rescheduled since the timer
> is pending)

In this case, since the command timer was fired and you did not re-add
the command timer, why here timer is pending? Maybe I missed
something? Thanks.

> - lock(xhci_lock)
> - no current command
> - return
>
> As the result, the later command isn't under track of the watch dog.
> If hardware fails to response to this command, kernel will hang in
> the thread which is waiting for the completion of the command.
>
> I can write a patch to fix this and cc stable kernel as well. For long
> term, in order to make it simple and easy to maintain, how about
> allocating a watch dog timer for each command? It could be part
> of the command structure and be managed just like the life cycle
> of a command structure.
>
> I can write a patch for review and discussion, if you think this
> change is possible.
>
> Best regards,
> Lu Baolu

--
Baolin.wang
Best Regards

2016-12-20 06:39:50

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 12/20/2016 02:06 PM, Baolin Wang wrote:
> Hi,
>
> On 20 December 2016 at 12:29, Lu Baolu <[email protected]> wrote:
>> Hi Mathias,
>>
>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>> Hi Mathias,
>>>>
>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>> <[email protected]> wrote:
>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>> Hi Mathias,
>>>>>>
>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>> <[email protected]> wrote:
>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>
>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>
>>>>>>>
>>>>>>> Ah, right, this could actually happen.
>>>>>>>
>>>>>>>>
>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>> del_timer()
>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>> fails,
>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>
>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>> check
>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>> current
>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>> command
>>>>>>>> as
>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>
>>>>>>>
>>>>>>> A counter like this could work.
>>>>>>>
>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>> event, this seems to cover both.
>>>>>>>
>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>
>>>>>>> cpu1 cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> queue_command(more),
>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>> unlock(xhci_lock)
>>>>>>> lock(xhci_lock)
>>>>>>> p-- (=1)
>>>>>>> if (p > 0), exit
>>>>>>> OK works
>>>>>>>
>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>
>>>>>>> cpu1 cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> queue_command(more),
>>>>>>> handle_cmd_timeout()
>>>>>>> p-- (P=0), don't exit
>>>>>>> mod_timer(), p++ (P=1)
>>>>>>> write_abort_bit()
>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>> handle_cmd_completion(STOP)
>>>>>>> del_timer(), fail, (P=0)
>>>>>>> handle_stopped_cmd_ring()
>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>> mod_timer()
>>>>>>>
>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>
>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>> explanation.
>>>>>>
>>>>> Gave this some more thought over the weekend, and this implementation
>>>>> doesn't solve the case when the last command times out and races with the
>>>>> completion handler:
>>>>>
>>>>> cpu1 cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> --completion irq fires-- -- timer times out at same time--
>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> no more commands, P (=1, nochange)
>>>>> unlock(xhci_lock)
>>>>> lock(xhci_lock)
>>>>> p-- (=0)
>>>>> p == 0, continue, even if we should
>>>>> not.
>>>>> For this we still need to rely on
>>>>> checking cur_cmd == NULL in the timeout function.
>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>> on Lu Baolu's new fix patch:
>>>> usb: xhci: fix possible wild pointer
>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>
>>>> After applying Baolu's patch, after decrement the counter, we will
>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>> cpu1 cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> --completion irq fires-- -- timer times out at same time--
>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>> del_timer() fail, p (=1, nochange)
>>>> no more commands, P (=1, nochange)
>>>> unlock(xhci_lock)
>>>> lock(xhci_lock)
>>>> p-- (=0)
>>>> no current command, return
>>>> if (!xhci->current_cmd) {
>>>> unlock(xhci_lock);
>>>> return;
>>>> }
>>>>
>>>> It can work.
>>> Yes,
>>>
>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>> it seems that patch 2/2 can be replaced by a single line change:
>>>
>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>
>>> Right?
>>>
>>> -Mathias
>>>
>> It seems that the watch dog algorithm for command queue becomes
>> more and more complicated and hard for maintain. I am also seeing
>> another case where a command may lose the chance to be tracked by
>> the watch dog timer.
>>
>> Say,
>>
>> queue_command(the only command in queue)
>> - completion irq fires-- - timer times out at same time-- - another command enqueue--
>> - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock)
>> - del_timer() fail
>> - free the command and
>> set current_cmd to NULL
>> - unlock(xhci_lock)
>> - lock(xhci_lock)
>> - queue_command()(timer will
>> not rescheduled since the timer
>> is pending)
> In this case, since the command timer was fired and you did not re-add
> the command timer, why here timer is pending? Maybe I missed
> something? Thanks.

In queue_command(),

/* if there are no other commands queued we start the timeout timer */
if (list_is_singular(&xhci->cmd_list) &&
!timer_pending(&xhci->cmd_timer)) {
xhci->current_cmd = cmd;
mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
}

timer_pending() will return true if the timer is fired, but the function is still
running on another CPU. Do I understand it right?

Best regards,
Lu Baolu

>> - lock(xhci_lock)
>> - no current command
>> - return
>>
>> As the result, the later command isn't under track of the watch dog.
>> If hardware fails to response to this command, kernel will hang in
>> the thread which is waiting for the completion of the command.
>>
>> I can write a patch to fix this and cc stable kernel as well. For long
>> term, in order to make it simple and easy to maintain, how about
>> allocating a watch dog timer for each command? It could be part
>> of the command structure and be managed just like the life cycle
>> of a command structure.
>>
>> I can write a patch for review and discussion, if you think this
>> change is possible.
>>
>> Best regards,
>> Lu Baolu
>
>

2016-12-20 06:46:47

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 20 December 2016 at 14:39, Lu Baolu <[email protected]> wrote:
> Hi,
>
> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>> Hi,
>>
>> On 20 December 2016 at 12:29, Lu Baolu <[email protected]> wrote:
>>> Hi Mathias,
>>>
>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>>> Hi Mathias,
>>>>>
>>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>>> <[email protected]> wrote:
>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>>> Hi Mathias,
>>>>>>>
>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>>> <[email protected]> wrote:
>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ah, right, this could actually happen.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>>> del_timer()
>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>>> fails,
>>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>>
>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>>> check
>>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>>> current
>>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>>> command
>>>>>>>>> as
>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>>
>>>>>>>>
>>>>>>>> A counter like this could work.
>>>>>>>>
>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>>> event, this seems to cover both.
>>>>>>>>
>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>>
>>>>>>>> cpu1 cpu2
>>>>>>>>
>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>> queue_command(more),
>>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>>> unlock(xhci_lock)
>>>>>>>> lock(xhci_lock)
>>>>>>>> p-- (=1)
>>>>>>>> if (p > 0), exit
>>>>>>>> OK works
>>>>>>>>
>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>>
>>>>>>>> cpu1 cpu2
>>>>>>>>
>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>> queue_command(more),
>>>>>>>> handle_cmd_timeout()
>>>>>>>> p-- (P=0), don't exit
>>>>>>>> mod_timer(), p++ (P=1)
>>>>>>>> write_abort_bit()
>>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>>> handle_cmd_completion(STOP)
>>>>>>>> del_timer(), fail, (P=0)
>>>>>>>> handle_stopped_cmd_ring()
>>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>>> mod_timer()
>>>>>>>>
>>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>>
>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>>> explanation.
>>>>>>>
>>>>>> Gave this some more thought over the weekend, and this implementation
>>>>>> doesn't solve the case when the last command times out and races with the
>>>>>> completion handler:
>>>>>>
>>>>>> cpu1 cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>> del_timer() fail, p (=1, nochange)
>>>>>> no more commands, P (=1, nochange)
>>>>>> unlock(xhci_lock)
>>>>>> lock(xhci_lock)
>>>>>> p-- (=0)
>>>>>> p == 0, continue, even if we should
>>>>>> not.
>>>>>> For this we still need to rely on
>>>>>> checking cur_cmd == NULL in the timeout function.
>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>>> on Lu Baolu's new fix patch:
>>>>> usb: xhci: fix possible wild pointer
>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>>
>>>>> After applying Baolu's patch, after decrement the counter, we will
>>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>>> cpu1 cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> --completion irq fires-- -- timer times out at same time--
>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> no more commands, P (=1, nochange)
>>>>> unlock(xhci_lock)
>>>>> lock(xhci_lock)
>>>>> p-- (=0)
>>>>> no current command, return
>>>>> if (!xhci->current_cmd) {
>>>>> unlock(xhci_lock);
>>>>> return;
>>>>> }
>>>>>
>>>>> It can work.
>>>> Yes,
>>>>
>>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>>> it seems that patch 2/2 can be replaced by a single line change:
>>>>
>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>>
>>>> Right?
>>>>
>>>> -Mathias
>>>>
>>> It seems that the watch dog algorithm for command queue becomes
>>> more and more complicated and hard for maintain. I am also seeing
>>> another case where a command may lose the chance to be tracked by
>>> the watch dog timer.
>>>
>>> Say,
>>>
>>> queue_command(the only command in queue)
>>> - completion irq fires-- - timer times out at same time-- - another command enqueue--
>>> - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock)
>>> - del_timer() fail
>>> - free the command and
>>> set current_cmd to NULL
>>> - unlock(xhci_lock)
>>> - lock(xhci_lock)
>>> - queue_command()(timer will
>>> not rescheduled since the timer
>>> is pending)
>> In this case, since the command timer was fired and you did not re-add
>> the command timer, why here timer is pending? Maybe I missed
>> something? Thanks.
>
> In queue_command(),
>
> /* if there are no other commands queued we start the timeout timer */
> if (list_is_singular(&xhci->cmd_list) &&
> !timer_pending(&xhci->cmd_timer)) {
> xhci->current_cmd = cmd;
> mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
> }
>
> timer_pending() will return true if the timer is fired, but the function is still
> running on another CPU. Do I understand it right?

>From my understanding, if the timer was fired, no matter the timeout
function is running or finished, timer_pending() will return false.
Please correct me if I made mistakes. Thanks.

>
> Best regards,
> Lu Baolu
>
>>> - lock(xhci_lock)
>>> - no current command
>>> - return
>>>
>>> As the result, the later command isn't under track of the watch dog.
>>> If hardware fails to response to this command, kernel will hang in
>>> the thread which is waiting for the completion of the command.
>>>
>>> I can write a patch to fix this and cc stable kernel as well. For long
>>> term, in order to make it simple and easy to maintain, how about
>>> allocating a watch dog timer for each command? It could be part
>>> of the command structure and be managed just like the life cycle
>>> of a command structure.
>>>
>>> I can write a patch for review and discussion, if you think this
>>> change is possible.
>>>
>>> Best regards,
>>> Lu Baolu
>>
>>
>

--
Baolin.wang
Best Regards

2016-12-20 07:19:12

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 12/20/2016 02:46 PM, Baolin Wang wrote:
> On 20 December 2016 at 14:39, Lu Baolu <[email protected]> wrote:
>> Hi,
>>
>> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>>> Hi,
>>>
>>> On 20 December 2016 at 12:29, Lu Baolu <[email protected]> wrote:
>>>> Hi Mathias,
>>>>
>>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>>>> Hi Mathias,
>>>>>>
>>>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>>>> <[email protected]> wrote:
>>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>>>> Hi Mathias,
>>>>>>>>
>>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>>>
>>>>>>>>> Ah, right, this could actually happen.
>>>>>>>>>
>>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>>>> del_timer()
>>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>>>> fails,
>>>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>>>
>>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>>>> check
>>>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>>>> current
>>>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>>>> command
>>>>>>>>>> as
>>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>>>
>>>>>>>>> A counter like this could work.
>>>>>>>>>
>>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>>>> event, this seems to cover both.
>>>>>>>>>
>>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>>>
>>>>>>>>> cpu1 cpu2
>>>>>>>>>
>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>> queue_command(more),
>>>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>>>> unlock(xhci_lock)
>>>>>>>>> lock(xhci_lock)
>>>>>>>>> p-- (=1)
>>>>>>>>> if (p > 0), exit
>>>>>>>>> OK works
>>>>>>>>>
>>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>>>
>>>>>>>>> cpu1 cpu2
>>>>>>>>>
>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>> queue_command(more),
>>>>>>>>> handle_cmd_timeout()
>>>>>>>>> p-- (P=0), don't exit
>>>>>>>>> mod_timer(), p++ (P=1)
>>>>>>>>> write_abort_bit()
>>>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>>>> handle_cmd_completion(STOP)
>>>>>>>>> del_timer(), fail, (P=0)
>>>>>>>>> handle_stopped_cmd_ring()
>>>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>>>> mod_timer()
>>>>>>>>>
>>>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>>>> explanation.
>>>>>>>>
>>>>>>> Gave this some more thought over the weekend, and this implementation
>>>>>>> doesn't solve the case when the last command times out and races with the
>>>>>>> completion handler:
>>>>>>>
>>>>>>> cpu1 cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>> no more commands, P (=1, nochange)
>>>>>>> unlock(xhci_lock)
>>>>>>> lock(xhci_lock)
>>>>>>> p-- (=0)
>>>>>>> p == 0, continue, even if we should
>>>>>>> not.
>>>>>>> For this we still need to rely on
>>>>>>> checking cur_cmd == NULL in the timeout function.
>>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>>>> on Lu Baolu's new fix patch:
>>>>>> usb: xhci: fix possible wild pointer
>>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>>>
>>>>>> After applying Baolu's patch, after decrement the counter, we will
>>>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>>>> cpu1 cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>> del_timer() fail, p (=1, nochange)
>>>>>> no more commands, P (=1, nochange)
>>>>>> unlock(xhci_lock)
>>>>>> lock(xhci_lock)
>>>>>> p-- (=0)
>>>>>> no current command, return
>>>>>> if (!xhci->current_cmd) {
>>>>>> unlock(xhci_lock);
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>> It can work.
>>>>> Yes,
>>>>>
>>>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>>>> it seems that patch 2/2 can be replaced by a single line change:
>>>>>
>>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>>>
>>>>> Right?
>>>>>
>>>>> -Mathias
>>>>>
>>>> It seems that the watch dog algorithm for command queue becomes
>>>> more and more complicated and hard for maintain. I am also seeing
>>>> another case where a command may lose the chance to be tracked by
>>>> the watch dog timer.
>>>>
>>>> Say,
>>>>
>>>> queue_command(the only command in queue)
>>>> - completion irq fires-- - timer times out at same time-- - another command enqueue--
>>>> - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock)
>>>> - del_timer() fail
>>>> - free the command and
>>>> set current_cmd to NULL
>>>> - unlock(xhci_lock)
>>>> - lock(xhci_lock)
>>>> - queue_command()(timer will
>>>> not rescheduled since the timer
>>>> is pending)
>>> In this case, since the command timer was fired and you did not re-add
>>> the command timer, why here timer is pending? Maybe I missed
>>> something? Thanks.
>> In queue_command(),
>>
>> /* if there are no other commands queued we start the timeout timer */
>> if (list_is_singular(&xhci->cmd_list) &&
>> !timer_pending(&xhci->cmd_timer)) {
>> xhci->current_cmd = cmd;
>> mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>> }
>>
>> timer_pending() will return true if the timer is fired, but the function is still
>> running on another CPU. Do I understand it right?
> >From my understanding, if the timer was fired, no matter the timeout
> function is running or finished, timer_pending() will return false.
> Please correct me if I made mistakes. Thanks.

Just looked into kernel/time/timer.c. You are right.

The pending is cleared in expire_timers() before call the timer function.

Base on this fact, we don't need to check timer_pending() at all in below code.

/* if there are no other commands queued we start the timeout timer */
if (xhci->cmd_list.next == &cmd->cmd_list &&
!timer_pending(&xhci->cmd_timer)) {
xhci->current_cmd = cmd;
mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
}

Best regards,
Lu Baolu

>
>> Best regards,
>> Lu Baolu
>>
>>>> - lock(xhci_lock)
>>>> - no current command
>>>> - return
>>>>
>>>> As the result, the later command isn't under track of the watch dog.
>>>> If hardware fails to response to this command, kernel will hang in
>>>> the thread which is waiting for the completion of the command.
>>>>
>>>> I can write a patch to fix this and cc stable kernel as well. For long
>>>> term, in order to make it simple and easy to maintain, how about
>>>> allocating a watch dog timer for each command? It could be part
>>>> of the command structure and be managed just like the life cycle
>>>> of a command structure.
>>>>
>>>> I can write a patch for review and discussion, if you think this
>>>> change is possible.
>>>>
>>>> Best regards,
>>>> Lu Baolu
>>>
>
>

2016-12-20 07:31:03

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 20 December 2016 at 15:18, Lu Baolu <[email protected]> wrote:
> Hi,
>
> On 12/20/2016 02:46 PM, Baolin Wang wrote:
>> On 20 December 2016 at 14:39, Lu Baolu <[email protected]> wrote:
>>> Hi,
>>>
>>> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>>>> Hi,
>>>>
>>>> On 20 December 2016 at 12:29, Lu Baolu <[email protected]> wrote:
>>>>> Hi Mathias,
>>>>>
>>>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>>>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>>>>> Hi Mathias,
>>>>>>>
>>>>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>>>>> <[email protected]> wrote:
>>>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>>>>> Hi Mathias,
>>>>>>>>>
>>>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>>>>
>>>>>>>>>> Ah, right, this could actually happen.
>>>>>>>>>>
>>>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>>>>> del_timer()
>>>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>>>>> fails,
>>>>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>>>>
>>>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>>>>> check
>>>>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>>>>> current
>>>>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>>>>> command
>>>>>>>>>>> as
>>>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>>>>
>>>>>>>>>> A counter like this could work.
>>>>>>>>>>
>>>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>>>>> event, this seems to cover both.
>>>>>>>>>>
>>>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>>>>
>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>
>>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>>> queue_command(more),
>>>>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>>>>> unlock(xhci_lock)
>>>>>>>>>> lock(xhci_lock)
>>>>>>>>>> p-- (=1)
>>>>>>>>>> if (p > 0), exit
>>>>>>>>>> OK works
>>>>>>>>>>
>>>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>>>>
>>>>>>>>>> cpu1 cpu2
>>>>>>>>>>
>>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>>> queue_command(more),
>>>>>>>>>> handle_cmd_timeout()
>>>>>>>>>> p-- (P=0), don't exit
>>>>>>>>>> mod_timer(), p++ (P=1)
>>>>>>>>>> write_abort_bit()
>>>>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>>>>> handle_cmd_completion(STOP)
>>>>>>>>>> del_timer(), fail, (P=0)
>>>>>>>>>> handle_stopped_cmd_ring()
>>>>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>>>>> mod_timer()
>>>>>>>>>>
>>>>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>>>>> explanation.
>>>>>>>>>
>>>>>>>> Gave this some more thought over the weekend, and this implementation
>>>>>>>> doesn't solve the case when the last command times out and races with the
>>>>>>>> completion handler:
>>>>>>>>
>>>>>>>> cpu1 cpu2
>>>>>>>>
>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>> no more commands, P (=1, nochange)
>>>>>>>> unlock(xhci_lock)
>>>>>>>> lock(xhci_lock)
>>>>>>>> p-- (=0)
>>>>>>>> p == 0, continue, even if we should
>>>>>>>> not.
>>>>>>>> For this we still need to rely on
>>>>>>>> checking cur_cmd == NULL in the timeout function.
>>>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>>>>> on Lu Baolu's new fix patch:
>>>>>>> usb: xhci: fix possible wild pointer
>>>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>>>>
>>>>>>> After applying Baolu's patch, after decrement the counter, we will
>>>>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>>>>> cpu1 cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> --completion irq fires-- -- timer times out at same time--
>>>>>>> handle_cmd_completion() handle_cmd_timeout(),)
>>>>>>> lock(xhci_lock ) spin_on(xhci_lock)
>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>> no more commands, P (=1, nochange)
>>>>>>> unlock(xhci_lock)
>>>>>>> lock(xhci_lock)
>>>>>>> p-- (=0)
>>>>>>> no current command, return
>>>>>>> if (!xhci->current_cmd) {
>>>>>>> unlock(xhci_lock);
>>>>>>> return;
>>>>>>> }
>>>>>>>
>>>>>>> It can work.
>>>>>> Yes,
>>>>>>
>>>>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>>>>> it seems that patch 2/2 can be replaced by a single line change:
>>>>>>
>>>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>>>>
>>>>>> Right?
>>>>>>
>>>>>> -Mathias
>>>>>>
>>>>> It seems that the watch dog algorithm for command queue becomes
>>>>> more and more complicated and hard for maintain. I am also seeing
>>>>> another case where a command may lose the chance to be tracked by
>>>>> the watch dog timer.
>>>>>
>>>>> Say,
>>>>>
>>>>> queue_command(the only command in queue)
>>>>> - completion irq fires-- - timer times out at same time-- - another command enqueue--
>>>>> - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock)
>>>>> - del_timer() fail
>>>>> - free the command and
>>>>> set current_cmd to NULL
>>>>> - unlock(xhci_lock)
>>>>> - lock(xhci_lock)
>>>>> - queue_command()(timer will
>>>>> not rescheduled since the timer
>>>>> is pending)
>>>> In this case, since the command timer was fired and you did not re-add
>>>> the command timer, why here timer is pending? Maybe I missed
>>>> something? Thanks.
>>> In queue_command(),
>>>
>>> /* if there are no other commands queued we start the timeout timer */
>>> if (list_is_singular(&xhci->cmd_list) &&
>>> !timer_pending(&xhci->cmd_timer)) {
>>> xhci->current_cmd = cmd;
>>> mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>>> }
>>>
>>> timer_pending() will return true if the timer is fired, but the function is still
>>> running on another CPU. Do I understand it right?
>> >From my understanding, if the timer was fired, no matter the timeout
>> function is running or finished, timer_pending() will return false.
>> Please correct me if I made mistakes. Thanks.
>
> Just looked into kernel/time/timer.c. You are right.
>
> The pending is cleared in expire_timers() before call the timer function.
>
> Base on this fact, we don't need to check timer_pending() at all in below code.

Indeed, I think so too.

>
> /* if there are no other commands queued we start the timeout timer */
> if (xhci->cmd_list.next == &cmd->cmd_list &&
> !timer_pending(&xhci->cmd_timer)) {
> xhci->current_cmd = cmd;
> mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
> }
>
>
> Best regards,
> Lu Baolu
>
>>
>>> Best regards,
>>> Lu Baolu
>>>
>>>>> - lock(xhci_lock)
>>>>> - no current command
>>>>> - return
>>>>>
>>>>> As the result, the later command isn't under track of the watch dog.
>>>>> If hardware fails to response to this command, kernel will hang in
>>>>> the thread which is waiting for the completion of the command.
>>>>>
>>>>> I can write a patch to fix this and cc stable kernel as well. For long
>>>>> term, in order to make it simple and easy to maintain, how about
>>>>> allocating a watch dog timer for each command? It could be part
>>>>> of the command structure and be managed just like the life cycle
>>>>> of a command structure.
>>>>>
>>>>> I can write a patch for review and discussion, if you think this
>>>>> change is possible.
>>>>>
>>>>> Best regards,
>>>>> Lu Baolu
>>>>
>>
>>
>

--
Baolin.wang
Best Regards

2016-12-20 15:12:35

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias

2016-12-21 02:22:42

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi Mathias,

On 20 December 2016 at 23:13, Mathias Nyman
<[email protected]> wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
> timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and make my life
> easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of
> counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
> command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.

How about applying below patch in your 'timeout_race_fixes' branch?
[PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
https://lkml.org/lkml/2016/12/13/94

>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races

I will try to test these patches.

--
Baolin.wang
Best Regards

2016-12-21 06:17:17

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi Mathias,

I have some comments for the implementation of xhci_abort_cmd_ring() below.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and make my life easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.
>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races
>
> -Mathias
>
>

Below is the latest code. I put my comments in line.

322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
323 {
324 u64 temp_64;
325 int ret;
326
327 xhci_dbg(xhci, "Abort command ring\n");
328
329 reinit_completion(&xhci->cmd_ring_stop_completion);
330
331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
333 &xhci->op_regs->cmd_ring);

We should hold xhci->lock when we are modifying xhci registers
at runtime.

334
335 /* Section 4.6.1.2 of xHCI 1.0 spec says software should
336 * time the completion od all xHCI commands, including

s/od/of/g

337 * the Command Abort operation. If software doesn't see
338 * CRR negated in a timely manner (e.g. longer than 5
339 * seconds), then it should assume that the there are
340 * larger problems with the xHC and assert HCRST.
341 */
342 ret = xhci_handshake(&xhci->op_regs->cmd_ring,
343 CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
344 if (ret < 0) {
345 /* we are about to kill xhci, give it one more chance */

The retry of setting CMD_RING_ABORT is not necessary according to
previous discussion. We have cleaned code for second try in
xhci_handle_command_timeout(). Need to clean up here as well.

346 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
347 &xhci->op_regs->cmd_ring);
348 udelay(1000);
349 ret = xhci_handshake(&xhci->op_regs->cmd_ring,
350 CMD_RING_RUNNING, 0, 3 * 1000 * 1000);
351 if (ret < 0) {
352 xhci_err(xhci, "Stopped the command ring failed, "
353 "maybe the host is dead\n");
354 xhci->xhc_state |= XHCI_STATE_DYING;
355 xhci_halt(xhci);
356 return -ESHUTDOWN;
357 }
358 }
359 /*
360 * Writing the CMD_RING_ABORT bit should cause a cmd completion event,
361 * however on some host hw the CMD_RING_RUNNING bit is correctly cleared
362 * but the completion event in never sent. Wait 2 secs (arbitrary

s/in never sent/is never sent/g

363 * number) to handle those cases after negation of CMD_RING_RUNNING.
364 */

This should be implemented with a quirk bit. It's not common for all host controllers.

365 if (!wait_for_completion_timeout(&xhci->cmd_ring_stop_completion,
366 2 * HZ)) {
367 xhci_dbg(xhci, "No stop event for abort, ring start fail?\n");
368 xhci_cleanup_command_queue(xhci);
369 } else {
370 unsigned long flags;
371
372 spin_lock_irqsave(&xhci->lock, flags);
373 xhci_handle_stopped_cmd_ring(xhci, xhci_next_queued_cmd(xhci));
374 spin_unlock_irqrestore(&xhci->lock, flags);
375 }
376 return 0;
377 }

Best regards,
Lu Baolu

2016-12-21 06:57:40

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi Mathias,

I have some comments for the implementation of
xhci_handle_command_timeout() as well.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and make my life easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.
>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races
>
> -Mathias
>
>

I post the code below and add my comments in line.

1276 void xhci_handle_command_timeout(struct work_struct *work)
1277 {
1278 struct xhci_hcd *xhci;
1279 int ret;
1280 unsigned long flags;
1281 u64 hw_ring_state;
1282
1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
1284
1285 spin_lock_irqsave(&xhci->lock, flags);
1286
1287 /*
1288 * If timeout work is pending, or current_cmd is NULL, it means we
1289 * raced with command completion. Command is handled so just return.
1290 */
1291 if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
1292 spin_unlock_irqrestore(&xhci->lock, flags);
1293 return;
1294 }
1295 /* mark this command to be cancelled */
1296 xhci->current_cmd->status = COMP_CMD_ABORT;
1297
1298 /* Make sure command ring is running before aborting it */
1299 hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
1301 (hw_ring_state & CMD_RING_RUNNING)) {
1302 /* Prevent new doorbell, and start command abort */
1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
1304 spin_unlock_irqrestore(&xhci->lock, flags);
1305 xhci_dbg(xhci, "Command timeout\n");
1306 ret = xhci_abort_cmd_ring(xhci);
1307 if (unlikely(ret == -ESHUTDOWN)) {
1308 xhci_err(xhci, "Abort command ring failed\n");
1309 xhci_cleanup_command_queue(xhci);
1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
1311 xhci_dbg(xhci, "xHCI host controller is dead.\n");
1312 }
1313 return;
1314 }
1315
1316 /* host removed. Bail out */
1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
1318 spin_unlock_irqrestore(&xhci->lock, flags);
1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
1320 xhci_cleanup_command_queue(xhci);
1321 return;
1322 }

I think this part of code should be moved up to line 1295.

1323
1324 /* command timeout on stopped ring, ring can't be aborted */
1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
1327 spin_unlock_irqrestore(&xhci->lock, flags);

This part of code is tricky. I have no idea about in which case should this
code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
here, right?

1328 return;
1329 }

Best regards,
Lu Baolu

2016-12-21 12:47:14

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 21.12.2016 08:17, Lu Baolu wrote:
> Hi Mathias,
>
> I have some comments for the implementation of xhci_abort_cmd_ring() below.
>
> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>> On 20.12.2016 09:30, Baolin Wang wrote:
>> ...
>>
>> Alright, I gathered all current work related to xhci races and timeouts
>> and put them into a branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>
>> Its based on 4.9
>> It includes a few other patches just to avoid conflicts and make my life easier
>>
>> Interesting patches are:
>>
>> ee4eb91 xhci: remove unnecessary check for pending timer
>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>> 4f2535f xhci: Handle command completion and timeout race
>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>> 529a5a0 usb: xhci: fix possible wild pointer
>> 4766555 xhci: Fix race related to abort operation
>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>> 69973b8 Linux 4.9
>>
>> The fixes for command queue races will go to usb-linus and stable, the
>> reworks for stop ep watchdog timer will go to usb-next.
>>
>> Still completely untested, (well it compiles)
>>
>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>> devicecommands to test these, I'll try to set that up.
>>
>> All additional testing is welcome, especially if you can trigger timeouts
>> and races
>>
>> -Mathias
>>
>>
>
> Below is the latest code. I put my comments in line.
>
> 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> 323 {
> 324 u64 temp_64;
> 325 int ret;
> 326
> 327 xhci_dbg(xhci, "Abort command ring\n");
> 328
> 329 reinit_completion(&xhci->cmd_ring_stop_completion);
> 330
> 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
> 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
> 333 &xhci->op_regs->cmd_ring);
>
> We should hold xhci->lock when we are modifying xhci registers
> at runtime.
>

Makes sense, but we need to unlock it before sleeping or waiting for completion.
I need to look into that in more detail.

But this was an issue already before these changes.

> The retry of setting CMD_RING_ABORT is not necessary according to
> previous discussion. We have cleaned code for second try in
> xhci_handle_command_timeout(). Need to clean up here as well.
>

Yes it can be cleaned up as well, but the two cases are a bit different.
The cleaned up one was about command ring not starting again after it was stopped.

This second try is a workaround for what we thought was the command ring failing
to stop in the first place, but is most likely due to the race that OGAWA Hirofumi
fixed. It races if the stop command ring interrupt happens between writing the abort
bit and polling for the ring stopped bit. The interrupt hander may start the command
ring again, and we would believe we failed to stop it in the first place.

This race could probably be fixed by just extending the lock (and preventing
interrupts) to cover both writing the abort bit and polling for the command ring
running bit, as you pointed out here previously.

But then again I really like OGAWA Hiroumi's solution that separates the
command ring stopping from aborting commands and restarting the ring.

The current way of always restarting the command ring as a response to
a stop command ring command really limits its usage.

So, with this in mind most reasonable would be to
1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
3. remove unnecessary second abort try as a separate patch, send to usb-next
4. remove polling for the Command ring running (CRR), waiting for completion
is enough, if completion times out then we can check CRR. for usb-next

I'll fix the typos these patches would introduce. Fixing old typos can be done as separate
patches later.

-Mathias

2016-12-21 12:56:11

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 21.12.2016 08:57, Lu Baolu wrote:
> Hi Mathias,
>
> I have some comments for the implementation of
> xhci_handle_command_timeout() as well.
>
> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>> On 20.12.2016 09:30, Baolin Wang wrote:
>> ...
>>
>> Alright, I gathered all current work related to xhci races and timeouts
>> and put them into a branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>
>> Its based on 4.9
>> It includes a few other patches just to avoid conflicts and make my life easier
>>
>> Interesting patches are:
>>
>> ee4eb91 xhci: remove unnecessary check for pending timer
>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>> 4f2535f xhci: Handle command completion and timeout race
>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>> 529a5a0 usb: xhci: fix possible wild pointer
>> 4766555 xhci: Fix race related to abort operation
>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>> 69973b8 Linux 4.9
>>
>> The fixes for command queue races will go to usb-linus and stable, the
>> reworks for stop ep watchdog timer will go to usb-next.
>>
>> Still completely untested, (well it compiles)
>>
>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>> devicecommands to test these, I'll try to set that up.
>>
>> All additional testing is welcome, especially if you can trigger timeouts
>> and races
>>
>> -Mathias
>>
>>
>
> I post the code below and add my comments in line.
>
> 1276 void xhci_handle_command_timeout(struct work_struct *work)
> 1277 {
> 1278 struct xhci_hcd *xhci;
> 1279 int ret;
> 1280 unsigned long flags;
> 1281 u64 hw_ring_state;
> 1282
> 1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
> 1284
> 1285 spin_lock_irqsave(&xhci->lock, flags);
> 1286
> 1287 /*
> 1288 * If timeout work is pending, or current_cmd is NULL, it means we
> 1289 * raced with command completion. Command is handled so just return.
> 1290 */
> 1291 if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
> 1292 spin_unlock_irqrestore(&xhci->lock, flags);
> 1293 return;
> 1294 }
> 1295 /* mark this command to be cancelled */
> 1296 xhci->current_cmd->status = COMP_CMD_ABORT;
> 1297
> 1298 /* Make sure command ring is running before aborting it */
> 1299 hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
> 1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
> 1301 (hw_ring_state & CMD_RING_RUNNING)) {
> 1302 /* Prevent new doorbell, and start command abort */
> 1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
> 1304 spin_unlock_irqrestore(&xhci->lock, flags);
> 1305 xhci_dbg(xhci, "Command timeout\n");
> 1306 ret = xhci_abort_cmd_ring(xhci);
> 1307 if (unlikely(ret == -ESHUTDOWN)) {
> 1308 xhci_err(xhci, "Abort command ring failed\n");
> 1309 xhci_cleanup_command_queue(xhci);
> 1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
> 1311 xhci_dbg(xhci, "xHCI host controller is dead.\n");
> 1312 }
> 1313 return;
> 1314 }
> 1315
> 1316 /* host removed. Bail out */
> 1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
> 1318 spin_unlock_irqrestore(&xhci->lock, flags);
> 1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
> 1320 xhci_cleanup_command_queue(xhci);
> 1321 return;
> 1322 }
>
> I think this part of code should be moved up to line 1295.

The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
I'm working on that.

Basically we want XHCI_STATE_REMOVING to mean that all devices are going,
away and driver will be removed. Don't bother with re-calculating available
bandwidths after every device removal, but do use xhci hardware to disable
devices cleanly etc.

XHCI_STATE_DYING should mean hardware is not working/responding. Don't
bother writing any registers or queuing anything. Just return all
pending and cancelled URBs, notify core we died, and free all allocated memory.

>
> 1323
> 1324 /* command timeout on stopped ring, ring can't be aborted */
> 1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
> 1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> 1327 spin_unlock_irqrestore(&xhci->lock, flags);
>
> This part of code is tricky. I have no idea about in which case should this
> code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
> here, right?
>

This isn't changed it these patches.

It will remove the aborted commands and restart the ring. It's useful if we
want to abort a command but command ring was not running. (if for some
unkown reason it was stopped, or forgot to restart.

-Mathias

2016-12-21 12:59:58

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 21.12.2016 04:22, Baolin Wang wrote:
> Hi Mathias,
>
> On 20 December 2016 at 23:13, Mathias Nyman
> <[email protected]> wrote:
>> On 20.12.2016 09:30, Baolin Wang wrote:
>> ...
>>
>> Alright, I gathered all current work related to xhci races and timeouts
>> and put them into a branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>> timeout_race_fixes
>>
>> Its based on 4.9
>> It includes a few other patches just to avoid conflicts and make my life
>> easier
>>
>> Interesting patches are:
>>
>> ee4eb91 xhci: remove unnecessary check for pending timer
>> 0cba67d xhci: detect stop endpoint race using pending timer instead of
>> counter.
>> 4f2535f xhci: Handle command completion and timeout race
>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
>> command
>> 529a5a0 usb: xhci: fix possible wild pointer
>> 4766555 xhci: Fix race related to abort operation
>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>> 69973b8 Linux 4.9
>>
>> The fixes for command queue races will go to usb-linus and stable, the
>> reworks for stop ep watchdog timer will go to usb-next.
>
> How about applying below patch in your 'timeout_race_fixes' branch?
> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
> https://lkml.org/lkml/2016/12/13/94
>

usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
which will cleanup the command queue. But I need to look into that

-Mathias

2016-12-21 14:36:52

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Mathias Nyman <[email protected]> writes:

>> Below is the latest code. I put my comments in line.
>>
>> 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>> 323 {
>> 324 u64 temp_64;
>> 325 int ret;
>> 326
>> 327 xhci_dbg(xhci, "Abort command ring\n");
>> 328
>> 329 reinit_completion(&xhci->cmd_ring_stop_completion);
>> 330
>> 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>> 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>> 333 &xhci->op_regs->cmd_ring);
>>
>> We should hold xhci->lock when we are modifying xhci registers
>> at runtime.
>>
>
> Makes sense, but we need to unlock it before sleeping or waiting for
> completion. I need to look into that in more detail.
>
> But this was an issue already before these changes.

We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
is for taking lock for register though, I guess it should be enough just
lock around of read=>write of ->cmd_ring if need lock.

[Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]

> But then again I really like OGAWA Hiroumi's solution that separates the
> command ring stopping from aborting commands and restarting the ring.
>
> The current way of always restarting the command ring as a response to
> a stop command ring command really limits its usage.
>
> So, with this in mind most reasonable would be to
> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
> 3. remove unnecessary second abort try as a separate patch, send to usb-next
> 4. remove polling for the Command ring running (CRR), waiting for completion
> is enough, if completion times out then we can check CRR. for usb-next

I think we should check both of CRR and even of stop completion. Because
CRR and stop completion was not same time (both can be first).

Thanks.
--
OGAWA Hirofumi <[email protected]>

2016-12-21 15:03:52

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 21.12.2016 16:10, OGAWA Hirofumi wrote:
> Mathias Nyman <[email protected]> writes:
>
>>> Below is the latest code. I put my comments in line.
>>>
>>> 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>> 323 {
>>> 324 u64 temp_64;
>>> 325 int ret;
>>> 326
>>> 327 xhci_dbg(xhci, "Abort command ring\n");
>>> 328
>>> 329 reinit_completion(&xhci->cmd_ring_stop_completion);
>>> 330
>>> 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>>> 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>>> 333 &xhci->op_regs->cmd_ring);
>>>
>>> We should hold xhci->lock when we are modifying xhci registers
>>> at runtime.
>>>
>>
>> Makes sense, but we need to unlock it before sleeping or waiting for
>> completion. I need to look into that in more detail.
>>
>> But this was an issue already before these changes.
>
> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
> is for taking lock for register though, I guess it should be enough just
> lock around of read=>write of ->cmd_ring if need lock.

After your patch it should be enough to have the lock only while
reading and writing the cmd_ring register.

If we want a locking fix that applies more easily to older stable releases before
your change then the lock needs to cover set CMD_RING_STATE_ABORT, read cmd_reg,
write cmd_reg and busiloop checking CRR bit. Otherwise the stop cmd ring interrupt
handler may restart the ring just before we start checing CRR. The stop cmd ring
interrupt will set the CMD_RING_STATE_ABORTED to CMD_RING_STATE_RUNNING so
ring will really restart in the interrupt handler.

>
> [Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]
>
>> But then again I really like OGAWA Hiroumi's solution that separates the
>> command ring stopping from aborting commands and restarting the ring.
>>
>> The current way of always restarting the command ring as a response to
>> a stop command ring command really limits its usage.
>>
>> So, with this in mind most reasonable would be to
>> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
>> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
>> 3. remove unnecessary second abort try as a separate patch, send to usb-next
>> 4. remove polling for the Command ring running (CRR), waiting for completion
>> is enough, if completion times out then we can check CRR. for usb-next
>
> I think we should check both of CRR and even of stop completion. Because
> CRR and stop completion was not same time (both can be first).

We can keep both, maybe change the order and do the busylooping-checking after
waiting for completion, but thats a optimization for usb-next sometimes later.

Thanks

-Mathias

2016-12-21 15:18:22

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Mathias Nyman <[email protected]> writes:

>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>> is for taking lock for register though, I guess it should be enough just
>> lock around of read=>write of ->cmd_ring if need lock.
>
> After your patch it should be enough to have the lock only while
> reading and writing the cmd_ring register.
>
> If we want a locking fix that applies more easily to older stable
> releases before your change then the lock needs to cover set
> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
> checking CRR bit. Otherwise the stop cmd ring interrupt handler may
> restart the ring just before we start checing CRR. The stop cmd ring
> interrupt will set the CMD_RING_STATE_ABORTED to
> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
> handler.

Just for record (no chance to make patch I myself for now, sorry), while
checking locking slightly, I noticed unrelated missing locking.

xhci_cleanup_command_queue()

We are calling it without locking, but we need to lock for accessing list.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2016-12-22 01:39:56

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 12/21/2016 08:57 PM, Mathias Nyman wrote:
> On 21.12.2016 08:57, Lu Baolu wrote:
>> Hi Mathias,
>>
>> I have some comments for the implementation of
>> xhci_handle_command_timeout() as well.
>>
>> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and make my life easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>> Still completely untested, (well it compiles)
>>>
>>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>>> devicecommands to test these, I'll try to set that up.
>>>
>>> All additional testing is welcome, especially if you can trigger timeouts
>>> and races
>>>
>>> -Mathias
>>>
>>>
>>
>> I post the code below and add my comments in line.
>>
>> 1276 void xhci_handle_command_timeout(struct work_struct *work)
>> 1277 {
>> 1278 struct xhci_hcd *xhci;
>> 1279 int ret;
>> 1280 unsigned long flags;
>> 1281 u64 hw_ring_state;
>> 1282
>> 1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
>> 1284
>> 1285 spin_lock_irqsave(&xhci->lock, flags);
>> 1286
>> 1287 /*
>> 1288 * If timeout work is pending, or current_cmd is NULL, it means we
>> 1289 * raced with command completion. Command is handled so just return.
>> 1290 */
>> 1291 if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
>> 1292 spin_unlock_irqrestore(&xhci->lock, flags);
>> 1293 return;
>> 1294 }
>> 1295 /* mark this command to be cancelled */
>> 1296 xhci->current_cmd->status = COMP_CMD_ABORT;
>> 1297
>> 1298 /* Make sure command ring is running before aborting it */
>> 1299 hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>> 1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
>> 1301 (hw_ring_state & CMD_RING_RUNNING)) {
>> 1302 /* Prevent new doorbell, and start command abort */
>> 1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
>> 1304 spin_unlock_irqrestore(&xhci->lock, flags);
>> 1305 xhci_dbg(xhci, "Command timeout\n");
>> 1306 ret = xhci_abort_cmd_ring(xhci);
>> 1307 if (unlikely(ret == -ESHUTDOWN)) {
>> 1308 xhci_err(xhci, "Abort command ring failed\n");
>> 1309 xhci_cleanup_command_queue(xhci);
>> 1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
>> 1311 xhci_dbg(xhci, "xHCI host controller is dead.\n");
>> 1312 }
>> 1313 return;
>> 1314 }
>> 1315
>> 1316 /* host removed. Bail out */
>> 1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
>> 1318 spin_unlock_irqrestore(&xhci->lock, flags);
>> 1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
>> 1320 xhci_cleanup_command_queue(xhci);
>> 1321 return;
>> 1322 }
>>
>> I think this part of code should be moved up to line 1295.
>
> The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
> I'm working on that.
>
> Basically we want XHCI_STATE_REMOVING to mean that all devices are going,
> away and driver will be removed. Don't bother with re-calculating available
> bandwidths after every device removal, but do use xhci hardware to disable
> devices cleanly etc.
>
> XHCI_STATE_DYING should mean hardware is not working/responding. Don't
> bother writing any registers or queuing anything. Just return all
> pending and cancelled URBs, notify core we died, and free all allocated memory.

Okay, thanks for the information.

>
>>
>> 1323
>> 1324 /* command timeout on stopped ring, ring can't be aborted */
>> 1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
>> 1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
>> 1327 spin_unlock_irqrestore(&xhci->lock, flags);
>>
>> This part of code is tricky. I have no idea about in which case should this
>> code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
>> here, right?
>>
>
> This isn't changed it these patches.
>
> It will remove the aborted commands and restart the ring. It's useful if we
> want to abort a command but command ring was not running. (if for some
> unkown reason it was stopped, or forgot to restart.

Make sense.

So how about put a warning (instead of a debug message which will normally
be ignored) here?

Best regards,
Lu Baolu

2016-12-22 01:44:01

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 12/21/2016 08:48 PM, Mathias Nyman wrote:
> On 21.12.2016 08:17, Lu Baolu wrote:
>> Hi Mathias,
>>
>> I have some comments for the implementation of xhci_abort_cmd_ring() below.
>>
>> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and make my life easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>> Still completely untested, (well it compiles)
>>>
>>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>>> devicecommands to test these, I'll try to set that up.
>>>
>>> All additional testing is welcome, especially if you can trigger timeouts
>>> and races
>>>
>>> -Mathias
>>>
>>>
>>
>> Below is the latest code. I put my comments in line.
>>
>> 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>> 323 {
>> 324 u64 temp_64;
>> 325 int ret;
>> 326
>> 327 xhci_dbg(xhci, "Abort command ring\n");
>> 328
>> 329 reinit_completion(&xhci->cmd_ring_stop_completion);
>> 330
>> 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>> 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>> 333 &xhci->op_regs->cmd_ring);
>>
>> We should hold xhci->lock when we are modifying xhci registers
>> at runtime.
>>
>
> Makes sense, but we need to unlock it before sleeping or waiting for completion.
> I need to look into that in more detail.
>
> But this was an issue already before these changes.
>
>> The retry of setting CMD_RING_ABORT is not necessary according to
>> previous discussion. We have cleaned code for second try in
>> xhci_handle_command_timeout(). Need to clean up here as well.
>>
>
> Yes it can be cleaned up as well, but the two cases are a bit different.
> The cleaned up one was about command ring not starting again after it was stopped.
>
> This second try is a workaround for what we thought was the command ring failing
> to stop in the first place, but is most likely due to the race that OGAWA Hirofumi
> fixed. It races if the stop command ring interrupt happens between writing the abort
> bit and polling for the ring stopped bit. The interrupt hander may start the command
> ring again, and we would believe we failed to stop it in the first place.
>
> This race could probably be fixed by just extending the lock (and preventing
> interrupts) to cover both writing the abort bit and polling for the command ring
> running bit, as you pointed out here previously.
>
> But then again I really like OGAWA Hiroumi's solution that separates the
> command ring stopping from aborting commands and restarting the ring.
>
> The current way of always restarting the command ring as a response to
> a stop command ring command really limits its usage.
>
> So, with this in mind most reasonable would be to
> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
> 3. remove unnecessary second abort try as a separate patch, send to usb-next
> 4. remove polling for the Command ring running (CRR), waiting for completion
> is enough, if completion times out then we can check CRR. for usb-next
> I'll fix the typos these patches would introduce. Fixing old typos can be done as separate
> patches later.

This is exactly the same as what I am thinking of. I will submit the patches later.

Best regards,
Lu Baolu

2016-12-22 01:46:11

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

Hi,

On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:
> Mathias Nyman <[email protected]> writes:
>
>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>>> is for taking lock for register though, I guess it should be enough just
>>> lock around of read=>write of ->cmd_ring if need lock.
>> After your patch it should be enough to have the lock only while
>> reading and writing the cmd_ring register.
>>
>> If we want a locking fix that applies more easily to older stable
>> releases before your change then the lock needs to cover set
>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
>> checking CRR bit. Otherwise the stop cmd ring interrupt handler may
>> restart the ring just before we start checing CRR. The stop cmd ring
>> interrupt will set the CMD_RING_STATE_ABORTED to
>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
>> handler.
> Just for record (no chance to make patch I myself for now, sorry), while
> checking locking slightly, I noticed unrelated missing locking.
>
> xhci_cleanup_command_queue()
>
> We are calling it without locking, but we need to lock for accessing list.

Yeah. I can make the patch.

Best regards,
Lu Baolu

2016-12-23 12:56:01

[permalink] [raw]

Subject: Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

On 22.12.2016 03:46, Lu Baolu wrote:
> Hi,
>
> On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:
>> Mathias Nyman <[email protected]> writes:
>>
>>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>>>> is for taking lock for register though, I guess it should be enough just
>>>> lock around of read=>write of ->cmd_ring if need lock.
>>> After your patch it should be enough to have the lock only while
>>> reading and writing the cmd_ring register.
>>>
>>> If we want a locking fix that applies more easily to older stable
>>> releases before your change then the lock needs to cover set
>>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
>>> checking CRR bit. Otherwise the stop cmd ring interrupt handler may
>>> restart the ring just before we start checing CRR. The stop cmd ring
>>> interrupt will set the CMD_RING_STATE_ABORTED to
>>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
>>> handler.
>> Just for record (no chance to make patch I myself for now, sorry), while
>> checking locking slightly, I noticed unrelated missing locking.
>>
>> xhci_cleanup_command_queue()
>>
>> We are calling it without locking, but we need to lock for accessing list.
>
> Yeah. I can make the patch.
>

Force updated timeout_race_fixes branch.

I'll be out for a few days during xmas, continue testing after that

-Mathias

2016-12-27 03:07:30