2024-04-15 05:42:23

by Aleksander Morgado

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

Hey Oliver & Bjørn,

On 3/14/24 11:50, Oliver Neukum wrote:
> wdm_read() cannot race with itself. However, in
> service_outstanding_interrupt() it can race with the
> workqueue, which can be triggered by error handling.
>
> Hence we need to make sure that the WDM_RESPONDING
> flag is not just only set but tested.
>
> Fixes: afba937e540c9 ("USB: CDC WDM driver")
> Signed-off-by: Oliver Neukum <[email protected]>

We are not aware of all the details involved in this patch, but we had
to revert it in all the different ChromeOS kernel versions where we had
it cherry-picked, because it broke the MBIM communication with the Intel
XMM based Fibocomm L850 modem. Other modems shipped in Chromebooks like
the QC based Fibocomm FM101 don't seem to be affected.

Attached is an example output of mbimcli talking directly to the cdc-wdm
port (i.e. without ModemManager or the mbim-proxy). In the example, we
are receiving a bunch of different messages from previous mbimcli runs.
Looking at the timestamps, it looks as if we only receive a message
right after we have sent one, e.g. after each "open request" we end up
receiving responses for requests sent in earlier runs; or something
along those lines.

Is this bad behavior of this specific modem chipset, and if so, how can
we workaround it? If you need any additional information or help to test
new patches, let us know.

Cheers!

--
Aleksander


Attachments:
cdc-wdm-errors.txt (14.90 kB)
OpenPGP_0xAECE0239C6606AD5.asc (3.12 kB)
OpenPGP public key
OpenPGP_signature.asc (855.00 B)
OpenPGP digital signature
Download all attachments

2024-04-15 06:58:23

by Bjørn Mork

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

Aleksander Morgado <[email protected]> writes:
> On 3/14/24 11:50, Oliver Neukum wrote:
>> wdm_read() cannot race with itself. However, in
>> service_outstanding_interrupt() it can race with the
>> workqueue, which can be triggered by error handling.
>> Hence we need to make sure that the WDM_RESPONDING
>> flag is not just only set but tested.
>> Fixes: afba937e540c9 ("USB: CDC WDM driver")
>> Signed-off-by: Oliver Neukum <[email protected]>
>
> We are not aware of all the details involved in this patch, but we had
> to revert it in all the different ChromeOS kernel versions where we
> had it cherry-picked, because it broke the MBIM communication with the
> Intel XMM based Fibocomm L850 modem. Other modems shipped in
> Chromebooks like the QC based Fibocomm FM101 don't seem to be
> affected.
>
> Attached is an example output of mbimcli talking directly to the
> cdc-wdm port (i.e. without ModemManager or the mbim-proxy). In the
> example, we are receiving a bunch of different messages from previous
> mbimcli runs. Looking at the timestamps, it looks as if we only
> receive a message right after we have sent one, e.g. after each "open
> request" we end up receiving responses for requests sent in earlier
> runs; or something along those lines.
>
> Is this bad behavior of this specific modem chipset, and if so, how
> can we workaround it? If you need any additional information or help
> to test new patches, let us know.

I'm not sure I understand what problem that patch is supposed to fix.
Which means that everything I write could be completely wrong...

But to me it looks like the described issue is exactly what you should
expect if that change ever triggers. I believe we must resubmit the
urb from service_outstanding_interrupt(). That's why it was added. See
the explanation Robert wrote when introducing it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/class/cdc-wdm.c?id=c1da59dad0ebd3f9bd238f3fff82b1f7ffda7829

As for the XMM behaviour: it's been a long time since I tried any of
those, but AFAIR one the major differences compared to Qualcomm was the
strict queue handling in the firmware. This caused a number of problems
where the cdc-wdm driver wanted to skip a message for some reason. So
I'm not surprised that a bug like this is triggered by one of those
modems. That's probably the only thing they are good for :-)



Bjørn

2024-04-15 08:43:08

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue



On 15.04.24 07:42, Aleksander Morgado wrote:

Hi,

> We are not aware of all the details involved in this patch,

I had gotten bug reports about resubmitting an active URB.

> but we had to revert it in all the different ChromeOS kernel versions where we had it cherry-picked, because it broke the MBIM communication with the Intel XMM based Fibocomm L850 modem. > Other modems shipped in Chromebooks like the QC based Fibocomm FM101 don't seem to be affected.

That is odd.

> Attached is an example output of mbimcli talking directly to the cdc-wdm port (i.e. without ModemManager or the mbim-proxy).

Could you provide a working example, that is with another chipset? And, most important, dmesg for both cases with
the log level set to maximum?

> In the example, we are receiving a bunch of different messages from previous mbimcli runs. Looking at the timestamps, it looks as if we only receive a message right after we have sent one, e.g. after each "open request" we end up receiving responses for requests sent in earlier runs; or something along those lines.

It looks like you are hitting the race later than my bug reporters, which means
that the submission works and we do not overwrite the buffer.

> Is this bad behavior of this specific modem chipset, and if so, how can we workaround it? If you need any additional information or help to test new patches, let us know.

Generally losing data is bad, so I cannot readily tell.
Please provide data for the working case.

Regards
Oliver


2024-04-15 09:06:49

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

On 15.04.24 08:47, Bjørn Mork wrote:

> I'm not sure I understand what problem that patch is supposed to fix.
> Which means that everything I write could be completely wrong...

wdm_in_callback() can schedule service_outs_intr(), which can call
service_outstanding_interrupt(), which sets WDM_RESPONDING and submits
desc->response.
That is not problematic in itself, but wdm_read() also calls
service_outstanding_interrupt(), which can lead to teh same URB
being submitted twice (which caused me to write the patch)
or, apparently, in this case, it leads to discarding a buffer
by resubmitting and completing an URB.

> But to me it looks like the described issue is exactly what you should
> expect if that change ever triggers. I believe we must resubmit the

Yes, it does.

> urb from service_outstanding_interrupt(). That's why it was added. See
> the explanation Robert wrote when introducing it:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/class/cdc-wdm.c?id=c1da59dad0ebd3f9bd238f3fff82b1f7ffda7829

Well, the explanation is correct in that we must read
data available. However, if the RESPONDING flag is set
and the URB submitted, we are already doing so.

> As for the XMM behaviour: it's been a long time since I tried any of
> those, but AFAIR one the major differences compared to Qualcomm was the
> strict queue handling in the firmware. This caused a number of problems
> where the cdc-wdm driver wanted to skip a message for some reason. So
> I'm not surprised that a bug like this is triggered by one of those
> modems. That's probably the only thing they are good for :-)

I am not sure where exactly the issue lies here. Suggestions for
debugging?

Regards
Oliver


2024-04-15 09:27:31

by Bjørn Mork

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

Oliver Neukum <[email protected]> writes:
> On 15.04.24 08:47, Bjørn Mork wrote:
>
>> urb from service_outstanding_interrupt(). That's why it was added. See
>> the explanation Robert wrote when introducing it:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/class/cdc-wdm.c?id=c1da59dad0ebd3f9bd238f3fff82b1f7ffda7829
>
> Well, the explanation is correct in that we must read
> data available. However, if the RESPONDING flag is set
> and the URB submitted, we are already doing so.

Sounds reasonable. Except that the bug proves we didn't.

If you are right that service_outstanding_interrupt can race againts
itself (and I don't doubt that), then I guess this could also happen
between failure to submit the URB and clearing the flag?

>> As for the XMM behaviour: it's been a long time since I tried any of
>> those, but AFAIR one the major differences compared to Qualcomm was the
>> strict queue handling in the firmware. This caused a number of problems
>> where the cdc-wdm driver wanted to skip a message for some reason. So
>> I'm not surprised that a bug like this is triggered by one of those
>> modems. That's probably the only thing they are good for :-)
>
> I am not sure where exactly the issue lies here. Suggestions for
> debugging?

Nothing more than the obvious; Get one of those modems and do some
usbmon snooping.


Bjørn

2024-04-15 10:15:07

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

On 15.04.24 11:26, Bjørn Mork wrote:
> Oliver Neukum <[email protected]> writes:
>> On 15.04.24 08:47, Bjørn Mork wrote:
>>
>>> urb from service_outstanding_interrupt(). That's why it was added. See
>>> the explanation Robert wrote when introducing it:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/class/cdc-wdm.c?id=c1da59dad0ebd3f9bd238f3fff82b1f7ffda7829
>>
>> Well, the explanation is correct in that we must read
>> data available. However, if the RESPONDING flag is set
>> and the URB submitted, we are already doing so.
>
> Sounds reasonable. Except that the bug proves we didn't.

Why? I am afraid I do not get that part.

> If you are right that service_outstanding_interrupt can race againts
> itself (and I don't doubt that), then I guess this could also happen
> between failure to submit the URB and clearing the flag?

Yes, it can. In fact in this case the behavior should not change.
I am afraid we have a misunderstanding. It seems to me that in the
unchanged driver the result of service_outstanding_interrupt()
is undefined.
Please explain.

Regards
Oliver



2024-04-15 10:33:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

On Mon, Apr 15, 2024 at 11:26:00AM +0200, Bj?rn Mork wrote:
> Oliver Neukum <[email protected]> writes:
> > On 15.04.24 08:47, Bj?rn Mork wrote:
> >
> >> urb from service_outstanding_interrupt(). That's why it was added. See
> >> the explanation Robert wrote when introducing it:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/class/cdc-wdm.c?id=c1da59dad0ebd3f9bd238f3fff82b1f7ffda7829
> >
> > Well, the explanation is correct in that we must read
> > data available. However, if the RESPONDING flag is set
> > and the URB submitted, we are already doing so.
>
> Sounds reasonable. Except that the bug proves we didn't.
>
> If you are right that service_outstanding_interrupt can race againts
> itself (and I don't doubt that), then I guess this could also happen
> between failure to submit the URB and clearing the flag?
>
> >> As for the XMM behaviour: it's been a long time since I tried any of
> >> those, but AFAIR one the major differences compared to Qualcomm was the
> >> strict queue handling in the firmware. This caused a number of problems
> >> where the cdc-wdm driver wanted to skip a message for some reason. So
> >> I'm not surprised that a bug like this is triggered by one of those
> >> modems. That's probably the only thing they are good for :-)
> >
> > I am not sure where exactly the issue lies here. Suggestions for
> > debugging?
>
> Nothing more than the obvious; Get one of those modems and do some
> usbmon snooping.

So for now, should we just revert this problem commit?

thanks,

greg k-h

2024-04-15 11:12:23

by Bjørn Mork

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

Oliver Neukum <[email protected]> writes:
> On 15.04.24 11:26, Bjørn Mork wrote:
>> Oliver Neukum <[email protected]> writes:
>>> On 15.04.24 08:47, Bjørn Mork wrote:
>>>
>>>> urb from service_outstanding_interrupt(). That's why it was added. See
>>>> the explanation Robert wrote when introducing it:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/class/cdc-wdm.c?id=c1da59dad0ebd3f9bd238f3fff82b1f7ffda7829
>>>
>>> Well, the explanation is correct in that we must read
>>> data available. However, if the RESPONDING flag is set
>>> and the URB submitted, we are already doing so.
>> Sounds reasonable. Except that the bug proves we didn't.
>
> Why? I am afraid I do not get that part.

I don't get how it happens either. But that's the only thing changed by
the patch.

> > If you are right that service_outstanding_interrupt can race
> againts
>> itself (and I don't doubt that), then I guess this could also happen
>> between failure to submit the URB and clearing the flag?
>
> Yes, it can. In fact in this case the behavior should not change.
> I am afraid we have a misunderstanding. It seems to me that in the
> unchanged driver the result of service_outstanding_interrupt()
> is undefined.
> Please explain.

Sorry, I am so lost here that I am probably only confusing things. I do
not understand why we unlock &desc->iuspin around the usb_submit_urb
call. And git tells me I wrote that.


Bjørn

2024-04-15 12:09:41

by Aleksander Morgado

[permalink] [raw]
Subject: Re: [PATCH] usb: cdc-wdm: close race between read and workqueue

Hey Oliver,

>> Attached is an example output of mbimcli talking directly to the
>> cdc-wdm port (i.e. without ModemManager or the mbim-proxy).
>
> Could you provide a working example, that is with another chipset? And,
> most important, dmesg for both cases with
> the log level set to maximum?
>

Find attached a bunch of different logs. There are 3 cases tested:
* FM101 (dmesg and mbimcli outputs), this shows the communication
working in a QC based modem.
* L850 (usbmon, dmesg and mbimcli outputs), this shows the break in
the communication.
* L850 with patch reverted (usbmon, dmesg and mbimcli outputs), this
shows the communication working fine again.


Let me know if you need anything else.
Cheers!

--
Aleksander


Attachments:
dmesg-fm101.txt (1.42 kB)
dmesg-l850.txt (1.90 kB)
dmesg-l850-patch-reverted.txt (1.02 kB)
mbimcli-fm101.txt (17.82 kB)
mbimcli-l850.txt (9.45 kB)
mbimcli-l850-patch-reverted.txt (14.01 kB)
usbmon-l850.txt (21.88 kB)
usbmon-l850-reverted.txt (4.67 kB)
Download all attachments