2024-02-28 10:18:59

by Wolfram Sang

[permalink] [raw]
Subject: [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()

With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
outside of the spinlock protected critical section. That leaves a small
race window during execution of 'tmio_mmc_reset()' where the done_work
handler could grab a pointer to the now invalid 'host->mrq'. Both would
use it to call mmc_request_done() causing problems (see Link).

However, 'host->mrq' cannot simply be cleared earlier inside the
critical section. That would allow new mrqs to come in asynchronously
while the actual reset of the controller still needs to be done. So,
like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
coming in but still avoiding concurrency between work handlers.

Reported-by: Dirk Behme <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Wolfram Sang <[email protected]>
Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")
---

Dirk: could you get this tested on your affected setups? I am somewhat
optimistic that this is already enough. For sure, it is a needed first
step.

drivers/mmc/host/tmio_mmc_core.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c
index be7f18fd4836..c253d176db69 100644
--- a/drivers/mmc/host/tmio_mmc_core.c
+++ b/drivers/mmc/host/tmio_mmc_core.c
@@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work)
else
mrq->cmd->error = -ETIMEDOUT;

+ /* No new calls yet, but disallow concurrent tmio_mmc_done_work() */
+ host->mrq = ERR_PTR(-EBUSY);
host->cmd = NULL;
host->data = NULL;

--
2.43.0



2024-02-29 06:22:17

by Dirk Behme

[permalink] [raw]
Subject: Re: [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()

Hi Wolfram,

On 28.02.2024 11:03, Wolfram Sang wrote:
> With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
> outside of the spinlock protected critical section. That leaves a small
> race window during execution of 'tmio_mmc_reset()' where the done_work
> handler could grab a pointer to the now invalid 'host->mrq'. Both would
> use it to call mmc_request_done() causing problems (see Link).
>
> However, 'host->mrq' cannot simply be cleared earlier inside the
> critical section. That would allow new mrqs to come in asynchronously
> while the actual reset of the controller still needs to be done. So,
> like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
> coming in but still avoiding concurrency between work handlers.
>
> Reported-by: Dirk Behme <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Wolfram Sang <[email protected]>
> Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")

Tested-by: Dirk Behme <[email protected]>
Reviewed-by: Dirk Behme <[email protected]>

> ---
>
> Dirk: could you get this tested on your affected setups? I am somewhat
> optimistic that this is already enough. For sure, it is a needed first
> step.

Testing looks good :) Many thanks!

At least the issues we observed before are not seen any more. As we are
not exactly sure on the root cause, of course this is not a 100% proof.
But as the change looks good, looks like it won't break something and
the system behaves good with it I would say we are good to go.

I think we could add anything like

Cc: [email protected] # 3.0+

?

> drivers/mmc/host/tmio_mmc_core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c
> index be7f18fd4836..c253d176db69 100644
> --- a/drivers/mmc/host/tmio_mmc_core.c
> +++ b/drivers/mmc/host/tmio_mmc_core.c
> @@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work)
> else
> mrq->cmd->error = -ETIMEDOUT;
>
> + /* No new calls yet, but disallow concurrent tmio_mmc_done_work() */
> + host->mrq = ERR_PTR(-EBUSY);
> host->cmd = NULL;
> host->data = NULL;
Thanks again!

Dirk

2024-02-29 07:33:37

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()

Hi Dirk,

> > With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
> > outside of the spinlock protected critical section. That leaves a small
> > race window during execution of 'tmio_mmc_reset()' where the done_work
> > handler could grab a pointer to the now invalid 'host->mrq'. Both would
> > use it to call mmc_request_done() causing problems (see Link).
> >
> > However, 'host->mrq' cannot simply be cleared earlier inside the
> > critical section. That would allow new mrqs to come in asynchronously
> > while the actual reset of the controller still needs to be done. So,
> > like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
> > coming in but still avoiding concurrency between work handlers.
> >
> > Reported-by: Dirk Behme <[email protected]>
> > Closes: https://lore.kernel.org/all/[email protected]/
> > Signed-off-by: Wolfram Sang <[email protected]>
> > Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")
>
> Tested-by: Dirk Behme <[email protected]>
> Reviewed-by: Dirk Behme <[email protected]>

Awesome! Thanks for the super-fast tags!

> At least the issues we observed before are not seen any more. As we are not
> exactly sure on the root cause, of course this is not a 100% proof. But as
> the change looks good, looks like it won't break something and the system
> behaves good with it I would say we are good to go.

I agree. We don't know if it is all you need. But there definitely was a
race window and closing it removes some observed anomalies. Let's hope
all of them :) I looked many times at the code and, to the best of my
knowledge, don't see side effects. 'host->mrq' stays non-NULL, so new
mrqs won't be added like before. Changing it to an ERR_PTR will only
affect the check in the done_work handler which is what we want. But, of
course, more eyes are always welcome.

> I think we could add anything like
>
> Cc: [email protected] # 3.0+

Yes, we should definitely have that. I would have added it once your
testing got good results. This affects every Renesas SDHI or Uniphier SD
instance since 3.0 (12 years). Wow! So, thanks a ton for your report and
assistance in debugging it. Very much appreciated! And, phew, I am happy
that this solution does not make the locking more complex \o/

All the best,

Wolfram


Attachments:
(No filename) (2.40 kB)
signature.asc (849.00 B)
Download all attachments