2020-05-06 14:38:48

by Veerabhadrarao Badiganti

[permalink] [raw]
Subject: [PATCH V1 2/2] mmc: core: Fix recursive locking issue in CQE recovery path

From: Sarthak Garg <[email protected]>

Consider the following stack trace

-001|raw_spin_lock_irqsave
-002|mmc_blk_cqe_complete_rq
-003|__blk_mq_complete_request(inline)
-003|blk_mq_complete_request(rq)
-004|mmc_cqe_timed_out(inline)
-004|mmc_mq_timed_out

mmc_mq_timed_out acquires the queue_lock for the first
time. The mmc_blk_cqe_complete_rq function also tries to acquire
the same queue lock resulting in recursive locking where the task
is spinning for the same lock which it has already acquired leading
to watchdog bark.

Fix this issue with the lock only for the required critical section.

Cc: <[email protected]> # v4.19+
Suggested-by: Sahitya Tummala <[email protected]>
Signed-off-by: Sarthak Garg <[email protected]>
---
drivers/mmc/core/queue.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 25bee3d..72bef39 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -107,7 +107,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
case MMC_ISSUE_DCMD:
if (host->cqe_ops->cqe_timeout(host, mrq, &recovery_needed)) {
if (recovery_needed)
- __mmc_cqe_recovery_notifier(mq);
+ mmc_cqe_recovery_notifier(mrq);
return BLK_EH_RESET_TIMER;
}
/* No timeout (XXX: huh? comment doesn't make much sense) */
@@ -131,12 +131,13 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,

spin_lock_irqsave(&mq->lock, flags);

- if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled)
+ if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled) {
ret = BLK_EH_RESET_TIMER;
- else
+ spin_unlock_irqrestore(&mq->lock, flags);
+ } else {
+ spin_unlock_irqrestore(&mq->lock, flags);
ret = mmc_cqe_timed_out(req);
-
- spin_unlock_irqrestore(&mq->lock, flags);
+ }

return ret;
}
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc., is a member of Code Aurora Forum, a Linux Foundation Collaborative Project


2020-05-07 11:52:15

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH V1 2/2] mmc: core: Fix recursive locking issue in CQE recovery path

On 6/05/20 5:34 pm, Veerabhadrarao Badiganti wrote:
> From: Sarthak Garg <[email protected]>
>
> Consider the following stack trace
>
> -001|raw_spin_lock_irqsave
> -002|mmc_blk_cqe_complete_rq
> -003|__blk_mq_complete_request(inline)
> -003|blk_mq_complete_request(rq)
> -004|mmc_cqe_timed_out(inline)
> -004|mmc_mq_timed_out
>
> mmc_mq_timed_out acquires the queue_lock for the first
> time. The mmc_blk_cqe_complete_rq function also tries to acquire
> the same queue lock resulting in recursive locking where the task
> is spinning for the same lock which it has already acquired leading
> to watchdog bark.
>
> Fix this issue with the lock only for the required critical section.
>
> Cc: <[email protected]> # v4.19+
> Suggested-by: Sahitya Tummala <[email protected]>
> Signed-off-by: Sarthak Garg <[email protected]>
> ---
> drivers/mmc/core/queue.c | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
> index 25bee3d..72bef39 100644
> --- a/drivers/mmc/core/queue.c
> +++ b/drivers/mmc/core/queue.c
> @@ -107,7 +107,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
> case MMC_ISSUE_DCMD:
> if (host->cqe_ops->cqe_timeout(host, mrq, &recovery_needed)) {
> if (recovery_needed)
> - __mmc_cqe_recovery_notifier(mq);
> + mmc_cqe_recovery_notifier(mrq);
> return BLK_EH_RESET_TIMER;
> }
> /* No timeout (XXX: huh? comment doesn't make much sense) */
> @@ -131,12 +131,13 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,
>
> spin_lock_irqsave(&mq->lock, flags);
>
> - if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled)
> + if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled) {
> ret = BLK_EH_RESET_TIMER;
> - else
> + spin_unlock_irqrestore(&mq->lock, flags);
> + } else {
> + spin_unlock_irqrestore(&mq->lock, flags);
> ret = mmc_cqe_timed_out(req);
> -
> - spin_unlock_irqrestore(&mq->lock, flags);
> + }

This looks good, but I think there needs to be another change also. I will
send a patch for that, but in the meantime maybe you could straighten up the
code flow through the spinlock e.g.

spin_lock_irqsave(&mq->lock, flags);
ignore = mq->recovery_needed || !mq->use_cqe || host->hsq_enabled;
spin_unlock_irqrestore(&mq->lock, flags);

return ignore ? BLK_EH_RESET_TIMER : mmc_cqe_timed_out(req);

And add a fixes tag.

>
> return ret;
> }
>

2020-05-07 14:09:01

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH] mmc: block: Fix request completion in the CQE timeout path

First, it should be noted that the CQE timeout (60 seconds) is substantial
so a CQE request that times out is really stuck, and the race between
timeout and completion is extremely unlikely. Nevertheless this patch
fixes an issue with it.

Commit ad73d6feadbd7b ("mmc: complete requests from ->timeout")
preserved the existing functionality, to complete the request.
However that had only been necessary because the block layer
timeout handler had been marking the request to prevent it from being
completed normally. That restriction was removed at the same time, the
result being that a request that has gone will have been completed anyway.
That is, the completion in the timeout handler became unnecessary.

At the time, the unnecessary completion was harmless because the block
layer would ignore it, although that changed in kernel v5.0.

Note for stable, this patch will not apply cleanly without patch "mmc:
core: Fix recursive locking issue in CQE recovery path"

Signed-off-by: Adrian Hunter <[email protected]>
Fixes: ad73d6feadbd7b ("mmc: complete requests from ->timeout")
Cc: [email protected]
---


This is the patch I alluded to when replying to "mmc: core: Fix recursive
locking issue in CQE recovery path"


drivers/mmc/core/queue.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 72bef39d7011..10ea67892b5f 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -110,8 +110,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct
request *req)
mmc_cqe_recovery_notifier(mrq);
return BLK_EH_RESET_TIMER;
}
- /* No timeout (XXX: huh? comment doesn't make much sense) */
- blk_mq_complete_request(req);
+ /* The request has gone already */
return BLK_EH_DONE;
default:
/* Timeout is handled by mmc core */
--
2.17.1

2020-05-07 16:18:32

by Veerabhadrarao Badiganti

[permalink] [raw]
Subject: [PATCH V2] mmc: core: Fix recursive locking issue in CQE recovery path

From: Sarthak Garg <[email protected]>

Consider the following stack trace

-001|raw_spin_lock_irqsave
-002|mmc_blk_cqe_complete_rq
-003|__blk_mq_complete_request(inline)
-003|blk_mq_complete_request(rq)
-004|mmc_cqe_timed_out(inline)
-004|mmc_mq_timed_out

mmc_mq_timed_out acquires the queue_lock for the first
time. The mmc_blk_cqe_complete_rq function also tries to acquire
the same queue lock resulting in recursive locking where the task
is spinning for the same lock which it has already acquired leading
to watchdog bark.

Fix this issue with the lock only for the required critical section.

Cc: <[email protected]>
Fixes: 1e8e55b67030 ("mmc: block: Add CQE support")
Suggested-by: Sahitya Tummala <[email protected]>
Signed-off-by: Sarthak Garg <[email protected]>
---
drivers/mmc/core/queue.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 25bee3d..b5fd3bc 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -107,7 +107,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
case MMC_ISSUE_DCMD:
if (host->cqe_ops->cqe_timeout(host, mrq, &recovery_needed)) {
if (recovery_needed)
- __mmc_cqe_recovery_notifier(mq);
+ mmc_cqe_recovery_notifier(mrq);
return BLK_EH_RESET_TIMER;
}
/* No timeout (XXX: huh? comment doesn't make much sense) */
@@ -127,18 +127,13 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,
struct mmc_card *card = mq->card;
struct mmc_host *host = card->host;
unsigned long flags;
- int ret;
+ bool ignore_tout;

spin_lock_irqsave(&mq->lock, flags);
-
- if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled)
- ret = BLK_EH_RESET_TIMER;
- else
- ret = mmc_cqe_timed_out(req);
-
+ ignore_tout = mq->recovery_needed || !mq->use_cqe || host->hsq_enabled;
spin_unlock_irqrestore(&mq->lock, flags);

- return ret;
+ return ignore_tout ? BLK_EH_RESET_TIMER : mmc_cqe_timed_out(req);
}

static void mmc_mq_recovery_handler(struct work_struct *work)
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc., is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

2020-05-07 17:23:28

by Adrian Hunter

[permalink] [raw]
Subject: Re: [PATCH V2] mmc: core: Fix recursive locking issue in CQE recovery path

On 7/05/20 7:15 pm, Veerabhadrarao Badiganti wrote:
> From: Sarthak Garg <[email protected]>
>
> Consider the following stack trace
>
> -001|raw_spin_lock_irqsave
> -002|mmc_blk_cqe_complete_rq
> -003|__blk_mq_complete_request(inline)
> -003|blk_mq_complete_request(rq)
> -004|mmc_cqe_timed_out(inline)
> -004|mmc_mq_timed_out
>
> mmc_mq_timed_out acquires the queue_lock for the first
> time. The mmc_blk_cqe_complete_rq function also tries to acquire
> the same queue lock resulting in recursive locking where the task
> is spinning for the same lock which it has already acquired leading
> to watchdog bark.
>
> Fix this issue with the lock only for the required critical section.
>
> Cc: <[email protected]>
> Fixes: 1e8e55b67030 ("mmc: block: Add CQE support")
> Suggested-by: Sahitya Tummala <[email protected]>
> Signed-off-by: Sarthak Garg <[email protected]>

Acked-by: Adrian Hunter <[email protected]>

> ---
> drivers/mmc/core/queue.c | 13 ++++---------
> 1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
> index 25bee3d..b5fd3bc 100644
> --- a/drivers/mmc/core/queue.c
> +++ b/drivers/mmc/core/queue.c
> @@ -107,7 +107,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
> case MMC_ISSUE_DCMD:
> if (host->cqe_ops->cqe_timeout(host, mrq, &recovery_needed)) {
> if (recovery_needed)
> - __mmc_cqe_recovery_notifier(mq);
> + mmc_cqe_recovery_notifier(mrq);
> return BLK_EH_RESET_TIMER;
> }
> /* No timeout (XXX: huh? comment doesn't make much sense) */
> @@ -127,18 +127,13 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,
> struct mmc_card *card = mq->card;
> struct mmc_host *host = card->host;
> unsigned long flags;
> - int ret;
> + bool ignore_tout;
>
> spin_lock_irqsave(&mq->lock, flags);
> -
> - if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled)
> - ret = BLK_EH_RESET_TIMER;
> - else
> - ret = mmc_cqe_timed_out(req);
> -
> + ignore_tout = mq->recovery_needed || !mq->use_cqe || host->hsq_enabled;
> spin_unlock_irqrestore(&mq->lock, flags);
>
> - return ret;
> + return ignore_tout ? BLK_EH_RESET_TIMER : mmc_cqe_timed_out(req);
> }
>
> static void mmc_mq_recovery_handler(struct work_struct *work)
>

2020-05-08 05:29:32

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH] mmc: block: Fix request completion in the CQE timeout path

On Thu, 7 May 2020 at 16:06, Adrian Hunter <[email protected]> wrote:
>
> First, it should be noted that the CQE timeout (60 seconds) is substantial
> so a CQE request that times out is really stuck, and the race between
> timeout and completion is extremely unlikely. Nevertheless this patch
> fixes an issue with it.
>
> Commit ad73d6feadbd7b ("mmc: complete requests from ->timeout")
> preserved the existing functionality, to complete the request.
> However that had only been necessary because the block layer
> timeout handler had been marking the request to prevent it from being
> completed normally. That restriction was removed at the same time, the
> result being that a request that has gone will have been completed anyway.
> That is, the completion in the timeout handler became unnecessary.
>
> At the time, the unnecessary completion was harmless because the block
> layer would ignore it, although that changed in kernel v5.0.
>
> Note for stable, this patch will not apply cleanly without patch "mmc:
> core: Fix recursive locking issue in CQE recovery path"
>
> Signed-off-by: Adrian Hunter <[email protected]>
> Fixes: ad73d6feadbd7b ("mmc: complete requests from ->timeout")
> Cc: [email protected]
> ---
>
>
> This is the patch I alluded to when replying to "mmc: core: Fix recursive
> locking issue in CQE recovery path"

Looks like the patch got corrupted, I was trying to fix it, but just
couldn't figure it out.

Can you please re-format and do a repost?

Kind regards
Uffe

>
>
> drivers/mmc/core/queue.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
> index 72bef39d7011..10ea67892b5f 100644
> --- a/drivers/mmc/core/queue.c
> +++ b/drivers/mmc/core/queue.c
> @@ -110,8 +110,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct
> request *req)
> mmc_cqe_recovery_notifier(mrq);
> return BLK_EH_RESET_TIMER;
> }
> - /* No timeout (XXX: huh? comment doesn't make much sense) */
> - blk_mq_complete_request(req);
> + /* The request has gone already */
> return BLK_EH_DONE;
> default:
> /* Timeout is handled by mmc core */
> --
> 2.17.1
>

2020-05-08 06:24:17

by Adrian Hunter

[permalink] [raw]
Subject: [PATCH RESEND] mmc: block: Fix request completion in the CQE timeout path

First, it should be noted that the CQE timeout (60 seconds) is substantial
so a CQE request that times out is really stuck, and the race between
timeout and completion is extremely unlikely. Nevertheless this patch
fixes an issue with it.

Commit ad73d6feadbd7b ("mmc: complete requests from ->timeout")
preserved the existing functionality, to complete the request.
However that had only been necessary because the block layer
timeout handler had been marking the request to prevent it from being
completed normally. That restriction was removed at the same time, the
result being that a request that has gone will have been completed anyway.
That is, the completion was unnecessary.

At the time, the unnecessary completion was harmless because the block
layer would ignore it, although that changed in kernel v5.0.

Note for stable, this patch will not apply cleanly without patch "mmc:
core: Fix recursive locking issue in CQE recovery path"

Signed-off-by: Adrian Hunter <[email protected]>
Fixes: ad73d6feadbd7b ("mmc: complete requests from ->timeout")
Cc: [email protected]
---
drivers/mmc/core/queue.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 72bef39d7011..10ea67892b5f 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -110,8 +110,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
mmc_cqe_recovery_notifier(mrq);
return BLK_EH_RESET_TIMER;
}
- /* No timeout (XXX: huh? comment doesn't make much sense) */
- blk_mq_complete_request(req);
+ /* The request has gone already */
return BLK_EH_DONE;
default:
/* Timeout is handled by mmc core */
--
2.17.1

2020-05-08 08:15:25

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH V2] mmc: core: Fix recursive locking issue in CQE recovery path

On Thu, 7 May 2020 at 18:15, Veerabhadrarao Badiganti
<[email protected]> wrote:
>
> From: Sarthak Garg <[email protected]>
>
> Consider the following stack trace
>
> -001|raw_spin_lock_irqsave
> -002|mmc_blk_cqe_complete_rq
> -003|__blk_mq_complete_request(inline)
> -003|blk_mq_complete_request(rq)
> -004|mmc_cqe_timed_out(inline)
> -004|mmc_mq_timed_out
>
> mmc_mq_timed_out acquires the queue_lock for the first
> time. The mmc_blk_cqe_complete_rq function also tries to acquire
> the same queue lock resulting in recursive locking where the task
> is spinning for the same lock which it has already acquired leading
> to watchdog bark.
>
> Fix this issue with the lock only for the required critical section.
>
> Cc: <[email protected]>
> Fixes: 1e8e55b67030 ("mmc: block: Add CQE support")
> Suggested-by: Sahitya Tummala <[email protected]>
> Signed-off-by: Sarthak Garg <[email protected]>

Applied for fixes, thanks!

Kind regards
Uffe


> ---
> drivers/mmc/core/queue.c | 13 ++++---------
> 1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
> index 25bee3d..b5fd3bc 100644
> --- a/drivers/mmc/core/queue.c
> +++ b/drivers/mmc/core/queue.c
> @@ -107,7 +107,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
> case MMC_ISSUE_DCMD:
> if (host->cqe_ops->cqe_timeout(host, mrq, &recovery_needed)) {
> if (recovery_needed)
> - __mmc_cqe_recovery_notifier(mq);
> + mmc_cqe_recovery_notifier(mrq);
> return BLK_EH_RESET_TIMER;
> }
> /* No timeout (XXX: huh? comment doesn't make much sense) */
> @@ -127,18 +127,13 @@ static enum blk_eh_timer_return mmc_mq_timed_out(struct request *req,
> struct mmc_card *card = mq->card;
> struct mmc_host *host = card->host;
> unsigned long flags;
> - int ret;
> + bool ignore_tout;
>
> spin_lock_irqsave(&mq->lock, flags);
> -
> - if (mq->recovery_needed || !mq->use_cqe || host->hsq_enabled)
> - ret = BLK_EH_RESET_TIMER;
> - else
> - ret = mmc_cqe_timed_out(req);
> -
> + ignore_tout = mq->recovery_needed || !mq->use_cqe || host->hsq_enabled;
> spin_unlock_irqrestore(&mq->lock, flags);
>
> - return ret;
> + return ignore_tout ? BLK_EH_RESET_TIMER : mmc_cqe_timed_out(req);
> }
>
> static void mmc_mq_recovery_handler(struct work_struct *work)
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc., is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

2020-05-08 08:20:26

by Ulf Hansson

[permalink] [raw]
Subject: Re: [PATCH RESEND] mmc: block: Fix request completion in the CQE timeout path

On Fri, 8 May 2020 at 08:22, Adrian Hunter <[email protected]> wrote:
>
> First, it should be noted that the CQE timeout (60 seconds) is substantial
> so a CQE request that times out is really stuck, and the race between
> timeout and completion is extremely unlikely. Nevertheless this patch
> fixes an issue with it.
>
> Commit ad73d6feadbd7b ("mmc: complete requests from ->timeout")
> preserved the existing functionality, to complete the request.
> However that had only been necessary because the block layer
> timeout handler had been marking the request to prevent it from being
> completed normally. That restriction was removed at the same time, the
> result being that a request that has gone will have been completed anyway.
> That is, the completion was unnecessary.
>
> At the time, the unnecessary completion was harmless because the block
> layer would ignore it, although that changed in kernel v5.0.
>
> Note for stable, this patch will not apply cleanly without patch "mmc:
> core: Fix recursive locking issue in CQE recovery path"
>
> Signed-off-by: Adrian Hunter <[email protected]>
> Fixes: ad73d6feadbd7b ("mmc: complete requests from ->timeout")
> Cc: [email protected]

Applied for fixes, thanks!

Kind regards
Uffe

> ---
> drivers/mmc/core/queue.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
> index 72bef39d7011..10ea67892b5f 100644
> --- a/drivers/mmc/core/queue.c
> +++ b/drivers/mmc/core/queue.c
> @@ -110,8 +110,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
> mmc_cqe_recovery_notifier(mrq);
> return BLK_EH_RESET_TIMER;
> }
> - /* No timeout (XXX: huh? comment doesn't make much sense) */
> - blk_mq_complete_request(req);
> + /* The request has gone already */
> return BLK_EH_DONE;
> default:
> /* Timeout is handled by mmc core */
> --
> 2.17.1
>