2013-06-20 23:48:13

by Doug Anderson

[permalink] [raw]
Subject: [PATCH] mmc: dw_mmc: don't queue up a card detect at slot startup

The MMC subsystem handles looking for a card at probe time. Queuing
up our own can race with the rest of the MMC subsystem and cause
problems if we get unlucky with timing. Just remove our own
detection.

Specifically, I found that with just the right set of printouts in my
system that one of the three SD/MMC devices in my system was having
trouble probing. It would get an err -123 (-ENOMEDIUM) during probe.
I found that the source of the error was in
dw_mci_work_routine_card(). Adding more printouts to the code
sometimes made this error go away, so it's a little touchy.

You can see a snippet with my printouts in it (printouts were in
set_ios() on an exynos5420 board with some of our local patches):

[ 4.216595] dwmmc_exynos 12210000.dwmmc1: Using internal DMA controller.
[ 4.395935] dwmmc_exynos 12210000.dwmmc1: Version ID is 250a
[ 4.401948] dwmmc_exynos 12210000.dwmmc1: DW MMC controller at irq 108, 64 bit host data width, 64 deep fifo
[ 4.424430] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
[ 4.453975] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
[ 4.459592] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 400000Hz, actual 400000HZ div = 125)
[ 4.484258] dwmmc_exynos 12210000.dwmmc1: 1 slots initialized
[ 4.485406] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
[ 4.487606] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
[ 4.489794] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
[ 4.509757] mmc1: error -123 whilst initialising SDIO card

While digging I found that doing our own card detection at init time
didn't appear to be necessary. If I remove this code then cards that
are in the system at bootup are still detected just fine.

Signed-off-by: Doug Anderson <[email protected]>
---
drivers/mmc/host/dw_mmc.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index bc3a1bc..c3a6fe9 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2016,12 +2016,6 @@ static int dw_mci_init_slot(struct dw_mci *host, unsigned int id)
/* Card initially undetected */
slot->last_detect_state = 0;

- /*
- * Card may have been plugged in prior to boot so we
- * need to run the detect tasklet
- */
- queue_work(host->card_workqueue, &host->card_work);
-
return 0;

err_setup_bus:
--
1.8.3


2013-06-24 08:04:36

by Seungwon Jeon

[permalink] [raw]
Subject: RE: [PATCH] mmc: dw_mmc: don't queue up a card detect at slot startup

Hi,

On Fri, June 21 2013, Doug Anderson wrote:
> The MMC subsystem handles looking for a card at probe time. Queuing
> up our own can race with the rest of the MMC subsystem and cause
> problems if we get unlucky with timing. Just remove our own
> detection.
This patch looks good to me. I agree above.
Card detection procedure of mmc subsystem will be started by mmc_start_host during probe time.
There is no need to do same in host driver.
Could you describe the race point of this problem and why the duplication makes the problem?
What is described below is not clear.
If a actual detection of card is triggered during probe, similar problem may be occurred in spite of this patch.

Thanks,
Seungwon Jeon
>
> Specifically, I found that with just the right set of printouts in my
> system that one of the three SD/MMC devices in my system was having
> trouble probing. It would get an err -123 (-ENOMEDIUM) during probe.
> I found that the source of the error was in
> dw_mci_work_routine_card(). Adding more printouts to the code
> sometimes made this error go away, so it's a little touchy.
>
> You can see a snippet with my printouts in it (printouts were in
> set_ios() on an exynos5420 board with some of our local patches):
>
> [ 4.216595] dwmmc_exynos 12210000.dwmmc1: Using internal DMA controller.
> [ 4.395935] dwmmc_exynos 12210000.dwmmc1: Version ID is 250a
> [ 4.401948] dwmmc_exynos 12210000.dwmmc1: DW MMC controller at irq 108, 64 bit host data width, 64
> deep fifo
> [ 4.424430] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
> [ 4.453975] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
> [ 4.459592] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 400000Hz, actual 400000HZ div
> = 125)
> [ 4.484258] dwmmc_exynos 12210000.dwmmc1: 1 slots initialized
> [ 4.485406] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
> [ 4.487606] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
> [ 4.489794] dwmmc_exynos 12210000.dwmmc1: sdr0 mode (irq=108, width=0)
> [ 4.509757] mmc1: error -123 whilst initialising SDIO card
>
> While digging I found that doing our own card detection at init time
> didn't appear to be necessary. If I remove this code then cards that
> are in the system at bootup are still detected just fine.
>
> Signed-off-by: Doug Anderson <[email protected]>
> ---
> drivers/mmc/host/dw_mmc.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index bc3a1bc..c3a6fe9 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -2016,12 +2016,6 @@ static int dw_mci_init_slot(struct dw_mci *host, unsigned int id)
> /* Card initially undetected */
> slot->last_detect_state = 0;
>
> - /*
> - * Card may have been plugged in prior to boot so we
> - * need to run the detect tasklet
> - */
> - queue_work(host->card_workqueue, &host->card_work);
> -
> return 0;
>
> err_setup_bus:
> --
> 1.8.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-06-28 16:56:16

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] mmc: dw_mmc: don't queue up a card detect at slot startup

Seungwon,

On Mon, Jun 24, 2013 at 1:04 AM, Seungwon Jeon <[email protected]> wrote:
> This patch looks good to me. I agree above.
> Card detection procedure of mmc subsystem will be started by mmc_start_host during probe time.
> There is no need to do same in host driver.
> Could you describe the race point of this problem and why the duplication makes the problem?
> What is described below is not clear.
> If a actual detection of card is triggered during probe, similar problem may be occurred in spite of this patch.

OK, so I think the race is between the "mmc_rescan" thread and the
"dw_mci_work_routine_card" thread:

The "mmc_rescan" thread sets "host->state = STATE_SENDING_CMD" in
dw_mci_queue_request() with a stack crawl that looks like this (from
dump_stack):

[<803c5d10>] (dw_mci_request+0xb0/0x100) from [<803b1864>]
(__mmc_start_req+0x14c/0x164)
[<803b1864>] (__mmc_start_req+0x14c/0x164) from [<803b189c>]
(mmc_wait_for_req+0x20/0x30)
[<803b189c>] (mmc_wait_for_req+0x20/0x30) from [<803b1934>]
(mmc_wait_for_cmd+0x88/0xb4)
[<803b1934>] (mmc_wait_for_cmd+0x88/0xb4) from [<803bae34>]
(mmc_io_rw_direct_host+0xd0/0x160)
[<803bae34>] (mmc_io_rw_direct_host+0xd0/0x160) from [<803bb324>]
(sdio_reset+0x44/0x9c)
[<803bb324>] (sdio_reset+0x44/0x9c) from [<803b35d8>] (mmc_rescan+0x230/0x2c8)
[<803b35d8>] (mmc_rescan+0x230/0x2c8) from [<8004619c>]
(process_one_work+0x25c/0x418)
[<8004619c>] (process_one_work+0x25c/0x418) from [<8004680c>]
(worker_thread+0x280/0x3c4)
[<8004680c>] (worker_thread+0x280/0x3c4) from [<8004b4c0>] (kthread+0xc8/0xdc)
[<8004b4c0>] (kthread+0xc8/0xdc) from [<8000e758>] (ret_from_fork+0x14/0x20)


It's got the host->lock when it does that. ...but then it releases
the lock at the end of dw_mci_request(). That allows
dw_mci_work_routine_card() to jump in. You can see its stack crawl
(from kgdb):

#1 0x803c8008 in dw_mci_work_routine_card (work=0xee6f8ba0) at
/.../drivers/mmc/host/dw_mmc.c:1720
#2 0x8004619c in process_one_work (worker=worker@entry=0xeea74d00,
work=0xee6f8ba0) at /.../kernel/workqueue.c:2263
#3 0x8004680c in worker_thread (__worker=__worker@entry=0xeea74d00)
at /.../kernel/workqueue.c:2383
#4 0x8004b4c0 in kthread (_create=0xef209e38) at /.../kernel/kthread.c:168


I can reproduce the problem reliably by adding an "mdelay(100);" in
dw_mci_queue_request() right after "host->state = STATE_SENDING_CMD;".

When I do that and add a kgdb_breakpoint() in
dw_mci_work_routine_card() for the -ENOMEDIUM case, I can even see the
state of the "mmc_rescan" thread with kgdb:

#0 mmc_wait_for_req (host=host@entry=0xee702140,
mrq=mrq@entry=0xef1efdbc) at /.../drivers/mmc/core/core.c:474
#1 0x803b1934 in mmc_wait_for_cmd (host=host@entry=0xee702140,
cmd=cmd@entry=0xef1efe14, retries=0) at
/.../drivers/mmc/core/core.c:567
#2 0x803bae34 in mmc_io_rw_direct_host (host=host@entry=0xee702140,
write=write@entry=0, fn=fn@entry=0, addr=addr@entry=6, in=0 '\000',
out=out@entry=0xef1efe83 "\356\027\345\032'")
at /.../drivers/mmc/core/sdio_ops.c:89
#3 0x803bb324 in sdio_reset (host=host@entry=0xee702140) at
/.../drivers/mmc/core/sdio_ops.c:214
#4 0x803b35d8 in mmc_rescan_try_freq (freq=<optimized out>,
host=0xee702140) at /.../drivers/mmc/core/core.c:2084
#5 mmc_rescan (work=0xee7023ac) at /.../drivers/mmc/core/core.c:2210
#6 0x8004619c in process_one_work (worker=worker@entry=0xef0af900,
work=0xee7023ac) at /.../kernel/workqueue.c:2263
#7 0x8004680c in worker_thread (__worker=__worker@entry=0xef0af900)
at /.../kernel/workqueue.c:2383
#8 0x8004b4c0 in kthread (_create=0xef18de40) at /.../kernel/kthread.c:168
#9 0x8000e758 in ret_from_fork () at /.../arch/arm/kernel/entry-common.S:92
#10 0x8000e758 in ret_from_fork () at /.../arch/arm/kernel/entry-common.S:92
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


I'm not sure I have time to track down the whole race at the moment,
though I may be able to come back to it later. However, we now have a
set of steps to reproduce (I think) and a full description of the
race. Perhaps someone else who knows the code better would be able to
have a whack at it?

In any case, it seems reasonable to still merge my CL, since it makes
the race much less likely (and impossible in the case of non-removable
cards) removes some pointless code. Would you be interested in acking
it?

-Doug

2013-07-01 10:38:05

by Seungwon Jeon

[permalink] [raw]
Subject: RE: [PATCH] mmc: dw_mmc: don't queue up a card detect at slot startup

On Sat, June 29, 2013, Doug Anderson wrote:
> Seungwon,
>
> On Mon, Jun 24, 2013 at 1:04 AM, Seungwon Jeon <[email protected]> wrote:
> > This patch looks good to me. I agree above.
> > Card detection procedure of mmc subsystem will be started by mmc_start_host during probe time.
> > There is no need to do same in host driver.
> > Could you describe the race point of this problem and why the duplication makes the problem?
> > What is described below is not clear.
> > If a actual detection of card is triggered during probe, similar problem may be occurred in spite of
> this patch.
>
> OK, so I think the race is between the "mmc_rescan" thread and the
> "dw_mci_work_routine_card" thread:
>
> The "mmc_rescan" thread sets "host->state = STATE_SENDING_CMD" in
> dw_mci_queue_request() with a stack crawl that looks like this (from
> dump_stack):
>
> [<803c5d10>] (dw_mci_request+0xb0/0x100) from [<803b1864>]
> (__mmc_start_req+0x14c/0x164)
> [<803b1864>] (__mmc_start_req+0x14c/0x164) from [<803b189c>]
> (mmc_wait_for_req+0x20/0x30)
> [<803b189c>] (mmc_wait_for_req+0x20/0x30) from [<803b1934>]
> (mmc_wait_for_cmd+0x88/0xb4)
> [<803b1934>] (mmc_wait_for_cmd+0x88/0xb4) from [<803bae34>]
> (mmc_io_rw_direct_host+0xd0/0x160)
> [<803bae34>] (mmc_io_rw_direct_host+0xd0/0x160) from [<803bb324>]
> (sdio_reset+0x44/0x9c)
> [<803bb324>] (sdio_reset+0x44/0x9c) from [<803b35d8>] (mmc_rescan+0x230/0x2c8)
> [<803b35d8>] (mmc_rescan+0x230/0x2c8) from [<8004619c>]
> (process_one_work+0x25c/0x418)
> [<8004619c>] (process_one_work+0x25c/0x418) from [<8004680c>]
> (worker_thread+0x280/0x3c4)
> [<8004680c>] (worker_thread+0x280/0x3c4) from [<8004b4c0>] (kthread+0xc8/0xdc)
> [<8004b4c0>] (kthread+0xc8/0xdc) from [<8000e758>] (ret_from_fork+0x14/0x20)
>
>
> It's got the host->lock when it does that. ...but then it releases
> the lock at the end of dw_mci_request(). That allows
> dw_mci_work_routine_card() to jump in. You can see its stack crawl
> (from kgdb):
>
> #1 0x803c8008 in dw_mci_work_routine_card (work=0xee6f8ba0) at
> /.../drivers/mmc/host/dw_mmc.c:1720
> #2 0x8004619c in process_one_work (worker=worker@entry=0xeea74d00,
> work=0xee6f8ba0) at /.../kernel/workqueue.c:2263
> #3 0x8004680c in worker_thread (__worker=__worker@entry=0xeea74d00)
> at /.../kernel/workqueue.c:2383
> #4 0x8004b4c0 in kthread (_create=0xef209e38) at /.../kernel/kthread.c:168
>
>
> I can reproduce the problem reliably by adding an "mdelay(100);" in
> dw_mci_queue_request() right after "host->state = STATE_SENDING_CMD;".
>
> When I do that and add a kgdb_breakpoint() in
> dw_mci_work_routine_card() for the -ENOMEDIUM case, I can even see the
> state of the "mmc_rescan" thread with kgdb:
>
> #0 mmc_wait_for_req (host=host@entry=0xee702140,
> mrq=mrq@entry=0xef1efdbc) at /.../drivers/mmc/core/core.c:474
> #1 0x803b1934 in mmc_wait_for_cmd (host=host@entry=0xee702140,
> cmd=cmd@entry=0xef1efe14, retries=0) at
> /.../drivers/mmc/core/core.c:567
> #2 0x803bae34 in mmc_io_rw_direct_host (host=host@entry=0xee702140,
> write=write@entry=0, fn=fn@entry=0, addr=addr@entry=6, in=0 '\000',
> out=out@entry=0xef1efe83 "\356\027\345\032'")
> at /.../drivers/mmc/core/sdio_ops.c:89
> #3 0x803bb324 in sdio_reset (host=host@entry=0xee702140) at
> /.../drivers/mmc/core/sdio_ops.c:214
> #4 0x803b35d8 in mmc_rescan_try_freq (freq=<optimized out>,
> host=0xee702140) at /.../drivers/mmc/core/core.c:2084
> #5 mmc_rescan (work=0xee7023ac) at /.../drivers/mmc/core/core.c:2210
> #6 0x8004619c in process_one_work (worker=worker@entry=0xef0af900,
> work=0xee7023ac) at /.../kernel/workqueue.c:2263
> #7 0x8004680c in worker_thread (__worker=__worker@entry=0xef0af900)
> at /.../kernel/workqueue.c:2383
> #8 0x8004b4c0 in kthread (_create=0xef18de40) at /.../kernel/kthread.c:168
> #9 0x8000e758 in ret_from_fork () at /.../arch/arm/kernel/entry-common.S:92
> #10 0x8000e758 in ret_from_fork () at /.../arch/arm/kernel/entry-common.S:92
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>
>
> I'm not sure I have time to track down the whole race at the moment,
> though I may be able to come back to it later. However, we now have a
> set of steps to reproduce (I think) and a full description of the
> race. Perhaps someone else who knows the code better would be able to
> have a whack at it?
>
> In any case, it seems reasonable to still merge my CL, since it makes
> the race much less likely (and impossible in the case of non-removable
> cards) removes some pointless code. Would you be interested in acking
> it?

Thanks you for detail.
Certainly, host own triggering for 'dw_mci_work_routine_card' makes problem during probing.
While progressing the request from 'mmc_rescan', if 'dw_mci_work_routine_card' routine is activated,
it will cancel the current request.
The problem case is that 'mmc_rescan' is prior to 'dw_mci_work_routine_card' from host own.

Acked-by: Seungwon Jeon <[email protected]>

Thanks,
Seungwon Jeon
>
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html