In the 3.7-rc3 kernel, there is an infinite loop in the
mmc_blk_err_check() function in drivers/mmc/card/block.c that can be
caused bad hardware. This loop has moved around a little, but appears to
have been around in the kernel since v2.6.12. The code will loop
forever on write if the card isn't ready for data or if it's in program
mode. I did some searching and saw that it was reported to
[email protected]
http://permalink.gmane.org/gmane.linux.kernel.mmc/2021 back in May but
didn't see a response. Should there be a maximum retry count or a
timeout to prevent an infinite loop?
1030 /*
1031 * Everything else is either success, or a data error of some
1032 * kind. If it was a write, we may have transitioned to
1033 * program mode, which we have to wait for it to complete.
1034 */
1035 if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
1036 u32 status;
1037 do {
1038 int err = get_card_status(card, &status, 5);
1039 if (err) {
1040
pr_err("%s: error %d requesting status\n",
1041 req->rq_disk->disk_name, err);
1042 return MMC_BLK_CMD_ERR;
1043 }
1044 /*
1045 * Some cards mishandle the status bits,
1046 * so make sure to check both the busy
1047 * indication and the card state.
1048 */
1049 } while (!(status & R1_READY_FOR_DATA) ||
1050 (R1_CURRENT_STATE(status) == R1_STATE_PRG));
1051 }
Hi,
On Wed, Oct 31 2012, Trey Ramsay wrote:
> In the 3.7-rc3 kernel, there is an infinite loop in the
> mmc_blk_err_check() function in drivers/mmc/card/block.c that can be
> caused bad hardware. This loop has moved around a little, but appears
> to have been around in the kernel since v2.6.12. The code will loop
> forever on write if the card isn't ready for data or if it's in
> program mode. I did some searching and saw that it was reported to
> [email protected]
> http://permalink.gmane.org/gmane.linux.kernel.mmc/2021 back in May but
> didn't see a response. Should there be a maximum retry count or a
> timeout to prevent an infinite loop?
Yes, a (long) timeout is a good idea, and there are three different
places where we use this type of loop waiting for R1_STATE_PRG to drop.
I'll add this to my TODO list, but feel free to fix it if you want to.
Thanks,
- Chris.
--
Chris Ball <[email protected]> <http://printf.net/>
One Laptop Per Child
Hi Trey,
On Tue, Nov 13 2012, Trey Ramsay wrote:
> Thanks Chris. Any idea how long the timeout should be? It could
> cause problems if we timeout to early. This what I have so far with a
> 10 minute timeout. The code is based off of v3.7-rc3
Ten minutes sounds excessive, which is actually fine for our purpose; I
don't think we need to make the value tunable. Thanks! Please can you:
* resend in plain text instead of HTML, with no line-wrapping corrupting
the patch
* include a commit message and "Signed-off-by" (see SubmittingPatches
in the Documentation/ directory)
* change the pr_err()s to include the MMC hostname, and give an English
description of the error instead of the line number -- most users
don't have their kernel source available to find line numbers in.
Something like "Card stuck in programming state!" sounds good.
After that, I'll merge the patch for testing.
Thanks,
- Chris.
--
Chris Ball <[email protected]> <http://printf.net/>
One Laptop Per Child
There are infinite loops in the mmc code that can be caused by bad hardware.
The code will loop forever if the device never comes back from program mode,
R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
A long timeout will be added to prevent the code from looping forever.
The timeout will occur if the device never comes back from program
state or the device never becomes ready for data.
Signed-off-by: Trey Ramsay <[email protected]>
---
drivers/mmc/card/block.c | 15 +++++++++++++++
drivers/mmc/core/core.c | 18 +++++++++++++++++-
drivers/mmc/core/mmc_ops.c | 11 +++++++++++
3 files changed, 43 insertions(+), 1 deletions(-)
diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 172a768..8e60508 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -57,6 +57,7 @@ MODULE_ALIAS("mmc:block");
#define INAND_CMD38_ARG_SECERASE 0x80
#define INAND_CMD38_ARG_SECTRIM1 0x81
#define INAND_CMD38_ARG_SECTRIM2 0x88
+#define MMC_BLK_TIMEOUT_MS (10 * 60 * 1000) /* 10 minute timeout */
static DEFINE_MUTEX(block_mutex);
@@ -1034,6 +1035,9 @@ static int mmc_blk_err_check(struct mmc_card *card,
*/
if (!mmc_host_is_spi(card->host) && rq_data_dir(req) != READ) {
u32 status;
+ unsigned long timeout;
+
+ timeout = jiffies + msecs_to_jiffies(MMC_BLK_TIMEOUT_MS);
do {
int err = get_card_status(card, &status, 5);
if (err) {
@@ -1041,6 +1045,17 @@ static int mmc_blk_err_check(struct mmc_card *card,
req->rq_disk->disk_name, err);
return MMC_BLK_CMD_ERR;
}
+
+ /* Timeout if the device never becomes ready for data
+ * and never leaves the program state.
+ */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming state!"\
+ " %s %s\n", mmc_hostname(card->host),
+ req->rq_disk->disk_name, __func__);
+
+ return MMC_BLK_CMD_ERR;
+ }
/*
* Some cards mishandle the status bits,
* so make sure to check both the busy
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 06c42cf..bccfd18 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -42,6 +42,9 @@
#include "sd_ops.h"
#include "sdio_ops.h"
+/* If the device is not responding */
+#define MMC_CORE_TIMEOUT_MS (10 * 60 * 1000) /* 10 minute timeout */
+
/*
* Background operations can take a long time, depending on the housekeeping
* operations the card has to perform.
@@ -1631,6 +1634,7 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from,
{
struct mmc_command cmd = {0};
unsigned int qty = 0;
+ unsigned long timeout;
int err;
/*
@@ -1708,6 +1712,7 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from,
if (mmc_host_is_spi(card->host))
goto out;
+ timeout = jiffies + msecs_to_jiffies(MMC_CORE_TIMEOUT_MS);
do {
memset(&cmd, 0, sizeof(struct mmc_command));
cmd.opcode = MMC_SEND_STATUS;
@@ -1721,8 +1726,19 @@ static int mmc_do_erase(struct mmc_card *card, unsigned int from,
err = -EIO;
goto out;
}
+
+ /* Timeout if the device never becomes ready for data and
+ * never leaves the program state.
+ */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming state! %s\n",
+ mmc_hostname(card->host), __func__);
+ err = -EIO;
+ goto out;
+ }
+
} while (!(cmd.resp[0] & R1_READY_FOR_DATA) ||
- R1_CURRENT_STATE(cmd.resp[0]) == R1_STATE_PRG);
+ (R1_CURRENT_STATE(cmd.resp[0]) == R1_STATE_PRG));
out:
return err;
}
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index a0e1720..6d8f701 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -21,6 +21,8 @@
#include "core.h"
#include "mmc_ops.h"
+#define MMC_OPS_TIMEOUT_MS (10 * 60 * 1000) /* 10 minute timeout */
+
static int _mmc_select_card(struct mmc_host *host, struct mmc_card *card)
{
int err;
@@ -409,6 +411,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
{
int err;
struct mmc_command cmd = {0};
+ unsigned long timeout;
u32 status;
BUG_ON(!card);
@@ -437,6 +440,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
return 0;
/* Must check status to be sure of no errors */
+ timeout = jiffies + msecs_to_jiffies(MMC_OPS_TIMEOUT_MS);
do {
err = mmc_send_status(card, &status);
if (err)
@@ -445,6 +449,13 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
break;
if (mmc_host_is_spi(card->host))
break;
+
+ /* Timeout if the device never leaves the program state. */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming state! %s\n",
+ mmc_hostname(card->host), __func__);
+ return -ETIMEDOUT;
+ }
} while (R1_CURRENT_STATE(status) == R1_STATE_PRG);
if (mmc_host_is_spi(card->host)) {
--
1.7.1
Hi Trey,
On Fri, Nov 16 2012, Trey Ramsay wrote:
> There are infinite loops in the mmc code that can be caused by bad hardware.
> The code will loop forever if the device never comes back from program mode,
> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>
> A long timeout will be added to prevent the code from looping forever.
> The timeout will occur if the device never comes back from program
> state or the device never becomes ready for data.
>
> Signed-off-by: Trey Ramsay <[email protected]>
Thanks, looks good!
Have you thought about what's going to happen after this path is hit?
Are we just going to start a new request that begins a new ten-minute
hang, or do we notice the bad card state somewhere and refuse to start
new I/O?
- Chris.
--
Chris Ball <[email protected]> <http://printf.net/>
One Laptop Per Child
On 11/16/2012 09:37 AM, Chris Ball wrote:
> Hi Trey,
>
> On Fri, Nov 16 2012, Trey Ramsay wrote:
>> There are infinite loops in the mmc code that can be caused by bad hardware.
>> The code will loop forever if the device never comes back from program mode,
>> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>>
>> A long timeout will be added to prevent the code from looping forever.
>> The timeout will occur if the device never comes back from program
>> state or the device never becomes ready for data.
>>
>> Signed-off-by: Trey Ramsay <[email protected]>
>
> Thanks, looks good!
>
> Have you thought about what's going to happen after this path is hit?
> Are we just going to start a new request that begins a new ten-minute
> hang, or do we notice the bad card state somewhere and refuse to start
> new I/O?
>
> - Chris.
>
You're welcome, and thanks for the help!
Good question. In regards to the original problem were it was hung in
mmc_blk_err_check, the new code path will timeout after 10 minutes, log
an error, issue a hardware reset and abort the request. Is the hardware
reset enough or will that even work when the device isn't coming out of
program state? Should we try to refuse all new I/O?
Below are just some notes I took about the code path for
mmc_blk_err_check and mmc_do_erase.
=============================
mmc_blk_err_check
int err = get_card_status(card, &status, 5);
if (err) {
pr_err("%s: error %d requesting status\n",
req->rq_disk->disk_name, err);
return MMC_BLK_CMD_ERR;
}
+
+ /* Timeout if the device never becomes ready for
data
+ * and never leaves the program state.
+ */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming
state!"\
+ " %s %s\n",
mmc_hostname(card->host),
+ req->rq_disk->disk_name, __func__);
+
+ return MMC_BLK_CMD_ERR;
+ }
Stack
-----------------
mmc_blk_err_check
mmc_start_req
mmc_blk_issue_rw_rq
mmc_blk_issue_rq
mmc_queue_thread
We return MMC_BLK_CMD_ERR the same way as we return MMC_BLK_CMD_ERR if
the get_card_status failed. The code which returns to mmc_start_req
which sets the error status and returns to mmc_blk_issue_rw_rq with a
status of MMC_BLK_CMD_ERR where it calls mmc_blk_reset which calls
mmc_hw_reset. The code then return so mmc_queue_thread which does not
check the return code.
=============================
mmc_do_erase
+
+ /* Timeout if the device never becomes ready for data and
+ * never leaves the program state.
+ */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming state! %s\n",
+ mmc_hostname(card->host), __func__);
+ err = -EIO;
+ goto out;
+ }
+
err = mmc_erase(card, from, nr, arg);
out:
if (err == -EIO && !mmc_blk_reset(md, card->host, type))
goto retry;
The mmc_do_erase function is called by mmc_erase which is called from 3
locations in the block.c code.
1) mmc_blk_issue_discard_rq--->mmc_erase--->mmc_do_erase line 848
2) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 884
3) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 904
This applies to 1, 2 and 3 above.
Since we return -EIO, mmc_blk_issue_discard_rq or
mmc_blk_issue_secdiscard will do a mmc_blk_reset which will call
mmc_hw_reset. If hardware reset is successfull, it will retry the
command. If the hardware reset is unsuccessfull, it will call
blk_end_request and will return 0 if there is
an err and will return to the mmc_queue_thread function which does
not check the return code.
Hi Trey, thanks for the analysis,
On Fri, Nov 16 2012, Trey Ramsay wrote:
> Good question. In regards to the original problem were it was hung in
> mmc_blk_err_check, the new code path will timeout after 10 minutes, log
> an error, issue a hardware reset and abort the request. Is the hardware
> reset enough or will that even work when the device isn't coming out of
> program state? Should we try to refuse all new I/O?
mmc_hw_reset() only works for eMMC devices with a hooked up reset GPIO
-- not SD cards -- and at the moment there's only one system (Intel
Medfield) that supplies a GPIO, so that's not a general solution.
Maybe we should just merge your patch for now; we'll definitely get at
least a pr_err() explaining what's going on, which is an improvement.
Next time someone hits this (if anyone has an SD card that exhibits
this problem, it'd be very valuable for testing) we can look at going
farther, such as immediately setting host->flags |= SDHCI_DEVICE_DEAD.
What do you think?
- Chris.
--
Chris Ball <[email protected]> <http://printf.net/>
One Laptop Per Child
On 11/16/2012 09:37 AM, Chris Ball wrote:
> Hi Trey,
>
> On Fri, Nov 16 2012, Trey Ramsay wrote:
>> There are infinite loops in the mmc code that can be caused by bad hardware.
>> The code will loop forever if the device never comes back from program mode,
>> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>>
>> A long timeout will be added to prevent the code from looping forever.
>> The timeout will occur if the device never comes back from program
>> state or the device never becomes ready for data.
>>
>> Signed-off-by: Trey Ramsay <[email protected]>
>
> Thanks, looks good!
>
> Have you thought about what's going to happen after this path is hit?
> Are we just going to start a new request that begins a new ten-minute
> hang, or do we notice the bad card state somewhere and refuse to start
> new I/O?
>
> - Chris.
>
You're welcome, and thanks for the help!
Good question. In regards to the original problem were it was hung in
mmc_blk_err_check, the new code path will timeout after 10 minutes, log
an error, issue a hardware reset and abort the request. Is the hardware
reset enough or will that even work when the device isn't coming out of
program state? Should we try to refuse all new I/O?
Below are just some notes I took about the code path for
mmc_blk_err_check and mmc_do_erase.
=============================
mmc_blk_err_check
int err = get_card_status(card, &status, 5);
if (err) {
pr_err("%s: error %d requesting status\n",
req->rq_disk->disk_name, err);
return MMC_BLK_CMD_ERR;
}
+
+ /* Timeout if the device never becomes ready for
data
+ * and never leaves the program state.
+ */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming
state!"\
+ " %s %s\n",
mmc_hostname(card->host),
+ req->rq_disk->disk_name, __func__);
+
+ return MMC_BLK_CMD_ERR;
+ }
Stack
-----------------
mmc_blk_err_check
mmc_start_req
mmc_blk_issue_rw_rq
mmc_blk_issue_rq
mmc_queue_thread
We return MMC_BLK_CMD_ERR the same way as we return MMC_BLK_CMD_ERR if
the get_card_status failed. The code which returns to mmc_start_req
which sets the error status and returns to mmc_blk_issue_rw_rq with a
status of MMC_BLK_CMD_ERR where it calls mmc_blk_reset which calls
mmc_hw_reset. The code then return so mmc_queue_thread which does not
check the return code.
=============================
mmc_do_erase
+
+ /* Timeout if the device never becomes ready for data and
+ * never leaves the program state.
+ */
+ if (time_after(jiffies, timeout)) {
+ pr_err("%s: Card stuck in programming state! %s\n",
+ mmc_hostname(card->host), __func__);
+ err = -EIO;
+ goto out;
+ }
+
err = mmc_erase(card, from, nr, arg);
out:
if (err == -EIO && !mmc_blk_reset(md, card->host, type))
goto retry;
The mmc_do_erase function is called by mmc_erase which is called from 3
locations in the block.c code.
1) mmc_blk_issue_discard_rq--->mmc_erase--->mmc_do_erase line 848
2) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 884
3) mmc_blk_issue_secdiscard_rq--->mmc_erase--->mmc_do_erase line 904
This applies to 1, 2 and 3 above.
Since we return -EIO, mmc_blk_issue_discard_rq or
mmc_blk_issue_secdiscard will do a mmc_blk_reset which will call
mmc_hw_reset. If hardware reset is successfull, it will retry the
command. If the hardware reset is unsuccessfull, it will call
blk_end_request and will return 0 if there is
an err and will return to the mmc_queue_thread function which does
not check the return code.
On 11/16/2012 06:37 PM, Chris Ball wrote:
> Hi Trey, thanks for the analysis,
>
> On Fri, Nov 16 2012, Trey Ramsay wrote:
>> Good question. In regards to the original problem were it was hung in
>> mmc_blk_err_check, the new code path will timeout after 10 minutes, log
>> an error, issue a hardware reset and abort the request. Is the hardware
>> reset enough or will that even work when the device isn't coming out of
>> program state? Should we try to refuse all new I/O?
>
> mmc_hw_reset() only works for eMMC devices with a hooked up reset GPIO
> -- not SD cards -- and at the moment there's only one system (Intel
> Medfield) that supplies a GPIO, so that's not a general solution.
>
> Maybe we should just merge your patch for now; we'll definitely get at
> least a pr_err() explaining what's going on, which is an improvement.
> Next time someone hits this (if anyone has an SD card that exhibits
> this problem, it'd be very valuable for testing) we can look at going
> farther, such as immediately setting host->flags |= SDHCI_DEVICE_DEAD.
> What do you think?
>
> - Chris.
>
Hi Chris,
Sounds good. Thanks for the explanation. Setting host->flags |=
SDHCI_DEVICE_DEAD is a great idea. I'll check with my team to see if we
have any hardware that exhibits this problem. If we do, I can do some
testing on the code you suggested.
Thanks,
Trey
Hi,
On Fri, Nov 16 2012, Trey Ramsay wrote:
> There are infinite loops in the mmc code that can be caused by bad hardware.
> The code will loop forever if the device never comes back from program mode,
> R1_STATE_PRG, and it is not ready for data, R1_READY_FOR_DATA.
>
> A long timeout will be added to prevent the code from looping forever.
> The timeout will occur if the device never comes back from program
> state or the device never becomes ready for data.
>
> Signed-off-by: Trey Ramsay <[email protected]>
Thanks for doing this; pushed to mmc-next for 3.8.
- Chris.
--
Chris Ball <[email protected]> <http://printf.net/>
One Laptop Per Child