On some of our boards containing Micron eMMC chips compatible with
the eMMC 5.0 specification we starting seeing boot failures due to
timeouts:
mmc1: error -110 whilst initialising MMC card
It turns out that switching the cache on after a power loss event can
take quite long. In some simple testing thusfar we've seen values up to
700ms, which is far longer then the GENERIC_CMD6_TIME of the chip
(250ms).
Looking at both the eMMC 4.51 and 5.0 specification there doesn't seem
to be a defined upper bound for the CACHE_CTRL ON command. For both
CACHE_CTRL OFF and FLUSH_CACHE it is documented that they can take
essentially unbounded time, but CACHE_CTRL ON i get the impression that
it's assumed to be "fast". Unfortunately this is not true in reality.
To resolve this, simply drop the timeout from CACHE_CTRL ON and assume
it might take an unbounded time similar to the FLUSH_CACHE command.
Signed-off-by: Sjoerd Simons <[email protected]>
---
drivers/mmc/core/mmc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index bc1bd2c25613..ac70b508a939 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1794,8 +1794,7 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
if (!mmc_card_broken_hpi(card) &&
card->ext_csd.cache_size > 0) {
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
- EXT_CSD_CACHE_CTRL, 1,
- card->ext_csd.generic_cmd6_time);
+ EXT_CSD_CACHE_CTRL, 1, 0);
if (err && err != -EBADMSG)
goto free_card;
--
2.19.1
Hi Sjoerd,
On Tuesday 06 November 2018 07:00 PM, Sjoerd Simons wrote:
> On some of our boards containing Micron eMMC chips compatible with
> the eMMC 5.0 specification we starting seeing boot failures due to
> timeouts:
> mmc1: error -110 whilst initialising MMC card
>
> It turns out that switching the cache on after a power loss event can
> take quite long. In some simple testing thusfar we've seen values up to
> 700ms, which is far longer then the GENERIC_CMD6_TIME of the chip
> (250ms).
>
> Looking at both the eMMC 4.51 and 5.0 specification there doesn't seem
> to be a defined upper bound for the CACHE_CTRL ON command. For both
> CACHE_CTRL OFF and FLUSH_CACHE it is documented that they can take
> essentially unbounded time, but CACHE_CTRL ON i get the impression that
> it's assumed to be "fast". Unfortunately this is not true in reality.
>
> To resolve this, simply drop the timeout from CACHE_CTRL ON and assume
> it might take an unbounded time similar to the FLUSH_CACHE command.
>
> Signed-off-by: Sjoerd Simons <[email protected]>
>
I have the exact same problem with micron cards. Specifically with CID:
R1J56L
The correct way to solve this would be to add a quirk for the specific
card. Does this patch solve your issue?
---8<---
From 9d470be59acd82859beb4c965a620c9a32858bb4 Mon Sep 17 00:00:00 2001
From: Faiz Abbas <[email protected]>
Date: Wed, 20 Jun 2018 20:36:38 +0530
Subject: [PATCH] mmc: core: Add QUIRK for eMMC CMD6 timeout for some Micron
cards
It seems that on some Micron eMMC cards present on TI's AM654 EVM
the CACHE_CTRL_ENABLE function in CMD6 takes longer (~750 ms) than
the specified timeout in the GENERIC_CMD6_TIMEOUT byte of the
Extended CSD (~250 ms).
Therefore, add a quirk to detect this card and use a value of 1 sec for
timeout.
Reported-by: Andreas Dannenberg <[email protected]>
Signed-off-by: Faiz Abbas <[email protected]>
Signed-off-by: Sekhar Nori <[email protected]>
---
drivers/mmc/core/card.h | 7 +++++++
drivers/mmc/core/mmc.c | 14 +++++++++++++-
drivers/mmc/core/quirks.h | 8 ++++++++
include/linux/mmc/card.h | 1 +
4 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/mmc/core/card.h b/drivers/mmc/core/card.h
index 9c821eedd156..6b5e3c6253f5 100644
--- a/drivers/mmc/core/card.h
+++ b/drivers/mmc/core/card.h
@@ -84,6 +84,8 @@ struct mmc_fixup {
#define CID_MANFID_HYNIX 0x90
#define CID_MANFID_NUMONYX 0xFE
+#define CID_NAME_R1J56L "R1J56L"
+
#define END_FIXUP { NULL }
#define _FIXUP_EXT(_name, _manfid, _oemid, _rev_start, _rev_end, \
@@ -221,4 +223,9 @@ static inline int mmc_card_broken_hpi(const struct mmc_card *c)
return c->quirks & MMC_QUIRK_BROKEN_HPI;
}
+static inline int mmc_card_long_cache_ctrl(const struct mmc_card *c)
+{
+ return c->quirks & MMC_QUIRK_LONG_CACHE_ENABLE_TIME;
+}
+
#endif
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index bad5c1bf4ed9..667caeb640a3 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1522,6 +1522,7 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
struct mmc_card *oldcard)
{
struct mmc_card *card;
+ unsigned int cache_ctrl_timeout;
int err;
u32 cid[4];
u32 rocr;
@@ -1766,9 +1767,20 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
*/
if (!mmc_card_broken_hpi(card) &&
card->ext_csd.cache_size > 0) {
+
+ /*
+ * Some cards require a longer timeout than given in CSD.
+ * Use a one second timeout here which can be increased
+ * further if more cards needing larger timeouts are found
+ */
+ if (mmc_card_long_cache_ctrl(card))
+ cache_ctrl_timeout = 1000;
+ else
+ cache_ctrl_timeout = card->ext_csd.generic_cmd6_time;
+
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
EXT_CSD_CACHE_CTRL, 1,
- card->ext_csd.generic_cmd6_time);
+ cache_ctrl_timeout);
if (err && err != -EBADMSG)
goto free_card;
diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
index 5153577754f0..934ca72ef2b1 100644
--- a/drivers/mmc/core/quirks.h
+++ b/drivers/mmc/core/quirks.h
@@ -116,6 +116,14 @@ static const struct mmc_fixup mmc_ext_csd_fixups[] = {
MMC_FIXUP_EXT_CSD_REV(CID_NAME_ANY, CID_MANFID_NUMONYX,
0x014e, add_quirk, MMC_QUIRK_BROKEN_HPI, 6),
+ /*
+ * Certain Micron eMMC cards need a longer CMD6:CACHE_CTRL timeout
+ * than indicated in CSD
+ */
+ MMC_FIXUP_EXT_CSD_REV(CID_NAME_R1J56L, CID_MANFID_MICRON,
+ 0x14e, add_quirk,
+ MMC_QUIRK_LONG_CACHE_ENABLE_TIME, 7),
+
END_FIXUP
};
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 279b39008a33..5a8ce0bb222e 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -268,6 +268,7 @@ struct mmc_card {
#define MMC_QUIRK_BROKEN_IRQ_POLLING (1<<11) /* Polling SDIO_CCCR_INTx could create a fake interrupt */
#define MMC_QUIRK_TRIM_BROKEN (1<<12) /* Skip trim */
#define MMC_QUIRK_BROKEN_HPI (1<<13) /* Disable broken HPI support */
+#define MMC_QUIRK_LONG_CACHE_ENABLE_TIME (1 << 14) /* CACHE_CTRL enable time > CSD says */
bool reenable_cmdq; /* Re-enable Command Queue */
--
2.18.0
Thanks,
Faiz
On Tue, 2018-11-06 at 19:34 +0530, Faiz Abbas wrote:
> Hi Sjoerd,
>
> On Tuesday 06 November 2018 07:00 PM, Sjoerd Simons wrote:
> > On some of our boards containing Micron eMMC chips compatible with
> > the eMMC 5.0 specification we starting seeing boot failures due to
> > timeouts:
> > mmc1: error -110 whilst initialising MMC card
> >
> > It turns out that switching the cache on after a power loss event
> > can
> > take quite long. In some simple testing thusfar we've seen values
> > up to
> > 700ms, which is far longer then the GENERIC_CMD6_TIME of the chip
> > (250ms).
> >
> > Looking at both the eMMC 4.51 and 5.0 specification there doesn't
> > seem
> > to be a defined upper bound for the CACHE_CTRL ON command. For both
> > CACHE_CTRL OFF and FLUSH_CACHE it is documented that they can take
> > essentially unbounded time, but CACHE_CTRL ON i get the impression
> > that
> > it's assumed to be "fast". Unfortunately this is not true in
> > reality.
> >
> > To resolve this, simply drop the timeout from CACHE_CTRL ON and
> > assume
> > it might take an unbounded time similar to the FLUSH_CACHE command.
> >
> > Signed-off-by: Sjoerd Simons <[email protected]>
> >
>
> I have the exact same problem with micron cards. Specifically with
> CID:
> R1J56L
>
> The correct way to solve this would be to add a quirk for the
> specific
> card. Does this patch solve your issue?
It would yes (unfortunately my google-foo failed to find your patch);
That also happens to be one of the cards we deploy; However i did
wonder about adding a quirk but decided against it as it was not clear
to me from the specification that CACHE ON really is meant to complete
within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in hit-a-
mole games as the failure is really quite tedious (boot failure).
Our approved parts lists has more Micron eMMC 5.0 parts, I'm trying to
find out whether we actually had batches build with those and then we
check if they have the same behaviour or not.
Sjoerd
> ---8<---
> From 9d470be59acd82859beb4c965a620c9a32858bb4 Mon Sep 17 00:00:00
> 2001
> From: Faiz Abbas <[email protected]>
> Date: Wed, 20 Jun 2018 20:36:38 +0530
> Subject: [PATCH] mmc: core: Add QUIRK for eMMC CMD6 timeout for some
> Micron
> cards
>
> It seems that on some Micron eMMC cards present on TI's AM654 EVM
> the CACHE_CTRL_ENABLE function in CMD6 takes longer (~750 ms) than
> the specified timeout in the GENERIC_CMD6_TIMEOUT byte of the
> Extended CSD (~250 ms).
>
> Therefore, add a quirk to detect this card and use a value of 1 sec
> for
> timeout.
>
> Reported-by: Andreas Dannenberg <[email protected]>
> Signed-off-by: Faiz Abbas <[email protected]>
> Signed-off-by: Sekhar Nori <[email protected]>
> ---
> drivers/mmc/core/card.h | 7 +++++++
> drivers/mmc/core/mmc.c | 14 +++++++++++++-
> drivers/mmc/core/quirks.h | 8 ++++++++
> include/linux/mmc/card.h | 1 +
> 4 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/core/card.h b/drivers/mmc/core/card.h
> index 9c821eedd156..6b5e3c6253f5 100644
> --- a/drivers/mmc/core/card.h
> +++ b/drivers/mmc/core/card.h
> @@ -84,6 +84,8 @@ struct mmc_fixup {
> #define CID_MANFID_HYNIX 0x90
> #define CID_MANFID_NUMONYX 0xFE
>
> +#define CID_NAME_R1J56L "R1J56L"
> +
> #define END_FIXUP { NULL }
>
> #define _FIXUP_EXT(_name, _manfid, _oemid, _rev_start, _rev_end,
> \
> @@ -221,4 +223,9 @@ static inline int mmc_card_broken_hpi(const
> struct mmc_card *c)
> return c->quirks & MMC_QUIRK_BROKEN_HPI;
> }
>
> +static inline int mmc_card_long_cache_ctrl(const struct mmc_card *c)
> +{
> + return c->quirks & MMC_QUIRK_LONG_CACHE_ENABLE_TIME;
> +}
> +
> #endif
> diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
> index bad5c1bf4ed9..667caeb640a3 100644
> --- a/drivers/mmc/core/mmc.c
> +++ b/drivers/mmc/core/mmc.c
> @@ -1522,6 +1522,7 @@ static int mmc_init_card(struct mmc_host *host,
> u32 ocr,
> struct mmc_card *oldcard)
> {
> struct mmc_card *card;
> + unsigned int cache_ctrl_timeout;
> int err;
> u32 cid[4];
> u32 rocr;
> @@ -1766,9 +1767,20 @@ static int mmc_init_card(struct mmc_host
> *host, u32 ocr,
> */
> if (!mmc_card_broken_hpi(card) &&
> card->ext_csd.cache_size > 0) {
> +
> + /*
> + * Some cards require a longer timeout than given in
> CSD.
> + * Use a one second timeout here which can be increased
> + * further if more cards needing larger timeouts are
> found
> + */
> + if (mmc_card_long_cache_ctrl(card))
> + cache_ctrl_timeout = 1000;
> + else
> + cache_ctrl_timeout = card-
> >ext_csd.generic_cmd6_time;
> +
> err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
> EXT_CSD_CACHE_CTRL, 1,
> - card->ext_csd.generic_cmd6_time);
> + cache_ctrl_timeout);
> if (err && err != -EBADMSG)
> goto free_card;
>
> diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
> index 5153577754f0..934ca72ef2b1 100644
> --- a/drivers/mmc/core/quirks.h
> +++ b/drivers/mmc/core/quirks.h
> @@ -116,6 +116,14 @@ static const struct mmc_fixup
> mmc_ext_csd_fixups[] = {
> MMC_FIXUP_EXT_CSD_REV(CID_NAME_ANY, CID_MANFID_NUMONYX,
> 0x014e, add_quirk, MMC_QUIRK_BROKEN_HPI,
> 6),
>
> + /*
> + * Certain Micron eMMC cards need a longer CMD6:CACHE_CTRL
> timeout
> + * than indicated in CSD
> + */
> + MMC_FIXUP_EXT_CSD_REV(CID_NAME_R1J56L, CID_MANFID_MICRON,
> + 0x14e, add_quirk,
> + MMC_QUIRK_LONG_CACHE_ENABLE_TIME, 7),
> +
> END_FIXUP
> };
>
> diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
> index 279b39008a33..5a8ce0bb222e 100644
> --- a/include/linux/mmc/card.h
> +++ b/include/linux/mmc/card.h
> @@ -268,6 +268,7 @@ struct mmc_card {
> #define MMC_QUIRK_BROKEN_IRQ_POLLING (1<<11) /* Polling
> SDIO_CCCR_INTx could create a fake interrupt */
> #define MMC_QUIRK_TRIM_BROKEN (1<<12) /* Skip trim
> */
> #define MMC_QUIRK_BROKEN_HPI (1<<13) /* Disable broken HPI
> support */
> +#define MMC_QUIRK_LONG_CACHE_ENABLE_TIME (1 << 14) /* CACHE_CTRL
> enable time > CSD says */
>
> bool reenable_cmdq; /* Re-enable Command
> Queue */
>
--
Sjoerd Simons
Collabora Ltd.
> That also happens to be one of the cards we deploy; However i did
> wonder about adding a quirk but decided against it as it was not clear
> to me from the specification that CACHE ON really is meant to complete
> within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in hit-a-
> mole games as the failure is really quite tedious (boot failure).
I agree that we should use the more defensive variant as a default. I
mean there should be no performance regression since most cards will
respond just faster, or? The only downside I could see is that we might
miss a real timeout with no bounds set and might get stuck? Maybe it is
worth contacting eMMC spec people to at least know what is the expected
behaviour?
On 7 November 2018 at 09:47, Wolfram Sang <[email protected]> wrote:
>
>> That also happens to be one of the cards we deploy; However i did
>> wonder about adding a quirk but decided against it as it was not clear
>> to me from the specification that CACHE ON really is meant to complete
>> within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in hit-a-
>> mole games as the failure is really quite tedious (boot failure).
>
> I agree that we should use the more defensive variant as a default. I
> mean there should be no performance regression since most cards will
> respond just faster, or? The only downside I could see is that we might
> miss a real timeout with no bounds set and might get stuck?
Well, you have a point, but still it's kind of nice to know which
cards are behaving well and which ones that doesn't. Hence I think I
prefer to stick using a quirk, unless you have a strong opinion.
Note that, in this case we can use CMD13 to poll for busy, which then
means it also works for those hosts that doesn't support HW busy
detection, without getting additional delays. If this hasn't been the
case, we must be using a quirk, but now we are more free to choose.
> Maybe it is
> worth contacting eMMC spec people to at least know what is the expected
> behaviour?
According to the spec, the GENERIC_CMD6_TIMEOUT should be sufficient.
So this card is not conforming to the spec, I think it's as simple as
that.
Kind regards
Uffe
Hi Uffe,
On 20/11/18 2:54 PM, Ulf Hansson wrote:
> On 7 November 2018 at 09:47, Wolfram Sang <[email protected]> wrote:
>>
>>> That also happens to be one of the cards we deploy; However i did
>>> wonder about adding a quirk but decided against it as it was not clear
>>> to me from the specification that CACHE ON really is meant to complete
>>> within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in hit-a-
>>> mole games as the failure is really quite tedious (boot failure).
>>
>> I agree that we should use the more defensive variant as a default. I
>> mean there should be no performance regression since most cards will
>> respond just faster, or? The only downside I could see is that we might
>> miss a real timeout with no bounds set and might get stuck?
>
> Well, you have a point, but still it's kind of nice to know which
> cards are behaving well and which ones that doesn't. Hence I think I
> prefer to stick using a quirk, unless you have a strong opinion.
>
> Note that, in this case we can use CMD13 to poll for busy, which then
> means it also works for those hosts that doesn't support HW busy
> detection, without getting additional delays. If this hasn't been the
> case, we must be using a quirk, but now we are more free to choose.
>
>> Maybe it is
>> worth contacting eMMC spec people to at least know what is the expected
>> behaviour?
>
> According to the spec, the GENERIC_CMD6_TIMEOUT should be sufficient.
> So this card is not conforming to the spec, I think it's as simple as
> that.
>
Is the QUIRK patch acceptable as it is or do you require some sort of
errata from the card manufacturers?
Thanks,
Faiz
> >> That also happens to be one of the cards we deploy; However i did
> >> wonder about adding a quirk but decided against it as it was not clear
> >> to me from the specification that CACHE ON really is meant to complete
> >> within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in hit-a-
> >> mole games as the failure is really quite tedious (boot failure).
> >
> > I agree that we should use the more defensive variant as a default. I
> > mean there should be no performance regression since most cards will
> > respond just faster, or? The only downside I could see is that we might
> > miss a real timeout with no bounds set and might get stuck?
>
> Well, you have a point, but still it's kind of nice to know which
> cards are behaving well and which ones that doesn't. Hence I think I
> prefer to stick using a quirk, unless you have a strong opinion.
No strong opinion. Especially not if you say it is in the spec (although
"must be sufficient" would be better than "should be" ;)). Also, I
assume this failure is reproducible and should turn up during
development? Compared to "happens once in a while randomly"?
Yet, if we add a quirk for that, then we should probably mention it in
an error message when we hit -ETIMEDOUT for cache on ("does your card
need this quirk?")? It can be pretty time consuming to track this down
otherwise, I'd think.
Hi,
On 20/11/18 3:53 PM, Wolfram Sang wrote:
>
>>>> That also happens to be one of the cards we deploy; However i did
>>>> wonder about adding a quirk but decided against it as it was not clear
>>>> to me from the specification that CACHE ON really is meant to complete
>>>> within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in hit-a-
>>>> mole games as the failure is really quite tedious (boot failure).
>>>
>>> I agree that we should use the more defensive variant as a default. I
>>> mean there should be no performance regression since most cards will
>>> respond just faster, or? The only downside I could see is that we might
>>> miss a real timeout with no bounds set and might get stuck?
>>
>> Well, you have a point, but still it's kind of nice to know which
>> cards are behaving well and which ones that doesn't. Hence I think I
>> prefer to stick using a quirk, unless you have a strong opinion.
>
> No strong opinion. Especially not if you say it is in the spec (although
> "must be sufficient" would be better than "should be" ;)). Also, I
> assume this failure is reproducible and should turn up during
> development? Compared to "happens once in a while randomly"?
At least for me, the failure happens only on some units but is
consistent for a given unit.
>
> Yet, if we add a quirk for that, then we should probably mention it in
> an error message when we hit -ETIMEDOUT for cache on ("does your card
> need this quirk?")? It can be pretty time consuming to track this down
> otherwise, I'd think.
>
The QUIRK needs to be card specific. The software should automatically
detect the card (from the CID) and apply the quirk. Please see patch in
my original reply.
Thanks,
Faiz
> > No strong opinion. Especially not if you say it is in the spec (although
> > "must be sufficient" would be better than "should be" ;)). Also, I
> > assume this failure is reproducible and should turn up during
> > development? Compared to "happens once in a while randomly"?
>
> At least for me, the failure happens only on some units but is
> consistent for a given unit.
I hoped for that, thanks!
> The QUIRK needs to be card specific. The software should automatically
> detect the card (from the CID) and apply the quirk. Please see patch in
> my original reply.
I understand that. What I meant was the case when you discover your card
times out on 'cache on' but it doesn't have a quirk entry yet. So, you
are the first do discover the flag is needed. I would like a pointer
there ("timeout on activating caches. Maybe your card needs $quirk?").
Because for someone just enabling the HW and not deep into MMC stuff,
the path from some -ETIMEOUT to this specific quirk may take some time
otherwise.
On 20 November 2018 at 15:00, Sjoerd Simons
<[email protected]> wrote:
> On Tue, 2018-11-20 at 14:08 +0100, Ulf Hansson wrote:
>> + Hal Emmerich
>>
>> On 20 November 2018 at 12:38, Sjoerd Simons
>> <[email protected]> wrote:
>> > On Tue, 2018-11-20 at 11:23 +0100, Wolfram Sang wrote:
>> > > > > >
>> > So if you know the pattern, or just happen to hit it often in e.g.
>> > automated testing, it does show up during development. Otherwise it
>> > can
>> > appear to "happen once in a while randomly".
>>
>> I don't quite follow. As far as I understand, the extended timeout is
>> needed when turning the cache on.
>>
>> The above seems more related to flushing the cache, no? Flushing have
>> no timeout (also reported to be an issue [1]), which happens either
>> at
>> _mmc_hw_reset() or at _mmc_suspend().
>>
>> What is the relation here?
>
> Yes it's the kinda of behaviour you would expect on a flush indeed! I
> don't know what the card actaully does when turning the cache on,
> whether it's actually flush of something persistent when turning the
> cache on after a hard poweroff or doing some other validation.
>
> All i can share is what our testing seems to indicate, which is that
> there is a wide spread in the time the card needs *and* there seems to
> be strong correlation to the I/O activity before the hard power off and
> the time taken by "cache on".
So the hard power off, means that you are cutting the power to the
platform and not doing a graceful power off? Just so I understand
correctly.
>
>> > Unfortunately for me, it was really a case of getting reports of
>> > some
>> > boards started failing at some point which took a while to track
>> > back.
>> > Especially since it's a battery powered device (thus hard poweroffs
>> > are
>> > rather rare) and we allow the board manufactorer to select from
>> > various
>> > different eMMCs depending on price/available at build time...
>> >
>> > > Yet, if we add a quirk for that, then we should probably mention
>> > > it
>> > > in
>> > > an error message when we hit -ETIMEDOUT for cache on ("does your
>> > > card
>> > > need this quirk?")? It can be pretty time consuming to track this
>> > > down
>> > > otherwise, I'd think.
>> >
>> > Yes please. It would be nice if someone happens to have the right
>> > contacts with Micron to see if it's a known issue for their cards
>> > in
>> > general or just this one.
>> >
>> > Also would be good to have a timeout higher then 1 seconds (or for
>> > these cards not have one?); On our testing thusfar we've seen
>> > timeouts
>> > up to 850ms, but it's impossible to ensure that that's the true
>> > upper
>> > bound.
>>
>> Using no limit of the timeout, would mean we may hang for ~10 minutes
>> (MMC_OPS_TIMEOUT_MS) instead, no thanks.
>
> Probably a silly question, but would this actually cause e.g. boot to
> hang while waiting for the card (assuming rootfs is somewhere else)?
Nope.
[...]
Kind regards
Uffe
On Tue, 2018-11-20 at 11:23 +0100, Wolfram Sang wrote:
> > > > That also happens to be one of the cards we deploy; However i
> > > > did
> > > > wonder about adding a quirk but decided against it as it was
> > > > not clear
> > > > to me from the specification that CACHE ON really is meant to
> > > > complete
> > > > within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in
> > > > hit-a-
> > > > mole games as the failure is really quite tedious (boot
> > > > failure).
> > >
> > > I agree that we should use the more defensive variant as a
> > > default. I
> > > mean there should be no performance regression since most cards
> > > will
> > > respond just faster, or? The only downside I could see is that we
> > > might
> > > miss a real timeout with no bounds set and might get stuck?
> >
> > Well, you have a point, but still it's kind of nice to know which
> > cards are behaving well and which ones that doesn't. Hence I think
> > I
> > prefer to stick using a quirk, unless you have a strong opinion.
Not an incredibly strong opinion either; I just wonder if it's the
right trade-off.
If the quirk/work-around is not there while it should be, the impact is
that you get an unusable card (which for eMMC is likely to mean a
failure to boot the system). Which is somewhat unfortunate.
If the work-around is there while it's not needed then there doesn't
seem to be much of an impact at all; Apart from it not being reported
to the user/developer/kernel community?
In which case it might make more to put in a warning iff the card takes
too long with a list of cards for which this is known?
> No strong opinion. Especially not if you say it is in the spec
> (although
> "must be sufficient" would be better than "should be" ;)). Also, I
> assume this failure is reproducible and should turn up during
> development? Compared to "happens once in a while randomly"?
For the card in question it happens only on hard power off; The time it
takes seems correlated to the state of the cache at hard power off (It
takes substantially longer if there was a lot of I/O activity at
the time of hard power off). With light I/O activity the current
timeout is sometimes enough.
So if you know the pattern, or just happen to hit it often in e.g.
automated testing, it does show up during development. Otherwise it can
appear to "happen once in a while randomly".
Unfortunately for me, it was really a case of getting reports of some
boards started failing at some point which took a while to track back.
Especially since it's a battery powered device (thus hard poweroffs are
rather rare) and we allow the board manufactorer to select from various
different eMMCs depending on price/available at build time...
> Yet, if we add a quirk for that, then we should probably mention it
> in
> an error message when we hit -ETIMEDOUT for cache on ("does your card
> need this quirk?")? It can be pretty time consuming to track this
> down
> otherwise, I'd think.
Yes please. It would be nice if someone happens to have the right
contacts with Micron to see if it's a known issue for their cards in
general or just this one.
Also would be good to have a timeout higher then 1 seconds (or for
these cards not have one?); On our testing thusfar we've seen timeouts
up to 850ms, but it's impossible to ensure that that's the true upper
bound.
--
Sjoerd Simons
Collabora Ltd.
+ Hal Emmerich
On 20 November 2018 at 12:38, Sjoerd Simons
<[email protected]> wrote:
> On Tue, 2018-11-20 at 11:23 +0100, Wolfram Sang wrote:
>> > > > That also happens to be one of the cards we deploy; However i
>> > > > did
>> > > > wonder about adding a quirk but decided against it as it was
>> > > > not clear
>> > > > to me from the specification that CACHE ON really is meant to
>> > > > complete
>> > > > within GENERIC_CMD6_TIMEOUT. That and i fret about ending up in
>> > > > hit-a-
>> > > > mole games as the failure is really quite tedious (boot
>> > > > failure).
>> > >
>> > > I agree that we should use the more defensive variant as a
>> > > default. I
>> > > mean there should be no performance regression since most cards
>> > > will
>> > > respond just faster, or? The only downside I could see is that we
>> > > might
>> > > miss a real timeout with no bounds set and might get stuck?
>> >
>> > Well, you have a point, but still it's kind of nice to know which
>> > cards are behaving well and which ones that doesn't. Hence I think
>> > I
>> > prefer to stick using a quirk, unless you have a strong opinion.
>
> Not an incredibly strong opinion either; I just wonder if it's the
> right trade-off.
>
> If the quirk/work-around is not there while it should be, the impact is
> that you get an unusable card (which for eMMC is likely to mean a
> failure to boot the system). Which is somewhat unfortunate.
>
> If the work-around is there while it's not needed then there doesn't
> seem to be much of an impact at all; Apart from it not being reported
> to the user/developer/kernel community?
>
> In which case it might make more to put in a warning iff the card takes
> too long with a list of cards for which this is known?
>
>> No strong opinion. Especially not if you say it is in the spec
>> (although
>> "must be sufficient" would be better than "should be" ;)). Also, I
>> assume this failure is reproducible and should turn up during
>> development? Compared to "happens once in a while randomly"?
>
> For the card in question it happens only on hard power off; The time it
> takes seems correlated to the state of the cache at hard power off (It
> takes substantially longer if there was a lot of I/O activity at
> the time of hard power off). With light I/O activity the current
> timeout is sometimes enough.
>
> So if you know the pattern, or just happen to hit it often in e.g.
> automated testing, it does show up during development. Otherwise it can
> appear to "happen once in a while randomly".
I don't quite follow. As far as I understand, the extended timeout is
needed when turning the cache on.
The above seems more related to flushing the cache, no? Flushing have
no timeout (also reported to be an issue [1]), which happens either at
_mmc_hw_reset() or at _mmc_suspend().
What is the relation here?
>
> Unfortunately for me, it was really a case of getting reports of some
> boards started failing at some point which took a while to track back.
> Especially since it's a battery powered device (thus hard poweroffs are
> rather rare) and we allow the board manufactorer to select from various
> different eMMCs depending on price/available at build time...
>
>> Yet, if we add a quirk for that, then we should probably mention it
>> in
>> an error message when we hit -ETIMEDOUT for cache on ("does your card
>> need this quirk?")? It can be pretty time consuming to track this
>> down
>> otherwise, I'd think.
>
> Yes please. It would be nice if someone happens to have the right
> contacts with Micron to see if it's a known issue for their cards in
> general or just this one.
>
> Also would be good to have a timeout higher then 1 seconds (or for
> these cards not have one?); On our testing thusfar we've seen timeouts
> up to 850ms, but it's impossible to ensure that that's the true upper
> bound.
Using no limit of the timeout, would mean we may hang for ~10 minutes
(MMC_OPS_TIMEOUT_MS) instead, no thanks.
I am fine with let's say double of 850ms (1700ms), to have some room.
Anyway, the point is, the timeouts in the spec is there for reason.
Unfortunate I think the spec is "lazy" in some other regards and don't
specify timeouts, which complicates things.
Kind regards
Uffe
[1]
https://www.spinics.net/lists/linux-mmc/msg51815.html
On Tue, 2018-11-20 at 14:08 +0100, Ulf Hansson wrote:
> + Hal Emmerich
>
> On 20 November 2018 at 12:38, Sjoerd Simons
> <[email protected]> wrote:
> > On Tue, 2018-11-20 at 11:23 +0100, Wolfram Sang wrote:
> > > > > >
> > So if you know the pattern, or just happen to hit it often in e.g.
> > automated testing, it does show up during development. Otherwise it
> > can
> > appear to "happen once in a while randomly".
>
> I don't quite follow. As far as I understand, the extended timeout is
> needed when turning the cache on.
>
> The above seems more related to flushing the cache, no? Flushing have
> no timeout (also reported to be an issue [1]), which happens either
> at
> _mmc_hw_reset() or at _mmc_suspend().
>
> What is the relation here?
Yes it's the kinda of behaviour you would expect on a flush indeed! I
don't know what the card actaully does when turning the cache on,
whether it's actually flush of something persistent when turning the
cache on after a hard poweroff or doing some other validation.
All i can share is what our testing seems to indicate, which is that
there is a wide spread in the time the card needs *and* there seems to
be strong correlation to the I/O activity before the hard power off and
the time taken by "cache on".
> > Unfortunately for me, it was really a case of getting reports of
> > some
> > boards started failing at some point which took a while to track
> > back.
> > Especially since it's a battery powered device (thus hard poweroffs
> > are
> > rather rare) and we allow the board manufactorer to select from
> > various
> > different eMMCs depending on price/available at build time...
> >
> > > Yet, if we add a quirk for that, then we should probably mention
> > > it
> > > in
> > > an error message when we hit -ETIMEDOUT for cache on ("does your
> > > card
> > > need this quirk?")? It can be pretty time consuming to track this
> > > down
> > > otherwise, I'd think.
> >
> > Yes please. It would be nice if someone happens to have the right
> > contacts with Micron to see if it's a known issue for their cards
> > in
> > general or just this one.
> >
> > Also would be good to have a timeout higher then 1 seconds (or for
> > these cards not have one?); On our testing thusfar we've seen
> > timeouts
> > up to 850ms, but it's impossible to ensure that that's the true
> > upper
> > bound.
>
> Using no limit of the timeout, would mean we may hang for ~10 minutes
> (MMC_OPS_TIMEOUT_MS) instead, no thanks.
Probably a silly question, but would this actually cause e.g. boot to
hang while waiting for the card (assuming rootfs is somewhere else)?
> I am fine with let's say double of 850ms (1700ms), to have some room.
>
> Anyway, the point is, the timeouts in the spec is there for reason.
> Unfortunate I think the spec is "lazy" in some other regards and
> don't
> specify timeouts, which complicates things.
--
Sjoerd Simons
Collabora Ltd.
On Tue, 2018-11-20 at 15:24 +0100, Ulf Hansson wrote:
> On 20 November 2018 at 15:00, Sjoerd Simons
> <[email protected]> wrote:
> > On Tue, 2018-11-20 at 14:08 +0100, Ulf Hansson wrote:
> > > + Hal Emmerich
> > >
> > > On 20 November 2018 at 12:38, Sjoerd Simons
> > > <[email protected]> wrote:
> > > > On Tue, 2018-11-20 at 11:23 +0100, Wolfram Sang wrote:
> > > > So if you know the pattern, or just happen to hit it often in
> > > > e.g.
> > > > automated testing, it does show up during development.
> > > > Otherwise it
> > > > can
> > > > appear to "happen once in a while randomly".
> > >
> > > I don't quite follow. As far as I understand, the extended
> > > timeout is
> > > needed when turning the cache on.
> > >
> > > The above seems more related to flushing the cache, no? Flushing
> > > have
> > > no timeout (also reported to be an issue [1]), which happens
> > > either
> > > at
> > > _mmc_hw_reset() or at _mmc_suspend().
> > >
> > > What is the relation here?
> >
> > Yes it's the kinda of behaviour you would expect on a flush indeed!
> > I
> > don't know what the card actaully does when turning the cache on,
> > whether it's actually flush of something persistent when turning
> > the
> > cache on after a hard poweroff or doing some other validation.
> >
> > All i can share is what our testing seems to indicate, which is
> > that
> > there is a wide spread in the time the card needs *and* there seems
> > to
> > be strong correlation to the I/O activity before the hard power off
> > and
> > the time taken by "cache on".
>
> So the hard power off, means that you are cutting the power to the
> platform and not doing a graceful power off? Just so I understand
> correctly.
Correct. With hard power off i mean cutting the power to the board in
some way. When doing a gracefull power off, the issue does not occur on
my boards (cache on is fast). Presumably as the eMMC got its cache
flushed as part of the shutdown procedure..
> > > Using no limit of the timeout, would mean we may hang for ~10
> > > minutes
> > > (MMC_OPS_TIMEOUT_MS) instead, no thanks.
> >
> > Probably a silly question, but would this actually cause e.g. boot
> > to
> > hang while waiting for the card (assuming rootfs is somewhere
> > else)?
>
> Nope.
>
> [...]
Then i'm probably just being dense, but i'm unsure what the issue is
with waiting for 10 minuts in this case. Apart from having to wait for
10 minutes before the user gets an indication in the kernel that the
card is not usable?
--
Sjoerd Simons
Collabora Ltd.
On 20 November 2018 at 15:55, Sjoerd Simons
<[email protected]> wrote:
> On Tue, 2018-11-20 at 15:24 +0100, Ulf Hansson wrote:
>> On 20 November 2018 at 15:00, Sjoerd Simons
>> <[email protected]> wrote:
>> > On Tue, 2018-11-20 at 14:08 +0100, Ulf Hansson wrote:
>> > > + Hal Emmerich
>> > >
>> > > On 20 November 2018 at 12:38, Sjoerd Simons
>> > > <[email protected]> wrote:
>> > > > On Tue, 2018-11-20 at 11:23 +0100, Wolfram Sang wrote:
>> > > > So if you know the pattern, or just happen to hit it often in
>> > > > e.g.
>> > > > automated testing, it does show up during development.
>> > > > Otherwise it
>> > > > can
>> > > > appear to "happen once in a while randomly".
>> > >
>> > > I don't quite follow. As far as I understand, the extended
>> > > timeout is
>> > > needed when turning the cache on.
>> > >
>> > > The above seems more related to flushing the cache, no? Flushing
>> > > have
>> > > no timeout (also reported to be an issue [1]), which happens
>> > > either
>> > > at
>> > > _mmc_hw_reset() or at _mmc_suspend().
>> > >
>> > > What is the relation here?
>> >
>> > Yes it's the kinda of behaviour you would expect on a flush indeed!
>> > I
>> > don't know what the card actaully does when turning the cache on,
>> > whether it's actually flush of something persistent when turning
>> > the
>> > cache on after a hard poweroff or doing some other validation.
>> >
>> > All i can share is what our testing seems to indicate, which is
>> > that
>> > there is a wide spread in the time the card needs *and* there seems
>> > to
>> > be strong correlation to the I/O activity before the hard power off
>> > and
>> > the time taken by "cache on".
>>
>> So the hard power off, means that you are cutting the power to the
>> platform and not doing a graceful power off? Just so I understand
>> correctly.
>
> Correct. With hard power off i mean cutting the power to the board in
> some way. When doing a gracefull power off, the issue does not occur on
> my boards (cache on is fast). Presumably as the eMMC got its cache
> flushed as part of the shutdown procedure..
Okay, got it.
>
>> > > Using no limit of the timeout, would mean we may hang for ~10
>> > > minutes
>> > > (MMC_OPS_TIMEOUT_MS) instead, no thanks.
>> >
>> > Probably a silly question, but would this actually cause e.g. boot
>> > to
>> > hang while waiting for the card (assuming rootfs is somewhere
>> > else)?
>>
>> Nope.
>>
>> [...]
>
> Then i'm probably just being dense, but i'm unsure what the issue is
> with waiting for 10 minuts in this case. Apart from having to wait for
> 10 minutes before the user gets an indication in the kernel that the
> card is not usable?
Exactly, to long timeouts is never good. Then it's better to timeout
in a reasonable time and log some error message to the let user know
of what went wrong.
To sum up, if someone wants to send a patch changing the timeout for
enabling the cache to let's say 1500 ms (and by adding a comment why),
I am fine to take it. In other words, lets drop the quirk-approach as
people seems more in favor of general change of the default behavior.
Kind regards
Uffe