LinuxLists.cc - [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB

2023-11-29 16:06:11

Subject: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
re-tune before switching to the RPMB partition would randomly cause
subsequent RPMB requests to fail with EILSEQ:
* data error -84, tigggered in __mmc_blk_ioctl_cmd()

This commit skips the retune when switching to RPMB.
Tested over several days with per minute RPMB reads.

Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>
---
Fixes v1: kernel test robot identified typo causing build failure
CIF_MANFID_SANDISK_SD --> CID_MANFID_SANDISK_SD

drivers/mmc/core/block.c | 6 +++++-
drivers/mmc/core/card.h | 7 +++++++
drivers/mmc/core/quirks.h | 7 +++++++
include/linux/mmc/card.h | 1 +
4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 152dfe593c43..9b7ba6562a3b 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -860,6 +860,11 @@ static int mmc_blk_part_switch_pre(struct mmc_card *card,
return ret;
}
mmc_retune_pause(card->host);
+
+ /* Do not force retune before RPMB switch */
+ if (mmc_can_retune(card->host) &&
+ mmc_card_broken_rpmb_retune(card))
+ card->host->need_retune = 0;
}

return ret;
@@ -3143,4 +3148,3 @@ module_exit(mmc_blk_exit);

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Multimedia Card (MMC) block device driver");
-
diff --git a/drivers/mmc/core/card.h b/drivers/mmc/core/card.h
index b7754a1b8d97..1e1555a15de9 100644
--- a/drivers/mmc/core/card.h
+++ b/drivers/mmc/core/card.h
@@ -85,6 +85,7 @@ struct mmc_fixup {
#define CID_MANFID_MICRON 0x13
#define CID_MANFID_SAMSUNG 0x15
#define CID_MANFID_APACER 0x27
+#define CID_MANFID_SANDISK2 0x45
#define CID_MANFID_KINGSTON 0x70
#define CID_MANFID_HYNIX 0x90
#define CID_MANFID_KINGSTON_SD 0x9F
@@ -284,4 +285,10 @@ static inline int mmc_card_broken_cache_flush(const struct mmc_card *c)
{
return c->quirks & MMC_QUIRK_BROKEN_CACHE_FLUSH;
}
+
+static inline int mmc_card_broken_rpmb_retune(const struct mmc_card *c)
+{
+ return c->quirks & MMC_QUIRK_BROKEN_RPMB_RETUNE;
+}
+
#endif
diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
index cca71867bc4a..56c79b6b3537 100644
--- a/drivers/mmc/core/quirks.h
+++ b/drivers/mmc/core/quirks.h
@@ -130,6 +130,13 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
MMC_FIXUP(CID_NAME_ANY, CID_MANFID_SANDISK_SD, 0x5344, add_quirk_sd,
MMC_QUIRK_BROKEN_SD_DISCARD),

+ /*
+ * SanDisk iNAND 7250 DG4064, this quirk shall disable the retune
+ * operation enforced by default when switching to RPMB.
+ */
+ MMC_FIXUP("DG4064", CID_MANFID_SANDISK2, 0x100, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_RPMB_RETUNE),
+
END_FIXUP
};

diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 7b12eebc5586..bd6986189e8b 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -296,6 +296,7 @@ struct mmc_card {
#define MMC_QUIRK_BROKEN_SD_DISCARD (1<<14) /* Disable broken SD discard support */
#define MMC_QUIRK_BROKEN_SD_CACHE (1<<15) /* Disable broken SD cache support */
#define MMC_QUIRK_BROKEN_CACHE_FLUSH (1<<16) /* Don't flush cache until the write has occurred */
+#define MMC_QUIRK_BROKEN_RPMB_RETUNE (1<<17) /* Don't force a retune before switching to RPMB */

bool written_flag; /* Indicates eMMC has been written since power on */
bool reenable_cmdq; /* Re-enable Command Queue */
--
2.34.1

2023-11-30 10:35:24

by Ulf Hansson

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
>
> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> re-tune before switching to the RPMB partition would randomly cause
> subsequent RPMB requests to fail with EILSEQ:
> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
>
> This commit skips the retune when switching to RPMB.
> Tested over several days with per minute RPMB reads.

This sounds weird to me and needs more testing/debugging in my
opinion, especially at the host driver level. Perhaps add some new
tests in mmc_test, that does a partition switch to/from any partition
and then run regular I/O again to see if the problem is easier to
reproduce?

The point is, I wonder what is so special with RPMB here? Note that,
it has been quite common that host drivers/controllers have had issues
with their tuning support, so I would not be surprised if that is the
case here too. Certainly I would be surprised if the problem is at
the eMMC card side, but I may be wrong.

Kind regards
Uffe

>
> Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>
> ---
> Fixes v1: kernel test robot identified typo causing build failure
> CIF_MANFID_SANDISK_SD --> CID_MANFID_SANDISK_SD
>
> drivers/mmc/core/block.c | 6 +++++-
> drivers/mmc/core/card.h | 7 +++++++
> drivers/mmc/core/quirks.h | 7 +++++++
> include/linux/mmc/card.h | 1 +
> 4 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 152dfe593c43..9b7ba6562a3b 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -860,6 +860,11 @@ static int mmc_blk_part_switch_pre(struct mmc_card *card,
> return ret;
> }
> mmc_retune_pause(card->host);
> +
> + /* Do not force retune before RPMB switch */
> + if (mmc_can_retune(card->host) &&
> + mmc_card_broken_rpmb_retune(card))
> + card->host->need_retune = 0;
> }
>
> return ret;
> @@ -3143,4 +3148,3 @@ module_exit(mmc_blk_exit);
>
> MODULE_LICENSE("GPL");
> MODULE_DESCRIPTION("Multimedia Card (MMC) block device driver");
> -
> diff --git a/drivers/mmc/core/card.h b/drivers/mmc/core/card.h
> index b7754a1b8d97..1e1555a15de9 100644
> --- a/drivers/mmc/core/card.h
> +++ b/drivers/mmc/core/card.h
> @@ -85,6 +85,7 @@ struct mmc_fixup {
> #define CID_MANFID_MICRON 0x13
> #define CID_MANFID_SAMSUNG 0x15
> #define CID_MANFID_APACER 0x27
> +#define CID_MANFID_SANDISK2 0x45
> #define CID_MANFID_KINGSTON 0x70
> #define CID_MANFID_HYNIX 0x90
> #define CID_MANFID_KINGSTON_SD 0x9F
> @@ -284,4 +285,10 @@ static inline int mmc_card_broken_cache_flush(const struct mmc_card *c)
> {
> return c->quirks & MMC_QUIRK_BROKEN_CACHE_FLUSH;
> }
> +
> +static inline int mmc_card_broken_rpmb_retune(const struct mmc_card *c)
> +{
> + return c->quirks & MMC_QUIRK_BROKEN_RPMB_RETUNE;
> +}
> +
> #endif
> diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
> index cca71867bc4a..56c79b6b3537 100644
> --- a/drivers/mmc/core/quirks.h
> +++ b/drivers/mmc/core/quirks.h
> @@ -130,6 +130,13 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
> MMC_FIXUP(CID_NAME_ANY, CID_MANFID_SANDISK_SD, 0x5344, add_quirk_sd,
> MMC_QUIRK_BROKEN_SD_DISCARD),
>
> + /*
> + * SanDisk iNAND 7250 DG4064, this quirk shall disable the retune
> + * operation enforced by default when switching to RPMB.
> + */
> + MMC_FIXUP("DG4064", CID_MANFID_SANDISK2, 0x100, add_quirk_mmc,
> + MMC_QUIRK_BROKEN_RPMB_RETUNE),
> +
> END_FIXUP
> };
>
> diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
> index 7b12eebc5586..bd6986189e8b 100644
> --- a/include/linux/mmc/card.h
> +++ b/include/linux/mmc/card.h
> @@ -296,6 +296,7 @@ struct mmc_card {
> #define MMC_QUIRK_BROKEN_SD_DISCARD (1<<14) /* Disable broken SD discard support */
> #define MMC_QUIRK_BROKEN_SD_CACHE (1<<15) /* Disable broken SD cache support */
> #define MMC_QUIRK_BROKEN_CACHE_FLUSH (1<<16) /* Don't flush cache until the write has occurred */
> +#define MMC_QUIRK_BROKEN_RPMB_RETUNE (1<<17) /* Don't force a retune before switching to RPMB */
>
> bool written_flag; /* Indicates eMMC has been written since power on */
> bool reenable_cmdq; /* Re-enable Command Queue */
> --
> 2.34.1

2023-11-30 13:24:52

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 30/11/23 11:34:18, Ulf Hansson wrote:
> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> >
> > On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> > re-tune before switching to the RPMB partition would randomly cause
> > subsequent RPMB requests to fail with EILSEQ:
> > * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> >
> > This commit skips the retune when switching to RPMB.
> > Tested over several days with per minute RPMB reads.
>
> This sounds weird to me and needs more testing/debugging in my
> opinion, especially at the host driver level. Perhaps add some new
> tests in mmc_test, that does a partition switch to/from any partition
> and then run regular I/O again to see if the problem is easier to
> reproduce?

hi Uffe

ok I'll have a look - I have never used this driver before, so if you
have anything in the works I'll be glad to integrated and adapt.

>
> The point is, I wonder what is so special with RPMB here? Note that,
> it has been quite common that host drivers/controllers have had issues
> with their tuning support, so I would not be surprised if that is the
> case here too.

Right, it is just that the tuning function for of-arasan is the generic
__sdhci_execute_tuning() - only wrapped around arasan DLL reset
calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
are not recent functions or architectures.

> Certainly I would be surprised if the problem is at
> the eMMC card side, but I may be wrong.

How do maintainers test the tuning methods? is there anything else for
me to do other than forcing a retune with different partitions?

>
> Kind regards
> Uffe

For completeness this is the error message - notice that we have a
trusted application (fiovb) going through OP-TEE and back to the TEE
supplicant issuing an rpmb read of a variable (pretty normal these days,
we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).

The issue on this Zynqmp platform is scarily simple to reproduce; you
can ignore the OP-TEE trace, it is just the TEE way of reporting that
the RPMB read failed.

root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
[ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
E/TC:? 0
E/TC:? 0 TA panicked with code 0xffff0000
E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
E/LD: arch: aarch64
E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
E/LD: Call stack:
E/LD: 0xc006de58
E/LD: 0xc006b388
E/LD: 0xc006ed40
E/LD: 0xc006b624
Read persistent value for m4hash failed: Exec format error

Also I instrumented sdhci-of-arasan.c to confirm that tuning wasn't failing.

diff --git a/drivers/mmc/host/sdhci-of-arasan.c b/drivers/mmc/host/sdhci-of-arasan.c
index 681ac4cab8ab..54cde79d2719 100644
--- a/drivers/mmc/host/sdhci-of-arasan.c
+++ b/drivers/mmc/host/sdhci-of-arasan.c
@@ -1123,7 +1123,10 @@ static int arasan_zynqmp_execute_tuning(struct mmc_host *mmc, u32 opcode)

err = sdhci_execute_tuning(mmc, opcode);
if (err)
- return err;
+ WARN_ON(1);
+
+ if (host->tuning_err)
+ WARN_ON(1);

arasan_zynqmp_dll_reset(host, device_id);

Incidentally - not sure if it is intentional or not - I noticed that the
function arasan_zynqmp_execute_tuning(..) can not fail which seems wrong
(IMO it should also check host->tuning_err and not only err which will
always be 0).

Do you think this needs fixing even though not related to this problem?

TIA
Jorge

>
> >
> > Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>
> > ---
> > Fixes v1: kernel test robot identified typo causing build failure
> > CIF_MANFID_SANDISK_SD --> CID_MANFID_SANDISK_SD
> >
> > drivers/mmc/core/block.c | 6 +++++-
> > drivers/mmc/core/card.h | 7 +++++++
> > drivers/mmc/core/quirks.h | 7 +++++++
> > include/linux/mmc/card.h | 1 +
> > 4 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> > index 152dfe593c43..9b7ba6562a3b 100644
> > --- a/drivers/mmc/core/block.c
> > +++ b/drivers/mmc/core/block.c
> > @@ -860,6 +860,11 @@ static int mmc_blk_part_switch_pre(struct mmc_card *card,
> > return ret;
> > }
> > mmc_retune_pause(card->host);
> > +
> > + /* Do not force retune before RPMB switch */
> > + if (mmc_can_retune(card->host) &&
> > + mmc_card_broken_rpmb_retune(card))
> > + card->host->need_retune = 0;
> > }
> >
> > return ret;
> > @@ -3143,4 +3148,3 @@ module_exit(mmc_blk_exit);
> >
> > MODULE_LICENSE("GPL");
> > MODULE_DESCRIPTION("Multimedia Card (MMC) block device driver");
> > -
> > diff --git a/drivers/mmc/core/card.h b/drivers/mmc/core/card.h
> > index b7754a1b8d97..1e1555a15de9 100644
> > --- a/drivers/mmc/core/card.h
> > +++ b/drivers/mmc/core/card.h
> > @@ -85,6 +85,7 @@ struct mmc_fixup {
> > #define CID_MANFID_MICRON 0x13
> > #define CID_MANFID_SAMSUNG 0x15
> > #define CID_MANFID_APACER 0x27
> > +#define CID_MANFID_SANDISK2 0x45
> > #define CID_MANFID_KINGSTON 0x70
> > #define CID_MANFID_HYNIX 0x90
> > #define CID_MANFID_KINGSTON_SD 0x9F
> > @@ -284,4 +285,10 @@ static inline int mmc_card_broken_cache_flush(const struct mmc_card *c)
> > {
> > return c->quirks & MMC_QUIRK_BROKEN_CACHE_FLUSH;
> > }
> > +
> > +static inline int mmc_card_broken_rpmb_retune(const struct mmc_card *c)
> > +{
> > + return c->quirks & MMC_QUIRK_BROKEN_RPMB_RETUNE;
> > +}
> > +
> > #endif
> > diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
> > index cca71867bc4a..56c79b6b3537 100644
> > --- a/drivers/mmc/core/quirks.h
> > +++ b/drivers/mmc/core/quirks.h
> > @@ -130,6 +130,13 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
> > MMC_FIXUP(CID_NAME_ANY, CID_MANFID_SANDISK_SD, 0x5344, add_quirk_sd,
> > MMC_QUIRK_BROKEN_SD_DISCARD),
> >
> > + /*
> > + * SanDisk iNAND 7250 DG4064, this quirk shall disable the retune
> > + * operation enforced by default when switching to RPMB.
> > + */
> > + MMC_FIXUP("DG4064", CID_MANFID_SANDISK2, 0x100, add_quirk_mmc,
> > + MMC_QUIRK_BROKEN_RPMB_RETUNE),
> > +
> > END_FIXUP
> > };
> >
> > diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
> > index 7b12eebc5586..bd6986189e8b 100644
> > --- a/include/linux/mmc/card.h
> > +++ b/include/linux/mmc/card.h
> > @@ -296,6 +296,7 @@ struct mmc_card {
> > #define MMC_QUIRK_BROKEN_SD_DISCARD (1<<14) /* Disable broken SD discard support */
> > #define MMC_QUIRK_BROKEN_SD_CACHE (1<<15) /* Disable broken SD cache support */
> > #define MMC_QUIRK_BROKEN_CACHE_FLUSH (1<<16) /* Don't flush cache until the write has occurred */
> > +#define MMC_QUIRK_BROKEN_RPMB_RETUNE (1<<17) /* Don't force a retune before switching to RPMB */
> >
> > bool written_flag; /* Indicates eMMC has been written since power on */
> > bool reenable_cmdq; /* Re-enable Command Queue */
> > --
> > 2.34.1

2023-11-30 17:36:07

by Ulf Hansson

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On Thu, 30 Nov 2023 at 14:24, Jorge Ramirez-Ortiz, Foundries
<[email protected]> wrote:
>
> On 30/11/23 11:34:18, Ulf Hansson wrote:
> > On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> > >
> > > On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> > > re-tune before switching to the RPMB partition would randomly cause
> > > subsequent RPMB requests to fail with EILSEQ:
> > > * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> > >
> > > This commit skips the retune when switching to RPMB.
> > > Tested over several days with per minute RPMB reads.
> >
> > This sounds weird to me and needs more testing/debugging in my
> > opinion, especially at the host driver level. Perhaps add some new
> > tests in mmc_test, that does a partition switch to/from any partition
> > and then run regular I/O again to see if the problem is easier to
> > reproduce?
>
> hi Uffe
>
> ok I'll have a look - I have never used this driver before, so if you
> have anything in the works I'll be glad to integrated and adapt.
>
> >
> > The point is, I wonder what is so special with RPMB here? Note that,
> > it has been quite common that host drivers/controllers have had issues
> > with their tuning support, so I would not be surprised if that is the
> > case here too.
>
> Right, it is just that the tuning function for of-arasan is the generic
> __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> are not recent functions or architectures.

Well, I suggest we try to get some help from the driver authors then.

>
>
> > Certainly I would be surprised if the problem is at
> > the eMMC card side, but I may be wrong.
>
> How do maintainers test the tuning methods? is there anything else for
> me to do other than forcing a retune with different partitions?

Good question. I think we should have an mmc_test for that, but I
don't think we have.

Unfortunately, I don't have the bandwidth to help with this currently,
so unless some other can step in and help you are on your own.

>
> >
> > Kind regards
> > Uffe
>
> For completeness this is the error message - notice that we have a
> trusted application (fiovb) going through OP-TEE and back to the TEE
> supplicant issuing an rpmb read of a variable (pretty normal these days,
> we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
>
> The issue on this Zynqmp platform is scarily simple to reproduce; you
> can ignore the OP-TEE trace, it is just the TEE way of reporting that
> the RPMB read failed.
>
> root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> E/TC:? 0
> E/TC:? 0 TA panicked with code 0xffff0000
> E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> E/LD: arch: aarch64
> E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> E/LD: Call stack:
> E/LD: 0xc006de58
> E/LD: 0xc006b388
> E/LD: 0xc006ed40
> E/LD: 0xc006b624
> Read persistent value for m4hash failed: Exec format error
>
> Also I instrumented sdhci-of-arasan.c to confirm that tuning wasn't failing.
>
> diff --git a/drivers/mmc/host/sdhci-of-arasan.c b/drivers/mmc/host/sdhci-of-arasan.c
> index 681ac4cab8ab..54cde79d2719 100644
> --- a/drivers/mmc/host/sdhci-of-arasan.c
> +++ b/drivers/mmc/host/sdhci-of-arasan.c
> @@ -1123,7 +1123,10 @@ static int arasan_zynqmp_execute_tuning(struct mmc_host *mmc, u32 opcode)
>
> err = sdhci_execute_tuning(mmc, opcode);
> if (err)
> - return err;
> + WARN_ON(1);
> +
> + if (host->tuning_err)
> + WARN_ON(1);
>
> arasan_zynqmp_dll_reset(host, device_id);
>
>
> Incidentally - not sure if it is intentional or not - I noticed that the
> function arasan_zynqmp_execute_tuning(..) can not fail which seems wrong
> (IMO it should also check host->tuning_err and not only err which will
> always be 0).
>
> Do you think this needs fixing even though not related to this problem?

I usually defer to driver authors and Adrian as the SDHCI maintainer
for questions like these, so again, sorry for not being able to help
more than this.

[...]

Kind regards
Uffe

2023-11-30 19:13:14

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
> On 30/11/23 11:34:18, Ulf Hansson wrote:
>> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
>>>
>>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
>>> re-tune before switching to the RPMB partition would randomly cause
>>> subsequent RPMB requests to fail with EILSEQ:
>>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
>>>
>>> This commit skips the retune when switching to RPMB.
>>> Tested over several days with per minute RPMB reads.
>>
>> This sounds weird to me and needs more testing/debugging in my
>> opinion, especially at the host driver level. Perhaps add some new
>> tests in mmc_test, that does a partition switch to/from any partition
>> and then run regular I/O again to see if the problem is easier to
>> reproduce?
>
> hi Uffe
>
> ok I'll have a look - I have never used this driver before, so if you
> have anything in the works I'll be glad to integrated and adapt.
>
>>
>> The point is, I wonder what is so special with RPMB here? Note that,
>> it has been quite common that host drivers/controllers have had issues
>> with their tuning support, so I would not be surprised if that is the
>> case here too.
>
> Right, it is just that the tuning function for of-arasan is the generic
> __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> are not recent functions or architectures.
>
>
>> Certainly I would be surprised if the problem is at
>> the eMMC card side, but I may be wrong.
>
> How do maintainers test the tuning methods? is there anything else for
> me to do other than forcing a retune with different partitions?
>
>>
>> Kind regards
>> Uffe
>
> For completeness this is the error message - notice that we have a
> trusted application (fiovb) going through OP-TEE and back to the TEE
> supplicant issuing an rpmb read of a variable (pretty normal these days,
> we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
>
> The issue on this Zynqmp platform is scarily simple to reproduce; you
> can ignore the OP-TEE trace, it is just the TEE way of reporting that
> the RPMB read failed.
>
> root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> E/TC:? 0
> E/TC:? 0 TA panicked with code 0xffff0000
> E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> E/LD: arch: aarch64
> E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> E/LD: Call stack:
> E/LD: 0xc006de58
> E/LD: 0xc006b388
> E/LD: 0xc006ed40
> E/LD: 0xc006b624
> Read persistent value for m4hash failed: Exec format error

Have you tried dynamic debug for mmc

Kernel must be configured:

CONFIG_DYNAMIC_DEBUG=y

To enable mmc debug via sysfs:

echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control

2023-11-30 22:02:51

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 30/11/23 21:12:28, Adrian Hunter wrote:
> On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
> > On 30/11/23 11:34:18, Ulf Hansson wrote:
> >> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> >>>
> >>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> >>> re-tune before switching to the RPMB partition would randomly cause
> >>> subsequent RPMB requests to fail with EILSEQ:
> >>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> >>>
> >>> This commit skips the retune when switching to RPMB.
> >>> Tested over several days with per minute RPMB reads.
> >>
> >> This sounds weird to me and needs more testing/debugging in my
> >> opinion, especially at the host driver level. Perhaps add some new
> >> tests in mmc_test, that does a partition switch to/from any partition
> >> and then run regular I/O again to see if the problem is easier to
> >> reproduce?
> >
> > hi Uffe
> >
> > ok I'll have a look - I have never used this driver before, so if you
> > have anything in the works I'll be glad to integrated and adapt.
> >
> >>
> >> The point is, I wonder what is so special with RPMB here? Note that,
> >> it has been quite common that host drivers/controllers have had issues
> >> with their tuning support, so I would not be surprised if that is the
> >> case here too.
> >
> > Right, it is just that the tuning function for of-arasan is the generic
> > __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> > calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> > are not recent functions or architectures.
> >
> >
> >> Certainly I would be surprised if the problem is at
> >> the eMMC card side, but I may be wrong.
> >
> > How do maintainers test the tuning methods? is there anything else for
> > me to do other than forcing a retune with different partitions?
> >
> >>
> >> Kind regards
> >> Uffe
> >
> > For completeness this is the error message - notice that we have a
> > trusted application (fiovb) going through OP-TEE and back to the TEE
> > supplicant issuing an rpmb read of a variable (pretty normal these days,
> > we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
> >
> > The issue on this Zynqmp platform is scarily simple to reproduce; you
> > can ignore the OP-TEE trace, it is just the TEE way of reporting that
> > the RPMB read failed.
> >
> > root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> > [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> > E/TC:? 0
> > E/TC:? 0 TA panicked with code 0xffff0000
> > E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> > E/LD: arch: aarch64
> > E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> > E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> > E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> > E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> > E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> > E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> > E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> > E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> > E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> > E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> > E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> > E/LD: Call stack:
> > E/LD: 0xc006de58
> > E/LD: 0xc006b388
> > E/LD: 0xc006ed40
> > E/LD: 0xc006b624
> > Read persistent value for m4hash failed: Exec format error
>
> Have you tried dynamic debug for mmc
>
> Kernel must be configured:
>
> CONFIG_DYNAMIC_DEBUG=y
>
> To enable mmc debug via sysfs:
>
> echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
> echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
>
>

hi Adrian

Sure, this is the output of the trace:

[ 422.018756] mmc0: sdhci: IRQ status 0x00000020
[ 422.018789] mmc0: sdhci: IRQ status 0x00000020
[ 422.018817] mmc0: sdhci: IRQ status 0x00000020
[ 422.018848] mmc0: sdhci: IRQ status 0x00000020
[ 422.018875] mmc0: sdhci: IRQ status 0x00000020
[ 422.018902] mmc0: sdhci: IRQ status 0x00000020
[ 422.018932] mmc0: sdhci: IRQ status 0x00000020
[ 422.020013] mmc0: sdhci: IRQ status 0x00000001
[ 422.020027] mmc0: sdhci: IRQ status 0x00000002
[ 422.020034] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
[ 422.020054] mmc0: starting CMD13 arg 00010000 flags 00000195
[ 422.020068] mmc0: sdhci: IRQ status 0x00000001
[ 422.020076] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
[ 422.020092] <mmc0: starting CMD23 arg 00000001 flags 00000015>
[ 422.020101] mmc0: starting CMD25 arg 00000000 flags 00000035
[ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms nsac 0
[ 422.020124] mmc0: sdhci: IRQ status 0x00000001
[ 422.021671] mmc0: sdhci: IRQ status 0x00000002
[ 422.021691] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
[ 422.021700] mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
[ 422.021708] mmc0: 512 bytes transferred: 0
[ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
[ 422.021743] mmc0: sdhci: IRQ status 0x00000001
[ 422.021752] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
[ 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015>
[ 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
[ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms nsac 0
[ 422.021804] mmc0: sdhci: IRQ status 0x00000001
[ 422.022566] mmc0: sdhci: IRQ status 0x00208000
[ 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
[ 422.022639] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
[ 422.022647] mmc0: 0 bytes transferred: -84
[ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
[ 422.029619] mmc0: starting CMD6 arg 03b30001 flags 0000049d
[ 422.029636] mmc0: sdhci: IRQ status 0x00000001
[ 422.029652] mmc0: sdhci: IRQ status 0x00000002
[ 422.029660] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
[ 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
[ 422.029693] mmc0: sdhci: IRQ status 0x00000001
[ 422.029702] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
[ 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015>
[ 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
[ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400 ms nsac 0
[ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
[ 422.199455] mmc0: sdhci: IRQ status 0x00000020
[ 422.199526] mmc0: sdhci: IRQ status 0x00000020
[ 422.199585] mmc0: sdhci: IRQ status 0x00000020
[ 422.199641] mmc0: sdhci: IRQ status 0x00000020
[ 422.199695] mmc0: sdhci: IRQ status 0x00000020
[ 422.199753] mmc0: sdhci: IRQ status 0x00000020
[ 422.199811] mmc0: sdhci: IRQ status 0x00000020
[ 422.199865] mmc0: sdhci: IRQ status 0x00000020
[ 422.199919] mmc0: sdhci: IRQ status 0x00000020
[ 422.199972] mmc0: sdhci: IRQ status 0x00000020
[ 422.200026] mmc0: sdhci: IRQ status 0x00000020

does this help?

thanks
Jorge

2023-11-30 22:20:21

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 30/11/23 23:02:15, Jorge Ramirez-Ortiz, Foundries wrote:
> On 30/11/23 21:12:28, Adrian Hunter wrote:
> > On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
> > > On 30/11/23 11:34:18, Ulf Hansson wrote:
> > >> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> > >>>
> > >>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> > >>> re-tune before switching to the RPMB partition would randomly cause
> > >>> subsequent RPMB requests to fail with EILSEQ:
> > >>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> > >>>
> > >>> This commit skips the retune when switching to RPMB.
> > >>> Tested over several days with per minute RPMB reads.
> > >>
> > >> This sounds weird to me and needs more testing/debugging in my
> > >> opinion, especially at the host driver level. Perhaps add some new
> > >> tests in mmc_test, that does a partition switch to/from any partition
> > >> and then run regular I/O again to see if the problem is easier to
> > >> reproduce?
> > >
> > > hi Uffe
> > >
> > > ok I'll have a look - I have never used this driver before, so if you
> > > have anything in the works I'll be glad to integrated and adapt.
> > >
> > >>
> > >> The point is, I wonder what is so special with RPMB here? Note that,
> > >> it has been quite common that host drivers/controllers have had issues
> > >> with their tuning support, so I would not be surprised if that is the
> > >> case here too.
> > >
> > > Right, it is just that the tuning function for of-arasan is the generic
> > > __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> > > calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> > > are not recent functions or architectures.
> > >
> > >
> > >> Certainly I would be surprised if the problem is at
> > >> the eMMC card side, but I may be wrong.
> > >
> > > How do maintainers test the tuning methods? is there anything else for
> > > me to do other than forcing a retune with different partitions?
> > >
> > >>
> > >> Kind regards
> > >> Uffe
> > >
> > > For completeness this is the error message - notice that we have a
> > > trusted application (fiovb) going through OP-TEE and back to the TEE
> > > supplicant issuing an rpmb read of a variable (pretty normal these days,
> > > we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
> > >
> > > The issue on this Zynqmp platform is scarily simple to reproduce; you
> > > can ignore the OP-TEE trace, it is just the TEE way of reporting that
> > > the RPMB read failed.
> > >
> > > root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> > > [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> > > E/TC:? 0
> > > E/TC:? 0 TA panicked with code 0xffff0000
> > > E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> > > E/LD: arch: aarch64
> > > E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> > > E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> > > E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> > > E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> > > E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> > > E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> > > E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> > > E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> > > E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> > > E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> > > E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> > > E/LD: Call stack:
> > > E/LD: 0xc006de58
> > > E/LD: 0xc006b388
> > > E/LD: 0xc006ed40
> > > E/LD: 0xc006b624
> > > Read persistent value for m4hash failed: Exec format error
> >
> > Have you tried dynamic debug for mmc
> >
> > Kernel must be configured:
> >
> > CONFIG_DYNAMIC_DEBUG=y
> >
> > To enable mmc debug via sysfs:
> >
> > echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
> > echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
> >
> >
>
> hi Adrian
>
> Sure, this is the output of the trace:
>
> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020
> [ 422.018789] mmc0: sdhci: IRQ status 0x00000020
> [ 422.018817] mmc0: sdhci: IRQ status 0x00000020
> [ 422.018848] mmc0: sdhci: IRQ status 0x00000020
> [ 422.018875] mmc0: sdhci: IRQ status 0x00000020
> [ 422.018902] mmc0: sdhci: IRQ status 0x00000020
> [ 422.018932] mmc0: sdhci: IRQ status 0x00000020
> [ 422.020013] mmc0: sdhci: IRQ status 0x00000001
> [ 422.020027] mmc0: sdhci: IRQ status 0x00000002
> [ 422.020034] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> [ 422.020054] mmc0: starting CMD13 arg 00010000 flags 00000195
> [ 422.020068] mmc0: sdhci: IRQ status 0x00000001
> [ 422.020076] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> [ 422.020092] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> [ 422.020101] mmc0: starting CMD25 arg 00000000 flags 00000035
> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms nsac 0
> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001
> [ 422.021671] mmc0: sdhci: IRQ status 0x00000002
> [ 422.021691] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> [ 422.021700] mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
> [ 422.021708] mmc0: 512 bytes transferred: 0
> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
> [ 422.021743] mmc0: sdhci: IRQ status 0x00000001
> [ 422.021752] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> [ 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> [ 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms nsac 0
> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001
> [ 422.022566] mmc0: sdhci: IRQ status 0x00208000 <---------------------------------- this doesnt seem right
> [ 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> [ 422.022639] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
> [ 422.022647] mmc0: 0 bytes transferred: -84 < --------------------------------- it should have transfered 4096 bytes
> [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> [ 422.029619] mmc0: starting CMD6 arg 03b30001 flags 0000049d
> [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
> [ 422.029652] mmc0: sdhci: IRQ status 0x00000002
> [ 422.029660] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> [ 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
> [ 422.029693] mmc0: sdhci: IRQ status 0x00000001
> [ 422.029702] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> [ 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015>
> [ 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400 ms nsac 0
> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199526] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199585] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199641] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199695] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199753] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199811] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199865] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199919] mmc0: sdhci: IRQ status 0x00000020
> [ 422.199972] mmc0: sdhci: IRQ status 0x00000020
> [ 422.200026] mmc0: sdhci: IRQ status 0x00000020
>
>
> does this help?
>
> thanks
> Jorge

2023-12-01 07:41:06

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 30/11/23 23:19:45, Jorge Ramirez-Ortiz, Foundries wrote:
> On 30/11/23 23:02:15, Jorge Ramirez-Ortiz, Foundries wrote:
> > On 30/11/23 21:12:28, Adrian Hunter wrote:
> > > On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
> > > > On 30/11/23 11:34:18, Ulf Hansson wrote:
> > > >> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> > > >>>
> > > >>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> > > >>> re-tune before switching to the RPMB partition would randomly cause
> > > >>> subsequent RPMB requests to fail with EILSEQ:
> > > >>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> > > >>>
> > > >>> This commit skips the retune when switching to RPMB.
> > > >>> Tested over several days with per minute RPMB reads.
> > > >>
> > > >> This sounds weird to me and needs more testing/debugging in my
> > > >> opinion, especially at the host driver level. Perhaps add some new
> > > >> tests in mmc_test, that does a partition switch to/from any partition
> > > >> and then run regular I/O again to see if the problem is easier to
> > > >> reproduce?
> > > >
> > > > hi Uffe
> > > >
> > > > ok I'll have a look - I have never used this driver before, so if you
> > > > have anything in the works I'll be glad to integrated and adapt.
> > > >
> > > >>
> > > >> The point is, I wonder what is so special with RPMB here? Note that,
> > > >> it has been quite common that host drivers/controllers have had issues
> > > >> with their tuning support, so I would not be surprised if that is the
> > > >> case here too.
> > > >
> > > > Right, it is just that the tuning function for of-arasan is the generic
> > > > __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> > > > calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> > > > are not recent functions or architectures.
> > > >
> > > >
> > > >> Certainly I would be surprised if the problem is at
> > > >> the eMMC card side, but I may be wrong.
> > > >
> > > > How do maintainers test the tuning methods? is there anything else for
> > > > me to do other than forcing a retune with different partitions?
> > > >
> > > >>
> > > >> Kind regards
> > > >> Uffe
> > > >
> > > > For completeness this is the error message - notice that we have a
> > > > trusted application (fiovb) going through OP-TEE and back to the TEE
> > > > supplicant issuing an rpmb read of a variable (pretty normal these days,
> > > > we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
> > > >
> > > > The issue on this Zynqmp platform is scarily simple to reproduce; you
> > > > can ignore the OP-TEE trace, it is just the TEE way of reporting that
> > > > the RPMB read failed.
> > > >
> > > > root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> > > > [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> > > > E/TC:? 0
> > > > E/TC:? 0 TA panicked with code 0xffff0000
> > > > E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> > > > E/LD: arch: aarch64
> > > > E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> > > > E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> > > > E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> > > > E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> > > > E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> > > > E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> > > > E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> > > > E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> > > > E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> > > > E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> > > > E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> > > > E/LD: Call stack:
> > > > E/LD: 0xc006de58
> > > > E/LD: 0xc006b388
> > > > E/LD: 0xc006ed40
> > > > E/LD: 0xc006b624
> > > > Read persistent value for m4hash failed: Exec format error
> > >
> > > Have you tried dynamic debug for mmc
> > >
> > > Kernel must be configured:
> > >
> > > CONFIG_DYNAMIC_DEBUG=y
> > >
> > > To enable mmc debug via sysfs:
> > >
> > > echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
> > > echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
> > >
> > >
> >
> > hi Adrian
> >
> > Sure, this is the output of the trace:
> >
> > [ 422.018756] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.018789] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.018817] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.018848] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.018875] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.018902] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.018932] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.020013] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.020027] mmc0: sdhci: IRQ status 0x00000002
> > [ 422.020034] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> > [ 422.020054] mmc0: starting CMD13 arg 00010000 flags 00000195
> > [ 422.020068] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.020076] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> > [ 422.020092] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> > [ 422.020101] mmc0: starting CMD25 arg 00000000 flags 00000035
> > [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms nsac 0
> > [ 422.020124] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.021671] mmc0: sdhci: IRQ status 0x00000002
> > [ 422.021691] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> > [ 422.021700] mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
> > [ 422.021708] mmc0: 512 bytes transferred: 0
> > [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
> > [ 422.021743] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.021752] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> > [ 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> > [ 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
> > [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms nsac 0
> > [ 422.021804] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.022566] mmc0: sdhci: IRQ status 0x00208000 <---------------------------------- this doesnt seem right
> > [ 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> > [ 422.022639] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
> > [ 422.022647] mmc0: 0 bytes transferred: -84 < --------------------------------- it should have transfered 4096 bytes
> > [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> > [ 422.029619] mmc0: starting CMD6 arg 03b30001 flags 0000049d
> > [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.029652] mmc0: sdhci: IRQ status 0x00000002
> > [ 422.029660] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> > [ 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
> > [ 422.029693] mmc0: sdhci: IRQ status 0x00000001
> > [ 422.029702] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> > [ 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015>
> > [ 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
> > [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400 ms nsac 0
> > [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
> > [ 422.199455] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199526] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199585] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199641] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199695] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199753] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199811] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199865] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199919] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.199972] mmc0: sdhci: IRQ status 0x00000020
> > [ 422.200026] mmc0: sdhci: IRQ status 0x00000020
> >
> >
> > does this help?

Just asking because it doesn't mean much to me other than the obvious CRC
problem.

Being this issue so easy to trigger - and to fix - indicates a problem
on the card more than on the algorithm (otherwise faults would be all
over the place). But I am not an expert on this area.

any additional suggestions welcome.

> >
> > thanks
> > Jorge

2023-12-01 11:46:53

by Adrian Hunter

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 1/12/23 09:40, Jorge Ramirez-Ortiz, Foundries wrote:
> On 30/11/23 23:19:45, Jorge Ramirez-Ortiz, Foundries wrote:
>> On 30/11/23 23:02:15, Jorge Ramirez-Ortiz, Foundries wrote:
>>> On 30/11/23 21:12:28, Adrian Hunter wrote:
>>>> On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
>>>>> On 30/11/23 11:34:18, Ulf Hansson wrote:
>>>>>> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
>>>>>>>
>>>>>>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
>>>>>>> re-tune before switching to the RPMB partition would randomly cause
>>>>>>> subsequent RPMB requests to fail with EILSEQ:
>>>>>>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
>>>>>>>
>>>>>>> This commit skips the retune when switching to RPMB.
>>>>>>> Tested over several days with per minute RPMB reads.
>>>>>>
>>>>>> This sounds weird to me and needs more testing/debugging in my
>>>>>> opinion, especially at the host driver level. Perhaps add some new
>>>>>> tests in mmc_test, that does a partition switch to/from any partition
>>>>>> and then run regular I/O again to see if the problem is easier to
>>>>>> reproduce?
>>>>>
>>>>> hi Uffe
>>>>>
>>>>> ok I'll have a look - I have never used this driver before, so if you
>>>>> have anything in the works I'll be glad to integrated and adapt.
>>>>>
>>>>>>
>>>>>> The point is, I wonder what is so special with RPMB here? Note that,
>>>>>> it has been quite common that host drivers/controllers have had issues
>>>>>> with their tuning support, so I would not be surprised if that is the
>>>>>> case here too.
>>>>>
>>>>> Right, it is just that the tuning function for of-arasan is the generic
>>>>> __sdhci_execute_tuning() - only wrapped around arasan DLL reset
>>>>> calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
>>>>> are not recent functions or architectures.
>>>>>
>>>>>
>>>>>> Certainly I would be surprised if the problem is at
>>>>>> the eMMC card side, but I may be wrong.
>>>>>
>>>>> How do maintainers test the tuning methods? is there anything else for
>>>>> me to do other than forcing a retune with different partitions?
>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>> Uffe
>>>>>
>>>>> For completeness this is the error message - notice that we have a
>>>>> trusted application (fiovb) going through OP-TEE and back to the TEE
>>>>> supplicant issuing an rpmb read of a variable (pretty normal these days,
>>>>> we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
>>>>>
>>>>> The issue on this Zynqmp platform is scarily simple to reproduce; you
>>>>> can ignore the OP-TEE trace, it is just the TEE way of reporting that
>>>>> the RPMB read failed.
>>>>>
>>>>> root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
>>>>> [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
>>>>> E/TC:? 0
>>>>> E/TC:? 0 TA panicked with code 0xffff0000
>>>>> E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
>>>>> E/LD: arch: aarch64
>>>>> E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
>>>>> E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
>>>>> E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
>>>>> E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
>>>>> E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
>>>>> E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
>>>>> E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
>>>>> E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
>>>>> E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
>>>>> E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
>>>>> E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
>>>>> E/LD: Call stack:
>>>>> E/LD: 0xc006de58
>>>>> E/LD: 0xc006b388
>>>>> E/LD: 0xc006ed40
>>>>> E/LD: 0xc006b624
>>>>> Read persistent value for m4hash failed: Exec format error
>>>>
>>>> Have you tried dynamic debug for mmc
>>>>
>>>> Kernel must be configured:
>>>>
>>>> CONFIG_DYNAMIC_DEBUG=y
>>>>
>>>> To enable mmc debug via sysfs:
>>>>
>>>> echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
>>>> echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
>>>>
>>>>
>>>
>>> hi Adrian
>>>
>>> Sure, this is the output of the trace:
>>>
>>> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.018789] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.018817] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.018848] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.018875] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.018902] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.018932] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.020013] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.020027] mmc0: sdhci: IRQ status 0x00000002
>>> [ 422.020034] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
>>> [ 422.020054] mmc0: starting CMD13 arg 00010000 flags 00000195
>>> [ 422.020068] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.020076] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
>>> [ 422.020092] <mmc0: starting CMD23 arg 00000001 flags 00000015>
>>> [ 422.020101] mmc0: starting CMD25 arg 00000000 flags 00000035
>>> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms nsac 0
>>> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.021671] mmc0: sdhci: IRQ status 0x00000002
>>> [ 422.021691] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
>>> [ 422.021700] mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
>>> [ 422.021708] mmc0: 512 bytes transferred: 0
>>> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
>>> [ 422.021743] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.021752] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
>>> [ 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015>
>>> [ 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
>>> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms nsac 0
>>> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.022566] mmc0: sdhci: IRQ status 0x00208000 <---------------------------------- this doesnt seem right
>>> [ 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
>>> [ 422.022639] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
>>> [ 422.022647] mmc0: 0 bytes transferred: -84 < --------------------------------- it should have transfered 4096 bytes
>>> [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
>>> [ 422.029619] mmc0: starting CMD6 arg 03b30001 flags 0000049d
>>> [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.029652] mmc0: sdhci: IRQ status 0x00000002
>>> [ 422.029660] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
>>> [ 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
>>> [ 422.029693] mmc0: sdhci: IRQ status 0x00000001
>>> [ 422.029702] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
>>> [ 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015>
>>> [ 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
>>> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400 ms nsac 0
>>> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
>>> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199526] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199585] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199641] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199695] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199753] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199811] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199865] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199919] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.199972] mmc0: sdhci: IRQ status 0x00000020
>>> [ 422.200026] mmc0: sdhci: IRQ status 0x00000020
>>>
>>>
>>> does this help?
>
> Just asking because it doesn't mean much to me other than the obvious CRC
> problem.
>
> Being this issue so easy to trigger - and to fix - indicates a problem
> on the card more than on the algorithm (otherwise faults would be all
> over the place). But I am not an expert on this area.
>
> any additional suggestions welcome.

My guess is that sometimes tuning produces a "bad" result. Perhaps
the margins are very tight and the difference is only 1 tap. When
a "bad" result happens in non-RPMB, a CRC error results in re-tuning
and retry, so no errors are seen. When it happens in RPMB, that is
not possible, so the error is obvious. Not re-tuning before RPMB
switch helps because the CRC-error->re-tuning to a "good" result has
probably already happened.

However, based on that theory, it is not necessary the eMMC that is
at fault.

It may be worth considering a stronger eMMC driver strength setting.

sdhci supports err_stats in debugfs - that may show how many CRC
errors there are when not accessing RPMB.

I don't object to skipping re-tuning before RPMB switch, but I am
not sure about tying it to a specific eMMC.

2023-12-01 15:54:37

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 01/12/23 13:46:25, Adrian Hunter wrote:
> On 1/12/23 09:40, Jorge Ramirez-Ortiz, Foundries wrote:
> > On 30/11/23 23:19:45, Jorge Ramirez-Ortiz, Foundries wrote:
> >> On 30/11/23 23:02:15, Jorge Ramirez-Ortiz, Foundries wrote:
> >>> On 30/11/23 21:12:28, Adrian Hunter wrote:
> >>>> On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
> >>>>> On 30/11/23 11:34:18, Ulf Hansson wrote:
> >>>>>> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> >>>>>>> re-tune before switching to the RPMB partition would randomly cause
> >>>>>>> subsequent RPMB requests to fail with EILSEQ:
> >>>>>>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> >>>>>>>
> >>>>>>> This commit skips the retune when switching to RPMB.
> >>>>>>> Tested over several days with per minute RPMB reads.
> >>>>>>
> >>>>>> This sounds weird to me and needs more testing/debugging in my
> >>>>>> opinion, especially at the host driver level. Perhaps add some new
> >>>>>> tests in mmc_test, that does a partition switch to/from any partition
> >>>>>> and then run regular I/O again to see if the problem is easier to
> >>>>>> reproduce?
> >>>>>
> >>>>> hi Uffe
> >>>>>
> >>>>> ok I'll have a look - I have never used this driver before, so if you
> >>>>> have anything in the works I'll be glad to integrated and adapt.
> >>>>>
> >>>>>>
> >>>>>> The point is, I wonder what is so special with RPMB here? Note that,
> >>>>>> it has been quite common that host drivers/controllers have had issues
> >>>>>> with their tuning support, so I would not be surprised if that is the
> >>>>>> case here too.
> >>>>>
> >>>>> Right, it is just that the tuning function for of-arasan is the generic
> >>>>> __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> >>>>> calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> >>>>> are not recent functions or architectures.
> >>>>>
> >>>>>
> >>>>>> Certainly I would be surprised if the problem is at
> >>>>>> the eMMC card side, but I may be wrong.
> >>>>>
> >>>>> How do maintainers test the tuning methods? is there anything else for
> >>>>> me to do other than forcing a retune with different partitions?
> >>>>>
> >>>>>>
> >>>>>> Kind regards
> >>>>>> Uffe
> >>>>>
> >>>>> For completeness this is the error message - notice that we have a
> >>>>> trusted application (fiovb) going through OP-TEE and back to the TEE
> >>>>> supplicant issuing an rpmb read of a variable (pretty normal these days,
> >>>>> we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
> >>>>>
> >>>>> The issue on this Zynqmp platform is scarily simple to reproduce; you
> >>>>> can ignore the OP-TEE trace, it is just the TEE way of reporting that
> >>>>> the RPMB read failed.
> >>>>>
> >>>>> root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> >>>>> [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> >>>>> E/TC:? 0
> >>>>> E/TC:? 0 TA panicked with code 0xffff0000
> >>>>> E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> >>>>> E/LD: arch: aarch64
> >>>>> E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> >>>>> E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> >>>>> E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> >>>>> E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> >>>>> E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> >>>>> E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> >>>>> E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> >>>>> E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> >>>>> E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> >>>>> E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> >>>>> E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> >>>>> E/LD: Call stack:
> >>>>> E/LD: 0xc006de58
> >>>>> E/LD: 0xc006b388
> >>>>> E/LD: 0xc006ed40
> >>>>> E/LD: 0xc006b624
> >>>>> Read persistent value for m4hash failed: Exec format error
> >>>>
> >>>> Have you tried dynamic debug for mmc
> >>>>
> >>>> Kernel must be configured:
> >>>>
> >>>> CONFIG_DYNAMIC_DEBUG=y
> >>>>
> >>>> To enable mmc debug via sysfs:
> >>>>
> >>>> echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
> >>>> echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
> >>>>
> >>>>
> >>>
> >>> hi Adrian
> >>>
> >>> Sure, this is the output of the trace:
> >>>
> >>> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.018789] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.018817] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.018848] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.018875] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.018902] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.018932] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.020013] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.020027] mmc0: sdhci: IRQ status 0x00000002
> >>> [ 422.020034] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> >>> [ 422.020054] mmc0: starting CMD13 arg 00010000 flags 00000195
> >>> [ 422.020068] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.020076] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> >>> [ 422.020092] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> >>> [ 422.020101] mmc0: starting CMD25 arg 00000000 flags 00000035
> >>> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms nsac 0
> >>> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.021671] mmc0: sdhci: IRQ status 0x00000002
> >>> [ 422.021691] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> >>> [ 422.021700] mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
> >>> [ 422.021708] mmc0: 512 bytes transferred: 0
> >>> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
> >>> [ 422.021743] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.021752] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> >>> [ 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> >>> [ 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
> >>> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms nsac 0
> >>> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.022566] mmc0: sdhci: IRQ status 0x00208000 <---------------------------------- this doesnt seem right
> >>> [ 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> >>> [ 422.022639] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
> >>> [ 422.022647] mmc0: 0 bytes transferred: -84 < --------------------------------- it should have transfered 4096 bytes
> >>> [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> >>> [ 422.029619] mmc0: starting CMD6 arg 03b30001 flags 0000049d
> >>> [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.029652] mmc0: sdhci: IRQ status 0x00000002
> >>> [ 422.029660] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> >>> [ 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
> >>> [ 422.029693] mmc0: sdhci: IRQ status 0x00000001
> >>> [ 422.029702] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> >>> [ 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015>
> >>> [ 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
> >>> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400 ms nsac 0
> >>> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
> >>> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199526] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199585] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199641] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199695] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199753] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199811] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199865] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199919] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.199972] mmc0: sdhci: IRQ status 0x00000020
> >>> [ 422.200026] mmc0: sdhci: IRQ status 0x00000020
> >>>
> >>>
> >>> does this help?
> >
> > Just asking because it doesn't mean much to me other than the obvious CRC
> > problem.
> >
> > Being this issue so easy to trigger - and to fix - indicates a problem
> > on the card more than on the algorithm (otherwise faults would be all
> > over the place). But I am not an expert on this area.
> >
> > any additional suggestions welcome.
>
> My guess is that sometimes tuning produces a "bad" result. Perhaps
> the margins are very tight and the difference is only 1 tap. When
> a "bad" result happens in non-RPMB, a CRC error results in re-tuning
> and retry, so no errors are seen. When it happens in RPMB, that is
> not possible, so the error is obvious. Not re-tuning before RPMB
> switch helps because the CRC-error->re-tuning to a "good" result has
> probably already happened.
>
> However, based on that theory, it is not necessary the eMMC that is
> at fault.
>
> It may be worth considering a stronger eMMC driver strength setting.

sure I can tune the value (just building now). however I am not sure
about the implications - is there any negative consequence of increasing
this value that I could monitor (if tests pass)?
>
> sdhci supports err_stats in debugfs - that may show how many CRC
> errors there are when not accessing RPMB.

ok

>
> I don't object to skipping re-tuning before RPMB switch, but I am
> not sure about tying it to a specific eMMC.

thanks. will follow up after further testing.

>

2023-12-01 17:09:45

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 01/12/23 16:54:18, Jorge Ramirez-Ortiz, Foundries wrote:
> On 01/12/23 13:46:25, Adrian Hunter wrote:
> > On 1/12/23 09:40, Jorge Ramirez-Ortiz, Foundries wrote:
> > > On 30/11/23 23:19:45, Jorge Ramirez-Ortiz, Foundries wrote:
> > >> On 30/11/23 23:02:15, Jorge Ramirez-Ortiz, Foundries wrote:
> > >>> On 30/11/23 21:12:28, Adrian Hunter wrote:
> > >>>> On 30/11/23 15:24, Jorge Ramirez-Ortiz, Foundries wrote:
> > >>>>> On 30/11/23 11:34:18, Ulf Hansson wrote:
> > >>>>>> On Wed, 29 Nov 2023 at 17:05, Jorge Ramirez-Ortiz <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>> On the eMMC SanDisk iNAND 7250 configured with HS200, requesting a
> > >>>>>>> re-tune before switching to the RPMB partition would randomly cause
> > >>>>>>> subsequent RPMB requests to fail with EILSEQ:
> > >>>>>>> * data error -84, tigggered in __mmc_blk_ioctl_cmd()
> > >>>>>>>
> > >>>>>>> This commit skips the retune when switching to RPMB.
> > >>>>>>> Tested over several days with per minute RPMB reads.
> > >>>>>>
> > >>>>>> This sounds weird to me and needs more testing/debugging in my
> > >>>>>> opinion, especially at the host driver level. Perhaps add some new
> > >>>>>> tests in mmc_test, that does a partition switch to/from any partition
> > >>>>>> and then run regular I/O again to see if the problem is easier to
> > >>>>>> reproduce?
> > >>>>>
> > >>>>> hi Uffe
> > >>>>>
> > >>>>> ok I'll have a look - I have never used this driver before, so if you
> > >>>>> have anything in the works I'll be glad to integrated and adapt.
> > >>>>>
> > >>>>>>
> > >>>>>> The point is, I wonder what is so special with RPMB here? Note that,
> > >>>>>> it has been quite common that host drivers/controllers have had issues
> > >>>>>> with their tuning support, so I would not be surprised if that is the
> > >>>>>> case here too.
> > >>>>>
> > >>>>> Right, it is just that the tuning function for of-arasan is the generic
> > >>>>> __sdhci_execute_tuning() - only wrapped around arasan DLL reset
> > >>>>> calls. Hence why I aimed for the card: __sdhci_execute_tuning and ZynqMP
> > >>>>> are not recent functions or architectures.
> > >>>>>
> > >>>>>
> > >>>>>> Certainly I would be surprised if the problem is at
> > >>>>>> the eMMC card side, but I may be wrong.
> > >>>>>
> > >>>>> How do maintainers test the tuning methods? is there anything else for
> > >>>>> me to do other than forcing a retune with different partitions?
> > >>>>>
> > >>>>>>
> > >>>>>> Kind regards
> > >>>>>> Uffe
> > >>>>>
> > >>>>> For completeness this is the error message - notice that we have a
> > >>>>> trusted application (fiovb) going through OP-TEE and back to the TEE
> > >>>>> supplicant issuing an rpmb read of a variable (pretty normal these days,
> > >>>>> we use it on many different platforms - ST, NXP, AMD/Xilinx, TI..).
> > >>>>>
> > >>>>> The issue on this Zynqmp platform is scarily simple to reproduce; you
> > >>>>> can ignore the OP-TEE trace, it is just the TEE way of reporting that
> > >>>>> the RPMB read failed.
> > >>>>>
> > >>>>> root@uz3cg-dwg-sec:/var/rootdirs/home/fio# fiovb_printenv m4hash
> > >>>>> [ 461.775084] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> > >>>>> E/TC:? 0
> > >>>>> E/TC:? 0 TA panicked with code 0xffff0000
> > >>>>> E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
> > >>>>> E/LD: arch: aarch64
> > >>>>> E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
> > >>>>> E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
> > >>>>> E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
> > >>>>> E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
> > >>>>> E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
> > >>>>> E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
> > >>>>> E/LD: region 6: va 0xc0019000 pa 0x816b31fc8 size 0x001000 flags rw-- (param)
> > >>>>> E/LD: region 7: va 0xc001a000 pa 0x816aa1fc8 size 0x002000 flags rw-- (param)
> > >>>>> E/LD: region 8: va 0xc006b000 pa 0x00001000 size 0x014000 flags r-xs [0]
> > >>>>> E/LD: region 9: va 0xc007f000 pa 0x00015000 size 0x008000 flags rw-s [0]
> > >>>>> E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc006b000
> > >>>>> E/LD: Call stack:
> > >>>>> E/LD: 0xc006de58
> > >>>>> E/LD: 0xc006b388
> > >>>>> E/LD: 0xc006ed40
> > >>>>> E/LD: 0xc006b624
> > >>>>> Read persistent value for m4hash failed: Exec format error
> > >>>>
> > >>>> Have you tried dynamic debug for mmc
> > >>>>
> > >>>> Kernel must be configured:
> > >>>>
> > >>>> CONFIG_DYNAMIC_DEBUG=y
> > >>>>
> > >>>> To enable mmc debug via sysfs:
> > >>>>
> > >>>> echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
> > >>>> echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
> > >>>>
> > >>>>
> > >>>
> > >>> hi Adrian
> > >>>
> > >>> Sure, this is the output of the trace:
> > >>>
> > >>> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.018789] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.018817] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.018848] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.018875] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.018902] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.018932] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.020013] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.020027] mmc0: sdhci: IRQ status 0x00000002
> > >>> [ 422.020034] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> > >>> [ 422.020054] mmc0: starting CMD13 arg 00010000 flags 00000195
> > >>> [ 422.020068] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.020076] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> > >>> [ 422.020092] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> > >>> [ 422.020101] mmc0: starting CMD25 arg 00000000 flags 00000035
> > >>> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms nsac 0
> > >>> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.021671] mmc0: sdhci: IRQ status 0x00000002
> > >>> [ 422.021691] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> > >>> [ 422.021700] mmc0: req done (CMD25): 0: 00000900 00000000 00000000 00000000
> > >>> [ 422.021708] mmc0: 512 bytes transferred: 0
> > >>> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
> > >>> [ 422.021743] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.021752] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> > >>> [ 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015>
> > >>> [ 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
> > >>> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms nsac 0
> > >>> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.022566] mmc0: sdhci: IRQ status 0x00208000 <---------------------------------- this doesnt seem right
> > >>> [ 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
> > >>> [ 422.022639] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
> > >>> [ 422.022647] mmc0: 0 bytes transferred: -84 < --------------------------------- it should have transfered 4096 bytes
> > >>> [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
> > >>> [ 422.029619] mmc0: starting CMD6 arg 03b30001 flags 0000049d
> > >>> [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.029652] mmc0: sdhci: IRQ status 0x00000002
> > >>> [ 422.029660] mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> > >>> [ 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195
> > >>> [ 422.029693] mmc0: sdhci: IRQ status 0x00000001
> > >>> [ 422.029702] mmc0: req done (CMD13): 0: 00000900 00000000 00000000 00000000
> > >>> [ 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015>
> > >>> [ 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
> > >>> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400 ms nsac 0
> > >>> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
> > >>> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199526] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199585] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199641] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199695] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199753] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199811] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199865] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199919] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.199972] mmc0: sdhci: IRQ status 0x00000020
> > >>> [ 422.200026] mmc0: sdhci: IRQ status 0x00000020
> > >>>
> > >>>
> > >>> does this help?
> > >
> > > Just asking because it doesn't mean much to me other than the obvious CRC
> > > problem.
> > >
> > > Being this issue so easy to trigger - and to fix - indicates a problem
> > > on the card more than on the algorithm (otherwise faults would be all
> > > over the place). But I am not an expert on this area.
> > >
> > > any additional suggestions welcome.
> >
> > My guess is that sometimes tuning produces a "bad" result. Perhaps
> > the margins are very tight and the difference is only 1 tap. When
> > a "bad" result happens in non-RPMB, a CRC error results in re-tuning
> > and retry, so no errors are seen. When it happens in RPMB, that is
> > not possible, so the error is obvious. Not re-tuning before RPMB
> > switch helps because the CRC-error->re-tuning to a "good" result has
> > probably already happened.
> >
> > However, based on that theory, it is not necessary the eMMC that is
> > at fault.
> >
> > It may be worth considering a stronger eMMC driver strength setting.
>
> sure I can tune the value (just building now). however I am not sure
> about the implications - is there any negative consequence of increasing
> this value that I could monitor (if tests pass)?

ZynqMP does not set the property "fixed-emmc-driver-type" and since the
sdhci-of-arasan driver does not implement select_drive_strength() the
drive_strength setting is zero.

So AFAICS things are working accordingly - it is hard for me to say if
things should have been coded any differently.

> >
> > sdhci supports err_stats in debugfs - that may show how many CRC
> > errors there are when not accessing RPMB.
>
> ok
>
> >
> > I don't object to skipping re-tuning before RPMB switch, but I am
> > not sure about tying it to a specific eMMC.
>
> thanks. will follow up after further testing.

should I just repost the patch now skiping the retune for all cards
before switching to the RPMB partition? instead of using a quirk?

On this particular card it has now run for a couple of days so I am
confident that it addresses at the very least the symptom of the issue.

>
> >

2023-12-02 16:47:50

by Avri Altman

[permalink] [raw]

Subject: RE: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

Hi,
Sorry for joining so late - This thread was routed to my junk mail somehow.
We were observing this issue recently with one of our clients using a Broadcom platform.
Similarly like in this case, the tuning process didn't use cmd21, so sending only cmd6 is perfectly ok.
We couldn't find any issue with the device at the time.
During our investigation, it turned out that the client had a kernel hack of its own,
and once it was removed the issue wasn't reproducing anymore.

> > > >>> hi Adrian
> > > >>>
> > > >>> Sure, this is the output of the trace:
> > > >>>
> > > >>> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020 [ 422.018789]
> > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.018817] mmc0: sdhci:
> > > >>> IRQ status 0x00000020 [ 422.018848] mmc0: sdhci: IRQ status
> > > >>> 0x00000020 [ 422.018875] mmc0: sdhci: IRQ status 0x00000020 [
> > > >>> 422.018902] mmc0: sdhci: IRQ status 0x00000020 [ 422.018932]
> > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.020013] mmc0: sdhci:
> > > >>> IRQ status 0x00000001 [ 422.020027] mmc0: sdhci: IRQ status
> > > >>> 0x00000002 [ 422.020034] mmc0: req done (CMD6): 0: 00000800
> > > >>> 00000000 00000000 00000000 [ 422.020054] mmc0: starting CMD13
> > > >>> arg 00010000 flags 00000195 [ 422.020068] mmc0: sdhci: IRQ
> > > >>> status 0x00000001 [ 422.020076] mmc0: req done (CMD13): 0:
> > > >>> 00000900 00000000 00000000 00000000 [ 422.020092] <mmc0:
> > > >>> starting CMD23 arg 00000001 flags 00000015> [ 422.020101] mmc0:
> > > >>> starting CMD25 arg 00000000 flags 00000035
> > > >>> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms
> nsac 0
> > > >>> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001 [ 422.021671]
> > > >>> mmc0: sdhci: IRQ status 0x00000002 [ 422.021691] mmc0: req done
> > > >>> <CMD23>: 0: 00000000 00000000 00000000 00000000 [ 422.021700]
> > > >>> mmc0: req done (CMD25): 0: 00000900 00000000 00000000
> 00000000
> > > >>> [ 422.021708] mmc0: 512 bytes transferred: 0
> > > >>> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
> > > >>> [ 422.021743] mmc0: sdhci: IRQ status 0x00000001 [ 422.021752]
> > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000
> 00000000 [
> > > >>> 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015> [
> > > >>> 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
> > > >>> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms
> nsac 0
> > > >>> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001 [ 422.022566]
> > > >>> mmc0: sdhci: IRQ status 0x00208000
> > > >>> <---------------------------------- this doesnt seem right [
Why not?
Its cmd25-cmd25-cmd18 which implies rpmb write?

> > > >>> 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000
> 00000000 00000000 [ 422.022639] mmc0: req done (CMD18): 0: 00000900
> 00000000 00000000 00000000
> > > >>> [ 422.022647] mmc0: 0 bytes transferred: -84 < -------------------------
> -------- it should have transfered 4096 bytes
> > > >>> [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd:
> > > >>> data error -84 [ 422.029619] mmc0: starting CMD6 arg 03b30001
> > > >>> flags 0000049d [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
> > > >>> [ 422.029652] mmc0: sdhci: IRQ status 0x00000002 [ 422.029660]
> > > >>> mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> [
> > > >>> 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195 [
> > > >>> 422.029693] mmc0: sdhci: IRQ status 0x00000001 [ 422.029702]
> > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000
> 00000000 [
> > > >>> 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015> [
> > > >>> 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
> > > >>> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400
> ms nsac 0
> > > >>> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
> > > >>> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020 [ 422.199526]
> > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.199585] mmc0: sdhci:
> > > >>> IRQ status 0x00000020 [ 422.199641] mmc0: sdhci: IRQ status
> > > >>> 0x00000020 [ 422.199695] mmc0: sdhci: IRQ status 0x00000020 [
> > > >>> 422.199753] mmc0: sdhci: IRQ status 0x00000020 [ 422.199811]
> > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.199865] mmc0: sdhci:
> > > >>> IRQ status 0x00000020 [ 422.199919] mmc0: sdhci: IRQ status
> > > >>> 0x00000020 [ 422.199972] mmc0: sdhci: IRQ status 0x00000020 [
> > > >>> 422.200026] mmc0: sdhci: IRQ status 0x00000020
> > > >>>
> > > >>>
> > > >>> does this help?
> > > >
> > > > Just asking because it doesn't mean much to me other than the
> > > > obvious CRC problem.
> > > >
> > > > Being this issue so easy to trigger - and to fix - indicates a
> > > > problem on the card more than on the algorithm (otherwise faults
> > > > would be all over the place). But I am not an expert on this area.
> > > >
> > > > any additional suggestions welcome.
> > >
> > > My guess is that sometimes tuning produces a "bad" result. Perhaps
> > > the margins are very tight and the difference is only 1 tap. When a
> > > "bad" result happens in non-RPMB, a CRC error results in re-tuning
> > > and retry, so no errors are seen. When it happens in RPMB, that is
> > > not possible, so the error is obvious. Not re-tuning before RPMB
> > > switch helps because the CRC-error->re-tuning to a "good" result has
> > > probably already happened.
> > >
> > > However, based on that theory, it is not necessary the eMMC that is
> > > at fault.
> > >
> > > It may be worth considering a stronger eMMC driver strength setting.
> >
> > sure I can tune the value (just building now). however I am not sure
> > about the implications - is there any negative consequence of
> > increasing this value that I could monitor (if tests pass)?
>
> ZynqMP does not set the property "fixed-emmc-driver-type" and since the
> sdhci-of-arasan driver does not implement select_drive_strength() the
> drive_strength setting is zero.
>
> So AFAICS things are working accordingly - it is hard for me to say if things
> should have been coded any differently.
>
> > >
> > > sdhci supports err_stats in debugfs - that may show how many CRC
> > > errors there are when not accessing RPMB.
> >
> > ok
> >
> > >
> > > I don't object to skipping re-tuning before RPMB switch, but I am
> > > not sure about tying it to a specific eMMC.
> >
> > thanks. will follow up after further testing.
>
> should I just repost the patch now skiping the retune for all cards before
> switching to the RPMB partition? instead of using a quirk?
>
> On this particular card it has now run for a couple of days so I am confident
> that it addresses at the very least the symptom of the issue.
As aforesaid, we observed a similar issue on a different platform as well.
If it's not realistic to further pursue Adrian's theory, *And* this doesn't reproduce on other cards,
we have no objection setting the quirk for Sandisk.
(if you're having trouble testing other cards ping me privately I can help you with that).

Thanks a lot for fixing this,
Avri

(btw - yes - our manufacturer id is 0x45 - it is set differently in the mmc driver for historic reasons -
Thank you for adding this.)

>
>
> >
> > >
>

2023-12-03 16:27:39

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 02/12/23 16:47:23, Avri Altman wrote:
> Hi,
> Sorry for joining so late - This thread was routed to my junk mail somehow.
> We were observing this issue recently with one of our clients using a Broadcom platform.
> Similarly like in this case, the tuning process didn't use cmd21, so sending only cmd6 is perfectly ok.
> We couldn't find any issue with the device at the time.
> During our investigation, it turned out that the client had a kernel hack of its own,
> and once it was removed the issue wasn't reproducing anymore.

um, do you know what driver or setting could have be caused the issue?

This product is using the Xilinx kernel. 5.15.64
https://github.com/Xilinx/linux-xlnx

>
> > > > >>> hi Adrian
> > > > >>>
> > > > >>> Sure, this is the output of the trace:
> > > > >>>
> > > > >>> [ 422.018756] mmc0: sdhci: IRQ status 0x00000020 [ 422.018789]
> > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.018817] mmc0: sdhci:
> > > > >>> IRQ status 0x00000020 [ 422.018848] mmc0: sdhci: IRQ status
> > > > >>> 0x00000020 [ 422.018875] mmc0: sdhci: IRQ status 0x00000020 [
> > > > >>> 422.018902] mmc0: sdhci: IRQ status 0x00000020 [ 422.018932]
> > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.020013] mmc0: sdhci:
> > > > >>> IRQ status 0x00000001 [ 422.020027] mmc0: sdhci: IRQ status
> > > > >>> 0x00000002 [ 422.020034] mmc0: req done (CMD6): 0: 00000800
> > > > >>> 00000000 00000000 00000000 [ 422.020054] mmc0: starting CMD13
> > > > >>> arg 00010000 flags 00000195 [ 422.020068] mmc0: sdhci: IRQ
> > > > >>> status 0x00000001 [ 422.020076] mmc0: req done (CMD13): 0:
> > > > >>> 00000900 00000000 00000000 00000000 [ 422.020092] <mmc0:
> > > > >>> starting CMD23 arg 00000001 flags 00000015> [ 422.020101] mmc0:
> > > > >>> starting CMD25 arg 00000000 flags 00000035
> > > > >>> [ 422.020108] mmc0: blksz 512 blocks 1 flags 00000100 tsac 400 ms
> > nsac 0
> > > > >>> [ 422.020124] mmc0: sdhci: IRQ status 0x00000001 [ 422.021671]
> > > > >>> mmc0: sdhci: IRQ status 0x00000002 [ 422.021691] mmc0: req done
> > > > >>> <CMD23>: 0: 00000000 00000000 00000000 00000000 [ 422.021700]
> > > > >>> mmc0: req done (CMD25): 0: 00000900 00000000 00000000
> > 00000000
> > > > >>> [ 422.021708] mmc0: 512 bytes transferred: 0
> > > > >>> [ 422.021728] mmc0: starting CMD13 arg 00010000 flags 00000195
> > > > >>> [ 422.021743] mmc0: sdhci: IRQ status 0x00000001 [ 422.021752]
> > > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000
> > 00000000 [
> > > > >>> 422.021771] <mmc0: starting CMD23 arg 00000001 flags 00000015> [
> > > > >>> 422.021779] mmc0: starting CMD18 arg 00000000 flags 00000035
> > > > >>> [ 422.021785] mmc0: blksz 512 blocks 1 flags 00000200 tsac 100 ms
> > nsac 0
> > > > >>> [ 422.021804] mmc0: sdhci: IRQ status 0x00000001 [ 422.022566]
> > > > >>> mmc0: sdhci: IRQ status 0x00208000
> > > > >>> <---------------------------------- this doesnt seem right [
> Why not?
> Its cmd25-cmd25-cmd18 which implies rpmb write?

sorry I am referring to the IRQ status 0x00208000 (CRC errors) - not
the sequence.

>
> > > > >>> 422.022629] mmc0: req done <CMD23>: 0: 00000000 00000000
> > 00000000 00000000 [ 422.022639] mmc0: req done (CMD18): 0: 00000900
> > 00000000 00000000 00000000
> > > > >>> [ 422.022647] mmc0: 0 bytes transferred: -84 < -------------------------
> > -------- it should have transfered 4096 bytes
> > > > >>> [ 422.022669] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd:
> > > > >>> data error -84 [ 422.029619] mmc0: starting CMD6 arg 03b30001
> > > > >>> flags 0000049d [ 422.029636] mmc0: sdhci: IRQ status 0x00000001
> > > > >>> [ 422.029652] mmc0: sdhci: IRQ status 0x00000002 [ 422.029660]
> > > > >>> mmc0: req done (CMD6): 0: 00000800 00000000 00000000 00000000
> > [
> > > > >>> 422.029680] mmc0: starting CMD13 arg 00010000 flags 00000195 [
> > > > >>> 422.029693] mmc0: sdhci: IRQ status 0x00000001 [ 422.029702]
> > > > >>> mmc0: req done (CMD13): 0: 00000900 00000000 00000000
> > 00000000 [
> > > > >>> 422.196996] <mmc0: starting CMD23 arg 00000400 flags 00000015> [
> > > > >>> 422.197051] mmc0: starting CMD25 arg 058160e0 flags 000000b5
> > > > >>> [ 422.197079] mmc0: blksz 512 blocks 1024 flags 00000100 tsac 400
> > ms nsac 0
> > > > >>> [ 422.197110] mmc0: CMD12 arg 00000000 flags 0000049d
> > > > >>> [ 422.199455] mmc0: sdhci: IRQ status 0x00000020 [ 422.199526]
> > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.199585] mmc0: sdhci:
> > > > >>> IRQ status 0x00000020 [ 422.199641] mmc0: sdhci: IRQ status
> > > > >>> 0x00000020 [ 422.199695] mmc0: sdhci: IRQ status 0x00000020 [
> > > > >>> 422.199753] mmc0: sdhci: IRQ status 0x00000020 [ 422.199811]
> > > > >>> mmc0: sdhci: IRQ status 0x00000020 [ 422.199865] mmc0: sdhci:
> > > > >>> IRQ status 0x00000020 [ 422.199919] mmc0: sdhci: IRQ status
> > > > >>> 0x00000020 [ 422.199972] mmc0: sdhci: IRQ status 0x00000020 [
> > > > >>> 422.200026] mmc0: sdhci: IRQ status 0x00000020
> > > > >>>
> > > > >>>
> > > > >>> does this help?
> > > > >
> > > > > Just asking because it doesn't mean much to me other than the
> > > > > obvious CRC problem.
> > > > >
> > > > > Being this issue so easy to trigger - and to fix - indicates a
> > > > > problem on the card more than on the algorithm (otherwise faults
> > > > > would be all over the place). But I am not an expert on this area.
> > > > >
> > > > > any additional suggestions welcome.
> > > >
> > > > My guess is that sometimes tuning produces a "bad" result. Perhaps
> > > > the margins are very tight and the difference is only 1 tap. When a
> > > > "bad" result happens in non-RPMB, a CRC error results in re-tuning
> > > > and retry, so no errors are seen. When it happens in RPMB, that is
> > > > not possible, so the error is obvious. Not re-tuning before RPMB
> > > > switch helps because the CRC-error->re-tuning to a "good" result has
> > > > probably already happened.
> > > >
> > > > However, based on that theory, it is not necessary the eMMC that is
> > > > at fault.
> > > >
> > > > It may be worth considering a stronger eMMC driver strength setting.
> > >
> > > sure I can tune the value (just building now). however I am not sure
> > > about the implications - is there any negative consequence of
> > > increasing this value that I could monitor (if tests pass)?
> >
> > ZynqMP does not set the property "fixed-emmc-driver-type" and since the
> > sdhci-of-arasan driver does not implement select_drive_strength() the
> > drive_strength setting is zero.
> >
> > So AFAICS things are working accordingly - it is hard for me to say if things
> > should have been coded any differently.
> >
> > > >
> > > > sdhci supports err_stats in debugfs - that may show how many CRC
> > > > errors there are when not accessing RPMB.
> > >
> > > ok
> > >
> > > >
> > > > I don't object to skipping re-tuning before RPMB switch, but I am
> > > > not sure about tying it to a specific eMMC.
> > >
> > > thanks. will follow up after further testing.
> >
> > should I just repost the patch now skiping the retune for all cards before
> > switching to the RPMB partition? instead of using a quirk?
> >
> > On this particular card it has now run for a couple of days so I am confident
> > that it addresses at the very least the symptom of the issue.
> As aforesaid, we observed a similar issue on a different platform as well.
> If it's not realistic to further pursue Adrian's theory, *And* this doesn't reproduce on other cards,
> we have no objection setting the quirk for Sandisk.
> (if you're having trouble testing other cards ping me privately I can help you with that).

I have some extra eMMC cards which I used to validate RPMB on the OP-TEE
port I did for AMD/Xilinx Versal ACAP some time ago and which I maintain upstream:
https://optee.readthedocs.io/en/latest/building/devices/versal.html?highlight=versal

However I cant use them on this hardware - there is not a uSD slot, just
USB. And from what I can see, RPMB doesnt get mapped when the device is
mounted as a mass storage device (unless there is a way that I dont
know?). Other than that I am not sure what to propose. Suggestions
welcome.

Versal uses the same sdhci-of-arasan driver but with some diffences:
https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18842090/SD+controller?responseToken=4bd005c7902a3dbd9ecb032f02e52ccb
This particular issue can not be reproduced on that platform.

It also didn't ever trigger in any of the other platforms I have
worked with and supported during the last four years (STM32MP1, NXP
(iMX8, iMX7, iMX6), etc). And we have heard of no customers complaining
about upgrade issues.

Being RPMB critical for our security story - device firmware
verification and upgrade - we would have noticed. So this one is the
first time we have had troubles accessing RPMB - incidentally blocking a
product launch and causing a bit of pain.
https://docs.foundries.io/latest/reference-manual/security/secure-machines.html?highlight=rpmb#accessing-secure-storage

We could carry the patch internally (it seems harmless after all the
testing done) but I'd much rather land it upstream if possible.

>
> Thanks a lot for fixing this,
> Avri

thanks everyone for the support.

>
> (btw - yes - our manufacturer id is 0x45 - it is set differently in the mmc driver for historic reasons -
> Thank you for adding this.)
>
> >
> >
> > >
> > > >
> >
>

2023-12-04 13:58:28

by Jorge Ramirez-Ortiz, Foundries

[permalink] [raw]

Subject: Re: [PATCHv2] mmc: rpmb: add quirk MMC_QUIRK_BROKEN_RPMB_RETUNE

On 04/12/23 12:59:07, Avri Altman wrote:
> > > We could carry the patch internally (it seems harmless after all the
> > > testing
> > > done) but I'd much rather land it upstream if possible.
> > Agreed.
> Also, I am totally fine, and maybe it's even better, with adding this as a default behavior for all vendors.
> I see no point in those tunings while accessing rpmb anyway.
>

I'll repost then without the quirk.

In fact just this morning - I was about to update - I was able to test
on an early board revision with a different eMMC module (at the time
RPMB wasnt validated) which upon reflashing also exhibited the issue

see below:

# cat /sys/class/mmc_host/mmc0/mmc0\:0001/csd
d02700328f5903ffffffffef86400000

# cat /sys/class/mmc_host/mmc0/mmc0\:0001/name
064GB2

[ 218.759036] sdhci-arasan ff160000.mmc: __mmc_blk_ioctl_cmd: data error -84
E/TC:? 0
E/TC:? 0 TA panicked with code 0xffff0000
E/LD: Status of TA 22250a54-0bf1-48fe-8002-7b20f1c9c9b1
E/LD: arch: aarch64
E/LD: region 0: va 0xc0004000 pa 0x7e200000 size 0x002000 flags rw-s (ldelf)
E/LD: region 1: va 0xc0006000 pa 0x7e202000 size 0x008000 flags r-xs (ldelf)
E/LD: region 2: va 0xc000e000 pa 0x7e20a000 size 0x001000 flags rw-s (ldelf)
E/LD: region 3: va 0xc000f000 pa 0x7e20b000 size 0x004000 flags rw-s (ldelf)
E/LD: region 4: va 0xc0013000 pa 0x7e20f000 size 0x001000 flags r--s
E/LD: region 5: va 0xc0014000 pa 0x7e22c000 size 0x005000 flags rw-s (stack)
E/LD: region 6: va 0xc0019000 pa 0x80dc4e298 size 0x002000 flags rw-- (param)
E/LD: region 7: va 0xc001b000 pa 0x80dd1c298 size 0x001000 flags rw-- (param)
E/LD: region 8: va 0xc0091000 pa 0x00001000 size 0x014000 flags r-xs [0]
E/LD: region 9: va 0xc00a5000 pa 0x00015000 size 0x008000 flags rw-s [0]
E/LD: [0] 22250a54-0bf1-48fe-8002-7b20f1c9c9b1 @ 0xc0091000
E/LD: Call stack:
E/LD: 0xc0093a14
E/LD: 0xc009131c
E/LD: 0xc0094d40
E/LD: 0xc0091624
Read persistent value for bootupgrade_available failed: Exec format error
Cant print the environment
Error: fiovb_printenv command failed. Exiting.

> Thanks,
> Avri
>
> >
> > Thanks,
> > Avri
> > >
> > > >
> > > > Thanks a lot for fixing this,
> > > > Avri
> > >
> > > thanks everyone for the support.
> > >
> > > >
> > > > (btw - yes - our manufacturer id is 0x45 - it is set differently in
> > > > the mmc driver for historic reasons - Thank you for adding this.)
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > >
> > > >