From: Claudiu Beznea <[email protected]>
On latest kernel revisions it has been noticed (on a RZ/G3S system) that
when booting Linux and root file system is on eMMC, at some point in
the booting process, when the systemd applications are started, the
"mmc0: tuning execution failed: -5" message is displayed on console.
On kernel v6.7-rc5 this is reproducible in 90% of the boots. This was
missing on the same system with kernel v6.5.0-rc1. It was also noticed on
kernel revisions v6.6-rcX on a RZ/G2UL based system but not on the kernel
this fix is based on (v6.7-rc5).
Investigating it on RZ/G3S lead to the conclusion that every time the issue
is reproduced all the probed TAPs are OK. According to datasheet, when this
happens the change point of data need to be considered for tuning.
Previous code considered the change point of data happens when the content
of the SMPCMP register is zero. According to RZ/V2M hardware manual,
chapter "Change Point of the Input Data" (as this is the most clear
description that I've found about change point of the input data and all
RZ hardware manual are similar on this chapter), at the time of tuning,
data is captured by the previous and next TAPs and the result is stored in
the SMPCMP register (previous TAP in bits 22..16, next TAP in bits 7..0).
If there is a mismatch b/w the previous and the next TAPs, it indicates
that there is a change point of the input data.
To comply with this, the patch checks if this mismatch is present and
updates the priv->smpcmp mask only if it is not. Previous code checked if
the value of SMPCMP register was zero. However, on RZ/G3S, this leads to
failues as it may happen, e.g., the following:
CMPNGU=0x0e, CMPNGD=0x0e, SMPCMP=0x000e000e.
Along with it, as mmc_send_tuning() may return with error even before the
MMC command reach the controller (and because at that point cmd_error = 0),
the update of priv->smpcmp mask has been done only if the return value of
mmc_send_tuning(mmc, opcode, &cmd_error) is 0 (success).
This change has been checked on the devices with the following DTSes by
doing 100 consecutive boots and checking for the tuning failure message:
- r9a08g045s33-smarc.dts
- r8a7742-iwg21d-q7.dts
- r8a7743-iwg20d-q7.dts
- r8a7744-iwg20d-q7.dts
- r8a7745-iwg22d-sodimm.dts
- r8a77470-iwg23s-sbc.dts
- r8a774a1-hihope-rzg2m-ex.dts
- r8a774b1-hihope-rzg2n-ex.dts
- r8a774c0-ek874.dts
- r8a774e1-hihope-rzg2h-ex.dts
- r9a07g043u11-smarc-rzg2ul.dts
- r9a07g044c2-smarc-rzg2lc.dts
- r9a07g044l2-smarc-rzg2l.dts
- r9a07g054l2-smarc-rzv2l.dts
Fixes: 5fb6bf51f6d1 ("mmc: renesas_sdhi: improve TAP selection if all TAPs are good")
Signed-off-by: Claudiu Beznea <[email protected]>
---
Changes in v3:
- set priv->smpcmp if cmpngu_data == cmpngd_data and return code of
mmc_send_tuning() is zero
- removed workaround introduced previously in
renesas_sdhi_select_tuning() as it is not needed with the code from v3
- update patch description
Changes in v2:
- read the SH_MOBILE_SDHI_SCC_SMPCMP register only on success path of
mmc_send_tuning()
drivers/mmc/host/renesas_sdhi_core.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/drivers/mmc/host/renesas_sdhi_core.c b/drivers/mmc/host/renesas_sdhi_core.c
index c675dec587ef..8871521e1274 100644
--- a/drivers/mmc/host/renesas_sdhi_core.c
+++ b/drivers/mmc/host/renesas_sdhi_core.c
@@ -18,6 +18,7 @@
*
*/
+#include <linux/bitfield.h>
#include <linux/clk.h>
#include <linux/delay.h>
#include <linux/iopoll.h>
@@ -312,6 +313,8 @@ static int renesas_sdhi_start_signal_voltage_switch(struct mmc_host *mmc,
#define SH_MOBILE_SDHI_SCC_SMPCMP_CMD_REQDOWN BIT(8)
#define SH_MOBILE_SDHI_SCC_SMPCMP_CMD_REQUP BIT(24)
#define SH_MOBILE_SDHI_SCC_SMPCMP_CMD_ERR (BIT(8) | BIT(24))
+#define SH_MOBILE_SDHI_SCC_SMPCMP_CMPNGU_DATA GENMASK(23, 16)
+#define SH_MOBILE_SDHI_SCC_SMPCMP_CMPNGD_DATA GENMASK(7, 0)
#define SH_MOBILE_SDHI_SCC_TMPPORT2_HS400OSEL BIT(4)
#define SH_MOBILE_SDHI_SCC_TMPPORT2_HS400EN BIT(31)
@@ -703,11 +706,18 @@ static int renesas_sdhi_execute_tuning(struct mmc_host *mmc, u32 opcode)
/* Set sampling clock position */
sd_scc_write32(host, priv, SH_MOBILE_SDHI_SCC_TAPSET, i % priv->tap_num);
- if (mmc_send_tuning(mmc, opcode, &cmd_error) == 0)
- set_bit(i, priv->taps);
+ if (mmc_send_tuning(mmc, opcode, &cmd_error) == 0) {
+ u32 val, cmpngu_data, cmpngd_data;
+
+ val = sd_scc_read32(host, priv, SH_MOBILE_SDHI_SCC_SMPCMP);
+ cmpngu_data = FIELD_GET(SH_MOBILE_SDHI_SCC_SMPCMP_CMPNGU_DATA, val);
+ cmpngd_data = FIELD_GET(SH_MOBILE_SDHI_SCC_SMPCMP_CMPNGD_DATA, val);
+
+ if (cmpngu_data == cmpngd_data)
+ set_bit(i, priv->smpcmp);
- if (sd_scc_read32(host, priv, SH_MOBILE_SDHI_SCC_SMPCMP) == 0)
- set_bit(i, priv->smpcmp);
+ set_bit(i, priv->taps);
+ }
if (cmd_error)
mmc_send_abort_tuning(mmc, opcode);
--
2.39.2
Hi Claudiu,
thanks for the updated version!
> To comply with this, the patch checks if this mismatch is present and
> updates the priv->smpcmp mask only if it is not. Previous code checked if
> the value of SMPCMP register was zero. However, on RZ/G3S, this leads to
> failues as it may happen, e.g., the following:
> CMPNGU=0x0e, CMPNGD=0x0e, SMPCMP=0x000e000e.
Can you add the current TAP number (variable 'i') to this printout?
According to my understanding, we should only mark this TAP good if it
is in the range 5-7. I need to double check with Renesas, though.
> Along with it, as mmc_send_tuning() may return with error even before the
> MMC command reach the controller (and because at that point cmd_error = 0),
> the update of priv->smpcmp mask has been done only if the return value of
> mmc_send_tuning(mmc, opcode, &cmd_error) is 0 (success).
This is a needed change, for sure.
> This change has been checked on the devices with the following DTSes by
> doing 100 consecutive boots and checking for the tuning failure message:
Boot failure is one test. Read/write tests should be another, I think.
Because if we select a bad TAP, bad things might happen later. To reduce
the amount of testing, read/write testing could only be triggered if the
new code path was excecuted?
Happy hacking,
Wolfram
Hi, Wolfram,
On 05.02.2024 15:07, Wolfram Sang wrote:
> Hi Claudiu,
>
> thanks for the updated version!
>
>> To comply with this, the patch checks if this mismatch is present and
>> updates the priv->smpcmp mask only if it is not. Previous code checked if
>> the value of SMPCMP register was zero. However, on RZ/G3S, this leads to
>> failues as it may happen, e.g., the following:
>> CMPNGU=0x0e, CMPNGD=0x0e, SMPCMP=0x000e000e.
>
> Can you add the current TAP number (variable 'i') to this printout?
This is a snapshot I have saved from my previous debugging session (but I
tried here to check the values of cmpngd, cmpngu):
i=0, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=1, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=2, cmpngu=0000000e, cmpngd=0000000e, smpcmp=000e000e
i=3, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=4, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002
i=5, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff
i=6, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000
i=7, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=8, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=9, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=10, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=11, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=12, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002
i=13, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff
i=14, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000
i=15, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
This is printed in this for loop:
https://elixir.bootlin.com/linux/latest/source/drivers/mmc/host/renesas_sdhi_core.c#L700
> According to my understanding, we should only mark this TAP good if it
> is in the range 5-7. I need to double check with Renesas, though.
OK, my understanding is that it should be in the middle (beginning being
the tap that triggered change point of the input data, end being the next
tap with the same ID). This is what I understand from this: "As the width
of the input data is 1 (UI), select TAP6 or TAP7 which is
*the median* of next TAP3 from TAP3."
>
>> Along with it, as mmc_send_tuning() may return with error even before the
>> MMC command reach the controller (and because at that point cmd_error = 0),
>> the update of priv->smpcmp mask has been done only if the return value of
>> mmc_send_tuning(mmc, opcode, &cmd_error) is 0 (success).
>
> This is a needed change, for sure.
>
>> This change has been checked on the devices with the following DTSes by
>> doing 100 consecutive boots and checking for the tuning failure message:
>
> Boot failure is one test. Read/write tests should be another, I think.
OK, I'll try also read/write. Do you have in mind something particular?
> Because if we select a bad TAP, bad things might happen later. To reduce
> the amount of testing, read/write testing could only be triggered if the
> new code path was excecuted?
I'm not sure how to trigger that (or maybe I haven't understood your
statement...)
Thank you,
Claudiu Beznea
>
> Happy hacking,
>
> Wolfram
>
> > According to my understanding, we should only mark this TAP good if it
> > is in the range 5-7. I need to double check with Renesas, though.
>
> OK, my understanding is that it should be in the middle (beginning being
> the tap that triggered change point of the input data, end being the next
> tap with the same ID). This is what I understand from this: "As the width
> of the input data is 1 (UI), select TAP6 or TAP7 which is
>
> *the median* of next TAP3 from TAP3."
Yes, I agree. With 0x0e, that means TAP1+2+3 are changing points and we
should be far away from them, like 5-7.
But: I am still waiting for Renesas to answer my questions regarding
SMPCMP. I'd like to get that first, so we have clear facts then.
> > Boot failure is one test. Read/write tests should be another, I think.
>
> OK, I'll try also read/write. Do you have in mind something particular?
Nope. Just consistency checks.
> > Because if we select a bad TAP, bad things might happen later. To reduce
> > the amount of testing, read/write testing could only be triggered if the
> > new code path was excecuted?
>
> I'm not sure how to trigger that (or maybe I haven't understood your
> statement...)
I thought something in the lines of:
- print out when you needed SMPCMP to select a TAP
- check the log for that printout
- if (printout) do read_write_tests
Dunno if that makes sense with your test setup.
Hi, Wolfram,
On 05.02.2024 16:51, Wolfram Sang wrote:
>
>>> According to my understanding, we should only mark this TAP good if it
>>> is in the range 5-7. I need to double check with Renesas, though.
>>
>> OK, my understanding is that it should be in the middle (beginning being
>> the tap that triggered change point of the input data, end being the next
>> tap with the same ID). This is what I understand from this: "As the width
>> of the input data is 1 (UI), select TAP6 or TAP7 which is
>>
>> *the median* of next TAP3 from TAP3."
>
> Yes, I agree. With 0x0e, that means TAP1+2+3 are changing points and we
> should be far away from them, like 5-7.
As of my understanding the TAP where cmpngu = 0x0e and cmpngd=0x0e is not
considered change point of the input data. For that to happen it would mean
that cmpngu != cmpngd.
From this snapshot, datasheet and our discussions:
i=0, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=1, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=2, cmpngu=0000000e, cmpngd=0000000e, smpcmp=000e000e
i=3, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
*i=4, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002*
*i=5, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff*
*i=6, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000*
i=7, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=8, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=9, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=10, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
i=11, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
*i=12, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002*
*i=13, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff*
*i=14, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000*
i=15, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
I understand that TAP4,5,6 are change point of the input data and
TAP8,0,1,2,3 are candidates for being selected, TAP 1,2 being the best
(please correct me if I'm wrong).
>
> But: I am still waiting for Renesas to answer my questions regarding
> SMPCMP. I'd like to get that first, so we have clear facts then.
>
>>> Boot failure is one test. Read/write tests should be another, I think.
>>
>> OK, I'll try also read/write. Do you have in mind something particular?
>
> Nope. Just consistency checks.
>
>>> Because if we select a bad TAP, bad things might happen later. To reduce
>>> the amount of testing, read/write testing could only be triggered if the
>>> new code path was excecuted?
>>
>> I'm not sure how to trigger that (or maybe I haven't understood your
>> statement...)
>
> I thought something in the lines of:
>
> - print out when you needed SMPCMP to select a TAP
On my device (RZ/G3S) that triggered initially "mmc0: tuning execution
failed" at probe, with this patch (when doing read/write tests) I have a
lot of moment when cmpngu == cmpngd and thus the smpcmp bitmask is populated.
With RZ/G3S+rootfs on eMMC and this patch I did the following read/write test:
root@smarc-rzg3s:~# dd if=/dev/random of=out bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
root@smarc-rzg3s:~#
root@smarc-rzg3s:~# dd if=out of=test bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
root@smarc-rzg3s:~#
root@smarc-rzg3s:~#
root@smarc-rzg3s:~#
root@smarc-rzg3s:~# md5sum out test
b053723af63801e665959d48cb7bd8e6 out
b053723af63801e665959d48cb7bd8e6 test
Do yo consider this enough?
Thank you,
Claudiu Beznea
> - check the log for that printout
> - if (printout) do read_write_tests
>
> Dunno if that makes sense with your test setup.
>
Hi Claudiu,
I got more information about SMPCMP now. I had a misunderstanding there.
According to your patch description, you might have the same
misunderstanding? Let me quote again:
===
RZ hardware manual are similar on this chapter), at the time of tuning,
data is captured by the previous and next TAPs and the result is stored in
the SMPCMP register (previous TAP in bits 22..16, next TAP in bits 7..0).
===
It is not the previous and next TAP but the previous and next clock
cycle using the *same* TAP. And the bits in the register describe if
there was a mismatch in the data bits across these clock cycles.
So, we really want SMPCMP to be 0 because the data should be stable
across all three clock cycles of the same TAP.
> As of my understanding the TAP where cmpngu = 0x0e and cmpngd=0x0e is not
> considered change point of the input data. For that to happen it would mean
> that cmpngu != cmpngd.
I am not sure you can assume that cmpngu != cmpngd is always true for a
change point. I'd think it is likely often the case. But always? I am
not convinced. But I am convinced that if SMPCMP is 0, this is a good
TAP because it was stable over these clock cycles.
> From this snapshot, datasheet and our discussions:
>
> i=0, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> i=1, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> i=2, cmpngu=0000000e, cmpngd=0000000e, smpcmp=000e000e
> i=3, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> *i=4, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002*
> *i=5, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff*
> *i=6, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000*
> i=7, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> i=8, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> i=9, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> i=10, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> i=11, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
> *i=12, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002*
> *i=13, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff*
> *i=14, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000*
> i=15, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>
> I understand that TAP4,5,6 are change point of the input data and
> TAP8,0,1,2,3 are candidates for being selected, TAP 1,2 being the best
> (please correct me if I'm wrong).
I agree that TAP4-6 are the change point. TAP2 could be a candidate. I
dunno why SMPCMP is non-zero at i == 2, maybe some glitch due to noise
on the board?
I do really wonder why probing failed, though? TAP1 sounds like a good
choice as well. I mean we consider SMPCMP only if all TAPs are good. So,
if probing fails, that means that SMPCMP was non-zero all the time?
That being said, our code to select the best TAP from SMPCMP is really
not considering the change point :( It just picks the first one where
SMPCMP is 0. We are not checking where the change point is and try to be
as far away as possible.
> root@smarc-rzg3s:~# md5sum out test
> b053723af63801e665959d48cb7bd8e6 out
> b053723af63801e665959d48cb7bd8e6 test
>
> Do yo consider this enough?
Yes, if done 100 times ;)
I hope this mail was helpful?
Thanks and happy hacking,
Wolfram
Hi, Wolfram,
On 08.02.2024 02:56, Wolfram Sang wrote:
> Hi Claudiu,
>
> I got more information about SMPCMP now. I had a misunderstanding there.
> According to your patch description, you might have the same
> misunderstanding? Let me quote again:
>
> ===
> RZ hardware manual are similar on this chapter), at the time of tuning,
> data is captured by the previous and next TAPs and the result is stored in
> the SMPCMP register (previous TAP in bits 22..16, next TAP in bits 7..0).
> ===
>
> It is not the previous and next TAP but the previous and next clock
> cycle using the *same* TAP. And the bits in the register describe if
> there was a mismatch in the data bits across these clock cycles.
That's something new for me, it's not described in HW manual (or at least I
haven't found it).
>
> So, we really want SMPCMP to be 0 because the data should be stable
> across all three clock cycles of the same TAP.
So, it means issues should be somewhere else on my setup.
>
>> As of my understanding the TAP where cmpngu = 0x0e and cmpngd=0x0e is not
>> considered change point of the input data. For that to happen it would mean
>> that cmpngu != cmpngd.
>
> I am not sure you can assume that cmpngu != cmpngd is always true for a
> change point. I'd think it is likely often the case. But always? I am
> not convinced.
That's was my understanding from HW manual and since it fixed my issue I
considered it valid at the point I wrote this statement. Maybe we need to
understand this?
> But I am convinced that if SMPCMP is 0, this is a good
> TAP because it was stable over these clock cycles.
>
>> From this snapshot, datasheet and our discussions:
>>
>> i=0, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> i=1, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> i=2, cmpngu=0000000e, cmpngd=0000000e, smpcmp=000e000e
>> i=3, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> *i=4, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002*
>> *i=5, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff*
>> *i=6, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000*
>> i=7, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> i=8, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> i=9, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> i=10, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> i=11, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>> *i=12, cmpngu=00000000, cmpngd=00000002, smpcmp=00000002*
>> *i=13, cmpngu=00000000, cmpngd=000000ff, smpcmp=000001ff*
>> *i=14, cmpngu=000000ff, cmpngd=00000000, smpcmp=01ff0000*
>> i=15, cmpngu=00000000, cmpngd=00000000, smpcmp=00000000
>>
>> I understand that TAP4,5,6 are change point of the input data and
>> TAP8,0,1,2,3 are candidates for being selected, TAP 1,2 being the best
>> (please correct me if I'm wrong).
>
> I agree that TAP4-6 are the change point. TAP2 could be a candidate. I
> dunno why SMPCMP is non-zero at i == 2, maybe some glitch due to noise
> on the board?
Hm... it worth considering it...
>
> I do really wonder why probing failed, though? TAP1 sounds like a good
> choice as well. I mean we consider SMPCMP only if all TAPs are good. So,
> if probing fails, that means that SMPCMP was non-zero all the time?
Yes, that was my finding as well on my setup which leads to this patch.
If we're taking as example the snapshot I dropped here in a previous email,
and do not consider this patch, code at [1] should clear bit for TAP2 in
smpcmp mask because in the 1st round SMPCMP was not zero (but 0x000e000e)
and in the 2nd round it was zero.
[1]
https://elixir.bootlin.com/linux/latest/source/drivers/mmc/host/renesas_sdhi_core.c#L629
>
> That being said, our code to select the best TAP from SMPCMP is really
> not considering the change point :( It just picks the first one where
> SMPCMP is 0.
Hm... I thought code at [2] selects the TAP in the middle (in the snapshot
I pointed, TAP1).
[1]
https://elixir.bootlin.com/linux/latest/source/drivers/mmc/host/renesas_sdhi_core.c#L656
> We are not checking where the change point is and try to be
> as far away as possible.
>
>> root@smarc-rzg3s:~# md5sum out test
>> b053723af63801e665959d48cb7bd8e6 out
>> b053723af63801e665959d48cb7bd8e6 test
>>
>> Do yo consider this enough?
>
> Yes, if done 100 times ;)
This may take a while...
>
> I hope this mail was helpful?
The tuning procedure it's better understand now. But I'm not sure in which
direction should I dig further... :)
Thank you for details and patience,
Claudiu Beznea
>
> Thanks and happy hacking,
>
> Wolfram
>