LinuxLists.cc - [PATCH 1/2] mmc: bcm2835: reset host on timeout

2018-02-14 14:40:53

Subject: [PATCH 1/2] mmc: bcm2835: reset host on timeout

The bcm2835 mmc host tends to lock up for unknown reason so reset it on
timeout. The upper mmc block layer tries retransimitting with single
blocks which tends to work out after a long wait.

This is better than giving up and leaving the machine broken for no
obvious reason.

Signed-off-by: Michal Suchanek <[email protected]>
---
drivers/mmc/host/bcm2835.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/mmc/host/bcm2835.c b/drivers/mmc/host/bcm2835.c
index 229dc18f0581..ce05fe72f865 100644
--- a/drivers/mmc/host/bcm2835.c
+++ b/drivers/mmc/host/bcm2835.c
@@ -286,6 +286,7 @@ static void bcm2835_reset(struct mmc_host *mmc)

if (host->dma_chan)
dmaengine_terminate_sync(host->dma_chan);
+ host->dma_chan = NULL;
bcm2835_reset_internal(host);
}

@@ -837,6 +838,8 @@ static void bcm2835_timeout(struct work_struct *work)
dev_err(dev, "timeout waiting for hardware interrupt.\n");
bcm2835_dumpregs(host);

+ bcm2835_reset(host->mmc);
+
if (host->data) {
host->data->error = -ETIMEDOUT;
bcm2835_finish_data(host);
--
2.13.6

2018-02-14 14:40:12

by Michal Suchánek

[permalink] [raw]

Subject: [PATCH 2/2] mmc: bcm2835: print some informational messages during reset

The previous patch does reset during hardware error so make the reset
progress more visible.

Signed-off-by: Michal Suchanek <[email protected]>
---
drivers/mmc/host/bcm2835.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/bcm2835.c b/drivers/mmc/host/bcm2835.c
index ce05fe72f865..4dde8b2b62a9 100644
--- a/drivers/mmc/host/bcm2835.c
+++ b/drivers/mmc/host/bcm2835.c
@@ -283,10 +283,14 @@ static void bcm2835_reset_internal(struct bcm2835_host *host)
static void bcm2835_reset(struct mmc_host *mmc)
{
struct bcm2835_host *host = mmc_priv(mmc);
+ struct device *dev = &host->pdev->dev;

- if (host->dma_chan)
+ if (host->dma_chan) {
+ dev_info(dev, "tearing down dma");
dmaengine_terminate_sync(host->dma_chan);
+ }
host->dma_chan = NULL;
+ dev_info(dev, "resetting");
bcm2835_reset_internal(host);
}

--
2.13.6

2018-02-14 15:01:25

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hi Michal,

Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> The bcm2835 mmc host tends to lock up for unknown reason so reset it on
> timeout. The upper mmc block layer tries retransimitting with single
> blocks which tends to work out after a long wait.
>
> This is better than giving up and leaving the machine broken for no
> obvious reason.

could you please provide more information about this issue (affected
hardware, kernel config, version, dmesg, reproducible scenario)?

2018-02-14 16:00:35

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

On Wed, 14 Feb 2018 16:36:49 +0100
Michal Suchánek <[email protected]> wrote:

> On Wed, 14 Feb 2018 15:58:31 +0100
> Stefan Wahren <[email protected]> wrote:
>
> > Hi Michal,
> >
> > Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > > The bcm2835 mmc host tends to lock up for unknown reason so reset
> > > it on timeout. The upper mmc block layer tries retransimitting
> > > with single blocks which tends to work out after a long wait.
> > >
> > > This is better than giving up and leaving the machine broken for
> > > no obvious reason.
> >
> > could you please provide more information about this issue (affected
> > hardware, kernel config, version, dmesg, reproducible scenario)?
> >
>

It tends to reproduce when upgrading a few packages with zypper and
otherwise at random during system operation. It seems that for my card
it worsens with age to some degree so perhaps it depends on the
fragmentation of the internal card flash.

Attaching dmesg and kernel config.

Thanks

Michal

Attachments:

(No filename) (1.02 kB)
config.txt (216.72 kB)
dmesg.txt (29.47 kB)
Download all attachments

2018-02-14 16:51:37

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hi Michal,

[add Phil]

Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> On Wed, 14 Feb 2018 16:36:49 +0100
> Michal Suchánek <[email protected]> wrote:
>
>> On Wed, 14 Feb 2018 15:58:31 +0100
>> Stefan Wahren <[email protected]> wrote:
>>
>>> Hi Michal,
>>>
>>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
>>>> The bcm2835 mmc host tends to lock up for unknown reason so reset
>>>> it on timeout. The upper mmc block layer tries retransimitting
>>>> with single blocks which tends to work out after a long wait.
>>>>
>>>> This is better than giving up and leaving the machine broken for
>>>> no obvious reason.
>>> could you please provide more information about this issue (affected
>>> hardware, kernel config, version, dmesg, reproducible scenario)?
>>>
> It tends to reproduce when upgrading a few packages with zypper and
> otherwise at random during system operation. It seems that for my card
> it worsens with age to some degree so perhaps it depends on the
> fragmentation of the internal card flash.
>
> Attaching dmesg and kernel config.

do you noticed this issue before 4.15-rc4?

Could you please test with 4.15 final again?

What kind of SD card (name) triggers the issue?

Thanks Stefan

>
> Thanks
>
> Michal

2018-02-14 16:51:44

by Florian Fainelli

[permalink] [raw]

Subject: Re: [PATCH 2/2] mmc: bcm2835: print some informational messages during reset

On February 14, 2018 6:38:58 AM PST, Michal Suchanek <[email protected]> wrote:
>The previous patch does reset during hardware error so make the reset
>progress more visible.

Based on your previous email it looks like this can happen quite frequently so we might be spamming the kernel log with such reset messages. Turning this into a debug print would not be great either, how about a custom sysfs attribute counting the number of times a reset was done?

We should ideally root cause this but I am sure we can.

--
Florian

2018-02-14 16:52:12

by Florian Fainelli

[permalink] [raw]

Subject: Re: [PATCH 2/2] mmc: bcm2835: print some informational messages during reset

2018-02-14 19:27:58

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

On Wed, 14 Feb 2018 17:49:31 +0100
Stefan Wahren <[email protected]> wrote:

> Hi Michal,
>
> [add Phil]
>
> Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> > On Wed, 14 Feb 2018 16:36:49 +0100
> > Michal Suchánek <[email protected]> wrote:
> >
> >> On Wed, 14 Feb 2018 15:58:31 +0100
> >> Stefan Wahren <[email protected]> wrote:
> >>
> >>> Hi Michal,
> >>>
> >>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> >>>> The bcm2835 mmc host tends to lock up for unknown reason so reset
> >>>> it on timeout. The upper mmc block layer tries retransimitting
> >>>> with single blocks which tends to work out after a long wait.
> >>>>
> >>>> This is better than giving up and leaving the machine broken for
> >>>> no obvious reason.
> >>> could you please provide more information about this issue
> >>> (affected hardware, kernel config, version, dmesg, reproducible
> >>> scenario)?
> > It tends to reproduce when upgrading a few packages with zypper and
> > otherwise at random during system operation. It seems that for my
> > card it worsens with age to some degree so perhaps it depends on the
> > fragmentation of the internal card flash.
> >
> > Attaching dmesg and kernel config.
>
> do you noticed this issue before 4.15-rc4?

I initially noticed it with 4.4 kernel with some backports to make it
bootable on RPi.
>
> Could you please test with 4.15 final again?

Right, I can apply the patches on something more recent.

>
> What kind of SD card (name) triggers the issue?

Samsung EVO MB-MP16D

Also see https://elinux.org/RPi_SD_cards#Which_SD_card.3F

Thanks

Michal

2018-02-14 19:48:37

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 2/2] mmc: bcm2835: print some informational messages during reset

On Wed, 14 Feb 2018 08:50:16 -0800
Florian Fainelli <[email protected]> wrote:

> On February 14, 2018 6:38:58 AM PST, Michal Suchanek
> <[email protected]> wrote:
> >The previous patch does reset during hardware error so make the reset
> >progress more visible.
>
> Based on your previous email it looks like this can happen quite
> frequently so we might be spamming the kernel log with such reset
> messages. Turning this into a debug print would not be great either,
> how about a custom sysfs attribute counting the number of times a
> reset was done?

Since every such message happens when the system stalls for like half a
minute I don't think there will be that many until somebody notices
something is amiss. It might be also helpful in diagnosing if other
cards lock up in different way - for me the DMA shutdown is short so I
guess it's the mmc host that is locked up and the DMA engine is fine.
It might look differently on different systems, though.

I understand that adding messages it somewhat controversial so I added
them in separate patch.

Thanks

Michal

2018-02-14 19:59:00

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

On Wed, 14 Feb 2018 15:58:31 +0100
Stefan Wahren <[email protected]> wrote:

> Hi Michal,
>
> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > The bcm2835 mmc host tends to lock up for unknown reason so reset
> > it on timeout. The upper mmc block layer tries retransimitting with
> > single blocks which tends to work out after a long wait.
> >
> > This is better than giving up and leaving the machine broken for no
> > obvious reason.
>
> could you please provide more information about this issue (affected
> hardware, kernel config, version, dmesg, reproducible scenario)?
>

The RPi3 is known to not work with some SD cards. You can find some
wiki pages with large tables of known-working and known-broken cards. I
have a couple of RPi3 boards and a card that works and card that does
not. I tried debugging the issue but did not find anything I can do
about it - AFAICT the issue happens somewhere inside the MMC controller
IP.

I have no inside knowledge of the controller in question but during
testing I tried to reset the controller whenever the issue happens so I
can continue running the test system for a longer time until it gets
unusable. While I did not find any solution to the problem the
workaround with resetting the controller works quite reliably for me.

So I am posting it in the hope that people with the wrong combination
of RPi3 and SD card will not get a blank screen but rather a system
that boots but tends to lock up for half a minute occasionally.

Thanks

Michal

2018-02-14 20:32:43

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hi Michal,

> Michal Suchánek <[email protected]> hat am 14. Februar 2018 um 20:24 geschrieben:
>
>
> On Wed, 14 Feb 2018 17:49:31 +0100
> Stefan Wahren <[email protected]> wrote:
>
> > Hi Michal,
> >
> > [add Phil]
> >
> > Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> > > On Wed, 14 Feb 2018 16:36:49 +0100
> > > Michal Suchánek <[email protected]> wrote:
> > >
> > >> On Wed, 14 Feb 2018 15:58:31 +0100
> > >> Stefan Wahren <[email protected]> wrote:
> > >>
> > >>> Hi Michal,
> > >>>
> > >>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > >>>> The bcm2835 mmc host tends to lock up for unknown reason so reset
> > >>>> it on timeout. The upper mmc block layer tries retransimitting
> > >>>> with single blocks which tends to work out after a long wait.
> > >>>>
> > >>>> This is better than giving up and leaving the machine broken for
> > >>>> no obvious reason.
> > >>> could you please provide more information about this issue
> > >>> (affected hardware, kernel config, version, dmesg, reproducible
> > >>> scenario)?
> > > It tends to reproduce when upgrading a few packages with zypper and
> > > otherwise at random during system operation. It seems that for my
> > > card it worsens with age to some degree so perhaps it depends on the
> > > fragmentation of the internal card flash.
> > >
> > > Attaching dmesg and kernel config.
> >
> > do you noticed this issue before 4.15-rc4?
>
> I initially noticed it with 4.4 kernel with some backports to make it
> bootable on RPi.

this confuses me. Gerd and i ported this driver from downstream and finally it's got merged in 4.12.

So do you mean that you backported the mainline version to 4.4 or the downstream version of 4.4?

On a quick look they seems identical, but they aren't.

> >
> > Could you please test with 4.15 final again?
>
> Right, I can apply the patches on something more recent.
>
> >
> > What kind of SD card (name) triggers the issue?
>
> Samsung EVO MB-MP16D

Thanks

>
> Also see https://elinux.org/RPi_SD_cards#Which_SD_card.3F

I'm very sceptical about this list. The card above is listed as OK and NOK. The experienced issues doesn't need to be direct related to the card (unproperly umounted, bad driver, ...).

Stefan

>
> Thanks
>
> Michal
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2018-02-16 13:11:12

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 2/2] mmc: bcm2835: print some informational messages during reset

Hi Michal,

> Michal Suchánek <[email protected]> hat am 14. Februar 2018 um 20:47 geschrieben:
>
>
> On Wed, 14 Feb 2018 08:50:16 -0800
> Florian Fainelli <[email protected]> wrote:
>
> > On February 14, 2018 6:38:58 AM PST, Michal Suchanek
> > <[email protected]> wrote:
> > >The previous patch does reset during hardware error so make the reset
> > >progress more visible.
> >
> > Based on your previous email it looks like this can happen quite
> > frequently so we might be spamming the kernel log with such reset
> > messages. Turning this into a debug print would not be great either,
> > how about a custom sysfs attribute counting the number of times a
> > reset was done?
>
> Since every such message happens when the system stalls for like half a
> minute I don't think there will be that many until somebody notices
> something is amiss. It might be also helpful in diagnosing if other
> cards lock up in different way - for me the DMA shutdown is short so I
> guess it's the mmc host that is locked up and the DMA engine is fine.
> It might look differently on different systems, though.

FWIW according to your dmesg your RPi doesn't use the DMA engine:

[ 5.004609] sdhost-bcm2835 3f202000.sdhost: unable to initialise DMA channel. Falling back to PIO
[ 5.154518] sdhost-bcm2835 3f202000.sdhost: loaded - DMA disabled

For me it's a chicken and egg problem if the DMA driver is build as a kernel module.

Stefan

>
> I understand that adding messages it somewhat controversial so I added
> them in separate patch.
>
> Thanks
>
> Michal

2018-03-03 14:00:35

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hi Michal,

[add Stefan to CC]

> Michal Suchánek <[email protected]> hat am 14. Februar 2018 um 20:24 geschrieben:
>
>
> On Wed, 14 Feb 2018 17:49:31 +0100
> Stefan Wahren <[email protected]> wrote:
>
> > Hi Michal,
> >
> > [add Phil]
> >
> > Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> > > On Wed, 14 Feb 2018 16:36:49 +0100
> > > Michal Suchánek <[email protected]> wrote:
> > >
> > >> On Wed, 14 Feb 2018 15:58:31 +0100
> > >> Stefan Wahren <[email protected]> wrote:
> > >>
> > >>> Hi Michal,
> > >>>
> > >>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > >>>> The bcm2835 mmc host tends to lock up for unknown reason so reset
> > >>>> it on timeout. The upper mmc block layer tries retransimitting
> > >>>> with single blocks which tends to work out after a long wait.
> > >>>>
> > >>>> This is better than giving up and leaving the machine broken for
> > >>>> no obvious reason.
> > >>> could you please provide more information about this issue
> > >>> (affected hardware, kernel config, version, dmesg, reproducible
> > >>> scenario)?
> > > It tends to reproduce when upgrading a few packages with zypper and
> > > otherwise at random during system operation. It seems that for my
> > > card it worsens with age to some degree so perhaps it depends on the
> > > fragmentation of the internal card flash.
> > >
> > > Attaching dmesg and kernel config.
> >
> > do you noticed this issue before 4.15-rc4?
>
> I initially noticed it with 4.4 kernel with some backports to make it
> bootable on RPi.
> >
> > Could you please test with 4.15 final again?
>
> Right, I can apply the patches on something more recent.
>
> >
> > What kind of SD card (name) triggers the issue?
>
> Samsung EVO MB-MP16D
>
> Also see https://elinux.org/RPi_SD_cards#Which_SD_card.3F
>
> Thanks
>
> Michal
>

yesterday i finished my stress tests with Raspberry Pi 3.

Scenario:
- copy Tumbleweed on SD card (openSUSE-Tumbleweed-ARM-JeOS-raspberrypi3.aarch64-2018.02.02-Build1.2.raw, Linux 4.14.15)
- setup locales with yast
- run zypper update
- reboot
- install and remove java 1.8 in a loop for at least 1 hour

Results of the different SD cards:
Toshiba uSDHC Class 10 UHS-1 32 GB: PASS
BASETech uSDHC Class 10 16 GB: PASS
Samsung uSDHC EVO+ UHS-1 16 GB: PASS
Samsung uSDHC Class 6 32 GB: PASS
SanDisk Edge Class 4 16 GB: PASS
Kingston uSDHC Class 10 UHS-1 32 GB: PASS
QUMOX uSDHC Class 10 UHS-1 16 GB: FAIL (zypper segfaulted permantently)
Transcend uSDHC Class 10 UHS-1 32 GB: PASS

I was never able to reproduce this timeout. So i still need the feedback about the 4.15 and i a reliable test scenario.

In a github issue, i've read that badblocks could reproduce the issue more likely.

Regards
Stefan

2018-03-04 16:06:27

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 2/2] mmc: bcm2835: print some informational messages during reset

On Thu, 15 Feb 2018 19:22:00 +0100 (CET)
Stefan Wahren <[email protected]> wrote:

> Hi Michal,
>
> > Michal Suchánek <[email protected]> hat am 14. Februar 2018 um
> > 20:47 geschrieben:
> >
> >
> > On Wed, 14 Feb 2018 08:50:16 -0800
> > Florian Fainelli <[email protected]> wrote:
> >
> > > On February 14, 2018 6:38:58 AM PST, Michal Suchanek
> > > <[email protected]> wrote:
> > > >The previous patch does reset during hardware error so make the
> > > >reset progress more visible.
> > >
> > > Based on your previous email it looks like this can happen quite
> > > frequently so we might be spamming the kernel log with such reset
> > > messages. Turning this into a debug print would not be great
> > > either, how about a custom sysfs attribute counting the number of
> > > times a reset was done?
> >
> > Since every such message happens when the system stalls for like
> > half a minute I don't think there will be that many until somebody
> > notices something is amiss. It might be also helpful in diagnosing
> > if other cards lock up in different way - for me the DMA shutdown
> > is short so I guess it's the mmc host that is locked up and the DMA
> > engine is fine. It might look differently on different systems,
> > though.
>
> FWIW according to your dmesg your RPi doesn't use the DMA engine:
>
> [ 5.004609] sdhost-bcm2835 3f202000.sdhost: unable to initialise DMA
> channel. Falling back to PIO [ 5.154518] sdhost-bcm2835
> 3f202000.sdhost: loaded - DMA disabled
>
> For me it's a chicken and egg problem if the DMA driver is build as a
> kernel module.

It can be included in the ramdisk but somebody would have to add it to
the list of required modules because the dependency is non-obvious.

Thanks

Michal

2018-03-04 17:13:52

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

On Wed, 14 Feb 2018 21:30:16 +0100 (CET)
Stefan Wahren <[email protected]> wrote:

> Hi Michal,
>
> > Michal Suchánek <[email protected]> hat am 14. Februar 2018 um
> > 20:24 geschrieben:
> >
> >
> > On Wed, 14 Feb 2018 17:49:31 +0100
> > Stefan Wahren <[email protected]> wrote:
> >
> > > Hi Michal,
> > >
> > > [add Phil]
> > >
> > > Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> > > > On Wed, 14 Feb 2018 16:36:49 +0100
> > > > Michal Suchánek <[email protected]> wrote:
> > > >
> > > >> On Wed, 14 Feb 2018 15:58:31 +0100
> > > >> Stefan Wahren <[email protected]> wrote:
> > > >>
> > > >>> Hi Michal,
> > > >>>
> > > >>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > > >>>> The bcm2835 mmc host tends to lock up for unknown reason so
> > > >>>> reset it on timeout. The upper mmc block layer tries
> > > >>>> retransimitting with single blocks which tends to work out
> > > >>>> after a long wait.
> > > >>>>
> > > >>>> This is better than giving up and leaving the machine broken
> > > >>>> for no obvious reason.
> > > >>> could you please provide more information about this issue
> > > >>> (affected hardware, kernel config, version, dmesg,
> > > >>> reproducible scenario)?
> > > > It tends to reproduce when upgrading a few packages with zypper
> > > > and otherwise at random during system operation. It seems that
> > > > for my card it worsens with age to some degree so perhaps it
> > > > depends on the fragmentation of the internal card flash.
> > > >
> > > > Attaching dmesg and kernel config.
> > >
> > > do you noticed this issue before 4.15-rc4?
> >
> > I initially noticed it with 4.4 kernel with some backports to make
> > it bootable on RPi.
>
> this confuses me. Gerd and i ported this driver from downstream and
> finally it's got merged in 4.12.
>
> So do you mean that you backported the mainline version to 4.4 or the
> downstream version of 4.4?

I did not backport it but looking at the changelog it is backport of
the 4.12 driver. It does not look as the 4.15 driver though. Looks like
there was some reorganization of the bcm mmc since then.

>
> On a quick look they seems identical, but they aren't.
>
> > >
> > > Could you please test with 4.15 final again?
> >

I tried upgrading to the current master (4.16-rc3+) and the issue is
still reproducible although less frequent. I did full upgrade from the
install image which installs over 300 packages and the issue triggered
somewhere around 200th while before installing a half dozen packages
would usually trigger it.

> > Right, I can apply the patches on something more recent.
> >
> > >
> > > What kind of SD card (name) triggers the issue?
> >
> > Samsung EVO MB-MP16D
>
> Thanks
>
> >
> > Also see https://elinux.org/RPi_SD_cards#Which_SD_card.3F
>
> I'm very sceptical about this list. The card above is listed as OK
> and NOK. The experienced issues doesn't need to be direct related to
> the card (unproperly umounted, bad driver, ...).

Right, it just shows that this is not an isolated problem. Not all test
results are reliable, of course. Some include interesting details,
though.

Thanks

Michal

Attachments:

(No filename) (3.21 kB)
dmesg.4.16.txt (24.63 kB)
Download all attachments

2018-03-04 17:36:51

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

On Sat, 3 Mar 2018 14:58:45 +0100 (CET)
Stefan Wahren <[email protected]> wrote:

> In a github issue, i've read that badblocks could reproduce the issue
> more likely.

How many iterations of badblocks? It doe snot reproduce the issue for
me.

Thanks

Michal

2018-03-04 17:36:51

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hello,

On Sat, 3 Mar 2018 14:58:45 +0100 (CET)
Stefan Wahren <[email protected]> wrote:

> Hi Michal,

> yesterday i finished my stress tests with Raspberry Pi 3.
>
> Scenario:
> - copy Tumbleweed on SD card
> (openSUSE-Tumbleweed-ARM-JeOS-raspberrypi3.aarch64-2018.02.02-Build1.2.raw,
> Linux 4.14.15)
> - setup locales with yast
> - run zypper update
> - reboot
> - install and remove java 1.8 in a loop for at least 1 hour

How many cycles does that perform?

>
> Results of the different SD cards:
> Toshiba uSDHC Class 10 UHS-1 32 GB: PASS
> BASETech uSDHC Class 10 16 GB: PASS
> Samsung uSDHC EVO+ UHS-1 16 GB: PASS
> Samsung uSDHC Class 6 32 GB: PASS
> SanDisk Edge Class 4 16 GB: PASS
> Kingston uSDHC Class 10 UHS-1 32 GB: PASS
> QUMOX uSDHC Class 10 UHS-1 16 GB: FAIL (zypper segfaulted
> permantently)

It may very well segfault because the card disconnected due to the
timeout and it is missing some pages of its code. It did for me as well
without the workaround.

> Transcend uSDHC Class 10 UHS-1 32 GB: PASS
>
> I was never able to reproduce this timeout. So i still need the
> feedback about the 4.15 and i a reliable test scenario.

You may very well never get any. AFAICT this is some kind of heisenbug
in the hardware.

Thanks

Michal

2018-03-04 18:32:38

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hi Michal,

> Michal Suchánek <[email protected]> hat am 4. März 2018 um 16:57 geschrieben:
>
>
> On Wed, 14 Feb 2018 21:30:16 +0100 (CET)
> Stefan Wahren <[email protected]> wrote:
>
> > Hi Michal,
> >
> > > Michal Suchánek <[email protected]> hat am 14. Februar 2018 um
> > > 20:24 geschrieben:
> > >
> > >
> > > On Wed, 14 Feb 2018 17:49:31 +0100
> > > Stefan Wahren <[email protected]> wrote:
> > >
> > > > Hi Michal,
> > > >
> > > > [add Phil]
> > > >
> > > > Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> > > > > On Wed, 14 Feb 2018 16:36:49 +0100
> > > > > Michal Suchánek <[email protected]> wrote:
> > > > >
> > > > >> On Wed, 14 Feb 2018 15:58:31 +0100
> > > > >> Stefan Wahren <[email protected]> wrote:
> > > > >>
> > > > >>> Hi Michal,
> > > > >>>
> > > > >>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > > > >>>> The bcm2835 mmc host tends to lock up for unknown reason so
> > > > >>>> reset it on timeout. The upper mmc block layer tries
> > > > >>>> retransimitting with single blocks which tends to work out
> > > > >>>> after a long wait.
> > > > >>>>
> > > > >>>> This is better than giving up and leaving the machine broken
> > > > >>>> for no obvious reason.
> > > > >>> could you please provide more information about this issue
> > > > >>> (affected hardware, kernel config, version, dmesg,
> > > > >>> reproducible scenario)?
> > > > > It tends to reproduce when upgrading a few packages with zypper
> > > > > and otherwise at random during system operation. It seems that
> > > > > for my card it worsens with age to some degree so perhaps it
> > > > > depends on the fragmentation of the internal card flash.
> > > > >
> > > > > Attaching dmesg and kernel config.
> > > >
> > > > do you noticed this issue before 4.15-rc4?
> > >
> > > I initially noticed it with 4.4 kernel with some backports to make
> > > it bootable on RPi.
> >
> > this confuses me. Gerd and i ported this driver from downstream and
> > finally it's got merged in 4.12.
> >
> > So do you mean that you backported the mainline version to 4.4 or the
> > downstream version of 4.4?
>
> I did not backport it but looking at the changelog it is backport of
> the 4.12 driver. It does not look as the 4.15 driver though. Looks like
> there was some reorganization of the bcm mmc since then.
>
> >
> > On a quick look they seems identical, but they aren't.
> >
> > > >
> > > > Could you please test with 4.15 final again?
> > >
>
> I tried upgrading to the current master (4.16-rc3+) and the issue is
> still reproducible although less frequent. I did full upgrade from the
> install image which installs over 300 packages and the issue triggered
> somewhere around 200th while before installing a half dozen packages
> would usually trigger it.
>

this is the same what i did during my stress tests. The step installed 475 packages. The timeout never occured.

Stefan

2018-03-04 18:32:42

by Stefan Wahren

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hi Michal,

> Michal Suchánek <[email protected]> hat am 4. März 2018 um 17:36 geschrieben:
>
>
> On Sat, 3 Mar 2018 14:58:45 +0100 (CET)
> Stefan Wahren <[email protected]> wrote:
>
> > In a github issue, i've read that badblocks could reproduce the issue
> > more likely.
>
> How many iterations of badblocks? It doe snot reproduce the issue for
> me.

i asked for the parameters here:

https://github.com/raspberrypi/linux/issues/2392

Stefan

>
> Thanks
>
> Michal

2018-03-04 19:41:10

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

On Sun, 4 Mar 2018 19:11:49 +0100 (CET)
Stefan Wahren <[email protected]> wrote:

> Hi Michal,
>
> > Michal Suchánek <[email protected]> hat am 4. März 2018 um 16:57
> > geschrieben:
> >
> >
> > On Wed, 14 Feb 2018 21:30:16 +0100 (CET)
> > Stefan Wahren <[email protected]> wrote:
> >
> > > Hi Michal,
> > >
> > > > Michal Suchánek <[email protected]> hat am 14. Februar 2018 um
> > > > 20:24 geschrieben:
> > > >
> > > >
> > > > On Wed, 14 Feb 2018 17:49:31 +0100
> > > > Stefan Wahren <[email protected]> wrote:
> > > >
> > > > > Hi Michal,
> > > > >
> > > > > [add Phil]
> > > > >
> > > > > Am 14.02.2018 um 17:13 schrieb Michal Suchánek:
> > > > > > On Wed, 14 Feb 2018 16:36:49 +0100
> > > > > > Michal Suchánek <[email protected]> wrote:
> > > > > >
> > > > > >> On Wed, 14 Feb 2018 15:58:31 +0100
> > > > > >> Stefan Wahren <[email protected]> wrote:
> > > > > >>
> > > > > >>> Hi Michal,
> > > > > >>>
> > > > > >>> Am 14.02.2018 um 15:38 schrieb Michal Suchanek:
> > > > > >>>> The bcm2835 mmc host tends to lock up for unknown reason
> > > > > >>>> so reset it on timeout. The upper mmc block layer tries
> > > > > >>>> retransimitting with single blocks which tends to work
> > > > > >>>> out after a long wait.
> > > > > >>>>
> > > > > >>>> This is better than giving up and leaving the machine
> > > > > >>>> broken for no obvious reason.
> > > > > >>> could you please provide more information about this issue
> > > > > >>> (affected hardware, kernel config, version, dmesg,
> > > > > >>> reproducible scenario)?
> > > > > > It tends to reproduce when upgrading a few packages with
> > > > > > zypper and otherwise at random during system operation. It
> > > > > > seems that for my card it worsens with age to some degree
> > > > > > so perhaps it depends on the fragmentation of the internal
> > > > > > card flash.
> > > > > >
> > > > > > Attaching dmesg and kernel config.
> > > > >
> > > > > do you noticed this issue before 4.15-rc4?
> > > >
> > > > I initially noticed it with 4.4 kernel with some backports to
> > > > make it bootable on RPi.
> > >
> > > this confuses me. Gerd and i ported this driver from downstream
> > > and finally it's got merged in 4.12.
> > >
> > > So do you mean that you backported the mainline version to 4.4 or
> > > the downstream version of 4.4?
> >
> > I did not backport it but looking at the changelog it is backport of
> > the 4.12 driver. It does not look as the 4.15 driver though. Looks
> > like there was some reorganization of the bcm mmc since then.
> >
> > >
> > > On a quick look they seems identical, but they aren't.
> > >
> > > > >
> > > > > Could you please test with 4.15 final again?
> > > >
> >
> > I tried upgrading to the current master (4.16-rc3+) and the issue is
> > still reproducible although less frequent. I did full upgrade from
> > the install image which installs over 300 packages and the issue
> > triggered somewhere around 200th while before installing a half
> > dozen packages would usually trigger it.
> >
>
> this is the same what i did during my stress tests. The step
> installed 475 packages. The timeout never occured.

First off, you did your testing with Tumbleweed image which probably
uses btrfs for / while I use Leap 42.3 image which uses ext4.

I was not able to reproduce the issue with installing packages so far -
installed GNOME which is over 700 packages and the issue did not
trigger. However, the upgrades also unlink the old files as does
removing packages - removing GNOME removed over 200 packages and the
issue triggered. However, re-installing and removing GNOME again did
not trigger the issue. So nothing so far reproduces the issue reliably.
With the new kernel the issue even reproduces less frequently than the
4.4 and 4.15 kernel - probably some i/o scheduler change affects the
disk i/o patterns.

Thanks

Michal

2018-03-06 19:22:22

by Michal Suchánek

[permalink] [raw]

Subject: Re: [PATCH 1/2] mmc: bcm2835: reset host on timeout

Hello,

On Sat, 3 Mar 2018 14:58:45 +0100 (CET)
Stefan Wahren <[email protected]> wrote:

> Hi Michal,
>

> yesterday i finished my stress tests with Raspberry Pi 3.
>
> Scenario:
> - copy Tumbleweed on SD card
> (openSUSE-Tumbleweed-ARM-JeOS-raspberrypi3.aarch64-2018.02.02-Build1.2.raw,
> Linux 4.14.15)
> - setup locales with yast
> - run zypper update
> - reboot
> - install and remove java 1.8 in a loop for at least 1 hour
>
> Results of the different SD cards:
> Toshiba uSDHC Class 10 UHS-1 32 GB: PASS
> BASETech uSDHC Class 10 16 GB: PASS
> Samsung uSDHC EVO+ UHS-1 16 GB: PASS
> Samsung uSDHC Class 6 32 GB: PASS
> SanDisk Edge Class 4 16 GB: PASS
> Kingston uSDHC Class 10 UHS-1 32 GB: PASS
> QUMOX uSDHC Class 10 UHS-1 16 GB: FAIL (zypper segfaulted
> permantently) Transcend uSDHC Class 10 UHS-1 32 GB: PASS
>
> I was never able to reproduce this timeout.

For the record I was not able to reproduce the issue installing and
removing openjdk-1.7.0 in a loop for an hour on Leap 42.3 system which
is affected by the issue. This was a loop of repeatedly installing and
removing four packages and was executed 68 times. I enabled zypper
package caching to avoid spending time in download.

Thanks

Michal