On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:
> On Sat, Nov 25, 2023 at 09:50:38AM +0100, Oleksij Rempel wrote:

> > It prevents HW damage. In a typical automotive under-voltage labor it is
> > usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
> > amount of under-voltage cycles (I do not have exact numbers right now).
> > Even if the numbers not so high in the labor tests (sometimes something
> > like one bricked device in a month of tests), the field returns are
> > significant enough to care about software solution for this problem.

> So hardware is attempting to rely on software in order to prevent the
> destruction of that same hardware? Surely hardware designers aren't
> that crazy, right? (rhetorical question, I know...)

Surely software people aren't going to make no effort to integrate with
the notification features that the hardware engineers have so helpfully
provided us with?

> > Same problem was seen not only in automotive devices, but also in
> > industrial or agricultural. With other words, it is important enough to bring
> > some kind of solution mainline.

> But you are not providing a real solution here, only a "I am going to
> attempt to shut down a specific type of device before the others, there
> are no time or ordering guarantees here, so good luck!" solution.

I'm not sure there are great solutions here, the system integrators are
constrained by the what the application appropriate silicon that's on
the market is capable of, the siicon is constrained by the area costs of
dealing with corner cases for system robustness and how much of the
market cares about fixing these issues and software is constrained by
what hardware ends up being built. Everyone's just got to try their
best with the reality they're confronted with, hopefully what's possible
will improve with time.

> And again, how are you going to prevent the in-fighting of all device
> types to be "first" in the list?

It doesn't seem like the most complex integration challenge we've ever
had to deal with TBH.

Attachments:

(No filename) (2.06 kB)
signature.asc (499.00 B)
Download all attachments

2023-11-25 14:35:55

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH v1 0/3] introduce priority-based shutdown support

On Sat, Nov 25, 2023 at 10:30:42AM +0000, Mark Brown wrote:
> On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:
> > On Sat, Nov 25, 2023 at 09:50:38AM +0100, Oleksij Rempel wrote:
>
> > > It prevents HW damage. In a typical automotive under-voltage labor it is
> > > usually possible to reproduce X amount of bricked eMMCs or NANDs on Y
> > > amount of under-voltage cycles (I do not have exact numbers right now).
> > > Even if the numbers not so high in the labor tests (sometimes something
> > > like one bricked device in a month of tests), the field returns are
> > > significant enough to care about software solution for this problem.
>
> > So hardware is attempting to rely on software in order to prevent the
> > destruction of that same hardware? Surely hardware designers aren't
> > that crazy, right? (rhetorical question, I know...)
>
> Surely software people aren't going to make no effort to integrate with
> the notification features that the hardware engineers have so helpfully
> provided us with?

That would be great, but I don't see that here, do you? All I see is
the shutdown sequence changing because someone wants it to go "faster"
with the threat of hardware breaking if we don't meet that "faster"
number, yet no knowledge or guarantee that this number can ever be known
or happen.

> > > Same problem was seen not only in automotive devices, but also in
> > > industrial or agricultural. With other words, it is important enough to bring
> > > some kind of solution mainline.
>
> > But you are not providing a real solution here, only a "I am going to
> > attempt to shut down a specific type of device before the others, there
> > are no time or ordering guarantees here, so good luck!" solution.
>
> I'm not sure there are great solutions here, the system integrators are
> constrained by the what the application appropriate silicon that's on
> the market is capable of, the siicon is constrained by the area costs of
> dealing with corner cases for system robustness and how much of the
> market cares about fixing these issues and software is constrained by
> what hardware ends up being built. Everyone's just got to try their
> best with the reality they're confronted with, hopefully what's possible
> will improve with time.

Agreed, but I don't think this patch is going to actually work properly
over time as there is no time values involved :)

> > And again, how are you going to prevent the in-fighting of all device
> > types to be "first" in the list?
>
> It doesn't seem like the most complex integration challenge we've ever
> had to deal with TBH.

True, but we all know how this grows and thinking about how to handle it
now is key for this to be acceptable.

thanks,

greg k-h

2023-11-25 15:43:19

[permalink] [raw]

Subject: Re: [PATCH v1 0/3] introduce priority-based shutdown support

On Sun, Nov 26, 2023 at 08:42:02PM +0100, Ferry Toth wrote:

> Funny discussion. As a hardware engineer (with no experience in automotive,
> but actual experience in industrial applications and debugging issues
> arising from bad shutdowns) let me add my 5ct at the end.

I suspect there's also a space here beyond systems that were designed
with these failure modes in mind where people run into issues once they
have the hardware and are trying to improve what they can after the fact.

> Now, we do need to keep in mind that storing J in a supercap, executing a
> CPU at GHz, storing GB data do not come free. So, after making sure things
> shutdown in time, it often pays off to shorten that deadline, and indeed
> make it faster.

Indeed.

Attachments:

(No filename) (760.00 B)
signature.asc (499.00 B)
Download all attachments

2023-11-27 14:25:27

On Mon, Nov 27, 2023 at 04:49:49PM +0200, Matti Vaittinen wrote:
> On 11/27/23 15:08, Greg Kroah-Hartman wrote:

> > Yes, using device tree would be good, but now you have created something
> > that is device-tree-specific and not all the world is device tree :(

> True. However, my understanding is that the regulator subsystem is largely
> written to work with DT-based systems. Hence supporting the DT-based
> solution would probably fit to this specific use-case as source of problem
> notifications is the regulator subsystem.

Yes, ACPI has a strong model that things like regulators and clocks are
not visible to the OS.

Attachments:

(No filename) (642.00 B)
signature.asc (499.00 B)
Download all attachments

2023-11-30 09:58:00

by Ulf Hansson

[permalink] [raw]

Subject: Re: [PATCH v1 0/3] introduce priority-based shutdown support

On Fri, 24 Nov 2023 at 15:53, Oleksij Rempel <[email protected]> wrote:
>
> Hi,
>
> This patch series introduces support for prioritized device shutdown.
> The main goal is to enable prioritization for shutting down specific
> devices, particularly crucial in scenarios like power loss where
> hardware damage can occur if not handled properly.
>
> Oleksij Rempel (3):
> driver core: move core part of device_shutdown() to a separate
> function
> driver core: introduce prioritized device shutdown sequence
> mmc: core: increase shutdown priority for MMC devices
>
> drivers/base/core.c | 157 +++++++++++++++++++++++++++--------------
> drivers/mmc/core/bus.c | 2 +
> include/linux/device.h | 51 ++++++++++++-
> kernel/reboot.c | 4 +-
> 4 files changed, 157 insertions(+), 57 deletions(-)
>

Sorry for joining the discussions a bit late! Besides the valuable
feedback that you already received from others (which indicates that
we have quite some work to do in the commit messages to better explain
and justify these changes), I wanted to share my overall thoughts
around this.

So, I fully understand the reason behind the $subject series, as we
unfortunately can't rely on flash-based (NAND/NOR) storage devices
being 100% tolerant to sudden-power failures. Besides for the reasons
already discussed in the thread, the robustness simply depends on the
"quality" of the FTL (flash translation layer) and the NAND/NOR/etc
device it runs.

For example, back in the days when Android showed up, we were testing
YAFFS and UBIFS on rawNAND, which failed miserably after just a few
thousands of power-cycles. It was even worse with ext3/4 on the early
variants of eMMC devices, as those survived only a few hundreds of
power-cycles. Now, I assume this has improved a lot over the years,
but I haven't really verified this myself.

That said, for eMMC and other flash-based storage devices, industrial
or not, I think it would make sense to try to notify the device about
the power-failure, if possible. This would add another level of
mitigation, I think.

From an implementation point of view, it looks to me that the approach
in the $subject series has some potential. Although, rather than
diving into the details, I will defer to review the next version.

Kind regards
Uffe

2023-11-30 21:59:46

by Francesco Dolcini

[permalink] [raw]

Subject: Re: [PATCH v1 0/3] introduce priority-based shutdown support

Hello all,

On Mon, Nov 27, 2023 at 12:36:11PM +0100, Oleksij Rempel wrote:
> On Mon, Nov 27, 2023 at 10:13:49AM +0000, Christian Loehle wrote:
> > > Same problem was seen not only in automotive devices, but also in
> > > industrial or agricultural. With other words, it is important enough to bring
> > > some kind of solution mainline.
> > >
> >
> > IMO that is a serious problem with the used storage / eMMC in that case and it
> > is not suitable for industrial/automotive uses?
> > Any industrial/automotive-suitable storage device should detect under-voltage and
> > just treat it as a power-down/loss, and while that isn't nice for the storage device,
> > it really shouldn't be able to brick a device (within <1M cycles anyway).
> > What does the storage module vendor say about this?
>
> Good question. I do not have insights ATM. I'll forward it.

From personal experience I can tell that bricked eMMC devices happen because of
eMMC controller firmware bugs. You might find some recently committed quirk
on this regard.

Waiting for any additional details you might be able to find, given my past
experience I would agree with what Christian wrote.

Francesco