2020-12-07 15:13:13

by Michael Walle

[permalink] [raw]
Subject: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

Hi,

The problem I'm having is that I'm trying to install debian on
an embedded system onto an sdcard. During installation it will
format the target filesystem, but the "mkfs.ext4 -F /dev/mmcblk0p2"
takes ages.

What I've found out so far:
- mkfs.ext4 tries to discard all blocks on the target device
- with my target device being an sdcard it seems to fallback
to normal erase [1], with erase_arg being set to what the card
is capable of [2]

Now I'm trying to figure out if this behavior is intended. I guess
one can reduce it to "blkdiscard /dev/mmcblk0p2". Should this
actually fall back to normal erasing or should it return -EOPNOTSUPP?

-michael

[1]
https://elixir.bootlin.com/linux/v5.9.12/source/drivers/mmc/core/block.c#L1063
[2]
https://elixir.bootlin.com/linux/v5.9.12/source/drivers/mmc/core/mmc.c#L1751


2020-12-07 18:38:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

On Mon, Dec 07, 2020 at 04:10:27PM +0100, Michael Walle wrote:
> Hi,
>
> The problem I'm having is that I'm trying to install debian on
> an embedded system onto an sdcard. During installation it will
> format the target filesystem, but the "mkfs.ext4 -F /dev/mmcblk0p2"
> takes ages.
>
> What I've found out so far:
> - mkfs.ext4 tries to discard all blocks on the target device
> - with my target device being an sdcard it seems to fallback
> to normal erase [1], with erase_arg being set to what the card
> is capable of [2]
>
> Now I'm trying to figure out if this behavior is intended. I guess
> one can reduce it to "blkdiscard /dev/mmcblk0p2". Should this
> actually fall back to normal erasing or should it return -EOPNOTSUPP?

There are three different MMC commands which are defined:

1) DISCARD
2) ERASE
3) SECURE ERASE

The first two are expected to be fast, since it only involves clearing
some metadata fields in the Flash Translation Layer (FTL), so that the
LBA's in the specified range are no longer mapped to a flash page.

The difference between "discard" and "erase" is that "discard" is a
hint, so the device is allowed to ignore it whenever it wants (in
practice, if it's busy doing a GC, or if it's busy writing back blocks
in its writeback cache). "Erase" is guaranteed to work, in that after
an erase, a read from a specified sector MUST return all zeros, but
that can easily be done by redirecting a point in the FTL metadata.

"Secure Erase" is the one which can be slow, since it requires
physically zeroing all of the flash pages (although if the device is
self-encrypting, this in theory could also be fast if you're doing a
secure erase at the granularity of the device's encryption keys, so
all it needs to do is to regenerate the crypto key).

It sounds like your SD card is implementing the "erase" command in a
particularly non-optimal way. If it's common, perhaps we need some
kind of blacklist for drivers with badly implemented erase commands.
As a workaround, you can run mke2fs with the command-line option "-E
discard=0".

Cheers,

- Ted

P.S. If your SD card got "erase" wrong, I'd be a little worried about
what else the FTL implementation may have screwed up. So you want to
under simply getting a different SD card --- especially if this is
something that you plan to distribute as a product to downstream
customers. In general, low-end flash needs to be very carefully
qualified to make sure they are competently implemented if you plan to
deploy in large quantities. An example of what happen if this
qualification process is not done:

https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/

Tesla is currently under investigation by the National Highway Traffic
Safety Administration due to cheaping out on their eMMC flash
(probably just a few pennies per unit). Given that customers are
having to pay $1500 to replace their engine controller out of warranty
(and the NHTSA is considering whether or not to force Tesla to eat the
costs, as opposed to forcing their customers to pay $$$), that's an
example of false economy....

2020-12-07 20:41:58

by Michael Walle

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

Hi Ted,

Am 2020-12-07 19:35, schrieb Theodore Y. Ts'o:
> On Mon, Dec 07, 2020 at 04:10:27PM +0100, Michael Walle wrote:
>> Hi,
>>
>> The problem I'm having is that I'm trying to install debian on
>> an embedded system onto an sdcard. During installation it will
>> format the target filesystem, but the "mkfs.ext4 -F /dev/mmcblk0p2"
>> takes ages.
>>
>> What I've found out so far:
>> - mkfs.ext4 tries to discard all blocks on the target device
>> - with my target device being an sdcard it seems to fallback
>> to normal erase [1], with erase_arg being set to what the card
>> is capable of [2]
>>
>> Now I'm trying to figure out if this behavior is intended. I guess
>> one can reduce it to "blkdiscard /dev/mmcblk0p2". Should this
>> actually fall back to normal erasing or should it return -EOPNOTSUPP?
>
> There are three different MMC commands which are defined:
>
> 1) DISCARD
> 2) ERASE
> 3) SECURE ERASE
>
> The first two are expected to be fast, since it only involves clearing
> some metadata fields in the Flash Translation Layer (FTL), so that the
> LBA's in the specified range are no longer mapped to a flash page.

Mh, where is it specified that the erase command is fast? According
to the Physical Layer Simplified Specification Version 8.00:

The actual erase time may be quite long, and the host may issue CMD7
to deselect the card or perform card disconnection, as described in
the Block Write section, above.

Honest question. Also reading "4.14 Erase Timeout Calculation" doesn't
sound that it is fast.

Also there is this comment:
https://elixir.bootlin.com/linux/v5.9.12/source/drivers/mmc/core/core.c#L1495

> The difference between "discard" and "erase" is that "discard" is a
> hint, so the device is allowed to ignore it whenever it wants (in
> practice, if it's busy doing a GC, or if it's busy writing back blocks
> in its writeback cache). "Erase" is guaranteed to work, in that after
> an erase, a read from a specified sector MUST return all zeros, but
> that can easily be done by redirecting a point in the FTL metadata.
>
> "Secure Erase" is the one which can be slow, since it requires
> physically zeroing all of the flash pages (although if the device is
> self-encrypting, this in theory could also be fast if you're doing a
> secure erase at the granularity of the device's encryption keys, so
> all it needs to do is to regenerate the crypto key).
>
> It sounds like your SD card is implementing the "erase" command in a
> particularly non-optimal way. If it's common, perhaps we need some
> kind of blacklist for drivers with badly implemented erase commands.
> As a workaround, you can run mke2fs with the command-line option "-E
> discard=0".

I've already tested that "mkfs.ext4 -E nodiscard" is fast (or works in
the same way as before the pre-discard feature).

But I wouldn't say it is a cheapo card (Toshiba Exceria). Although I
cannot guarantee that it might be a china clone, but it looks authentic
;)


> P.S. If your SD card got "erase" wrong, I'd be a little worried about
> what else the FTL implementation may have screwed up. So you want to
> under simply getting a different SD card --- especially if this is
> something that you plan to distribute as a product to downstream
> customers. In general, low-end flash needs to be very carefully
> qualified to make sure they are competently implemented if you plan to
> deploy in large quantities. An example of what happen if this
> qualification process is not done:
>
> https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/
>
> Tesla is currently under investigation by the National Highway Traffic
> Safety Administration due to cheaping out on their eMMC flash
> (probably just a few pennies per unit). Given that customers are
> having to pay $1500 to replace their engine controller out of warranty
> (and the NHTSA is considering whether or not to force Tesla to eat the
> costs, as opposed to forcing their customers to pay $$$), that's an
> example of false economy....

Yeah I'm aware of the Tesla eMMC wear-out problem. But I've seen this
esp. from a user point of view. Like take our product, where the user
can freely choose its sdcard just to then notice that the installation
of its distribution is painfully slow. So I'm interested in
understanding
the implications. Like is it really the case that the erase command can
be assumed fast.

-michael

2020-12-08 02:42:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote:
> > There are three different MMC commands which are defined:
> >
> > 1) DISCARD
> > 2) ERASE
> > 3) SECURE ERASE
> >
> > The first two are expected to be fast, since it only involves clearing
> > some metadata fields in the Flash Translation Layer (FTL), so that the
> > LBA's in the specified range are no longer mapped to a flash page.
>
> Mh, where is it specified that the erase command is fast? According
> to the Physical Layer Simplified Specification Version 8.00:
>
> The actual erase time may be quite long, and the host may issue CMD7
> to deselect thhe card or perform card disconnection, as described in
> the Block Write section, above.

I looked at the eMMC specification from JEDEC (JESD84-A44) and there,
both the "erase" and "trim" are specified that the work is to be
queued to be done at a time which is convenient to the controller
(read: FTL). This is in contrast to the "secure erase" and "secure
trim" commands, where the erasing has to be done NOW NOW NOW for "high
security applications".

The only difference between "erase" and "trim" seems to be that erahse
has to be done in units of the "erase groups" which is typically
larger than the "write pages" which is the granularity required by the
trim command. There is also a comment that when you are erasing the
entire partition, "erase" is preferred over "trim". (Presumably
because it is more convenient? The spec is not clear.)

Unfortunately, the SD Card spec and the eMMC spec both read like they
were written by a standards committee stacked by hardware engineers.
It doesn't look like they had file system engineers in the room,
because the distinctions between "erase" and "trim" are pretty silly,
and not well defined. Aside from what I wrote, the spec is remarkably
silent about what the host OS can depend upon.

From the fs perspective, what we care about is whether or not the
command is a hint or a reliable way to zero a range of sectors. A
command could be a hint if the device is allowed to ignore it, or if
the values of the sector are indeterminate, or if the sectors are
zero'ed or not could change after a power cycle. (I've seen an
implementation where discard would result in the LBA's being read as
zero --- but after a power cycle, reading from the same LBA would
return the old data again. This is standards complaint, but it's not
terribly useful.)

Assuming that the command is reliable, the next question is whether
the erase operation is logical or physical --- which is to say, if an
attacker has physical access to the die, with the ability to bypass
the FTL and directly read the flash cells, could the attack retrieve
the data, even if it required a distructive, physical attack on the
hardware? A logical erase would not require that the data be erased
or otherwise made inaccessible against an attacker who bypasses the
FTL; a physical erase would provide security guarantees that even if
your phone has handed over to state-sponsored attacker, that nothing
could be extracted after a physical erase.

So if I were king, those would be the three levels of discard: "hint",
"reliable logical", and "reliable physical", as those map to real use
cases that are of actual use to a Host. The challenge is mapping what
we *actually* are given by different specs, which were written by
hardware engineers and make distinctions that are not well defined so
that multiple implementations can be "standard compliant", but have
completely different performance profiles, thus making life easy for
the marketing types, and hard for the file system engineers. :-)

All I can tell you is that I know a bunch of Android system team
members at $WORK, and the current assumptions seem to work just fine
for the sorts of devices that are used on mobile handsets --- even
really cheap ones that are sold in India. At least, there are bunch
of "cost optimized" (as well as high end) Android devices running
ext4, and no one has complained to me about mke2fs taking a long time.

I definitely agree with you that the SD Card spec seems to imply that
other standards-compliant implementations could have the erase command
taking minutes, and this seems to be allowable by the spec. I would
consider this to be a flaw in the spec, myself. But I don't sit on
the standards committess, and I don't write the specs. I (and
everyone else) just have to live with them. Sigh....

- Ted

2020-12-08 09:53:11

by Ulf Hansson

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

Hi Ted, Michael,

On Tue, 8 Dec 2020 at 03:41, Theodore Y. Ts'o <[email protected]> wrote:
>
> On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote:
> > > There are three different MMC commands which are defined:
> > >
> > > 1) DISCARD
> > > 2) ERASE
> > > 3) SECURE ERASE
> > >
> > > The first two are expected to be fast, since it only involves clearing
> > > some metadata fields in the Flash Translation Layer (FTL), so that the
> > > LBA's in the specified range are no longer mapped to a flash page.
> >
> > Mh, where is it specified that the erase command is fast? According
> > to the Physical Layer Simplified Specification Version 8.00:
> >
> > The actual erase time may be quite long, and the host may issue CMD7
> > to deselect thhe card or perform card disconnection, as described in
> > the Block Write section, above.

Before I go into some more detail, of course I fully agree that
dealing with erase/discard from the eMMC/SD specifications (and other
types of devices) point of view isn't entirely easy. :-)

But I also think we can do better than currently, at least for eMMC/SD.

>
> I looked at the eMMC specification from JEDEC (JESD84-A44) and there,
> both the "erase" and "trim" are specified that the work is to be
> queued to be done at a time which is convenient to the controller
> (read: FTL). This is in contrast to the "secure erase" and "secure
> trim" commands, where the erasing has to be done NOW NOW NOW for "high
> security applications".
>
> The only difference between "erase" and "trim" seems to be that erahse
> has to be done in units of the "erase groups" which is typically
> larger than the "write pages" which is the granularity required by the
> trim command. There is also a comment that when you are erasing the
> entire partition, "erase" is preferred over "trim". (Presumably
> because it is more convenient? The spec is not clear.)
>
> Unfortunately, the SD Card spec and the eMMC spec both read like they
> were written by a standards committee stacked by hardware engineers.
> It doesn't look like they had file system engineers in the room,
> because the distinctions between "erase" and "trim" are pretty silly,
> and not well defined. Aside from what I wrote, the spec is remarkably
> silent about what the host OS can depend upon.

Moreover, the specs have evolved over the years. Somehow, we need to
map a REQ_OP_DISCARD and REQ_OP_SECURE_ERASE to the best matching
operation that the currently inserted eMMC/SD card supports...

Long time time ago, both the SD and eMMC spec introduced support for
real discards commands, as being hints to the card without any
guarantees of what will happen to the data from a logical or a
physical point of view. If the card supports that, we should use it as
the first option for REQ_OP_DISCARD. Although, what should we pick as
the second best option, when the card doesn't support discard - that's
when it becomes more tricky. And the similar applies for
REQ_OP_SECURE_ERASE, or course.

If you have any suggestions for how we can improve in the above
decisions, feel free to suggest something.

Another issue that most likely is causing poor performance for
REQ_OP_DISCARD/REQ_OP_SECURE_ERASE for eMMC/SD, is that in
mmc_queue_setup_discard() we set up the maximum discard sectors
allowed per request and the discard granularity.

To find performance bottlenecks, I would start looking at what actual
eMMC/SD commands/args we end up mapping towards the
REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would
also look at the values we end up picking as max discard sectors and
the discard granularity.

>
> From the fs perspective, what we care about is whether or not the
> command is a hint or a reliable way to zero a range of sectors. A
> command could be a hint if the device is allowed to ignore it, or if
> the values of the sector are indeterminate, or if the sectors are
> zero'ed or not could change after a power cycle. (I've seen an
> implementation where discard would result in the LBA's being read as
> zero --- but after a power cycle, reading from the same LBA would
> return the old data again. This is standards complaint, but it's not
> terribly useful.)

:-)

>
> Assuming that the command is reliable, the next question is whether
> the erase operation is logical or physical --- which is to say, if an
> attacker has physical access to the die, with the ability to bypass
> the FTL and directly read the flash cells, could the attack retrieve
> the data, even if it required a distructive, physical attack on the
> hardware? A logical erase would not require that the data be erased
> or otherwise made inaccessible against an attacker who bypasses the
> FTL; a physical erase would provide security guarantees that even if
> your phone has handed over to state-sponsored attacker, that nothing
> could be extracted after a physical erase.
>
> So if I were king, those would be the three levels of discard: "hint",
> "reliable logical", and "reliable physical", as those map to real use
> cases that are of actual use to a Host. The challenge is mapping what
> we *actually* are given by different specs, which were written by
> hardware engineers and make distinctions that are not well defined so
> that multiple implementations can be "standard compliant", but have
> completely different performance profiles, thus making life easy for
> the marketing types, and hard for the file system engineers. :-)

I agree, these are the three levels that make sense to support.

Honestly I haven't been paying enough attention to discussions for the
generic block layer around discards. However, considering what you
just stated above, we seem to be missing one request operation, don't
we?

[...]

Kind regards
Uffe

2020-12-08 11:29:36

by Michael Walle

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

Hi Ulf, Hi Ted,

Am 2020-12-08 10:49, schrieb Ulf Hansson:
> On Tue, 8 Dec 2020 at 03:41, Theodore Y. Ts'o <[email protected]> wrote:
>> On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote:
>> > > There are three different MMC commands which are defined:
>> > >
>> > > 1) DISCARD
>> > > 2) ERASE
>> > > 3) SECURE ERASE
>> > >
>> > > The first two are expected to be fast, since it only involves clearing
>> > > some metadata fields in the Flash Translation Layer (FTL), so that the
>> > > LBA's in the specified range are no longer mapped to a flash page.
>> >
>> > Mh, where is it specified that the erase command is fast? According
>> > to the Physical Layer Simplified Specification Version 8.00:
>> >
>> > The actual erase time may be quite long, and the host may issue CMD7
>> > to deselect thhe card or perform card disconnection, as described in
>> > the Block Write section, above.
>
> Before I go into some more detail, of course I fully agree that
> dealing with erase/discard from the eMMC/SD specifications (and other
> types of devices) point of view isn't entirely easy. :-)
>
> But I also think we can do better than currently, at least for eMMC/SD.
>
>>
>> I looked at the eMMC specification from JEDEC (JESD84-A44) and there,
>> both the "erase" and "trim" are specified that the work is to be
>> queued to be done at a time which is convenient to the controller
>> (read: FTL). This is in contrast to the "secure erase" and "secure
>> trim" commands, where the erasing has to be done NOW NOW NOW for "high
>> security applications".

Oh this might also be because I've cited from the wrong place, namely
the
mmc_init_card() function. But what I really meant was the sd card
equivalent
which should be mmc_read_ssr(). Sorry.

discard_support = UNSTUFF_BITS(resp, 313 - 288, 1);
card->erase_arg = (card->scr.sda_specx && discard_support) ?
SD_DISCARD_ARG : SD_ERASE_ARG;

>> The only difference between "erase" and "trim" seems to be that erahse
>> has to be done in units of the "erase groups" which is typically
>> larger than the "write pages" which is the granularity required by the
>> trim command. There is also a comment that when you are erasing the
>> entire partition, "erase" is preferred over "trim". (Presumably
>> because it is more convenient? The spec is not clear.)
>>
>> Unfortunately, the SD Card spec and the eMMC spec both read like they
>> were written by a standards committee stacked by hardware engineers.
>> It doesn't look like they had file system engineers in the room,
>> because the distinctions between "erase" and "trim" are pretty silly,
>> and not well defined. Aside from what I wrote, the spec is remarkably
>> silent about what the host OS can depend upon.
>
> Moreover, the specs have evolved over the years. Somehow, we need to
> map a REQ_OP_DISCARD and REQ_OP_SECURE_ERASE to the best matching
> operation that the currently inserted eMMC/SD card supports...

Do we really need to map these functions? What if we don't have an
actual discard, but just a slow erase (I'm now assuming that erase
will likely be slow on sdcards)? Can't we just tell the user space
there is no discard? Like on a normal HDD? I really don't know the
implications, seems like mmc_erase() is just there for the linux
discard feature.

Coming from the user space side. Does mkfs.ext4 assumes its pre-discard
is fast? I'd think so, right? I'd presume it was intented to tell the
FTL of the block device, "hey these blocks are unused, you can do some
wear leveling with them".

> Long time time ago, both the SD and eMMC spec introduced support for
> real discards commands, as being hints to the card without any
> guarantees of what will happen to the data from a logical or a
> physical point of view. If the card supports that, we should use it as
> the first option for REQ_OP_DISCARD. Although, what should we pick as
> the second best option, when the card doesn't support discard - that's
> when it becomes more tricky. And the similar applies for
> REQ_OP_SECURE_ERASE, or course.
>
> If you have any suggestions for how we can improve in the above
> decisions, feel free to suggest something.
>
> Another issue that most likely is causing poor performance for
> REQ_OP_DISCARD/REQ_OP_SECURE_ERASE for eMMC/SD, is that in
> mmc_queue_setup_discard() we set up the maximum discard sectors
> allowed per request and the discard granularity.
>
> To find performance bottlenecks, I would start looking at what actual
> eMMC/SD commands/args we end up mapping towards the
> REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would
> also look at the values we end up picking as max discard sectors and
> the discard granularity.

I'm just about finding some SD cards and looking how they behave timing
wise and what they report they support (ie. erase or discard). Looks
like other cards are doing better. But I'd have to find out if they
support the discard (mine doesn't) and if they are slow too if I force
them to use the normal erase.

>> From the fs perspective, what we care about is whether or not the
>> command is a hint or a reliable way to zero a range of sectors. A
>> command could be a hint if the device is allowed to ignore it, or if
>> the values of the sector are indeterminate, or if the sectors are
>> zero'ed or not could change after a power cycle. (I've seen an
>> implementation where discard would result in the LBA's being read as
>> zero --- but after a power cycle, reading from the same LBA would
>> return the old data again. This is standards complaint, but it's not
>> terribly useful.)
>
> :-)
>
>>
>> Assuming that the command is reliable, the next question is whether
>> the erase operation is logical or physical --- which is to say, if an
>> attacker has physical access to the die, with the ability to bypass
>> the FTL and directly read the flash cells, could the attack retrieve
>> the data, even if it required a distructive, physical attack on the
>> hardware? A logical erase would not require that the data be erased
>> or otherwise made inaccessible against an attacker who bypasses the
>> FTL; a physical erase would provide security guarantees that even if
>> your phone has handed over to state-sponsored attacker, that nothing
>> could be extracted after a physical erase.
>>
>> So if I were king, those would be the three levels of discard: "hint",
>> "reliable logical", and "reliable physical", as those map to real use
>> cases that are of actual use to a Host. The challenge is mapping what
>> we *actually* are given by different specs, which were written by
>> hardware engineers and make distinctions that are not well defined so
>> that multiple implementations can be "standard compliant", but have
>> completely different performance profiles, thus making life easy for
>> the marketing types, and hard for the file system engineers. :-)
>
> I agree, these are the three levels that make sense to support.
>
> Honestly I haven't been paying enough attention to discussions for the
> generic block layer around discards. However, considering what you
> just stated above, we seem to be missing one request operation, don't
> we?

-michael

2020-12-08 16:21:01

by Ulf Hansson

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

On Tue, 8 Dec 2020 at 12:26, Michael Walle <[email protected]> wrote:
>
> Hi Ulf, Hi Ted,
>
> Am 2020-12-08 10:49, schrieb Ulf Hansson:
> > On Tue, 8 Dec 2020 at 03:41, Theodore Y. Ts'o <[email protected]> wrote:
> >> On Mon, Dec 07, 2020 at 09:39:32PM +0100, Michael Walle wrote:
> >> > > There are three different MMC commands which are defined:
> >> > >
> >> > > 1) DISCARD
> >> > > 2) ERASE
> >> > > 3) SECURE ERASE
> >> > >
> >> > > The first two are expected to be fast, since it only involves clearing
> >> > > some metadata fields in the Flash Translation Layer (FTL), so that the
> >> > > LBA's in the specified range are no longer mapped to a flash page.
> >> >
> >> > Mh, where is it specified that the erase command is fast? According
> >> > to the Physical Layer Simplified Specification Version 8.00:
> >> >
> >> > The actual erase time may be quite long, and the host may issue CMD7
> >> > to deselect thhe card or perform card disconnection, as described in
> >> > the Block Write section, above.
> >
> > Before I go into some more detail, of course I fully agree that
> > dealing with erase/discard from the eMMC/SD specifications (and other
> > types of devices) point of view isn't entirely easy. :-)
> >
> > But I also think we can do better than currently, at least for eMMC/SD.
> >
> >>
> >> I looked at the eMMC specification from JEDEC (JESD84-A44) and there,
> >> both the "erase" and "trim" are specified that the work is to be
> >> queued to be done at a time which is convenient to the controller
> >> (read: FTL). This is in contrast to the "secure erase" and "secure
> >> trim" commands, where the erasing has to be done NOW NOW NOW for "high
> >> security applications".
>
> Oh this might also be because I've cited from the wrong place, namely
> the
> mmc_init_card() function. But what I really meant was the sd card
> equivalent
> which should be mmc_read_ssr(). Sorry.
>
> discard_support = UNSTUFF_BITS(resp, 313 - 288, 1);
> card->erase_arg = (card->scr.sda_specx && discard_support) ?
> SD_DISCARD_ARG : SD_ERASE_ARG;

I assumed you were referring to this, but good that you pointed this
out, for clarity.

>
> >> The only difference between "erase" and "trim" seems to be that erahse
> >> has to be done in units of the "erase groups" which is typically
> >> larger than the "write pages" which is the granularity required by the
> >> trim command. There is also a comment that when you are erasing the
> >> entire partition, "erase" is preferred over "trim". (Presumably
> >> because it is more convenient? The spec is not clear.)
> >>
> >> Unfortunately, the SD Card spec and the eMMC spec both read like they
> >> were written by a standards committee stacked by hardware engineers.
> >> It doesn't look like they had file system engineers in the room,
> >> because the distinctions between "erase" and "trim" are pretty silly,
> >> and not well defined. Aside from what I wrote, the spec is remarkably
> >> silent about what the host OS can depend upon.
> >
> > Moreover, the specs have evolved over the years. Somehow, we need to
> > map a REQ_OP_DISCARD and to the best matching
> > operation that the currently inserted eMMC/SD card supports...
>
> Do we really need to map these functions? What if we don't have an
> actual discard, but just a slow erase (I'm now assuming that erase
> will likely be slow on sdcards)? Can't we just tell the user space
> there is no discard? Like on a normal HDD?

I have considered that, but not sure what would be the best option.

> I really don't know the
> implications, seems like mmc_erase() is just there for the linux
> discard feature.

mmc_erase() is used for both REQ_OP_DISCARD and REQ_OP_SECURE_ERASE,
but that's an implementation detail that we can change, of course.

Honestly, the hole erase/discard support in the mmc core deserves a
cleanup and I am looking at that (occasionally).

>
> Coming from the user space side. Does mkfs.ext4 assumes its pre-discard
> is fast? I'd think so, right? I'd presume it was intented to tell the
> FTL of the block device, "hey these blocks are unused, you can do some
> wear leveling with them".

I would assume that too.

On the other hand, I guess there are situations when user space could
live with slow formatting times. In particular if the goal is to let
card clean up its internal garbage, as a way to improve "performance"
for later I/O writes.

>
> > Long time time ago, both the SD and eMMC spec introduced support for
> > real discards commands, as being hints to the card without any
> > guarantees of what will happen to the data from a logical or a
> > physical point of view. If the card supports that, we should use it as
> > the first option for REQ_OP_DISCARD. Although, what should we pick as
> > the second best option, when the card doesn't support discard - that's
> > when it becomes more tricky. And the similar applies for
> > REQ_OP_SECURE_ERASE, or course.
> >
> > If you have any suggestions for how we can improve in the above
> > decisions, feel free to suggest something.
> >
> > Another issue that most likely is causing poor performance for
> > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE for eMMC/SD, is that in
> > mmc_queue_setup_discard() we set up the maximum discard sectors
> > allowed per request and the discard granularity.
> >
> > To find performance bottlenecks, I would start looking at what actual
> > eMMC/SD commands/args we end up mapping towards the
> > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would
> > also look at the values we end up picking as max discard sectors and
> > the discard granularity.
>
> I'm just about finding some SD cards and looking how they behave timing
> wise and what they report they support (ie. erase or discard). Looks
> like other cards are doing better. But I'd have to find out if they
> support the discard (mine doesn't) and if they are slow too if I force
> them to use the normal erase.

Sounds great, looking forward to hear more about your findings.

[...]

Kind regards
Uffe

2020-12-08 16:53:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

On Tue, Dec 08, 2020 at 12:26:22PM +0100, Michael Walle wrote:
> Do we really need to map these functions? What if we don't have an
> actual discard, but just a slow erase (I'm now assuming that erase
> will likely be slow on sdcards)? Can't we just tell the user space
> there is no discard? Like on a normal HDD? I really don't know the
> implications, seems like mmc_erase() is just there for the linux
> discard feature.

So the potential gotcha here is that "discard" is important for
reducing write amplification, and thus improving the lifespan of
devices. (See my reference to the Tesla engine controller story
earlier.) So if a device doesn't have "discard" but has "erase", and
"erase" is fast, then skipping the discard could end up significantly
reducing the lifespan of your product, and we're back to the NHTSA
investigating whether they should stick Tesla for the $1500 engine
controller replacement when cards die early.

I guess the JEDEC spec does specify a way to query the card for how
long an erase takes, but I don't have the knowledge about how the
actual real-world implementations of these specs (and their many
variants over the years) actually behave. Can the erase times that
they advertise actually be trusted to be accurate? How many of them
actually supply erase times at all, no matter what the spec says?

> Coming from the user space side. Does mkfs.ext4 assumes its pre-discard
> is fast? I'd think so, right? I'd presume it was intented to tell the
> FTL of the block device, "hey these blocks are unused, you can do some
> wear leveling with them".

Yes, the assumption is that discard is fast. Exactly how fast seems
to vary; this is one of the reasons why there are three different ways
to do discards on a file system after files are deleted. One way is
to do them after the deleted definitely won't be unwound (i.e., after
the ext4 journal commit). But on some devices, the discard command,
while fast, is slow enough that this will compete with the I/O
completion times of other read commands, thus degrading system
performance. So you can also execute the trim commands out of cron,
using the fstrim command, which will run the discards in the
background, and the system administrator can adjust when fstrim is
executed during times wheno performance isn't critical. (e.g., when
the phone is on a charger in the middle of the night, or at 4am local
time, etc.) Finally, you can configure e2fsck to run the discards
after the file system consistency check is done.

The reason why we have to leave this up to the system administrators
is that we have essentially no guidance from the device how slow the
discard command might be, how it intereferes with other device
operations, and whether the discard might be more likely ignored if
the device is busy. So it might be that the discard will more likely
improve write endurance when it is done when the device is idle. All
of the speccs (SCSI, SATA, UFS, eMMC, SD) are remarkable unhelpful
because performance considerations is generally consider "out of
scope" of standards committees. They want to leave that up to market
forces; which is why big companies (at handset vendors, hyperscale
cloud providers, systems integrators, etc.) have to spend as much
money doing certification testing before deciding which products to
buy; think of it as a full-employment act for storage engineers. :-)

But yes, mke2fs assumes that discard is sufficiently fast that it
doing it at file system format time is extremely reasonable. The
bigger concern is that we can't necessarily count on discard zero'ing
the inode table, and there are robustness reasons (especially if
before we had metadata checksums) where it makes file system repairs
much more robust if the inode table is zero'ed ahead of time.

> I'm just about finding some SD cards and looking how they behave timing
> wise and what they report they support (ie. erase or discard). Looks
> like other cards are doing better. But I'd have to find out if they
> support the discard (mine doesn't) and if they are slow too if I force
> them to use the normal erase.

The challenge is that this sort of thing gets rapidly out of date, and
it's not just SD cards but also eMMC devices which are built into
various embedded devices, high-end SDHC cards, etc., etc. So doing
this gets very expensive.

That being said, both ext4 and f2fs do pre-discards as part of the
format step, since improving write endurance is important; customers
get cranky when their $1000 smart phones die an early death. So an SD
card that behaves the way yours does would probably get disqualified
very early in the certification step if it were ever intended to be
used in an Android handset, since pretty much all Android devices, or
embedded devices for that matter, use either f2fs or ext4. That's one
of the reasons why I was a bit surprised that your device had such an
"interesting" performance profile. Maybe it was intended for use in
digital cameras, and digital camerase don't issue discards? I don't
know....

> > I agree, these are the three levels that make sense to support.
> >
> > Honestly I haven't been paying enough attention to discussions for the
> > generic block layer around discards. However, considering what you
> > just stated above, we seem to be missing one request operation, don't
> > we?

Yes, that's true. We only have "discard" and "secure discard". Part
of that is because that's the only levels which are available for
SSD's, for which I have the same general complaint vis-a-vis standards
committees and the general lack of usefulness for file system
engineers.

For example, pretty much everyone in the enterprise and hyperscale
cloud world assume that low-numbered LBA's have better performance
profiles, and are located physically at the outer diameter of HDD's,
compared to high-number'ed LBA's. But that's nothing which is
specified by the standards committees, because "performance
considerations are out of scope". Yet we still have to engineer
storage systems which assume this to be true, even though nothing in
the formal specs guarantees this. We just have to trust that anyone
who tries to sell a HDD for which this isn't true, even if it is
"standards complaint", is going to have a bad time, and trust that
this is enough. (Perhaps this is why when a certain HDD manufacturer
tried to sell HDD's containing drive-managed SMR for the NAS market,
without disclosing this fact to consumers, this generated a massive
backlash.... Simply being standards compliant is not enough.)

- Ted

2020-12-08 21:11:44

by Michael Walle

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

Hi Ulf, Ted,

Am 2020-12-08 17:17, schrieb Ulf Hansson:
> On Tue, 8 Dec 2020 at 12:26, Michael Walle <[email protected]> wrote:
>> > To find performance bottlenecks, I would start looking at what actual
>> > eMMC/SD commands/args we end up mapping towards the
>> > REQ_OP_DISCARD/REQ_OP_SECURE_ERASE requests. Then definitely, I would
>> > also look at the values we end up picking as max discard sectors and
>> > the discard granularity.
>>
>> I'm just about finding some SD cards and looking how they behave
>> timing
>> wise and what they report they support (ie. erase or discard). Looks
>> like other cards are doing better. But I'd have to find out if they
>> support the discard (mine doesn't) and if they are slow too if I force
>> them to use the normal erase.
>
> Sounds great, looking forward to hear more about your findings.

Ok so sample size is 3 *g*. Two of these cards are actually "fast",
meaning that a discard of any size will take less than a second and
one is the slow card.

I've added tracing to dump the cards parameters (see patch at the end
of this mail). No card supports discard, they just use the normal erase
method. That wasn't what I was expecting ;)

(1) Fast card (Kingston CANVAS Select Plus, 16GB)
# time blkdiscard -l 536870912 /dev/mmcblk1
real 0m 0.34s
user 0m 0.00s
sys 0m 0.00s

kworker/0:2-81 [000] .... 123.285801: mmc_sd_setup_card:
card->erase_arg=0, au=9 es=512 et=12 eo=3
kworker/1:3H-2368 [001] .... 133.570762: mmc_do_erase: from=0x0
to=0x1fff
kworker/1:3H-2368 [001] .... 133.585204: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.585284: mmc_do_erase:
from=0x2000 to=0x3fff
kworker/1:3H-2368 [001] .... 133.589201: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.589217: mmc_do_erase:
from=0x4000 to=0x5fff
kworker/1:3H-2368 [001] .... 133.591315: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.591330: mmc_do_erase:
from=0x6000 to=0x7fff
kworker/1:3H-2368 [001] .... 133.593202: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.593217: mmc_do_erase:
from=0x8000 to=0x9fff
kworker/1:3H-2368 [001] .... 133.595338: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.595353: mmc_do_erase:
from=0xa000 to=0xbfff
kworker/1:3H-2368 [001] .... 133.597473: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.597488: mmc_do_erase:
from=0xc000 to=0xdfff
kworker/1:3H-2368 [001] .... 133.599605: mmc_do_erase:
mmc_poll_for_busy() done
[..]
kworker/1:3H-2368 [001] .... 133.891681: mmc_do_erase:
from=0xf0000 to=0xf1fff
kworker/1:3H-2368 [001] .... 133.893919: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.893947: mmc_do_erase:
from=0xf2000 to=0xf3fff
kworker/1:3H-2368 [001] .... 133.896186: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.896213: mmc_do_erase:
from=0xf4000 to=0xf5fff
kworker/1:3H-2368 [001] .... 133.898452: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.898481: mmc_do_erase:
from=0xf6000 to=0xf7fff
kworker/1:3H-2368 [001] .... 133.900713: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.900741: mmc_do_erase:
from=0xf8000 to=0xf9fff
kworker/1:3H-2368 [001] .... 133.902979: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.903008: mmc_do_erase:
from=0xfa000 to=0xfbfff
kworker/1:3H-2368 [001] .... 133.905246: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.905274: mmc_do_erase:
from=0xfc000 to=0xfdfff
kworker/1:3H-2368 [001] .... 133.909589: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 133.909620: mmc_do_erase:
from=0xfe000 to=0xfffff
kworker/1:3H-2368 [001] .... 133.911870: mmc_do_erase:
mmc_poll_for_busy() done


(2) Fast card (Panasonic, Unknown model, 8GB)
kworker/0:2-81 [000] .... 492.192453: mmc_sd_setup_card:
card->erase_arg=0, au=9 es=8 et=1 eo=3

I didn't discard the blocks again, so no logs, but it didn't take
long in the first run.


(3) Slow card (Toshiba Exceria, 16GB)
# time blkdiscard -l 536870912 /dev/mmcblk1
real 0m 39.78s
user 0m 0.00s
sys 0m 0.00s

kworker/0:2-81 [000] .... 207.271171: mmc_sd_setup_card:
card->erase_arg=0, au=9 es=512 et=12 eo=3
kworker/1:3H-2368 [001] .... 212.096265: mmc_do_erase: from=0x0
to=0x1fff
kworker/1:3H-2368 [001] .... 212.100282: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 212.100328: mmc_do_erase:
from=0x2000 to=0x3fff
kworker/1:3H-2368 [001] .... 212.102207: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 212.102215: mmc_do_erase:
from=0x4000 to=0x5fff
kworker/1:3H-2368 [001] .... 212.104260: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 212.104267: mmc_do_erase:
from=0x6000 to=0x7fff
kworker/1:3H-2368 [001] .... 213.086808: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.086842: mmc_do_erase:
from=0x8000 to=0x9fff
kworker/1:3H-2368 [001] .... 213.149232: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.149263: mmc_do_erase:
from=0xa000 to=0xbfff
kworker/1:3H-2368 [001] .... 213.215185: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.215216: mmc_do_erase:
from=0xc000 to=0xdfff
kworker/1:3H-2368 [001] .... 213.346672: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.346702: mmc_do_erase:
from=0xe000 to=0xffff
kworker/1:3H-2368 [001] .... 213.412594: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.412623: mmc_do_erase:
from=0x10000 to=0x11fff
kworker/1:3H-2368 [001] .... 213.478507: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.478541: mmc_do_erase:
from=0x12000 to=0x13fff
kworker/1:3H-2368 [001] .... 213.598798: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.598829: mmc_do_erase:
from=0x14000 to=0x15fff
kworker/1:3H-2368 [001] .... 213.664721: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.664750: mmc_do_erase:
from=0x16000 to=0x17fff
kworker/1:3H-2368 [001] .... 213.730632: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.730661: mmc_do_erase:
from=0x18000 to=0x19fff
kworker/1:3H-2368 [001] .... 213.862108: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.862138: mmc_do_erase:
from=0x1a000 to=0x1bfff
kworker/1:3H-2368 [001] .... 213.928017: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.928046: mmc_do_erase:
from=0x1c000 to=0x1dfff
kworker/1:3H-2368 [001] .... 213.993925: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 213.993954: mmc_do_erase:
from=0x1e000 to=0x1ffff
kworker/1:3H-2368 [001] .... 214.110795: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 214.110827: mmc_do_erase:
from=0x20000 to=0x21fff
kworker/1:3H-2368 [001] .... 214.173232: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 214.173263: mmc_do_erase:
from=0x22000 to=0x23fff
kworker/1:3H-2368 [001] .... 214.239191: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 214.239221: mmc_do_erase:
from=0x24000 to=0x25fff
kworker/1:3H-2368 [001] .... 215.069222: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 215.069253: mmc_do_erase:
from=0x26000 to=0x27fff
kworker/1:3H-2368 [001] .... 215.135138: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 215.135168: mmc_do_erase:
from=0x28000 to=0x29fff
kworker/1:3H-2368 [001] .... 215.197232: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 215.197264: mmc_do_erase:
from=0x2a000 to=0x2bfff
kworker/1:3H-2368 [001] .... 216.040197: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 216.040229: mmc_do_erase:
from=0x2c000 to=0x2dfff
kworker/1:3H-2368 [001] .... 216.158794: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 216.158824: mmc_do_erase:
from=0x2e000 to=0x2ffff
kworker/1:3H-2368 [001] .... 216.221232: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 216.221263: mmc_do_erase:
from=0x30000 to=0x31fff
kworker/1:3H-2368 [001] .... 217.064195: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 217.064226: mmc_do_erase:
from=0x32000 to=0x33fff
kworker/1:3H-2368 [001] .... 217.182794: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 217.182824: mmc_do_erase:
from=0x34000 to=0x35fff
kworker/1:3H-2368 [001] .... 217.245231: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 217.245263: mmc_do_erase:
from=0x36000 to=0x37fff
kworker/1:3H-2368 [001] .... 218.083500: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 218.083532: mmc_do_erase:
from=0x38000 to=0x39fff
kworker/1:3H-2368 [001] .... 218.141223: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 218.141253: mmc_do_erase:
from=0x3a000 to=0x3bfff
kworker/1:3H-2368 [001] .... 218.207130: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 218.207160: mmc_do_erase:
from=0x3c000 to=0x3dfff
kworker/1:3H-2368 [001] .... 219.046630: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 219.046663: mmc_do_erase:
from=0x3e000 to=0x3ffff
kworker/1:3H-2368 [001] .... 219.112564: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 219.112595: mmc_do_erase:
from=0x40000 to=0x41fff
kworker/1:3H-2368 [001] .... 219.230811: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 219.230842: mmc_do_erase:
from=0x42000 to=0x43fff
kworker/1:3H-2368 [001] .... 220.070631: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 220.070665: mmc_do_erase:
from=0x44000 to=0x45fff
kworker/1:3H-2368 [001] .... 220.136551: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 220.136580: mmc_do_erase:
from=0x46000 to=0x47fff
kworker/1:3H-2368 [001] .... 220.254794: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 220.254824: mmc_do_erase:
from=0x48000 to=0x49fff
kworker/1:3H-2368 [001] .... 221.094626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 221.094658: mmc_do_erase:
from=0x4a000 to=0x4bfff
kworker/1:3H-2368 [001] .... 221.160559: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 221.160588: mmc_do_erase:
from=0x4c000 to=0x4dfff
kworker/1:3H-2368 [001] .... 221.278793: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 221.278823: mmc_do_erase:
from=0x4e000 to=0x4ffff
kworker/1:3H-2368 [001] .... 222.118626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 222.118658: mmc_do_erase:
from=0x50000 to=0x51fff
kworker/1:3H-2368 [001] .... 222.184557: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 222.184586: mmc_do_erase:
from=0x52000 to=0x53fff
kworker/1:3H-2368 [001] .... 222.302797: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 222.302829: mmc_do_erase:
from=0x54000 to=0x55fff
kworker/1:3H-2368 [001] .... 223.142627: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 223.142659: mmc_do_erase:
from=0x56000 to=0x57fff
kworker/1:3H-2368 [001] .... 223.208558: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 223.208587: mmc_do_erase:
from=0x58000 to=0x59fff
kworker/1:3H-2368 [001] .... 223.326793: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 223.326823: mmc_do_erase:
from=0x5a000 to=0x5bfff
kworker/1:3H-2368 [001] .... 224.166631: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 224.166663: mmc_do_erase:
from=0x5c000 to=0x5dfff
kworker/1:3H-2368 [001] .... 224.232553: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 224.232582: mmc_do_erase:
from=0x5e000 to=0x5ffff
kworker/1:3H-2368 [001] .... 224.350792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 224.350822: mmc_do_erase:
from=0x60000 to=0x61fff
kworker/1:3H-2368 [001] .... 225.190627: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 225.190658: mmc_do_erase:
from=0x62000 to=0x63fff
kworker/1:3H-2368 [001] .... 225.256542: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 225.256571: mmc_do_erase:
from=0x64000 to=0x65fff
kworker/1:3H-2368 [001] .... 225.374796: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 225.374827: mmc_do_erase:
from=0x66000 to=0x67fff
kworker/1:3H-2368 [001] .... 226.214627: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 226.214658: mmc_do_erase:
from=0x68000 to=0x69fff
kworker/1:3H-2368 [001] .... 226.333222: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 226.333255: mmc_do_erase:
from=0x6a000 to=0x6bfff
kworker/1:3H-2368 [001] .... 226.399137: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 226.399168: mmc_do_erase:
from=0x6c000 to=0x6dfff
kworker/1:3H-2368 [001] .... 227.238625: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 227.238657: mmc_do_erase:
from=0x6e000 to=0x6ffff
kworker/1:3H-2368 [001] .... 227.304560: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 227.304589: mmc_do_erase:
from=0x70000 to=0x71fff
kworker/1:3H-2368 [001] .... 227.422792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 227.422822: mmc_do_erase:
from=0x72000 to=0x73fff
kworker/1:3H-2368 [001] .... 228.262629: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 228.262661: mmc_do_erase:
from=0x74000 to=0x75fff
kworker/1:3H-2368 [001] .... 228.328546: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 228.328575: mmc_do_erase:
from=0x76000 to=0x77fff
kworker/1:3H-2368 [001] .... 228.446796: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 228.446827: mmc_do_erase:
from=0x78000 to=0x79fff
kworker/1:3H-2368 [001] .... 229.286630: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 229.286662: mmc_do_erase:
from=0x7a000 to=0x7bfff
kworker/1:3H-2368 [001] .... 229.352545: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 229.352573: mmc_do_erase:
from=0x7c000 to=0x7dfff
kworker/1:3H-2368 [001] .... 229.470792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 229.470822: mmc_do_erase:
from=0x7e000 to=0x7ffff
kworker/1:3H-2368 [001] .... 230.310627: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 230.310659: mmc_do_erase:
from=0x80000 to=0x81fff
kworker/1:3H-2368 [001] .... 230.376544: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 230.376574: mmc_do_erase:
from=0x82000 to=0x83fff
kworker/1:3H-2368 [001] .... 230.494792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 230.494822: mmc_do_erase:
from=0x84000 to=0x85fff
kworker/1:3H-2368 [001] .... 231.334626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 231.334658: mmc_do_erase:
from=0x86000 to=0x87fff
kworker/1:3H-2368 [001] .... 231.400542: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 231.400571: mmc_do_erase:
from=0x88000 to=0x89fff
kworker/1:3H-2368 [001] .... 231.518795: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 231.518827: mmc_do_erase:
from=0x8a000 to=0x8bfff
kworker/1:3H-2368 [001] .... 232.358627: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 232.358659: mmc_do_erase:
from=0x8c000 to=0x8dfff
kworker/1:3H-2368 [001] .... 232.477222: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 232.477255: mmc_do_erase:
from=0x8e000 to=0x8ffff
kworker/1:3H-2368 [001] .... 232.543130: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 232.543160: mmc_do_erase:
from=0x90000 to=0x91fff
kworker/1:3H-2368 [001] .... 233.382626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 233.382658: mmc_do_erase:
from=0x92000 to=0x93fff
kworker/1:3H-2368 [001] .... 233.448558: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 233.448587: mmc_do_erase:
from=0x94000 to=0x95fff
kworker/1:3H-2368 [001] .... 233.566793: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 233.566823: mmc_do_erase:
from=0x96000 to=0x97fff
kworker/1:3H-2368 [001] .... 234.406628: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 234.406659: mmc_do_erase:
from=0x98000 to=0x99fff
kworker/1:3H-2368 [001] .... 234.472545: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 234.472574: mmc_do_erase:
from=0x9a000 to=0x9bfff
kworker/1:3H-2368 [001] .... 234.590796: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 234.590827: mmc_do_erase:
from=0x9c000 to=0x9dfff
kworker/1:3H-2368 [001] .... 235.430625: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 235.430656: mmc_do_erase:
from=0x9e000 to=0x9ffff
kworker/1:3H-2368 [001] .... 235.496536: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 235.496566: mmc_do_erase:
from=0xa0000 to=0xa1fff
kworker/1:3H-2368 [001] .... 235.614792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 235.614822: mmc_do_erase:
from=0xa2000 to=0xa3fff
kworker/1:3H-2368 [001] .... 236.454627: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 236.454657: mmc_do_erase:
from=0xa4000 to=0xa5fff
kworker/1:3H-2368 [001] .... 236.520546: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 236.520575: mmc_do_erase:
from=0xa6000 to=0xa7fff
kworker/1:3H-2368 [001] .... 236.638793: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 236.638824: mmc_do_erase:
from=0xa8000 to=0xa9fff
kworker/1:3H-2368 [001] .... 237.478625: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 237.478656: mmc_do_erase:
from=0xaa000 to=0xabfff
kworker/1:3H-2368 [001] .... 237.544554: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 237.544583: mmc_do_erase:
from=0xac000 to=0xadfff
kworker/1:3H-2368 [001] .... 237.662796: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 237.662827: mmc_do_erase:
from=0xae000 to=0xaffff
kworker/1:3H-2368 [001] .... 238.502625: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 238.502656: mmc_do_erase:
from=0xb0000 to=0xb1fff
kworker/1:3H-2368 [001] .... 238.621222: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 238.621255: mmc_do_erase:
from=0xb2000 to=0xb3fff
kworker/1:3H-2368 [001] .... 238.687131: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 238.687161: mmc_do_erase:
from=0xb4000 to=0xb5fff
kworker/1:3H-2368 [001] .... 239.526626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 239.526657: mmc_do_erase:
from=0xb6000 to=0xb7fff
kworker/1:3H-2368 [001] .... 239.592540: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 239.592569: mmc_do_erase:
from=0xb8000 to=0xb9fff
kworker/1:3H-2368 [001] .... 239.710792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 239.710822: mmc_do_erase:
from=0xba000 to=0xbbfff
kworker/1:3H-2368 [001] .... 240.550626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 240.550656: mmc_do_erase:
from=0xbc000 to=0xbdfff
kworker/1:3H-2368 [001] .... 240.616539: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 240.616567: mmc_do_erase:
from=0xbe000 to=0xbffff
kworker/1:3H-2368 [001] .... 240.734796: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 240.734828: mmc_do_erase:
from=0xc0000 to=0xc1fff
kworker/1:3H-2368 [001] .... 241.574624: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 241.574655: mmc_do_erase:
from=0xc2000 to=0xc3fff
kworker/1:3H-2368 [001] .... 241.640552: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 241.640581: mmc_do_erase:
from=0xc4000 to=0xc5fff
kworker/1:3H-2368 [001] .... 241.758792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 241.758823: mmc_do_erase:
from=0xc6000 to=0xc7fff
kworker/1:3H-2368 [001] .... 242.598626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 242.598658: mmc_do_erase:
from=0xc8000 to=0xc9fff
kworker/1:3H-2368 [001] .... 242.664543: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 242.664571: mmc_do_erase:
from=0xca000 to=0xcbfff
kworker/1:3H-2368 [001] .... 242.782792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 242.782822: mmc_do_erase:
from=0xcc000 to=0xcdfff
kworker/1:3H-2368 [001] .... 243.622626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 243.622658: mmc_do_erase:
from=0xce000 to=0xcffff
kworker/1:3H-2368 [001] .... 243.688545: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 243.688574: mmc_do_erase:
from=0xd0000 to=0xd1fff
kworker/1:3H-2368 [001] .... 243.806795: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 243.806827: mmc_do_erase:
from=0xd2000 to=0xd3fff
kworker/1:3H-2368 [001] .... 244.646625: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 244.646656: mmc_do_erase:
from=0xd4000 to=0xd5fff
kworker/1:3H-2368 [001] .... 244.765222: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 244.765255: mmc_do_erase:
from=0xd6000 to=0xd7fff
kworker/1:3H-2368 [001] .... 244.831131: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 244.831160: mmc_do_erase:
from=0xd8000 to=0xd9fff
kworker/1:3H-2368 [001] .... 245.670626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 245.670658: mmc_do_erase:
from=0xda000 to=0xdbfff
kworker/1:3H-2368 [001] .... 245.736537: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 245.736566: mmc_do_erase:
from=0xdc000 to=0xddfff
kworker/1:3H-2368 [001] .... 245.854792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 245.854823: mmc_do_erase:
from=0xde000 to=0xdffff
kworker/1:3H-2368 [001] .... 246.694624: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 246.694655: mmc_do_erase:
from=0xe0000 to=0xe1fff
kworker/1:3H-2368 [001] .... 246.760553: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 246.760582: mmc_do_erase:
from=0xe2000 to=0xe3fff
kworker/1:3H-2368 [001] .... 246.878795: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 246.878827: mmc_do_erase:
from=0xe4000 to=0xe5fff
kworker/1:3H-2368 [001] .... 247.718624: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 247.718656: mmc_do_erase:
from=0xe6000 to=0xe7fff
kworker/1:3H-2368 [001] .... 247.784540: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 247.784570: mmc_do_erase:
from=0xe8000 to=0xe9fff
kworker/1:3H-2368 [001] .... 247.902791: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 247.902821: mmc_do_erase:
from=0xea000 to=0xebfff
kworker/1:3H-2368 [001] .... 248.742625: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 248.742657: mmc_do_erase:
from=0xec000 to=0xedfff
kworker/1:3H-2368 [001] .... 248.808554: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 248.808584: mmc_do_erase:
from=0xee000 to=0xeffff
kworker/1:3H-2368 [001] .... 248.926792: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 248.926822: mmc_do_erase:
from=0xf0000 to=0xf1fff
kworker/1:3H-2368 [001] .... 249.766626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 249.766657: mmc_do_erase:
from=0xf2000 to=0xf3fff
kworker/1:3H-2368 [001] .... 249.832544: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 249.832574: mmc_do_erase:
from=0xf4000 to=0xf5fff
kworker/1:3H-2368 [001] .... 249.950797: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 249.950828: mmc_do_erase:
from=0xf6000 to=0xf7fff
kworker/1:3H-2368 [001] .... 250.790626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 250.790658: mmc_do_erase:
from=0xf8000 to=0xf9fff
kworker/1:3H-2368 [001] .... 250.909223: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 250.909256: mmc_do_erase:
from=0xfa000 to=0xfbfff
kworker/1:3H-2368 [001] .... 250.975143: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 250.975173: mmc_do_erase:
from=0xfc000 to=0xfdfff
kworker/1:3H-2368 [001] .... 251.814626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/1:3H-2368 [001] .... 251.814656: mmc_do_erase:
from=0xfe000 to=0xfffff
kworker/1:3H-2368 [001] .... 251.880556: mmc_do_erase:
mmc_poll_for_busy() done

As you can see, some erase operations are fast and some take significant
longer. While for the fast card, all are completed almost
instantaneously.
Looks like the slow card will do some kind of background work between
erase
cycles.

The reported parameters of the slow card sounds reasonable, like 15s
for 2GiB. Because of this I've changed the perf_erase to its max value
for this card (i.e. au * 512):

# time blkdiscard /dev/mmcblk1
real 0m 1.72s
user 0m 0.00s
sys 0m 0.00s

kworker/0:3H-2375 [000] .... 528.308617: mmc_do_erase: from=0x0
to=0x3fdfff
kworker/0:3H-2375 [000] .... 528.435991: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 528.436047: mmc_do_erase:
from=0x3fe000 to=0x7fbfff
kworker/0:3H-2375 [000] .... 528.605276: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 528.605311: mmc_do_erase:
from=0x7fc000 to=0x7ffffe
kworker/0:3H-2375 [000] .... 528.736726: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 528.736757: mmc_do_erase:
from=0x7fffff to=0xbfdffe
kworker/0:3H-2375 [000] .... 528.926908: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 528.926940: mmc_do_erase:
from=0xbfdfff to=0xffbffe
kworker/0:3H-2375 [000] .... 529.189489: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 529.189520: mmc_do_erase:
from=0xffbfff to=0xfffffd
kworker/0:3H-2375 [000] .... 529.386494: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 529.386524: mmc_do_erase:
from=0xfffffe to=0x13fdffd
kworker/0:3H-2375 [000] .... 529.629276: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 529.629307: mmc_do_erase:
from=0x13fdffe to=0x17fbffd
kworker/0:3H-2375 [000] .... 529.760731: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 529.760762: mmc_do_erase:
from=0x17fbffe to=0x17ffffc
kworker/0:3H-2375 [000] .... 529.892180: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 529.892211: mmc_do_erase:
from=0x17ffffd to=0x1bfdffc
kworker/0:3H-2375 [000] .... 530.023626: mmc_do_erase:
mmc_poll_for_busy() done
kworker/0:3H-2375 [000] .... 530.023656: mmc_do_erase:
from=0x1bfdffd to=0x1cd9fff
kworker/0:3H-2375 [000] .... 530.032057: mmc_do_erase:
mmc_poll_for_busy() done

Now there is a comment about the "perf_erase" that states it should
be small to allow other I/O. But maybe we could also take the erase
time into account and allow larger erase sizes.

-michael



diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 19f1ee57fb34..e126a01414be 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -1675,6 +1675,8 @@ static int mmc_do_erase(struct mmc_card *card,
unsigned int from,
to <<= 9;
}

+ trace_printk("from=0x%x to=0x%x\n", from, to);
+
if (mmc_card_sd(card))
cmd.opcode = SD_ERASE_WR_BLK_START;
else
@@ -1747,6 +1749,7 @@ static int mmc_do_erase(struct mmc_card *card,
unsigned int from,
/* Let's poll to find out when the erase operation completes. */
err = mmc_poll_for_busy(card, busy_timeout, MMC_BUSY_ERASE);

+ trace_printk("mmc_poll_for_busy() done\n");
out:
mmc_retune_release(card->host);
return err;
diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
index 6f054c449d46..5e48a2cd4ad3 100644
--- a/drivers/mmc/core/sd.c
+++ b/drivers/mmc/core/sd.c
@@ -291,6 +291,8 @@ static int mmc_read_ssr(struct mmc_card *card)
card->erase_arg = (card->scr.sda_specx && discard_support) ?
SD_DISCARD_ARG : SD_ERASE_ARG;

+ trace_printk("card->erase_arg=%d, au=%d es=%d et=%d eo=%d\n",
card->erase_arg, au, es, et, eo);
+
return 0;
}

2020-12-09 15:25:46

by Ulf Hansson

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

On Tue, 8 Dec 2020 at 17:52, Theodore Y. Ts'o <[email protected]> wrote:
>
> On Tue, Dec 08, 2020 at 12:26:22PM +0100, Michael Walle wrote:
> > Do we really need to map these functions? What if we don't have an
> > actual discard, but just a slow erase (I'm now assuming that erase
> > will likely be slow on sdcards)? Can't we just tell the user space
> > there is no discard? Like on a normal HDD? I really don't know the
> > implications, seems like mmc_erase() is just there for the linux
> > discard feature.
>
> So the potential gotcha here is that "discard" is important for
> reducing write amplification, and thus improving the lifespan of
> devices. (See my reference to the Tesla engine controller story
> earlier.) So if a device doesn't have "discard" but has "erase", and
> "erase" is fast, then skipping the discard could end up significantly
> reducing the lifespan of your product, and we're back to the NHTSA
> investigating whether they should stick Tesla for the $1500 engine
> controller replacement when cards die early.

Yes, exactly. The point about wear leveling and the lifespan of the
device are critical.

That said, we should continue to map discard requests to legacy erase
commands for SD cards, unless the card supports the new discard, of
course.

One thing I realized though, is that we should probably announce and
implement support for secure erase (QUEUE_FLAG_SECERASE) for SD cards,
as that seems to map well towards with the erase command.

An erase is specified in the SD spec as, after the erase the data is
either "0" or "1", which I guess is what is expected from a
REQ_OP_SECURE_ERASE operation?

>
> I guess the JEDEC spec does specify a way to query the card for how
> long an erase takes, but I don't have the knowledge about how the
> actual real-world implementations of these specs (and their many
> variants over the years) actually behave. Can the erase times that
> they advertise actually be trusted to be accurate? How many of them
> actually supply erase times at all, no matter what the spec says?

For eMMC discard commands are fast, but probably also trim commands.
For erase I don't know.

Then, whether the corresponding "erase times" that are be specified in
the eMMC registers, I guess those always refer to the worst case
scenario. I don't know how useful they really are in the end.

In any case, we may end up with poor erase/discard performance,
because of internal FW implementations.

Although, what I think we may be able to improve, both from eMMC and
SD point of view, is to allow more blocks per discard/erase operation.
But honestly, I don't know how big of a problem this is, even if just
staring at the code, gives me some ideas.

>
> > Coming from the user space side. Does mkfs.ext4 assumes its pre-discard
> > is fast? I'd think so, right? I'd presume it was intented to tell the
> > FTL of the block device, "hey these blocks are unused, you can do some
> > wear leveling with them".
>
> Yes, the assumption is that discard is fast. Exactly how fast seems
> to vary; this is one of the reasons why there are three different ways
> to do discards on a file system after files are deleted. One way is
> to do them after the deleted definitely won't be unwound (i.e., after
> the ext4 journal commit). But on some devices, the discard command,
> while fast, is slow enough that this will compete with the I/O
> completion times of other read commands, thus degrading system
> performance. So you can also execute the trim commands out of cron,
> using the fstrim command, which will run the discards in the
> background, and the system administrator can adjust when fstrim is
> executed during times wheno performance isn't critical. (e.g., when
> the phone is on a charger in the middle of the night, or at 4am local
> time, etc.) Finally, you can configure e2fsck to run the discards
> after the file system consistency check is done.
>
> The reason why we have to leave this up to the system administrators
> is that we have essentially no guidance from the device how slow the
> discard command might be, how it intereferes with other device
> operations, and whether the discard might be more likely ignored if
> the device is busy. So it might be that the discard will more likely
> improve write endurance when it is done when the device is idle. All
> of the speccs (SCSI, SATA, UFS, eMMC, SD) are remarkable unhelpful
> because performance considerations is generally consider "out of
> scope" of standards committees. They want to leave that up to market
> forces; which is why big companies (at handset vendors, hyperscale
> cloud providers, systems integrators, etc.) have to spend as much
> money doing certification testing before deciding which products to
> buy; think of it as a full-employment act for storage engineers. :-)

A few comments related to the above.

Even if the discarded blocks are flushed at some wisely selected
point, when the device is idle, that doesn't guarantee that the
internal garbage collection runs inside the device. In the end that
depends on the FW implementation of the card - and I assume it's
likely triggered based on some internal idle time and the amount of
"garbage" there is to deal with.

For both eMMC and SD cards, the specs define commands for how to
manually control the background operations inside the cards. In
principle, this allows us to tell the card when it's a good time to
run garbage collection (and when not to).

Both for eMMC and SD, we are not using this, yet. However, I have been
playing with a couple of different ideas to explore this:

*) Use the runtime PM framework to detect an idle period and then
trigger background operations. The problem is, that we don't really
know how long we will be idle, meaning that we don't know if it's
really a wise decision to trigger the background operations in the
end.

**) Invent a new type of generic block request, as to let userspace
trigger this.

Of course, another option is also to leave this as is, thus relying on
the internal FW of the card to act the best it can.

Do you have any thoughts around this?

[...]

Kind regards
Uffe

2020-12-09 16:39:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

On Wed, Dec 09, 2020 at 03:51:24PM +0100, Ulf Hansson wrote:
>
> Even if the discarded blocks are flushed at some wisely selected
> point, when the device is idle, that doesn't guarantee that the
> internal garbage collection runs inside the device. In the end that
> depends on the FW implementation of the card - and I assume it's
> likely triggered based on some internal idle time and the amount of
> "garbage" there is to deal with.

At least from a file system perspective, I don't care when the
internal garbage collection actually runs inside the device. What I
do care is that (a) a read to a discarded sector returns zero's after
it has been discard (or the storage device needs to tell me I can't
count on that), and (b) that eventually, for write endurance reasons,
the garbage collection will *eventually* happen.

If the list of erase blocks or flash pages that are not in use are
tracked in such a way that they are actually garbage collected before
the device actually needs free blocks, it really doesn't matter if it
happens right away, or hours later. (If the device is 90% free,
because it was just formatted and we did a pre-discard at format time,
then it could happen hours or days later.)

But if the device's FTL is too incompetent such that it loses track of
which erase blocks / flash pages do need to be GC'ed, such that it
impacts device lifetime... well then, that's sad, and it would be nice
to find out about this without having to do an expensive,
time-consuming certification process. (OTOH, all the big companies
are doing hardware certifications anyway, because you can't fully
trust the storage vendors, and how many storage vendors are really
going to admit, or make it easy to determine, "the FTL is so
cost-optimized that it's cr*p"? :-)

Having a way to tell the storage device that it would be better to
suspend GC, or to accelerate GC, because we know the device is about
to become much less likely to perform writes, would certainly be a
good and useful thing to do, although I see that as mostly being
useful for improving I/O performance, especially for low-end flash ---
I suspect that for high-end SSD's, which are designed so that they can
handle continuous write streams without much performance degradation,
they have enough oomph in their internal CPU that they can do GC's in
real-time while the device is under a continuous random write workload
with only minimal performance impacts.

> *) Use the runtime PM framework to detect an idle period and then
> trigger background operations. The problem is, that we don't really
> know how long we will be idle, meaning that we don't know if it's
> really a wise decision to trigger the background operations in the
> end.
>
> **) Invent a new type of generic block request, as to let userspace
> trigger this.

I think you really want to give userspace the ability to trigger this.
Whether it's via a generic block request, or an ioctl, I'll leave that
to the people maintain the driver and/or block layer. That's because
userspace will have knowledge to things like, "the screen is off", or
"the phone is on the wireless charger and/or the user has said, "OK,
Google, goodnight" to trigger the night-time home automation commands.

We can of course try to make some automatic determinations based on
the runtime PM framework, but that doesn't necessarily tell us the
likelihood that the system will become busy in the future; OTOH, maybe
that doesn't matter, if the storage needs only a very tiny amount of
time after it's told, "stop GC", to finish up what it's doing so it
can respond to I/O request at full speed?

- Ted