LinuxLists.cc - [LSF/MM TOPIC] More async operations for file systems

2019-02-17 20:36:16

Subject: [LSF/MM TOPIC] More async operations for file systems - async discard?

One proposal for btrfs was that we should look at getting discard out of the
synchronous path in order to minimize the slowdown associated with enabling
discard at mount time. Seems like an obvious win for "hint" like operations like
discard.

I do wonder where we stand now with the cost of the various discard commands -
how painful is it for modern SSD's? Do we have a good sense of how discard
performance scales as the request size increases? Do most devices "no op" a
discard operation when issued against an already discarded region?

Would this be an interesting topic to discuss in a shared block/file system session?

Regards,

Ric

2019-02-17 21:09:54

by Dave Chinner

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Sun, Feb 17, 2019 at 03:36:10PM -0500, Ric Wheeler wrote:
> One proposal for btrfs was that we should look at getting discard
> out of the synchronous path in order to minimize the slowdown
> associated with enabling discard at mount time. Seems like an
> obvious win for "hint" like operations like discard.

We already have support for that. blkdev_issue_discard() is
synchornous, yes, but __blkdev_issue_discard() will only build the
discard bio chain - it is up to the caller to submit and wait for it.

Some callers (XFS, dm-thinp, nvmet, etc) use a bio completion to
handle the discard IO completion, hence allowing async dispatch and
processing of the discard chain without blocking the caller. Others
(like ext4) simply call submit_bio_wait() to do wait synchronously
on completion of the discard bio chain.

> I do wonder where we stand now with the cost of the various discard
> commands - how painful is it for modern SSD's?

AIUI, it still depends on the SSD implementation, unfortunately.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-02-17 23:43:06

by Ric Wheeler

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On 2/17/19 4:09 PM, Dave Chinner wrote:
> On Sun, Feb 17, 2019 at 03:36:10PM -0500, Ric Wheeler wrote:
>> One proposal for btrfs was that we should look at getting discard
>> out of the synchronous path in order to minimize the slowdown
>> associated with enabling discard at mount time. Seems like an
>> obvious win for "hint" like operations like discard.
> We already have support for that. blkdev_issue_discard() is
> synchornous, yes, but __blkdev_issue_discard() will only build the
> discard bio chain - it is up to the caller to submit and wait for it.
>
> Some callers (XFS, dm-thinp, nvmet, etc) use a bio completion to
> handle the discard IO completion, hence allowing async dispatch and
> processing of the discard chain without blocking the caller. Others
> (like ext4) simply call submit_bio_wait() to do wait synchronously
> on completion of the discard bio chain.
>
>> I do wonder where we stand now with the cost of the various discard
>> commands - how painful is it for modern SSD's?
> AIUI, it still depends on the SSD implementation, unfortunately.

I think the variability makes life really miserable for layers above it.

Might be worth constructing some tooling that we can use to validate or shame
vendors over - testing things like a full device discard, discard of fs block
size and big chunks, discard against already discarded, etc.

Regards,

Ric

2019-02-18 02:22:50

by Dave Chinner

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Sun, Feb 17, 2019 at 06:42:59PM -0500, Ric Wheeler wrote:
> On 2/17/19 4:09 PM, Dave Chinner wrote:
> >On Sun, Feb 17, 2019 at 03:36:10PM -0500, Ric Wheeler wrote:
> >>One proposal for btrfs was that we should look at getting discard
> >>out of the synchronous path in order to minimize the slowdown
> >>associated with enabling discard at mount time. Seems like an
> >>obvious win for "hint" like operations like discard.
> >We already have support for that. blkdev_issue_discard() is
> >synchornous, yes, but __blkdev_issue_discard() will only build the
> >discard bio chain - it is up to the caller to submit and wait for it.
> >
> >Some callers (XFS, dm-thinp, nvmet, etc) use a bio completion to
> >handle the discard IO completion, hence allowing async dispatch and
> >processing of the discard chain without blocking the caller. Others
> >(like ext4) simply call submit_bio_wait() to do wait synchronously
> >on completion of the discard bio chain.
> >
> >>I do wonder where we stand now with the cost of the various discard
> >>commands - how painful is it for modern SSD's?
> >AIUI, it still depends on the SSD implementation, unfortunately.
>
> I think the variability makes life really miserable for layers above it.

Yup, that it does.

> Might be worth constructing some tooling that we can use to validate
> or shame vendors over

That doesn't seem to work.

> - testing things like a full device discard,
> discard of fs block size and big chunks, discard against already
> discarded, etc.

We did that many years ago because discard on SSDs sucked:

https://people.redhat.com/lczerner/discard/test_discard.html
https://sourceforge.net/projects/test-discard/files/

And, really, that didn't changed a thing - discard still sucks...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-02-18 22:30:55

by Ric Wheeler

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On 2/17/19 9:22 PM, Dave Chinner wrote:
> On Sun, Feb 17, 2019 at 06:42:59PM -0500, Ric Wheeler wrote:
>> On 2/17/19 4:09 PM, Dave Chinner wrote:
>>> On Sun, Feb 17, 2019 at 03:36:10PM -0500, Ric Wheeler wrote:
>>>> One proposal for btrfs was that we should look at getting discard
>>>> out of the synchronous path in order to minimize the slowdown
>>>> associated with enabling discard at mount time. Seems like an
>>>> obvious win for "hint" like operations like discard.
>>> We already have support for that. blkdev_issue_discard() is
>>> synchornous, yes, but __blkdev_issue_discard() will only build the
>>> discard bio chain - it is up to the caller to submit and wait for it.
>>>
>>> Some callers (XFS, dm-thinp, nvmet, etc) use a bio completion to
>>> handle the discard IO completion, hence allowing async dispatch and
>>> processing of the discard chain without blocking the caller. Others
>>> (like ext4) simply call submit_bio_wait() to do wait synchronously
>>> on completion of the discard bio chain.
>>>
>>>> I do wonder where we stand now with the cost of the various discard
>>>> commands - how painful is it for modern SSD's?
>>> AIUI, it still depends on the SSD implementation, unfortunately.
>> I think the variability makes life really miserable for layers above it.
> Yup, that it does.
>
>> Might be worth constructing some tooling that we can use to validate
>> or shame vendors over
> That doesn't seem to work.
>
>> - testing things like a full device discard,
>> discard of fs block size and big chunks, discard against already
>> discarded, etc.
> We did that many years ago because discard on SSDs sucked:
>
> https://people.redhat.com/lczerner/discard/test_discard.html
> https://sourceforge.net/projects/test-discard/files/
>
> And, really, that didn't changed a thing - discard still sucks...
>
> Cheers,
>
> Dave.

Totally forgot about that work - maybe it is time to try again and poke some
vendors.

Ric

2019-02-20 23:47:28

by Keith Busch

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Sun, Feb 17, 2019 at 06:42:59PM -0500, Ric Wheeler wrote:
> I think the variability makes life really miserable for layers above it.
>
> Might be worth constructing some tooling that we can use to validate or
> shame vendors over - testing things like a full device discard, discard of
> fs block size and big chunks, discard against already discarded, etc.

With respect to fs block sizes, one thing making discards suck is that
many high capacity SSDs' physical page sizes are larger than the fs block
size, and a sub-page discard is worse than doing nothing.

We've discussed previously about supporting block size larger than
the system's page size, but it doesn't look like that's gone anywhere.
Maybe it's worth revisiting since it's really inefficient if you write
or discard at the smaller granularity.

2019-02-21 20:08:08

by Dave Chinner

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Wed, Feb 20, 2019 at 04:47:24PM -0700, Keith Busch wrote:
> On Sun, Feb 17, 2019 at 06:42:59PM -0500, Ric Wheeler wrote:
> > I think the variability makes life really miserable for layers above it.
> >
> > Might be worth constructing some tooling that we can use to validate or
> > shame vendors over - testing things like a full device discard, discard of
> > fs block size and big chunks, discard against already discarded, etc.
>
> With respect to fs block sizes, one thing making discards suck is that
> many high capacity SSDs' physical page sizes are larger than the fs block
> size, and a sub-page discard is worse than doing nothing.
>
> We've discussed previously about supporting block size larger than
> the system's page size, but it doesn't look like that's gone anywhere.

You mean in filesystems? Work for XFS is in progress:

https://lwn.net/Articles/770975/

But it's still only a maximum of 64k block size. Essentially, that's
a hard limit backed into the on-disk format (similar to max sector
size limits of 32k)

> Maybe it's worth revisiting since it's really inefficient if you write
> or discard at the smaller granularity.

Filesystems discard extents these days, not individual blocks. If
you free a 1MB file, they you are likely to get a 1MB discard. Or if
you use fstrim, then it's free space extent sizes (on XFS can be
hundred of GBs) and small free spaces can be ignored. So the
filesystem block size is often not an issue at all...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-02-21 23:55:11

by Jeff Mahoney

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On 2/20/19 6:47 PM, Keith Busch wrote:
> On Sun, Feb 17, 2019 at 06:42:59PM -0500, Ric Wheeler wrote:
>> I think the variability makes life really miserable for layers above it.
>>
>> Might be worth constructing some tooling that we can use to validate or
>> shame vendors over - testing things like a full device discard, discard of
>> fs block size and big chunks, discard against already discarded, etc.
>
> With respect to fs block sizes, one thing making discards suck is that
> many high capacity SSDs' physical page sizes are larger than the fs block
> size, and a sub-page discard is worse than doing nothing.
>
> We've discussed previously about supporting block size larger than
> the system's page size, but it doesn't look like that's gone anywhere.
> Maybe it's worth revisiting since it's really inefficient if you write
> or discard at the smaller granularity.

Isn't this addressing the problem at the wrong layer? There are other
efficiencies to be gained by larger block sizes, but better discard
behavior is a side effect. As Dave said, the major file systems already
assemble contiguous extents as large we can can before sending them to
discard. The lower bound for that is the larger of minimum lengths
passed by the user or provided by the block layer. We've always been
told "don't worry about what the internal block size is, that only
matters to the FTL." That's obviously not true, but when devices only
report a 512 byte granularity, we believe them and will issue discard
for the smallest size that makes sense for the file system regardless of
whether it makes sense (internally) for the SSD. That means 4k for
pretty much anything except btrfs metadata nodes, which are 16k.

So, I don't think changing the file system block size is the right
approach. It *may* bring benefits, but I think many of the same
benefits can be gained by using the minimum-size option for fstrim and
allowing the discard mount options to accept a minimum size as well.

-Jeff

--
Jeff Mahoney
SUSE Labs

2019-02-22 02:51:26

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

Keith,

> With respect to fs block sizes, one thing making discards suck is that
> many high capacity SSDs' physical page sizes are larger than the fs
> block size, and a sub-page discard is worse than doing nothing.

That ties into the whole zeroing as a side-effect thing.

The devices really need to distinguish between discard-as-a-hint where
it is free to ignore anything that's not a whole multiple of whatever
the internal granularity is, and the WRITE ZEROES use case where the end
result needs to be deterministic.

--
Martin K. Petersen Oracle Linux Engineering

2019-02-22 03:01:45

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

Jeff,

> We've always been told "don't worry about what the internal block size
> is, that only matters to the FTL." That's obviously not true, but
> when devices only report a 512 byte granularity, we believe them and
> will issue discard for the smallest size that makes sense for the file
> system regardless of whether it makes sense (internally) for the SSD.
> That means 4k for pretty much anything except btrfs metadata nodes,
> which are 16k.

The devices are free to report a bigger discard granularity. We already
support and honor that (for SCSI, anyway). It's completely orthogonal to
reported the logical block size, although it obviously needs to be a
multiple.

The real problem is that vendors have zero interest in optimizing for
discard. They are so confident in their FTL and overprovisioning that
they don't view it as an important feature. At all.

Consequently, many of the modern devices that claim to support discard
to make us software folks happy (or to satisfy a purchase order
requirements) complete the commands without doing anything at all.
We're simply wasting queue slots.

Personally, I think discard is dead on anything but the cheapest
devices. And on those it is probably going to be
performance-prohibitive to use it in any other way than a weekly fstrim.

--
Martin K. Petersen Oracle Linux Engineering

2019-02-22 06:20:47

by Roman Mamedov

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Thu, 21 Feb 2019 22:01:24 -0500
"Martin K. Petersen" <[email protected]> wrote:

> Consequently, many of the modern devices that claim to support discard
> to make us software folks happy (or to satisfy a purchase order
> requirements) complete the commands without doing anything at all.
> We're simply wasting queue slots.

Any example of such devices? Let alone "many"? Where you would issue a
full-device blkdiscard, but then just read back old data.

I know only one model(PLEXTOR PX-512M6M) of dozens tested, which is peculiar
that it ignores trim specifically for the 1st sector of the entire disk. But
implying there are "many" which no-op it entirely, seems like imagining the
world already works like you would assume it to.

--
With respect,
Roman

2019-02-22 14:13:31

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

Roman,

>> Consequently, many of the modern devices that claim to support
>> discard to make us software folks happy (or to satisfy a purchase
>> order requirements) complete the commands without doing anything at
>> all. We're simply wasting queue slots.
>
> Any example of such devices? Let alone "many"? Where you would issue a
> full-device blkdiscard, but then just read back old data.

I obviously can't mention names or go into implementation details. But
there are many drives out there that return old data. And that's
perfectly within spec.

At least some of the pain in the industry in this department can be
attributed to us Linux folks and RAID device vendors. We all wanted
deterministic zeroes on completion of DSM TRIM, UNMAP, or DEALLOCATE.
The device vendors weren't happy about that and we ended up with weasel
language in the specs. This lead to the current libata whitelist mess
for SATA SSDs and ongoing vendor implementation confusion in SCSI and
NVMe devices.

On the Linux side the problem was that we originally used discard for
two distinct purposes: Clearing block ranges and deallocating block
ranges. We cleaned that up a while back and now have BLKZEROOUT and
BLKDISCARD. Those operations get translated to different operations
depending on the device. We also cleaned up several of the
inconsistencies in the SCSI and NVMe specs to facilitate making this
distinction possible in the kernel.

In the meantime the SSD vendors made great strides in refining their
flash management. To the point where pretty much all enterprise device
vendors will ask you not to issue discards. The benefits simply do not
outweigh the costs.

If you have special workloads where write amplification is a major
concern it may still be advantageous to do the discards and reduce WA
and prolong drive life. However, these workloads are increasingly moving
away from the classic LBA read/write model. Open Channel originally
targeted this space. Right now work is underway on Zoned Namespaces and
Key-Value command sets in NVMe.

These curated application workload protocols are fundamental departures
from the traditional way of accessing storage. And my postulate is that
where tail latency and drive lifetime management is important, those new
command sets offer much better bang for the buck. And they make the
notion of discard completely moot. That's why I don't think it's going
to be terribly important in the long term.

This leaves consumer devices and enterprise devices using the
traditional LBA I/O model.

For consumer devices I still think fstrim is a good compromise. Lack of
queuing for DSM hurt us for a long time. And when it was finally added
to the ATA command set, many device vendors got their implementations
wrong. So it sucked for a lot longer than it should have. And of course
FTL implementations differ.

For enterprise devices we're still in the situation where vendors
generally prefer for us not to use discard. I would love for the
DEALLOCATE/WRITE ZEROES mess to be sorted out in their FTLs, but I have
fairly low confidence that it's going to happen. Case in point: Despite
a lot of leverage and purchasing power, the cloud industry has not been
terribly successful in compelling the drive manufacturers to make
DEALLOCATE perform well for typical application workloads. So I'm not
holding my breath...

--
Martin K. Petersen Oracle Linux Engineering

2019-02-22 16:45:05

by Keith Busch

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Thu, Feb 21, 2019 at 09:51:12PM -0500, Martin K. Petersen wrote:
>
> Keith,
>
> > With respect to fs block sizes, one thing making discards suck is that
> > many high capacity SSDs' physical page sizes are larger than the fs
> > block size, and a sub-page discard is worse than doing nothing.
>
> That ties into the whole zeroing as a side-effect thing.
>
> The devices really need to distinguish between discard-as-a-hint where
> it is free to ignore anything that's not a whole multiple of whatever
> the internal granularity is, and the WRITE ZEROES use case where the end
> result needs to be deterministic.

Exactly, yes, considering the deterministic zeroing behavior. For devices
supporting that, sub-page discards turn into a read-modify-write instead
of invalidating the page. That increases WAF instead of improving it
as intended, and large page SSDs are most likely to have relatively poor
write endurance in the first place.

We have NVMe spec changes in the pipeline so devices can report this
granularity. But my real concern isn't with discard per se, but more
with the writes since we don't support "sector" sizes greater than the
system's page size. This is a bit of a different topic from where this
thread started, though.

2019-02-27 11:40:14

by Ric Wheeler

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On 2/22/19 11:45 AM, Keith Busch wrote:
> On Thu, Feb 21, 2019 at 09:51:12PM -0500, Martin K. Petersen wrote:
>> Keith,
>>
>>> With respect to fs block sizes, one thing making discards suck is that
>>> many high capacity SSDs' physical page sizes are larger than the fs
>>> block size, and a sub-page discard is worse than doing nothing.
>> That ties into the whole zeroing as a side-effect thing.
>>
>> The devices really need to distinguish between discard-as-a-hint where
>> it is free to ignore anything that's not a whole multiple of whatever
>> the internal granularity is, and the WRITE ZEROES use case where the end
>> result needs to be deterministic.
> Exactly, yes, considering the deterministic zeroing behavior. For devices
> supporting that, sub-page discards turn into a read-modify-write instead
> of invalidating the page. That increases WAF instead of improving it
> as intended, and large page SSDs are most likely to have relatively poor
> write endurance in the first place.
>
> We have NVMe spec changes in the pipeline so devices can report this
> granularity. But my real concern isn't with discard per se, but more
> with the writes since we don't support "sector" sizes greater than the
> system's page size. This is a bit of a different topic from where this
> thread started, though.

All of this behavior I think could be helped if we can get some discard testing
tooling that large customers could use to validate/quantify performance issues.
Most vendors are moderately good at jumping through hoops held up by large deals
when the path through that hoop leads to a big deal :)

Ric

2019-02-27 13:24:56

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

On Fri, Feb 22, 2019 at 09:45:05AM -0700, Keith Busch wrote:
> On Thu, Feb 21, 2019 at 09:51:12PM -0500, Martin K. Petersen wrote:
> >
> > Keith,
> >
> > > With respect to fs block sizes, one thing making discards suck is that
> > > many high capacity SSDs' physical page sizes are larger than the fs
> > > block size, and a sub-page discard is worse than doing nothing.
> >
> > That ties into the whole zeroing as a side-effect thing.
> >
> > The devices really need to distinguish between discard-as-a-hint where
> > it is free to ignore anything that's not a whole multiple of whatever
> > the internal granularity is, and the WRITE ZEROES use case where the end
> > result needs to be deterministic.
>
> Exactly, yes, considering the deterministic zeroing behavior. For devices
> supporting that, sub-page discards turn into a read-modify-write instead
> of invalidating the page. That increases WAF instead of improving it
> as intended, and large page SSDs are most likely to have relatively poor
> write endurance in the first place.
>
> We have NVMe spec changes in the pipeline so devices can report this
> granularity. But my real concern isn't with discard per se, but more
> with the writes since we don't support "sector" sizes greater than the
> system's page size. This is a bit of a different topic from where this
> thread started, though.

I don't understand how reporting a larger discard granularity helps.
Sure, if the file was written block-by-block in that large granularity
to begin with, then the drive can invalidate an entire page. But if
even one page of that, say, 256kB block was rewritten, then discarding
the 256kB block will need to discard 252kB from one erase block and 4kB
from another erase block.

So it looks like you really just want to report a larger "optimal IO
size", which I thought we already had.