2011-04-28 20:59:50

by Mike Snitzer

[permalink] [raw]
Subject: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

[cc'ing linux-ext4]

On Thu, Apr 28 2011 at 3:53am -0400,
Christoph Hellwig <[email protected]> wrote:

> On Wed, Apr 27, 2011 at 08:19:13PM -0400, Mike Snitzer wrote:
> > Discards pose a problem for the snapshot-origin target because they are
> > treated as writes. Treating a discard as a write would trigger a
> > copyout to the snapshot. Such copyout can prove too costly in the face
> > of otherwise benign scenarios (e.g. create a snapshot and then mkfs.ext4
> > the origin -- mkfs.ext4 discards the entire volume by default, which
> > would copyout the entire origin volume to the snapshot).
>
> You also need to make sure that we don't claim discard_zeroes_data for
> the origin volume in this case. Especially as ext4 started to rely
> on this actually working (very bad idea IMHO, but that's another story)

Eric Sandeen helped me see that having the DM snapshot-origin target
return success but actually ignore discards is just bad form.

Especially when you consider that this exercise was motivated by the
fact that ext4 will disable discards on the first discard failure, see:
http://www.redhat.com/archives/dm-devel/2011-April/msg00070.html

Eric and I think it is best to revert this commit:
a30eec2 ext4: stop issuing discards if not supported by device

(though ideally ext4 would still WARN_ONCE per superblock with something
like: "discard failed, please consider disabling discard support")

1) The user asked for discards (with '-o discard' mount option)
- what is the real harm in coninuing to issue them even if it _seems_
they aren't supported?
2) assuming the entire block device uniformly supports discards can
be flawed (a DM device's discard support can vary based on logical
offset).

Thoughts?


2011-04-28 21:28:33

by Eric Sandeen

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On 4/28/11 3:59 PM, Mike Snitzer wrote:
> [cc'ing linux-ext4]
>
> On Thu, Apr 28 2011 at 3:53am -0400,
> Christoph Hellwig <[email protected]> wrote:
>
>> On Wed, Apr 27, 2011 at 08:19:13PM -0400, Mike Snitzer wrote:
>>> Discards pose a problem for the snapshot-origin target because they are
>>> treated as writes. Treating a discard as a write would trigger a
>>> copyout to the snapshot. Such copyout can prove too costly in the face
>>> of otherwise benign scenarios (e.g. create a snapshot and then mkfs.ext4
>>> the origin -- mkfs.ext4 discards the entire volume by default, which
>>> would copyout the entire origin volume to the snapshot).
>>
>> You also need to make sure that we don't claim discard_zeroes_data for
>> the origin volume in this case. Especially as ext4 started to rely
>> on this actually working (very bad idea IMHO, but that's another story)
>
> Eric Sandeen helped me see that having the DM snapshot-origin target
> return success but actually ignore discards is just bad form.
>
> Especially when you consider that this exercise was motivated by the
> fact that ext4 will disable discards on the first discard failure, see:
> http://www.redhat.com/archives/dm-devel/2011-April/msg00070.html
>
> Eric and I think it is best to revert this commit:
> a30eec2 ext4: stop issuing discards if not supported by device
>
> (though ideally ext4 would still WARN_ONCE per superblock with something
> like: "discard failed, please consider disabling discard support")
>
> 1) The user asked for discards (with '-o discard' mount option)
> - what is the real harm in coninuing to issue them even if it _seems_
> they aren't supported?

TBH I sent a30eec2 on a whim. Seemed reasonable at the time, but if
discard-ability changes over time, it may not be the best plan.

> 2) assuming the entire block device uniformly supports discards can
> be flawed (a DM device's discard support can vary based on logical
> offset).

I still think that concats of floppies, usb disks, and ssds should be rare, so I'm less concerned about that ;)

I think Mike is right though, that if you do not do anything with a discard, you should return -EOPNOTSUPP, and not pretend that you honored it.

We should, IMHO, deal with the truth of the matter at the filesystem caller.

-Eric

> Thoughts?


2011-04-28 23:00:05

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [linux-lvm] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Thu, Apr 28, 2011 at 04:28:17PM -0500, Eric Sandeen wrote:
> I still think that concats of floppies, usb disks, and ssds should be rare, so I'm less concerned about that ;)

It's the 'undefined' cases that cause us the trouble though: we do have
to return something and I prefer to work with defined and documented
behaviour, rather than pretending something is 'undefined' or
'unimportant' and later finding people relied on its actual behaviour.

Alasdair


2011-04-28 23:01:30

by Eric Sandeen

[permalink] [raw]
Subject: Re: [linux-lvm] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On 4/28/11 5:59 PM, Alasdair G Kergon wrote:
> On Thu, Apr 28, 2011 at 04:28:17PM -0500, Eric Sandeen wrote:
>> I still think that concats of floppies, usb disks, and ssds should be rare, so I'm less concerned about that ;)
>
> It's the 'undefined' cases that cause us the trouble though: we do have
> to return something and I prefer to work with defined and documented
> behaviour, rather than pretending something is 'undefined' or
> 'unimportant' and later finding people relied on its actual behaviour.

If you are saying that a target should return EOPNOTSUPP when it does not support the operation, then we are in violent agreement. :)

-Eric

> Alasdair


2011-04-28 23:11:22

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [linux-lvm] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Thu, Apr 28, 2011 at 06:01:18PM -0500, Eric Sandeen wrote:
> If you are saying that a target should return EOPNOTSUPP when it does not support the operation, then we are in violent agreement. :)

Naturally, with the definition you used in the first post.
(I.e. if the target does *anything* with it, you do not get that error, even
if it did not process it completely.)

Alasdair


2011-04-29 01:12:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Apr 28, 2011, at 14:59, Mike Snitzer wrote:
> On Thu, Apr 28 2011 at 3:53am -0400,
> Christoph Hellwig <[email protected]> wrote:
>> On Wed, Apr 27, 2011 at 08:19:13PM -0400, Mike Snitzer wrote:
>>> Discards pose a problem for the snapshot-origin target because they are
>>> treated as writes. Treating a discard as a write would trigger a
>>> copyout to the snapshot. Such copyout can prove too costly in the face
>>> of otherwise benign scenarios (e.g. create a snapshot and then mkfs.ext4
>>> the origin -- mkfs.ext4 discards the entire volume by default, which
>>> would copyout the entire origin volume to the snapshot).
>>
>> You also need to make sure that we don't claim discard_zeroes_data for
>> the origin volume in this case. Especially as ext4 started to rely
>> on this actually working (very bad idea IMHO, but that's another story)
>
> Eric Sandeen helped me see that having the DM snapshot-origin target
> return success but actually ignore discards is just bad form.
>
> Especially when you consider that this exercise was motivated by the
> fact that ext4 will disable discards on the first discard failure, see:
> http://www.redhat.com/archives/dm-devel/2011-April/msg00070.html
>
> Eric and I think it is best to revert this commit:
> a30eec2 ext4: stop issuing discards if not supported by device
>
> (though ideally ext4 would still WARN_ONCE per superblock with something
> like: "discard failed, please consider disabling discard support")
>
> 1) The user asked for discards (with '-o discard' mount option)
> - what is the real harm in coninuing to issue them even if it _seems_
> they aren't supported?

Aren't discard operations considered barriers? If there are needless discard operations being sent to the disk then it seems like this would have a non-zero performance penalty, just by virtue of blocking IO submissions at the block device layer.

If my understanding of the mechanics is flawed, then please ignore.

> 2) assuming the entire block device uniformly supports discards can
> be flawed (a DM device's discard support can vary based on logical
> offset).

Hybrid storage is potentially an increasingly popular configuration. We've been looking at DM-mapped SSD RAID-1 + HDD RAID-6 arrays for ext4 to allow storing only the metadata on SSD. That said, it may make sense to have the DM translation layer check the component devices of an LV to see if discard is supported on any of them, and only return -EOPNOTSUPP if none of them do.

While not a perfect solution for the case where an LV is migrated onto a device that does support discard, I don't think we have to handle every corner case perfectly to still handle reasonable use cases in a sensible manner.

Cheers, Andreas






2011-04-29 09:31:05

by Lukas Czerner

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Thu, 28 Apr 2011, Mike Snitzer wrote:

> [cc'ing linux-ext4]
>
> On Thu, Apr 28 2011 at 3:53am -0400,
> Christoph Hellwig <[email protected]> wrote:
>
> > On Wed, Apr 27, 2011 at 08:19:13PM -0400, Mike Snitzer wrote:
> > > Discards pose a problem for the snapshot-origin target because they are
> > > treated as writes. Treating a discard as a write would trigger a
> > > copyout to the snapshot. Such copyout can prove too costly in the face
> > > of otherwise benign scenarios (e.g. create a snapshot and then mkfs.ext4
> > > the origin -- mkfs.ext4 discards the entire volume by default, which
> > > would copyout the entire origin volume to the snapshot).
> >
> > You also need to make sure that we don't claim discard_zeroes_data for
> > the origin volume in this case. Especially as ext4 started to rely
> > on this actually working (very bad idea IMHO, but that's another story)

I do not think that it is bad idea. It is supposed to work and we do not
want to "optimize" for broken devices (or broken cheap crap, as someone
concisely described before).

>
> Eric Sandeen helped me see that having the DM snapshot-origin target
> return success but actually ignore discards is just bad form.
>
> Especially when you consider that this exercise was motivated by the
> fact that ext4 will disable discards on the first discard failure, see:
> http://www.redhat.com/archives/dm-devel/2011-April/msg00070.html
>
> Eric and I think it is best to revert this commit:
> a30eec2 ext4: stop issuing discards if not supported by device
>
> (though ideally ext4 would still WARN_ONCE per superblock with something
> like: "discard failed, please consider disabling discard support")

I think that we do not need to revert it, we just need to do the "right
thing" in the underlying layers. That said:

1. We need to honor all the "discard limits" so the discard bios does
not actually fail.
2. If the device is composed of various other devices, we should return
-EOPNOTSUPP is none of the devices support discard.
3. We should succeed, if at least one of the devices support discard and
it does not fail for any reason.
4. We should not advertise discard_zeroes_data if any of the devices
does not zero data or does not support discard.

I am not sure how "hard" is to assure those conditions in DM. If those
conditions are met, we can rely on consistent information in the layers
above.

Thanks!
-Lukas

>
> 1) The user asked for discards (with '-o discard' mount option)
> - what is the real harm in coninuing to issue them even if it _seems_
> they aren't supported?
> 2) assuming the entire block device uniformly supports discards can
> be flawed (a DM device's discard support can vary based on logical
> offset).
>
> Thoughts?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2011-04-29 12:25:06

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Fri, Apr 29, 2011 at 11:30:47AM +0200, Lukas Czerner wrote:
> 1. We need to honor all the "discard limits" so the discard bios does
> not actually fail.
> 2. If the device is composed of various other devices, we should return
> -EOPNOTSUPP is none of the devices support discard.
> 3. We should succeed, if at least one of the devices support discard and
> it does not fail for any reason.
> 4. We should not advertise discard_zeroes_data if any of the devices
> does not zero data or does not support discard.
> I am not sure how "hard" is to assure those conditions in DM. If those
> conditions are met, we can rely on consistent information in the layers
> above.

Remember that -EOPNOTSUPP return applies only to that one *specific*
discard. It does not tell you for sure whether or not another future
discard to the device will succeed. (It's a property of offset - if
there are several devices underneath - and of time - if the device or
one below it gets reconfigured.)

The core issue here is whether a filesytem should decide that the
receipt of a single -EOPNOTSUPP is a reason never to send any more,
whether a more sophisticated algorithm should be used (considering
the proportion/offsets of them over given periods of time and retrying
later), or whether more comprehensive information about the discard
capabilities of the device should be presented - and whether this should
be handled automatically or whether it should be under userspace control
(i.e. the sysadmin can instruct the filesystem what to do).

dm's role is simply to handle a discard if it can, and report back
if it couldn't. Additionally it could provide more comprehensive
information about the discard capabilities of the device, but my sense
is that most consider this unnecessary as normally dm devices will have
a coherent behaviour throughout.

Alasdair


2011-04-29 12:29:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

FYI, that disable discard on failure seems to be an ext4 special and no
one else picked up that stupid idea. If ext4 wants to misbehave for
that just let them..


2011-04-29 14:29:05

by Eric Sandeen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On 4/29/11 7:29 AM, Christoph Hellwig wrote:
> FYI, that disable discard on failure seems to be an ext4 special and no
> one else picked up that stupid idea. If ext4 wants to misbehave for
> that just let them..

It was my "stupid idea," and I'm ok with reverting it ;)

-Eric

2011-04-29 16:17:09

by Ray Morris

[permalink] [raw]
Subject: Re: [linux-lvm] [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

> On 4/29/11 7:29 AM, Christoph Hellwig wrote:
> > no one else picked up that stupid idea. If ext4 wants to
> > misbehave for that just let them..
>
> It was my "stupid idea," and I'm ok with reverting it ;)
>
> -Eric

http://www.google.com/search?q=how+to+remove+foot+from+mouth
--
Ray Morris
[email protected]

Strongbox - The next generation in site security:
http://www.bettercgi.com/strongbox/

Throttlebox - Intelligent Bandwidth Control
http://www.bettercgi.com/throttlebox/

Strongbox / Throttlebox affiliate program:
http://www.bettercgi.com/affiliates/user/register.php




On Fri, 29 Apr 2011 09:28:51 -0500
Eric Sandeen <[email protected]> wrote:

> On 4/29/11 7:29 AM, Christoph Hellwig wrote:
> > FYI, that disable discard on failure seems to be an ext4 special
> > and no one else picked up that stupid idea. If ext4 wants to
> > misbehave for that just let them..
>
> It was my "stupid idea," and I'm ok with reverting it ;)
>
> -Eric
> _______________________________________________
> linux-lvm mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>


2011-05-02 07:16:42

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Fri, 29 Apr 2011, Alasdair G Kergon wrote:

> On Fri, Apr 29, 2011 at 11:30:47AM +0200, Lukas Czerner wrote:
> > 1. We need to honor all the "discard limits" so the discard bios does
> > not actually fail.
> > 2. If the device is composed of various other devices, we should return
> > -EOPNOTSUPP is none of the devices support discard.
> > 3. We should succeed, if at least one of the devices support discard and
> > it does not fail for any reason.
> > 4. We should not advertise discard_zeroes_data if any of the devices
> > does not zero data or does not support discard.
> > I am not sure how "hard" is to assure those conditions in DM. If those
> > conditions are met, we can rely on consistent information in the layers
> > above.
>
> Remember that -EOPNOTSUPP return applies only to that one *specific*
> discard. It does not tell you for sure whether or not another future
> discard to the device will succeed. (It's a property of offset - if
> there are several devices underneath - and of time - if the device or
> one below it gets reconfigured.)
>
> The core issue here is whether a filesytem should decide that the
> receipt of a single -EOPNOTSUPP is a reason never to send any more,
> whether a more sophisticated algorithm should be used (considering
> the proportion/offsets of them over given periods of time and retrying
> later), or whether more comprehensive information about the discard
> capabilities of the device should be presented - and whether this should
> be handled automatically or whether it should be under userspace control
> (i.e. the sysadmin can instruct the filesystem what to do).
>
> dm's role is simply to handle a discard if it can, and report back
> if it couldn't. Additionally it could provide more comprehensive
> information about the discard capabilities of the device, but my sense
> is that most consider this unnecessary as normally dm devices will have
> a coherent behaviour throughout.
>
> Alasdair
>
>

Hi Alasdair,

I like the last idea the best. Dm is the only one who has the
overall information about the storage, hence it should provide
some of that information for others to do the right thing.

Christoph mentioned that ext4 is the only one actually disabling
discard on -EOPNOTSUPP, but firstly that's not true (gfs2 and
nilfs does that as well) and secondly it is the right thing to
do! What else should we be doing when it tells us that "discard is
not supported"? continue trying ? that does not make sense at all!

However when we have dm device where part of the device supports
discard due to underlying hardware capability we just can not return
EOPNOTSUPP from blkdev_issue_discard, because it is just not true! It
does support discard, but it is not consistent through the whole device
and at some layer we should notice that and not let it through. How to
do this is just implementation detail. We can either do it at dm level,
while handling bios, or we can export "do not return eopnotsupp"
information and blkdev_issue_discard() should use that information to do
the right thing.

Thanks!
-Lukas

2011-05-02 08:13:23

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, May 02, 2011 at 09:16:21AM +0200, Lukas Czerner wrote:
> However when we have dm device where part of the device supports
> discard due to underlying hardware capability we just can not return
> EOPNOTSUPP from blkdev_issue_discard, because it is just not true!

EOPNOTSUPP from dm means the operation was not supported on that *one* bio.
It does *not* tell you anything in general about the device, or whether
you'd get the same error from different bios in future.

Alasdair


2011-05-02 08:19:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, May 02, 2011 at 09:13:08AM +0100, Alasdair G Kergon wrote:
> On Mon, May 02, 2011 at 09:16:21AM +0200, Lukas Czerner wrote:
> > However when we have dm device where part of the device supports
> > discard due to underlying hardware capability we just can not return
> > EOPNOTSUPP from blkdev_issue_discard, because it is just not true!
>
> EOPNOTSUPP from dm means the operation was not supported on that *one* bio.
> It does *not* tell you anything in general about the device, or whether
> you'd get the same error from different bios in future.

Exactly. We already have the information in the queue limits to tell
the filesystem if discard is supported at all or not.


2011-05-02 10:24:48

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, 2 May 2011, Christoph Hellwig wrote:

> On Mon, May 02, 2011 at 09:13:08AM +0100, Alasdair G Kergon wrote:
> > On Mon, May 02, 2011 at 09:16:21AM +0200, Lukas Czerner wrote:
> > > However when we have dm device where part of the device supports
> > > discard due to underlying hardware capability we just can not return
> > > EOPNOTSUPP from blkdev_issue_discard, because it is just not true!
> >
> > EOPNOTSUPP from dm means the operation was not supported on that *one* bio.
> > It does *not* tell you anything in general about the device, or whether
> > you'd get the same error from different bios in future.
>
> Exactly. We already have the information in the queue limits to tell
> the filesystem if discard is supported at all or not.
>

So I gave it a try. First of all the device composed of SSD and spinning
disk does export discard_support information properly, however it also
advertise discard_zeroes_data which is wrong and possibly dangerous and
should be fixed!

And second of all, strictly speaking if EOPNOTSUPP from dm means that
operation was not supported on that *one* bio, blkdev_issue_discard()
should handle that and do not return EOPNOTSUPP further if queue limits
tells that device has discard support. Is this acceptable solution for
you guys ? I can make that change since I am changing blkdev_issue_discard()
anyway. Or, we can make that change in filesystem where we disable
discard on mount time, when we notice that discard mount option was
specified, but the device does not support it (we should probably do
this regardless on blkdev_issue_discard() change).

Thanks!
-Lukas

2011-05-02 12:48:19

by Mike Snitzer

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, May 02 2011 at 6:24am -0400,
Lukas Czerner <[email protected]> wrote:

> On Mon, 2 May 2011, Christoph Hellwig wrote:
>
> > On Mon, May 02, 2011 at 09:13:08AM +0100, Alasdair G Kergon wrote:
> > > On Mon, May 02, 2011 at 09:16:21AM +0200, Lukas Czerner wrote:
> > > > However when we have dm device where part of the device supports
> > > > discard due to underlying hardware capability we just can not return
> > > > EOPNOTSUPP from blkdev_issue_discard, because it is just not true!
> > >
> > > EOPNOTSUPP from dm means the operation was not supported on that *one* bio.
> > > It does *not* tell you anything in general about the device, or whether
> > > you'd get the same error from different bios in future.
> >
> > Exactly. We already have the information in the queue limits to tell
> > the filesystem if discard is supported at all or not.
> >
>
> So I gave it a try. First of all the device composed of SSD and spinning
> disk does export discard_support information properly, however it also
> advertise discard_zeroes_data which is wrong and possibly dangerous and
> should be fixed!

You're effectively advocating that blk_stack_limits() needs to clear the
topmost device's discard_zeroes_data flag if any one bottom device does
not have discard_zeroes_data.

> And second of all, strictly speaking if EOPNOTSUPP from dm means that
> operation was not supported on that *one* bio, blkdev_issue_discard()
> should handle that and do not return EOPNOTSUPP further if queue limits
> tells that device has discard support. Is this acceptable solution for
> you guys ? I can make that change since I am changing blkdev_issue_discard()
> anyway. Or, we can make that change in filesystem where we disable
> discard on mount time, when we notice that discard mount option was
> specified, but the device does not support it (we should probably do
> this regardless on blkdev_issue_discard() change).

The blkdev_issue_discard() change you propose could be fine (mask
EOPNOTSUPP return if device advertises support for discards) -- though
Eric said we shouldn't ever say we did something when we didn't.

But that blkdev_issue_discard() change is really only safe if the
discard_zeroes_data flag is cleared by blk_stack_limits() if finds
inconsistent discard_zeroes_data support.

Mike

2011-05-02 13:05:46

by Lukas Czerner

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, 2 May 2011, Mike Snitzer wrote:

> On Mon, May 02 2011 at 6:24am -0400,
> Lukas Czerner <[email protected]> wrote:
>
> > On Mon, 2 May 2011, Christoph Hellwig wrote:
> >
> > > On Mon, May 02, 2011 at 09:13:08AM +0100, Alasdair G Kergon wrote:
> > > > On Mon, May 02, 2011 at 09:16:21AM +0200, Lukas Czerner wrote:
> > > > > However when we have dm device where part of the device supports
> > > > > discard due to underlying hardware capability we just can not return
> > > > > EOPNOTSUPP from blkdev_issue_discard, because it is just not true!
> > > >
> > > > EOPNOTSUPP from dm means the operation was not supported on that *one* bio.
> > > > It does *not* tell you anything in general about the device, or whether
> > > > you'd get the same error from different bios in future.
> > >
> > > Exactly. We already have the information in the queue limits to tell
> > > the filesystem if discard is supported at all or not.
> > >
> >
> > So I gave it a try. First of all the device composed of SSD and spinning
> > disk does export discard_support information properly, however it also
> > advertise discard_zeroes_data which is wrong and possibly dangerous and
> > should be fixed!
>
> You're effectively advocating that blk_stack_limits() needs to clear the
> topmost device's discard_zeroes_data flag if any one bottom device does
> not have discard_zeroes_data.

Exactly.

>
> > And second of all, strictly speaking if EOPNOTSUPP from dm means that
> > operation was not supported on that *one* bio, blkdev_issue_discard()
> > should handle that and do not return EOPNOTSUPP further if queue limits
> > tells that device has discard support. Is this acceptable solution for
> > you guys ? I can make that change since I am changing blkdev_issue_discard()
> > anyway. Or, we can make that change in filesystem where we disable
> > discard on mount time, when we notice that discard mount option was
> > specified, but the device does not support it (we should probably do
> > this regardless on blkdev_issue_discard() change).
>
> The blkdev_issue_discard() change you propose could be fine (mask
> EOPNOTSUPP return if device advertises support for discards) -- though
> Eric said we shouldn't ever say we did something when we didn't.

Exactly, so we should not say that it is not supported when it is, but
we just hit the "wrong" part of the device:) I would just very much like
to keep the abstraction of having one consistent device underneath the
file system and not deal with several devices, or regions with different
behaviour in the file system itself (let the pixies underneath deal with
that, after all not all of us are btrfs:))

>
> But that blkdev_issue_discard() change is really only safe if the
> discard_zeroes_data flag is cleared by blk_stack_limits() if finds
> inconsistent discard_zeroes_data support.

Agreed

>
> Mike
>

Thanks!
-Lukas

2011-05-02 13:48:59

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

Lukas> however it also advertise discard_zeroes_data which is wrong and
Lukas> possibly dangerous and should be fixed!

That shouldn't happen. I'll have a look...

--
Martin K. Petersen Oracle Linux Engineering

2011-05-02 14:20:54

by Martin K. Petersen

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

Lukas> So I gave it a try. First of all the device composed of SSD and
Lukas> spinning disk does export discard_support information properly,
Lukas> however it also advertise discard_zeroes_data which is wrong and
Lukas> possibly dangerous and should be fixed!

I can't reproduce this here. If I mix discard and non-discard devices
things work correctly. discard_zeroes_data also gets cleared if I mix
discard-capable drives where one zeroes and one doesn't.

--
Martin K. Petersen Oracle Linux Engineering

2011-05-02 14:40:01

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, 2 May 2011, Martin K. Petersen wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> Lukas> So I gave it a try. First of all the device composed of SSD and
> Lukas> spinning disk does export discard_support information properly,
> Lukas> however it also advertise discard_zeroes_data which is wrong and
> Lukas> possibly dangerous and should be fixed!
>
> I can't reproduce this here. If I mix discard and non-discard devices
> things work correctly. discard_zeroes_data also gets cleared if I mix
> discard-capable drives where one zeroes and one doesn't.
>
>

[root@trim ~]# hdparm -I /dev/sdb | grep -i trim
[root@trim ~]# cat /sys/block/sdb/queue/discard_zeroes_data
0
[root@trim ~]# hdparm -I /dev/sdd | grep -i trim
* Data Set Management TRIM supported
* Deterministic read after TRIM
[root@trim ~]# cat /sys/block/sdd/queue/discard_zeroes_data
1
[root@trim ~]# pvcreate /dev/sdd1
Physical volume "/dev/sdd1" successfully created
[root@trim ~]# pvcreate /dev/sdb3
Physical volume "/dev/sdb3" successfully created
[root@trim ~]# vgcreate vg_test /dev/sdd1 /dev/sdb3
Volume group "vg_test" successfully created
[root@trim ~]# lvcreate -L 3500M vg_test
Logical volume "lvol0" created
[root@trim ~]# ls -lah /dev/mapper/vg_test-lvol0
lrwxrwxrwx. 1 root root 7 2.?kv? 10.30 /dev/mapper/vg_test-lvol0 -> ../dm-0
[root@trim ~]# cat /sys/block/dm-0/queue/discard_zeroes_data
1
[root@trim ~]# uname -r
2.6.39-rc5-blkdev+
[root@trim ~]# cat /proc/partitions
8 19 1951897 sdb3
8 49 1951866 sdd1
253 0 3584000 dm-0



So I assume it got fixes just recently ?. I'll give it a try with the
up-to-date kernel.

Thanks!
-Lukas

2011-05-02 14:47:26

by Eric Sandeen

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On 5/2/11 8:05 AM, Lukas Czerner wrote:
> On Mon, 2 May 2011, Mike Snitzer wrote:

...

>> The blkdev_issue_discard() change you propose could be fine (mask
>> EOPNOTSUPP return if device advertises support for discards) -- though
>> Eric said we shouldn't ever say we did something when we didn't.
>
> Exactly, so we should not say that it is not supported when it is, but
> we just hit the "wrong" part of the device:) I would just very much like
> to keep the abstraction of having one consistent device underneath the
> file system and not deal with several devices, or regions with different
> behaviour in the file system itself (let the pixies underneath deal with
> that, after all not all of us are btrfs:))

I still think we need to stick with the simple rule: "EOPNOTSUPP returned for a particular bio means that it is not supported for that particular bio" - I don't know what else we can do, without creating an ambiguity...

This does, however, suck for the layer calling in to a complex device.

What is the overhead for sending discard bios down to a device that does not support it?

-Eric

2011-05-02 14:48:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, May 02, 2011 at 09:47:16AM -0500, Eric Sandeen wrote:
> I still think we need to stick with the simple rule: "EOPNOTSUPP returned for a particular bio means that it is not supported for that particular bio" - I don't know what else we can do, without creating an ambiguity...
>
> This does, however, suck for the layer calling in to a complex device.
>
> What is the overhead for sending discard bios down to a device that does not support it?

For a DM-like device there is very little overhead. The bio gets
dispatched to make_request and failed in there almost immediately.


2011-05-02 14:50:16

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

Lukas> So I assume it got fixes just recently ?. I'll give it a try with
Lukas> the up-to-date kernel.

Ok, turns out the SSD I used for the test predates advertising
TRIM. Beats me why this is broken for the zeroes/!zeroes case. I'll get
it fixed...

--
Martin K. Petersen Oracle Linux Engineering

2011-05-02 14:58:23

by Mike Snitzer

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, May 02 2011 at 10:39am -0400,
Lukas Czerner <[email protected]> wrote:

> On Mon, 2 May 2011, Martin K. Petersen wrote:
>
> > >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
> >
> > Lukas> So I gave it a try. First of all the device composed of SSD and
> > Lukas> spinning disk does export discard_support information properly,
> > Lukas> however it also advertise discard_zeroes_data which is wrong and
> > Lukas> possibly dangerous and should be fixed!
> >
> > I can't reproduce this here. If I mix discard and non-discard devices
> > things work correctly. discard_zeroes_data also gets cleared if I mix
> > discard-capable drives where one zeroes and one doesn't.
> >
> >
>
> [root@trim ~]# hdparm -I /dev/sdb | grep -i trim
> [root@trim ~]# cat /sys/block/sdb/queue/discard_zeroes_data
> 0
> [root@trim ~]# hdparm -I /dev/sdd | grep -i trim
> * Data Set Management TRIM supported
> * Deterministic read after TRIM
> [root@trim ~]# cat /sys/block/sdd/queue/discard_zeroes_data
> 1
> [root@trim ~]# pvcreate /dev/sdd1
> Physical volume "/dev/sdd1" successfully created
> [root@trim ~]# pvcreate /dev/sdb3
> Physical volume "/dev/sdb3" successfully created
> [root@trim ~]# vgcreate vg_test /dev/sdd1 /dev/sdb3
> Volume group "vg_test" successfully created
> [root@trim ~]# lvcreate -L 3500M vg_test
> Logical volume "lvol0" created
> [root@trim ~]# ls -lah /dev/mapper/vg_test-lvol0
> lrwxrwxrwx. 1 root root 7 2. kvě 10.30 /dev/mapper/vg_test-lvol0 -> ../dm-0
> [root@trim ~]# cat /sys/block/dm-0/queue/discard_zeroes_data
> 1

Hmm, Strange considering DM just uses blk_stack_limits().

As Martin said, current blk_stack_limits() code is fine:
t->discard_zeroes_data &= b->discard_zeroes_data;

> So I assume it got fixes just recently ?. I'll give it a try with the
> up-to-date kernel.

Hasn't changed at all since it was introduced:
98262f2 v2.6.33-rc1 block: Allow devices to indicate whether discarded blocks are zeroed

2011-05-02 14:58:35

by Lukas Czerner

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, 2 May 2011, Eric Sandeen wrote:

> On 5/2/11 8:05 AM, Lukas Czerner wrote:
> > On Mon, 2 May 2011, Mike Snitzer wrote:
>
> ...
>
> >> The blkdev_issue_discard() change you propose could be fine (mask
> >> EOPNOTSUPP return if device advertises support for discards) -- though
> >> Eric said we shouldn't ever say we did something when we didn't.
> >
> > Exactly, so we should not say that it is not supported when it is, but
> > we just hit the "wrong" part of the device:) I would just very much like
> > to keep the abstraction of having one consistent device underneath the
> > file system and not deal with several devices, or regions with different
> > behaviour in the file system itself (let the pixies underneath deal with
> > that, after all not all of us are btrfs:))
>
> I still think we need to stick with the simple rule: "EOPNOTSUPP returned for a particular bio means that it is not supported for that particular bio" - I don't know what else we can do, without creating an ambiguity...

I agree, but we do not care about bios in file system. We need to know
whether the device itself supports it or not. Also consider BLKDISCARD
ioctl(), now as it is it's completely broken on such dm devices, because
you are not able to use it due to -EOPNOTSUPP first time you hit the
device which does not support it, but it can very well be just a small
part of dm device.

Again, it does not make sense to return -EOPNOTSUPP from
blkdev_issue_discard() when the device actually supports it.

-Lukas

>
> This does, however, suck for the layer calling in to a complex device.
>
> What is the overhead for sending discard bios down to a device that does not support it?
>
> -Eric
>

--

2011-05-02 16:58:50

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

Lukas,

Lukas> [root@trim ~]# lvcreate -L 3500M vg_test
Lukas> Logical volume "lvol0" created

Ok, so here's what I think is going on. You're creating a linear target
which happens to fit inside the first PV.

Here's two devices. One that supports discard_zeroes_data=1 (8:17) and
one that doesn't (8:49).

# dmsetup table
foo-bar: 0 1032192 striped 2 32 8:17 384 8:49 384
foo-baz: 0 106496 linear 8:17 516480

# grep . /sys/block/d*/queue/discard_z*
/sys/block/dm-0/queue/discard_zeroes_data:0
/sys/block/dm-1/queue/discard_zeroes_data:1

LV foo/bar (dm-0) is striped, straddling two devices with incompatible
values. Hence 0.

LV foo/baz (dm-1) is linear and fits inside the first PV. Thus it has
discard_zeroes_data=1.

--
Martin K. Petersen Oracle Linux Engineering

2011-05-03 08:57:36

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Mon, 2 May 2011, Martin K. Petersen wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> Lukas,
>
> Lukas> [root@trim ~]# lvcreate -L 3500M vg_test
> Lukas> Logical volume "lvol0" created
>
> Ok, so here's what I think is going on. You're creating a linear target
> which happens to fit inside the first PV.
>
> Here's two devices. One that supports discard_zeroes_data=1 (8:17) and
> one that doesn't (8:49).
>
> # dmsetup table
> foo-bar: 0 1032192 striped 2 32 8:17 384 8:49 384
> foo-baz: 0 106496 linear 8:17 516480
>
> # grep . /sys/block/d*/queue/discard_z*
> /sys/block/dm-0/queue/discard_zeroes_data:0
> /sys/block/dm-1/queue/discard_zeroes_data:1
>
> LV foo/bar (dm-0) is striped, straddling two devices with incompatible
> values. Hence 0.
>
> LV foo/baz (dm-1) is linear and fits inside the first PV. Thus it has
> discard_zeroes_data=1.

Well that's why I have included /proc/partition, so you can see that
this is not happening, because both partitions are 1.86GB long and the LV I
am creating from those is 3.42GB long.

[root@trim ~]# dmsetup table
vg_test-lvol0: 0 3899392 linear 8:49 2048
vg_test-lvol0: 3899392 3268608 linear 8:19 2048


Nevertheless there is something weird going on, because even when I
create striped volume I get this:


[root@trim ~]# dmsetup table
vg_test-lvol0: 0 7176192 striped 2 16 8:49 2048 8:19 2048
[root@trim ~]# ls -lah /dev/mapper/vg_test-lvol0
lrwxrwxrwx. 1 root root 7 3.?kv? 04.42 /dev/mapper/vg_test-lvol0 ->
../dm-0
[root@trim ~]# grep . /sys/block/dm-0/queue/discard_*
/sys/block/dm-0/queue/discard_granularity:512
/sys/block/dm-0/queue/discard_max_bytes:4294966784
/sys/block/dm-0/queue/discard_zeroes_data:1


And of course (8:49) is a partition on the SSD which does zero data and
(8:19) is a partition on the spinning device which does not zero data,
nor does it support discard. And this is on yesterdays kernel.

[root@trim ~]# uname -r
2.6.39-rc5+

So assuming that you're trying recent kernel as well, someone is doing
something different :)

Thanks!
-Lukas

2011-05-04 15:10:51

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

Lukas> Nevertheless there is something weird going on, because even when
Lukas> I create striped volume I get this:

Could you please try the following patch? It has a bunch of small tweaks
to the discard stack in it. I'll split it up before posting for real but
I'd like to know if it fixes your issue...


block/libata/scsi: Various logical block provisioning fixes

- Add sysfs documentation for the discard topology parameters

- Fix discard stacking problem

- Switch our libata SAT over to using the WRITE SAME limits

- UNMAP alignment needs to be converted to bytes

- Only report alignment and zeroes_data if the device supports discard

Reported-by: Lukas Czerner <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index 4873c75..c1eb41c 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -142,3 +142,67 @@ Description:
with the previous I/O request are enabled. When set to 2,
all merge tries are disabled. The default value is 0 -
which enables all types of merge tries.
+
+What: /sys/block/<disk>/discard_alignment
+Date: May 2011
+Contact: Martin K. Petersen <[email protected]>
+Description:
+ Devices that support discard functionality may
+ internally allocate space in units that are bigger than
+ the exported logical block size. The discard_alignment
+ parameter indicates how many bytes the beginning of the
+ device is offset from the internal allocation unit's
+ natural alignment.
+
+What: /sys/block/<disk>/<partition>/discard_alignment
+Date: May 2011
+Contact: Martin K. Petersen <[email protected]>
+Description:
+ Devices that support discard functionality may
+ internally allocate space in units that are bigger than
+ the exported logical block size. The discard_alignment
+ parameter indicates how many bytes the beginning of the
+ partition is offset from the internal allocation unit's
+ natural alignment.
+
+What: /sys/block/<disk>/queue/discard_granularity
+Date: May 2011
+Contact: Martin K. Petersen <[email protected]>
+Description:
+ Devices that support discard functionality may
+ internally allocate space using units that are bigger
+ than the logical block size. The discard_granularity
+ parameter indicates the size of the internal allocation
+ unit in bytes if reported by the device. Otherwise the
+ discard_granularity will be set to match the device's
+ physical block size. A discard_granularity of 0 means
+ that the device does not support discard functionality.
+
+What: /sys/block/<disk>/queue/discard_max_bytes
+Date: May 2011
+Contact: Martin K. Petersen <[email protected]>
+Description:
+ Devices that support discard functionality may have
+ internal limits on the number of bytes that can be
+ trimmed or unmapped in a single operation. Some storage
+ protocols also have inherent limits on the number of
+ blocks that can be described in a single command. The
+ discard_max_bytes parameter is set by the device driver
+ to the maximum number of bytes that can be discarded in
+ a single operation. Discard requests issued to the
+ device must not exceed this limit. A discard_max_bytes
+ value of 0 means that the device does not support
+ discard functionality.
+
+What: /sys/block/<disk>/queue/discard_zeroes_data
+Date: May 2011
+Contact: Martin K. Petersen <[email protected]>
+Description:
+ Devices that support discard functionality may return
+ stale or random data when a previously discarded block
+ is read back. This can cause problems if the filesystem
+ expects discarded blocks to be explicitly cleared. If a
+ device reports that it deterministically returns zeroes
+ when a discarded area is read the discard_zeroes_data
+ parameter will be set to one. Otherwise it will be 0 and
+ the result of reading a discarded area is undefined.
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1fa7692..42d3bf5 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -120,7 +120,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->discard_granularity = 0;
lim->discard_alignment = 0;
lim->discard_misaligned = 0;
- lim->discard_zeroes_data = -1;
+ lim->discard_zeroes_data = 1;
lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
lim->alignment_offset = 0;
@@ -166,6 +166,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)

blk_set_default_limits(&q->limits);
blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
+ q->limits.discard_zeroes_data = 0;

/*
* by default assume old behaviour and bounce for any highmem page
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e2f57e9e..30ea95f 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2138,7 +2138,7 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
* with the unmap bit set.
*/
if (ata_id_has_trim(args->id)) {
- put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
+ put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
put_unaligned_be32(1, &rbuf[28]);
}

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bd0806e..cc081dd 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -490,7 +490,8 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
unsigned int max_blocks = 0;

q->limits.discard_zeroes_data = sdkp->lbprz;
- q->limits.discard_alignment = sdkp->unmap_alignment;
+ q->limits.discard_alignment = sdkp->unmap_alignment *
+ (logical_block_size >> 9);
q->limits.discard_granularity =
max(sdkp->physical_block_size,
sdkp->unmap_granularity * logical_block_size);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2ad95fa..1416e3c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -257,7 +257,7 @@ struct queue_limits {
unsigned char misaligned;
unsigned char discard_misaligned;
unsigned char cluster;
- signed char discard_zeroes_data;
+ unsigned char discard_zeroes_data;
};

struct request_queue
@@ -1066,13 +1066,16 @@ static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector
{
unsigned int alignment = (sector << 9) & (lim->discard_granularity - 1);

+ if (!lim->max_discard_sectors)
+ return 0;
+
return (lim->discard_granularity + lim->discard_alignment - alignment)
& (lim->discard_granularity - 1);
}

static inline unsigned int queue_discard_zeroes_data(struct request_queue *q)
{
- if (q->limits.discard_zeroes_data == 1)
+ if (q->limits.max_discard_sectors && q->limits.discard_zeroes_data == 1)
return 1;

return 0;

2011-05-04 15:16:30

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

I got tired of poking around in sysfs to find the discard topology.
Here's a patch against lsblk that adds a -D option to present this
information in a human-readable form:

# lsblk -D
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 0B 0B 0
└─sda1 0 0B 0B 0
sdb 0 512B 2G 1
└─sdb1 0 512B 2G 1


Signed-off-by: Martin K. Petersen <[email protected]>

diff --git a/misc-utils/lsblk.8 b/misc-utils/lsblk.8
index 38ff48f..d7d7aa8 100644
--- a/misc-utils/lsblk.8
+++ b/misc-utils/lsblk.8
@@ -29,6 +29,8 @@ Print the SIZE column in bytes rather than in human-readable format.
.IP "\fB\-d, \-\-nodeps\fP"
Don't print device holders or slaves. For example "lsblk --nodeps /dev/sda" prints
information about the sda device only.
+.IP "\fB\-D, \-\-discard\fP"
+Print information about the discard (TRIM, UNMAP) capabilities for each device.
.IP "\fB\-e, \-\-exclude \fIlist\fP
Exclude the devices specified by a comma-separated \fIlist\fR of major device numbers.
Note that RAM disks (major=1) are excluded by default.
diff --git a/misc-utils/lsblk.c b/misc-utils/lsblk.c
index 38326d0..671e690 100644
--- a/misc-utils/lsblk.c
+++ b/misc-utils/lsblk.c
@@ -77,6 +77,10 @@ enum {
COL_ROTA,
COL_SCHED,
COL_TYPE,
+ COL_DALIGN,
+ COL_DGRAN,
+ COL_DMAX,
+ COL_DZERO,

__NCOLUMNS
};
@@ -112,8 +116,11 @@ static struct colinfo infos[__NCOLUMNS] = {
[COL_PHYSEC] = { "PHY-SEC", 7, TT_FL_RIGHT, N_("physical sector size") },
[COL_LOGSEC] = { "LOG-SEC", 7, TT_FL_RIGHT, N_("logical sector size") },
[COL_SCHED] = { "SCHED", 0.1, 0, N_("I/O scheduler name") },
- [COL_TYPE] = { "TYPE", 4, 0, N_("device type") }
-
+ [COL_TYPE] = { "TYPE", 4, 0, N_("device type") },
+ [COL_DALIGN] = { "DISC-ALN", 6, TT_FL_RIGHT, N_("discard alignment offset") },
+ [COL_DGRAN] = { "DISC-GRAN", 6, TT_FL_RIGHT, N_("discard granularity") },
+ [COL_DMAX] = { "DISC-MAX", 6, TT_FL_RIGHT, N_("discard max bytes") },
+ [COL_DZERO] = { "DISC-ZERO", 1, TT_FL_RIGHT, N_("discard zeroes data") },
};

struct lsblk {
@@ -702,6 +709,33 @@ static void set_tt_data(struct blkdev_cxt *cxt, int col, int id, struct tt_line
if (p)
tt_line_set_data(ln, col, p);
break;
+ case COL_DALIGN:
+ p = sysfs_strdup(cxt, "discard_alignment");
+ if (p)
+ tt_line_set_data(ln, col, p);
+ break;
+ case COL_DGRAN:
+ p = sysfs_strdup(cxt, "queue/discard_granularity");
+ if (!lsblk->bytes)
+ p = size_to_human_string(atoi(p));
+
+ if (p)
+ tt_line_set_data(ln, col, p);
+ break;
+ case COL_DMAX:
+ p = sysfs_strdup(cxt, "queue/discard_max_bytes");
+
+ if (!lsblk->bytes)
+ p = size_to_human_string(atoi(p));
+
+ if (p)
+ tt_line_set_data(ln, col, p);
+ break;
+ case COL_DZERO:
+ p = sysfs_strdup(cxt, "queue/discard_zeroes_data");
+ if (p)
+ tt_line_set_data(ln, col, p);
+ break;
};
}

@@ -930,6 +964,7 @@ static void __attribute__((__noreturn__)) help(FILE *out)
" -a, --all print all devices\n"
" -b, --bytes print SIZE in bytes rather than in human readable format\n"
" -d, --nodeps don't print slaves or holders\n"
+ " -D, --discard print discard capabilities\n"
" -e, --exclude <list> exclude devices by major number (default: RAM disks)\n"
" -f, --fs output info about filesystems\n"
" -h, --help usage information (this)\n"
@@ -967,6 +1002,7 @@ int main(int argc, char *argv[])
{ "all", 0, 0, 'a' },
{ "bytes", 0, 0, 'b' },
{ "nodeps", 0, 0, 'd' },
+ { "discard", 0, 0, 'D' },
{ "help", 0, 0, 'h' },
{ "output", 1, 0, 'o' },
{ "perms", 0, 0, 'm' },
@@ -987,7 +1023,7 @@ int main(int argc, char *argv[])
lsblk = &_ls;
memset(lsblk, 0, sizeof(*lsblk));

- while((c = getopt_long(argc, argv, "abde:fhlnmo:irt", longopts, NULL)) != -1) {
+ while((c = getopt_long(argc, argv, "abdDe:fhlnmo:irt", longopts, NULL)) != -1) {
switch(c) {
case 'a':
lsblk->all_devices = 1;
@@ -998,6 +1034,13 @@ int main(int argc, char *argv[])
case 'd':
lsblk->nodeps = 1;
break;
+ case 'D':
+ columns[ncolumns++] = COL_NAME;
+ columns[ncolumns++] = COL_DALIGN;
+ columns[ncolumns++] = COL_DGRAN;
+ columns[ncolumns++] = COL_DMAX;
+ columns[ncolumns++] = COL_DZERO;
+ break;
case 'e':
parse_excludes(optarg);
break;

2011-05-04 16:02:47

by Mike Snitzer

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, May 04 2011 at 11:10am -0400,
Martin K. Petersen <[email protected]> wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> Lukas> Nevertheless there is something weird going on, because even when
> Lukas> I create striped volume I get this:
>
> Could you please try the following patch? It has a bunch of small tweaks
> to the discard stack in it. I'll split it up before posting for real but
> I'd like to know if it fixes your issue...
>
>
> block/libata/scsi: Various logical block provisioning fixes
>
> - Add sysfs documentation for the discard topology parameters
>
> - Fix discard stacking problem
>
> - Switch our libata SAT over to using the WRITE SAME limits
>
> - UNMAP alignment needs to be converted to bytes
>
> - Only report alignment and zeroes_data if the device supports discard
>
> Reported-by: Lukas Czerner <[email protected]>
> Signed-off-by: Martin K. Petersen <[email protected]>
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 1fa7692..42d3bf5 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -120,7 +120,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->discard_granularity = 0;
> lim->discard_alignment = 0;
> lim->discard_misaligned = 0;
> - lim->discard_zeroes_data = -1;
> + lim->discard_zeroes_data = 1;
> lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
> lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
> lim->alignment_offset = 0;

lim->discard_zeroes_data = -1; was suspect to me too.
But why default to 1 here?

> @@ -166,6 +166,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>
> blk_set_default_limits(&q->limits);
> blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
> + q->limits.discard_zeroes_data = 0;
>
> /*
> * by default assume old behaviour and bounce for any highmem page

Only to then reset to 0 here? Shouldn't we default to 0 and only set to
1 where applicable (e.g. sd_config_discard)?

2011-05-04 16:13:10

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, 4 May 2011, Martin K. Petersen wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> I got tired of poking around in sysfs to find the discard topology.
> Here's a patch against lsblk that adds a -D option to present this
> information in a human-readable form:
>
> # lsblk -D
> NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> sda 0 0B 0B 0
> └─sda1 0 0B 0B 0
> sdb 0 512B 2G 1
> └─sdb1 0 512B 2G 1

Very useful indeed!

Thanks!
-Lukas

>
>
> Signed-off-by: Martin K. Petersen <[email protected]>
>
> diff --git a/misc-utils/lsblk.8 b/misc-utils/lsblk.8
> index 38ff48f..d7d7aa8 100644
> --- a/misc-utils/lsblk.8
> +++ b/misc-utils/lsblk.8
> @@ -29,6 +29,8 @@ Print the SIZE column in bytes rather than in human-readable format.
> .IP "\fB\-d, \-\-nodeps\fP"
> Don't print device holders or slaves. For example "lsblk --nodeps /dev/sda" prints
> information about the sda device only.
> +.IP "\fB\-D, \-\-discard\fP"
> +Print information about the discard (TRIM, UNMAP) capabilities for each device.
> .IP "\fB\-e, \-\-exclude \fIlist\fP
> Exclude the devices specified by a comma-separated \fIlist\fR of major device numbers.
> Note that RAM disks (major=1) are excluded by default.
> diff --git a/misc-utils/lsblk.c b/misc-utils/lsblk.c
> index 38326d0..671e690 100644
> --- a/misc-utils/lsblk.c
> +++ b/misc-utils/lsblk.c
> @@ -77,6 +77,10 @@ enum {
> COL_ROTA,
> COL_SCHED,
> COL_TYPE,
> + COL_DALIGN,
> + COL_DGRAN,
> + COL_DMAX,
> + COL_DZERO,
>
> __NCOLUMNS
> };
> @@ -112,8 +116,11 @@ static struct colinfo infos[__NCOLUMNS] = {
> [COL_PHYSEC] = { "PHY-SEC", 7, TT_FL_RIGHT, N_("physical sector size") },
> [COL_LOGSEC] = { "LOG-SEC", 7, TT_FL_RIGHT, N_("logical sector size") },
> [COL_SCHED] = { "SCHED", 0.1, 0, N_("I/O scheduler name") },
> - [COL_TYPE] = { "TYPE", 4, 0, N_("device type") }
> -
> + [COL_TYPE] = { "TYPE", 4, 0, N_("device type") },
> + [COL_DALIGN] = { "DISC-ALN", 6, TT_FL_RIGHT, N_("discard alignment offset") },
> + [COL_DGRAN] = { "DISC-GRAN", 6, TT_FL_RIGHT, N_("discard granularity") },
> + [COL_DMAX] = { "DISC-MAX", 6, TT_FL_RIGHT, N_("discard max bytes") },
> + [COL_DZERO] = { "DISC-ZERO", 1, TT_FL_RIGHT, N_("discard zeroes data") },
> };
>
> struct lsblk {
> @@ -702,6 +709,33 @@ static void set_tt_data(struct blkdev_cxt *cxt, int col, int id, struct tt_line
> if (p)
> tt_line_set_data(ln, col, p);
> break;
> + case COL_DALIGN:
> + p = sysfs_strdup(cxt, "discard_alignment");
> + if (p)
> + tt_line_set_data(ln, col, p);
> + break;
> + case COL_DGRAN:
> + p = sysfs_strdup(cxt, "queue/discard_granularity");
> + if (!lsblk->bytes)
> + p = size_to_human_string(atoi(p));
> +
> + if (p)
> + tt_line_set_data(ln, col, p);
> + break;
> + case COL_DMAX:
> + p = sysfs_strdup(cxt, "queue/discard_max_bytes");
> +
> + if (!lsblk->bytes)
> + p = size_to_human_string(atoi(p));
> +
> + if (p)
> + tt_line_set_data(ln, col, p);
> + break;
> + case COL_DZERO:
> + p = sysfs_strdup(cxt, "queue/discard_zeroes_data");
> + if (p)
> + tt_line_set_data(ln, col, p);
> + break;
> };
> }
>
> @@ -930,6 +964,7 @@ static void __attribute__((__noreturn__)) help(FILE *out)
> " -a, --all print all devices\n"
> " -b, --bytes print SIZE in bytes rather than in human readable format\n"
> " -d, --nodeps don't print slaves or holders\n"
> + " -D, --discard print discard capabilities\n"
> " -e, --exclude <list> exclude devices by major number (default: RAM disks)\n"
> " -f, --fs output info about filesystems\n"
> " -h, --help usage information (this)\n"
> @@ -967,6 +1002,7 @@ int main(int argc, char *argv[])
> { "all", 0, 0, 'a' },
> { "bytes", 0, 0, 'b' },
> { "nodeps", 0, 0, 'd' },
> + { "discard", 0, 0, 'D' },
> { "help", 0, 0, 'h' },
> { "output", 1, 0, 'o' },
> { "perms", 0, 0, 'm' },
> @@ -987,7 +1023,7 @@ int main(int argc, char *argv[])
> lsblk = &_ls;
> memset(lsblk, 0, sizeof(*lsblk));
>
> - while((c = getopt_long(argc, argv, "abde:fhlnmo:irt", longopts, NULL)) != -1) {
> + while((c = getopt_long(argc, argv, "abdDe:fhlnmo:irt", longopts, NULL)) != -1) {
> switch(c) {
> case 'a':
> lsblk->all_devices = 1;
> @@ -998,6 +1034,13 @@ int main(int argc, char *argv[])
> case 'd':
> lsblk->nodeps = 1;
> break;
> + case 'D':
> + columns[ncolumns++] = COL_NAME;
> + columns[ncolumns++] = COL_DALIGN;
> + columns[ncolumns++] = COL_DGRAN;
> + columns[ncolumns++] = COL_DMAX;
> + columns[ncolumns++] = COL_DZERO;
> + break;
> case 'e':
> parse_excludes(optarg);
> break;
>

--

2011-05-04 16:33:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Fri, Apr 29, 2011 at 09:28:51AM -0500, Eric Sandeen wrote:
> On 4/29/11 7:29 AM, Christoph Hellwig wrote:
> > FYI, that disable discard on failure seems to be an ext4 special and no
> > one else picked up that stupid idea. If ext4 wants to misbehave for
> > that just let them..
>
> It was my "stupid idea," and I'm ok with reverting it ;)

I think I forgot to send the patch out, but it's been reverted in the
ext4 master branch, commit id: d9f34504e695. It's more than a revert,
actually, since I also dropped error checking for FITRIM. Otherwise
an attempt to use FITRIM would stop after hitting the first dm region
that didn't support discards.

- Ted

2011-05-04 16:51:15

by Martin K. Petersen

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Mike" == Mike Snitzer <[email protected]> writes:

Mike> lim->discard_zeroes_data = -1; was suspect to me too.
Mike> But why default to 1 here?

Because otherwise DM would default to having dzd to "unsupported",
meaning the feature would never be turned on regardless of the bottom
device capabilities.

The old approach used the -1 value to indicate "has not been set". That
was only really intended as a value for the stacking drivers, not for
the LLDs. It was a bit of a hack and I'd rather deal with dzd the same
way as we do with clustering.


>> @@ -166,6 +166,7 @@ void blk_queue_make_request(struct request_queue
>> *q, make_request_fn *mfn)
>>
>> blk_set_default_limits(&q->limits);
>> blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
>> + q->limits.discard_zeroes_data = 0;
>>
>> /*
>> * by default assume old behaviour and bounce for any highmem page

Mike> Only to then reset to 0 here? Shouldn't we default to 0 and only
Mike> set to 1 where applicable (e.g. sd_config_discard)?

My first approach was to set it in dm-table.c before stacking. But I
thought it was icky to have the stacking driver ask for defaults and
then have to tweak them for things to work correctly.

The other option is to have blk_set_default_stacking_limits(). Or we
could add a flag to blk_set_default_limits to indicate whether this is a
LLD or a stacking driver.

We already special-case BLK_SAFE_MAX_SECTORS when setting the request
function. And that's the only non-stacking user of the default limits
call. So that's why I disabled dzd there. Since this is a stable bugfix
I also wanted to keep it small and simple. But I'm totally open to
suggestions.

--
Martin K. Petersen Oracle Linux Engineering

2011-05-04 16:51:58

by Eric Sandeen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On 5/4/11 11:33 AM, Ted Ts'o wrote:
> On Fri, Apr 29, 2011 at 09:28:51AM -0500, Eric Sandeen wrote:
>> On 4/29/11 7:29 AM, Christoph Hellwig wrote:
>>> FYI, that disable discard on failure seems to be an ext4 special and no
>>> one else picked up that stupid idea. If ext4 wants to misbehave for
>>> that just let them..
>>
>> It was my "stupid idea," and I'm ok with reverting it ;)
>
> I think I forgot to send the patch out, but it's been reverted in the
> ext4 master branch, commit id: d9f34504e695. It's more than a revert,
> actually, since I also dropped error checking for FITRIM. Otherwise
> an attempt to use FITRIM would stop after hitting the first dm region
> that didn't support discards.
>
> - Ted

The xfs must-be-posted-and-reviewed-before-commit policy would be helpful here, I think.

If it's understood and agreed upon, people will get more diligent about review, and it'll be easier to remember to send the patch out - because it's policy.

Thanks,
-Eric

2011-05-04 16:57:24

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, 4 May 2011, Eric Sandeen wrote:

> On 5/4/11 11:33 AM, Ted Ts'o wrote:
> > On Fri, Apr 29, 2011 at 09:28:51AM -0500, Eric Sandeen wrote:
> >> On 4/29/11 7:29 AM, Christoph Hellwig wrote:
> >>> FYI, that disable discard on failure seems to be an ext4 special and no
> >>> one else picked up that stupid idea. If ext4 wants to misbehave for
> >>> that just let them..
> >>
> >> It was my "stupid idea," and I'm ok with reverting it ;)
> >
> > I think I forgot to send the patch out, but it's been reverted in the
> > ext4 master branch, commit id: d9f34504e695. It's more than a revert,
> > actually, since I also dropped error checking for FITRIM. Otherwise
> > an attempt to use FITRIM would stop after hitting the first dm region
> > that didn't support discards.
> >
> > - Ted
>
> The xfs must-be-posted-and-reviewed-before-commit policy would be helpful here, I think.
>
> If it's understood and agreed upon, people will get more diligent about review, and it'll be easier to remember to send the patch out - because it's policy.

I second that! Also we should do more reviews anyway :)

Thanks!
-Lukas

>
> Thanks,
> -Eric
>

2011-05-04 17:02:22

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, 4 May 2011, Ted Ts'o wrote:

> On Fri, Apr 29, 2011 at 09:28:51AM -0500, Eric Sandeen wrote:
> > On 4/29/11 7:29 AM, Christoph Hellwig wrote:
> > > FYI, that disable discard on failure seems to be an ext4 special and no
> > > one else picked up that stupid idea. If ext4 wants to misbehave for
> > > that just let them..
> >
> > It was my "stupid idea," and I'm ok with reverting it ;)
>
> I think I forgot to send the patch out, but it's been reverted in the
> ext4 master branch, commit id: d9f34504e695. It's more than a revert,
> actually, since I also dropped error checking for FITRIM. Otherwise
> an attempt to use FITRIM would stop after hitting the first dm region
> that didn't support discards.
>
> - Ted
>

However I would very much like to change blkdev_issue_discard() not NOT
return EOPNOTSUPP if QUEUE_FLAG_DISCARD is set, this way we can rely on
EOPNOTSUPP saying really that it is NOT supported. Regardless on ext4
change.

I will send a patch for it, if you guy agree. Also consider BLKDISCARD
which is now broken and this change will fix that!

Thanks!
-Lukas

2011-05-04 17:10:29

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, 4 May 2011, Martin K. Petersen wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> Lukas> Nevertheless there is something weird going on, because even when
> Lukas> I create striped volume I get this:
>
> Could you please try the following patch? It has a bunch of small tweaks
> to the discard stack in it. I'll split it up before posting for real but
> I'd like to know if it fixes your issue...

It works thanks! This is before the patch applied:

NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sdb 0 0B 0B 0
??sdb1 4293918720 0B 0B 0
??sdb2 4293918720 0B 0B 0
??sdb3 4293918720 0B 0B 0
??vg_test-lvol0 (dm-0) 0 512B 16E 1
sdd 0 512B 16E 1
??sdd1 0 512B 16E 1
??vg_test-lvol0 (dm-0) 0 512B 16E 1

and this is with the patch:

NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sdb 0 0B 0B 0
??sdb1 0 0B 0B 0
??sdb2 0 0B 0B 0
??sdb3 0 0B 0B 0
??vg_test-lvol0 (dm-0) 0 512B 2G 0
sdd 0 512B 2G 1
??sdd1 0 512B 2G 1
??vg_test-lvol0 (dm-0) 0 512B 2G 0

It also works properly with two devices supporting discard.

NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sdd 0 512B 2G 1
??sdd1 0 512B 2G 1
? ??vg_test-lvol0 (dm-0) 0 512B 2G 1
??sdd2 0 512B 2G 1
??vg_test-lvol0 (dm-0) 0 512B 2G 1


Also I am not sure why this changed discard_max_bytes from 4294966784
to 2147450880 in my case, that does not seem right...lsblk in the first
place apparently overflowed.

Thanks for solving this!
-Lukas


>
>
> block/libata/scsi: Various logical block provisioning fixes
>
> - Add sysfs documentation for the discard topology parameters
>
> - Fix discard stacking problem
>
> - Switch our libata SAT over to using the WRITE SAME limits
>
> - UNMAP alignment needs to be converted to bytes
>
> - Only report alignment and zeroes_data if the device supports discard
>
> Reported-by: Lukas Czerner <[email protected]>
> Signed-off-by: Martin K. Petersen <[email protected]>
>
> diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
> index 4873c75..c1eb41c 100644
> --- a/Documentation/ABI/testing/sysfs-block
> +++ b/Documentation/ABI/testing/sysfs-block
> @@ -142,3 +142,67 @@ Description:
> with the previous I/O request are enabled. When set to 2,
> all merge tries are disabled. The default value is 0 -
> which enables all types of merge tries.
> +
> +What: /sys/block/<disk>/discard_alignment
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may
> + internally allocate space in units that are bigger than
> + the exported logical block size. The discard_alignment
> + parameter indicates how many bytes the beginning of the
> + device is offset from the internal allocation unit's
> + natural alignment.
> +
> +What: /sys/block/<disk>/<partition>/discard_alignment
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may
> + internally allocate space in units that are bigger than
> + the exported logical block size. The discard_alignment
> + parameter indicates how many bytes the beginning of the
> + partition is offset from the internal allocation unit's
> + natural alignment.
> +
> +What: /sys/block/<disk>/queue/discard_granularity
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may
> + internally allocate space using units that are bigger
> + than the logical block size. The discard_granularity
> + parameter indicates the size of the internal allocation
> + unit in bytes if reported by the device. Otherwise the
> + discard_granularity will be set to match the device's
> + physical block size. A discard_granularity of 0 means
> + that the device does not support discard functionality.
> +
> +What: /sys/block/<disk>/queue/discard_max_bytes
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may have
> + internal limits on the number of bytes that can be
> + trimmed or unmapped in a single operation. Some storage
> + protocols also have inherent limits on the number of
> + blocks that can be described in a single command. The
> + discard_max_bytes parameter is set by the device driver
> + to the maximum number of bytes that can be discarded in
> + a single operation. Discard requests issued to the
> + device must not exceed this limit. A discard_max_bytes
> + value of 0 means that the device does not support
> + discard functionality.
> +
> +What: /sys/block/<disk>/queue/discard_zeroes_data
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may return
> + stale or random data when a previously discarded block
> + is read back. This can cause problems if the filesystem
> + expects discarded blocks to be explicitly cleared. If a
> + device reports that it deterministically returns zeroes
> + when a discarded area is read the discard_zeroes_data
> + parameter will be set to one. Otherwise it will be 0 and
> + the result of reading a discarded area is undefined.
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 1fa7692..42d3bf5 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -120,7 +120,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->discard_granularity = 0;
> lim->discard_alignment = 0;
> lim->discard_misaligned = 0;
> - lim->discard_zeroes_data = -1;
> + lim->discard_zeroes_data = 1;
> lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
> lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
> lim->alignment_offset = 0;
> @@ -166,6 +166,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>
> blk_set_default_limits(&q->limits);
> blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
> + q->limits.discard_zeroes_data = 0;
>
> /*
> * by default assume old behaviour and bounce for any highmem page
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index e2f57e9e..30ea95f 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2138,7 +2138,7 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
> * with the unmap bit set.
> */
> if (ata_id_has_trim(args->id)) {
> - put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
> + put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
> put_unaligned_be32(1, &rbuf[28]);
> }
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index bd0806e..cc081dd 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -490,7 +490,8 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
> unsigned int max_blocks = 0;
>
> q->limits.discard_zeroes_data = sdkp->lbprz;
> - q->limits.discard_alignment = sdkp->unmap_alignment;
> + q->limits.discard_alignment = sdkp->unmap_alignment *
> + (logical_block_size >> 9);
> q->limits.discard_granularity =
> max(sdkp->physical_block_size,
> sdkp->unmap_granularity * logical_block_size);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2ad95fa..1416e3c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -257,7 +257,7 @@ struct queue_limits {
> unsigned char misaligned;
> unsigned char discard_misaligned;
> unsigned char cluster;
> - signed char discard_zeroes_data;
> + unsigned char discard_zeroes_data;
> };
>
> struct request_queue
> @@ -1066,13 +1066,16 @@ static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector
> {
> unsigned int alignment = (sector << 9) & (lim->discard_granularity - 1);
>
> + if (!lim->max_discard_sectors)
> + return 0;
> +
> return (lim->discard_granularity + lim->discard_alignment - alignment)
> & (lim->discard_granularity - 1);
> }
>
> static inline unsigned int queue_discard_zeroes_data(struct request_queue *q)
> {
> - if (q->limits.discard_zeroes_data == 1)
> + if (q->limits.max_discard_sectors && q->limits.discard_zeroes_data == 1)
> return 1;
>
> return 0;
>

--

2011-05-04 17:32:31

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

>>>>> "Lukas" == Lukas Czerner <[email protected]> writes:

Lukas> Also I am not sure why this changed discard_max_bytes from
Lukas> 4294966784 to 2147450880 in my case, that does not seem
Lukas> right...lsblk in the first place apparently overflowed.

2G is correct. The relevant libata patch was missing from the series I
submitted during the merge window and I don't see it in the archives. It
was at the end of the stack so I guess I had an off-by-one in my commit
range...

--
Martin K. Petersen Oracle Linux Engineering

2011-05-04 17:35:17

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, 4 May 2011, Martin K. Petersen wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> Lukas> Also I am not sure why this changed discard_max_bytes from
> Lukas> 4294966784 to 2147450880 in my case, that does not seem
> Lukas> right...lsblk in the first place apparently overflowed.
>
> 2G is correct. The relevant libata patch was missing from the series I
> submitted during the merge window and I don't see it in the archives. It
> was at the end of the stack so I guess I had an off-by-one in my commit
> range...
>

Ok, thanks.

-Lukas

2011-05-04 18:03:58

by Mike Snitzer

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, May 04 2011 at 12:50pm -0400,
Martin K. Petersen <[email protected]> wrote:

> >>>>> "Mike" == Mike Snitzer <[email protected]> writes:
>
> Mike> lim->discard_zeroes_data = -1; was suspect to me too.
> Mike> But why default to 1 here?
>
> Because otherwise DM would default to having dzd to "unsupported",
> meaning the feature would never be turned on regardless of the bottom
> device capabilities.
>
> The old approach used the -1 value to indicate "has not been set". That
> was only really intended as a value for the stacking drivers, not for
> the LLDs. It was a bit of a hack and I'd rather deal with dzd the same
> way as we do with clustering.
>
>
> >> @@ -166,6 +166,7 @@ void blk_queue_make_request(struct request_queue
> >> *q, make_request_fn *mfn)
> >>
> >> blk_set_default_limits(&q->limits);
> >> blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
> >> + q->limits.discard_zeroes_data = 0;
> >>
> >> /*
> >> * by default assume old behaviour and bounce for any highmem page
>
> Mike> Only to then reset to 0 here? Shouldn't we default to 0 and only
> Mike> set to 1 where applicable (e.g. sd_config_discard)?
>
> My first approach was to set it in dm-table.c before stacking. But I
> thought it was icky to have the stacking driver ask for defaults and
> then have to tweak them for things to work correctly.
>
> The other option is to have blk_set_default_stacking_limits(). Or we
> could add a flag to blk_set_default_limits to indicate whether this is a
> LLD or a stacking driver.
>
> We already special-case BLK_SAFE_MAX_SECTORS when setting the request
> function. And that's the only non-stacking user of the default limits
> call. So that's why I disabled dzd there. Since this is a stable bugfix
> I also wanted to keep it small and simple. But I'm totally open to
> suggestions.

Your current approach sounds good. Might be good to briefly speak to
the duality of the stacking vs non-stacking approach in the associated
patch header.

Thanks for clarifying.

Mike

2011-05-05 08:34:02

by Karel Zak

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, May 04, 2011 at 11:16:05AM -0400, Martin K. Petersen wrote:
> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> I got tired of poking around in sysfs to find the discard topology.
> Here's a patch against lsblk that adds a -D option to present this
> information in a human-readable form:

Applied, thanks.

> # lsblk -D
> NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> sda 0 0B 0B 0
> └─sda1 0 0B 0B 0
> sdb 0 512B 2G 1
> └─sdb1 0 512B 2G 1

I have a question, 2.6.35 on my ThinkPad, non-SSD disk:

NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 0B 0B 0
├─sda1 4294935040 0B 0B 0
├─sda2 4188038656 0B 0B 0
├─sda3 1346205184 0B 0B 0
├─sda4 3231165440 0B 0B 0
├─sda5 4188006400 0B 0B 0
│ └─kzak-home (dm-0) 0 0B 0B 0
└─sda6 2035725312 0B 0B 0


Does is make sense? The DISC-ALN is non-zero but DISC-GRAN is zero.

Note that cat /sys/block/sda/sda*/discard_alignment returns the same
numbers.

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2011-05-05 10:48:59

by Lukas Czerner

[permalink] [raw]
Subject: Re: [dm-devel] do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Thu, 5 May 2011, Karel Zak wrote:

> On Wed, May 04, 2011 at 11:16:05AM -0400, Martin K. Petersen wrote:
> > >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
> >
> > I got tired of poking around in sysfs to find the discard topology.
> > Here's a patch against lsblk that adds a -D option to present this
> > information in a human-readable form:
>
> Applied, thanks.
>
> > # lsblk -D
> > NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> > sda 0 0B 0B 0
> > └─sda1 0 0B 0B 0
> > sdb 0 512B 2G 1
> > └─sdb1 0 512B 2G 1
>
> I have a question, 2.6.35 on my ThinkPad, non-SSD disk:
>
> NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> sda 0 0B 0B 0
> ├─sda1 4294935040 0B 0B 0
> ├─sda2 4188038656 0B 0B 0
> ├─sda3 1346205184 0B 0B 0
> ├─sda4 3231165440 0B 0B 0
> ├─sda5 4188006400 0B 0B 0
> │ └─kzak-home (dm-0) 0 0B 0B 0
> └─sda6 2035725312 0B 0B 0
>
>
> Does is make sense? The DISC-ALN is non-zero but DISC-GRAN is zero.
>
> Note that cat /sys/block/sda/sda*/discard_alignment returns the same
> numbers.
>
> Karel
>
>

That is fixed by the kernel patch posted above :)

-Lukas

2011-05-18 12:16:47

by Mike Snitzer

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

Hey Martin,

On Wed, May 04 2011 at 11:10am -0400,
Martin K. Petersen <[email protected]> wrote:

> >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
>
> Lukas> Nevertheless there is something weird going on, because even when
> Lukas> I create striped volume I get this:
>
> Could you please try the following patch? It has a bunch of small tweaks
> to the discard stack in it. I'll split it up before posting for real but
> I'd like to know if it fixes your issue...

Would be ideal to get these fixes staged for 2.6.40.

Do you intend to split these patches up and post for 2.6.40 inclussion?

Please advise, thanks!


> block/libata/scsi: Various logical block provisioning fixes
>
> - Add sysfs documentation for the discard topology parameters
>
> - Fix discard stacking problem
>
> - Switch our libata SAT over to using the WRITE SAME limits
>
> - UNMAP alignment needs to be converted to bytes
>
> - Only report alignment and zeroes_data if the device supports discard
>
> Reported-by: Lukas Czerner <[email protected]>
> Signed-off-by: Martin K. Petersen <[email protected]>
>
> diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
> index 4873c75..c1eb41c 100644
> --- a/Documentation/ABI/testing/sysfs-block
> +++ b/Documentation/ABI/testing/sysfs-block
> @@ -142,3 +142,67 @@ Description:
> with the previous I/O request are enabled. When set to 2,
> all merge tries are disabled. The default value is 0 -
> which enables all types of merge tries.
> +
> +What: /sys/block/<disk>/discard_alignment
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may
> + internally allocate space in units that are bigger than
> + the exported logical block size. The discard_alignment
> + parameter indicates how many bytes the beginning of the
> + device is offset from the internal allocation unit's
> + natural alignment.
> +
> +What: /sys/block/<disk>/<partition>/discard_alignment
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may
> + internally allocate space in units that are bigger than
> + the exported logical block size. The discard_alignment
> + parameter indicates how many bytes the beginning of the
> + partition is offset from the internal allocation unit's
> + natural alignment.
> +
> +What: /sys/block/<disk>/queue/discard_granularity
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may
> + internally allocate space using units that are bigger
> + than the logical block size. The discard_granularity
> + parameter indicates the size of the internal allocation
> + unit in bytes if reported by the device. Otherwise the
> + discard_granularity will be set to match the device's
> + physical block size. A discard_granularity of 0 means
> + that the device does not support discard functionality.
> +
> +What: /sys/block/<disk>/queue/discard_max_bytes
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may have
> + internal limits on the number of bytes that can be
> + trimmed or unmapped in a single operation. Some storage
> + protocols also have inherent limits on the number of
> + blocks that can be described in a single command. The
> + discard_max_bytes parameter is set by the device driver
> + to the maximum number of bytes that can be discarded in
> + a single operation. Discard requests issued to the
> + device must not exceed this limit. A discard_max_bytes
> + value of 0 means that the device does not support
> + discard functionality.
> +
> +What: /sys/block/<disk>/queue/discard_zeroes_data
> +Date: May 2011
> +Contact: Martin K. Petersen <[email protected]>
> +Description:
> + Devices that support discard functionality may return
> + stale or random data when a previously discarded block
> + is read back. This can cause problems if the filesystem
> + expects discarded blocks to be explicitly cleared. If a
> + device reports that it deterministically returns zeroes
> + when a discarded area is read the discard_zeroes_data
> + parameter will be set to one. Otherwise it will be 0 and
> + the result of reading a discarded area is undefined.
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 1fa7692..42d3bf5 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -120,7 +120,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->discard_granularity = 0;
> lim->discard_alignment = 0;
> lim->discard_misaligned = 0;
> - lim->discard_zeroes_data = -1;
> + lim->discard_zeroes_data = 1;
> lim->logical_block_size = lim->physical_block_size = lim->io_min = 512;
> lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT);
> lim->alignment_offset = 0;
> @@ -166,6 +166,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>
> blk_set_default_limits(&q->limits);
> blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
> + q->limits.discard_zeroes_data = 0;
>
> /*
> * by default assume old behaviour and bounce for any highmem page
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index e2f57e9e..30ea95f 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -2138,7 +2138,7 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
> * with the unmap bit set.
> */
> if (ata_id_has_trim(args->id)) {
> - put_unaligned_be32(65535 * 512 / 8, &rbuf[20]);
> + put_unaligned_be64(65535 * 512 / 8, &rbuf[36]);
> put_unaligned_be32(1, &rbuf[28]);
> }
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index bd0806e..cc081dd 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -490,7 +490,8 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
> unsigned int max_blocks = 0;
>
> q->limits.discard_zeroes_data = sdkp->lbprz;
> - q->limits.discard_alignment = sdkp->unmap_alignment;
> + q->limits.discard_alignment = sdkp->unmap_alignment *
> + (logical_block_size >> 9);
> q->limits.discard_granularity =
> max(sdkp->physical_block_size,
> sdkp->unmap_granularity * logical_block_size);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2ad95fa..1416e3c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -257,7 +257,7 @@ struct queue_limits {
> unsigned char misaligned;
> unsigned char discard_misaligned;
> unsigned char cluster;
> - signed char discard_zeroes_data;
> + unsigned char discard_zeroes_data;
> };
>
> struct request_queue
> @@ -1066,13 +1066,16 @@ static inline int queue_limit_discard_alignment(struct queue_limits *lim, sector
> {
> unsigned int alignment = (sector << 9) & (lim->discard_granularity - 1);
>
> + if (!lim->max_discard_sectors)
> + return 0;
> +
> return (lim->discard_granularity + lim->discard_alignment - alignment)
> & (lim->discard_granularity - 1);
> }
>
> static inline unsigned int queue_discard_zeroes_data(struct request_queue *q)
> {
> - if (q->limits.discard_zeroes_data == 1)
> + if (q->limits.max_discard_sectors && q->limits.discard_zeroes_data == 1)
> return 1;
>
> return 0;
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel

2011-05-18 12:52:41

by Mike Snitzer

[permalink] [raw]
Subject: Re: do not disable ext4 discards on first discard failure? [was: Re: dm snapshot: ignore discards issued to the snapshot-origin target]

On Wed, May 18 2011 at 8:16am -0400,
Mike Snitzer <[email protected]> wrote:

> Hey Martin,
>
> On Wed, May 04 2011 at 11:10am -0400,
> Martin K. Petersen <[email protected]> wrote:
>
> > >>>>> "Lukas" == Lukas Czerner <[email protected]> writes:
> >
> > Lukas> Nevertheless there is something weird going on, because even when
> > Lukas> I create striped volume I get this:
> >
> > Could you please try the following patch? It has a bunch of small tweaks
> > to the discard stack in it. I'll split it up before posting for real but
> > I'd like to know if it fixes your issue...
>
> Would be ideal to get these fixes staged for 2.6.40.
>
> Do you intend to split these patches up and post for 2.6.40 inclussion?

Ah, I now see you posted them and Jens picked some up.. thanks

Mike