Hello all
I am doing some testing of dm-thin on kernel 3.4.2 and latest lvm from
source (the rest is Ubuntu Precise 12.04).
There are a few problems with ext4 and (different ones with) xfs
I am doing this:
dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync
lvs
rm zeroes #optional
dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync #again
lvs
rm zeroes #optional
...
dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync #again
lvs
rm zeroes
fstrim /mnt/mountpoint
lvs
On ext4 the problem is that it always reallocates blocks at different
places, so you can see from lvs that space occupation in the pool and
thinlv increases at each iteration of dd, again and again, until it has
allocated the whole thin device (really 100% of it). And this is true
regardless of me doing rm or not between one dd and the other.
The other problem is that by doing this, ext4 always gets the worst
performance from thinp, about 140MB/sec on my system, because it is
constantly allocating blocks, instead of 350MB/sec which should have
been with my system if it used already allocated regions (see below
compared to xfs). I am on an MD raid-5 of 5 hdds.
I could suggest to add a "thinp mode" mount option to ext4 affecting the
allocator, so that it tries to reallocate recently used and freed areas
and not constantly new areas. Note that mount -o discard does work and
prevents allocation bloating, but it still always gets the worst write
performances from thinp. Alternatively thinp could be improved so that
block allocation is fast :-P (*)
However, good news is that fstrim works correctly on ext4, and is able
to drop all space allocated by all dd's. Also mount -o discard works.
On xfs there is a different problem.
Xfs apparently correctly re-uses the same blocks so that after the first
write at 140MB/sec, subsequent overwrites of the same file are at full
speed such as 350MB/sec (same speed as with non-thin lvm), and also you
don't see space occupation going up at every iteration of dd, either
with or without rm in-between the dd's. [ok actually now retrying it
needed 3 rewrites to stabilize allocation... probably an AG count thing.]
However the problem with XFS is that discard doesn't appear to work.
Fstrim doesn't work, and neither does "mount -o discard ... + rm zeroes"
. There is apparently no way to drop the allocated blocks, as seen from
lvs. This is in contrast to what it is written here
http://xfs.org/index.php/FITRIM/discard which declare fstrim and mount
-o discard to be working.
Please note that since I am above MD raid5 (I believe this is the
reason), the passdown of discards does not work, as my dmesg says:
[160508.497879] device-mapper: thin: Discard unsupported by data device
(dm-1): Disabling discard passdown.
but AFAIU, unless there is a thinp bug, this should not affect the
unmapping of thin blocks by fstrimming xfs... and in fact ext4 is able
to do that.
(*) Strange thing is that write performance appears to be roughly the
same for default thin chunksize and for 1MB thin chunksize. I would have
expected thinp allocation to be faster with larger thin chunksizes but
instead it is actually slower (note that there are no snapshots here and
hence no CoW). This is also true if I set the thinpool to not zero newly
allocated blocks: performances are about 240 MB/sec then, but again they
don't increase with larger chunksizes, they actually decrease slightly
with very large chunksizes such as 16MB. Why is that?
Thanks for your help
S.
On Mon, Jun 18, 2012 at 11:33:50PM +0200, Spelic wrote:
> Hello all
> I am doing some testing of dm-thin on kernel 3.4.2 and latest lvm
> from source (the rest is Ubuntu Precise 12.04).
> There are a few problems with ext4 and (different ones with) xfs
>
> I am doing this:
> dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync
> lvs
> rm zeroes #optional
> dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync #again
> lvs
> rm zeroes #optional
> ...
> dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync #again
> lvs
> rm zeroes
> fstrim /mnt/mountpoint
> lvs
[snip ext4 problems]
> On xfs there is a different problem.
> Xfs apparently correctly re-uses the same blocks so that after the
> first write at 140MB/sec, subsequent overwrites of the same file are
> at full speed such as 350MB/sec (same speed as with non-thin lvm),
> and also you don't see space occupation going up at every iteration
> of dd, either with or without rm in-between the dd's. [ok actually
> now retrying it needed 3 rewrites to stabilize allocation...
> probably an AG count thing.]
That's just a characteristic of the allocation algorithm. It's not
something that you see in day-to-day operation of the filesystem,
though, because you rarely remove and rewrite a file like this
repeatedly. So in the real world, performance will be more like ext4
when you are running workloads where you actually store data for
longer than a millisecond...
Expect that the 140MB/s number is the normal performance case,
because as soon as you take a snapshot, the overwrite requires new
blocks to be allocated in dm-thinp. You don't get thinp for nothing
- it has an associated performance cost as you are now finding
out....
> However the problem with XFS is that discard doesn't appear to work.
> Fstrim doesn't work, and neither does "mount -o discard ... + rm
> zeroes" . There is apparently no way to drop the allocated blocks,
> as seen from lvs. This is in contrast to what it is written here
> http://xfs.org/index.php/FITRIM/discard which declare fstrim and
> mount -o discard to be working.
I don't see why it wouldnt be if the underlying device supports it.
Have you looked at a block trace or an xfs event trace to see if
discards are being issued by XFS?
Are you getting messages like:
XFS: (dev) discard failed for extent [0x123,4096], error -5
in dmesg, or is fstrim seeing errors returned from the trim ioctl?
> Please note that since I am above MD raid5 (I believe this is the
> reason), the passdown of discards does not work, as my dmesg says:
> [160508.497879] device-mapper: thin: Discard unsupported by data
> device (dm-1): Disabling discard passdown.
> but AFAIU, unless there is a thinp bug, this should not affect the
> unmapping of thin blocks by fstrimming xfs... and in fact ext4 is
> able to do that.
Does ext4 report that same error?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jun 18 2012 at 9:57pm -0400,
Dave Chinner <[email protected]> wrote:
> On Mon, Jun 18, 2012 at 11:33:50PM +0200, Spelic wrote:
>
> > Please note that since I am above MD raid5 (I believe this is the
> > reason), the passdown of discards does not work, as my dmesg says:
> > [160508.497879] device-mapper: thin: Discard unsupported by data
> > device (dm-1): Disabling discard passdown.
> > but AFAIU, unless there is a thinp bug, this should not affect the
> > unmapping of thin blocks by fstrimming xfs... and in fact ext4 is
> > able to do that.
>
> Does ext4 report that same error?
That message says the underlying device doesn't support discards
(because it is an MD device). But the thinp device still has discards
enabled -- it just won't pass the discards down to the underlying data
device.
So yes, it'll happen with ext4 -- it is generated when the thin-pool
device is loaded (which happens independent of the filesystem that is
layered ontop).
The discards still inform the thin-pool that the corresponding extents
are no longer allocated.
On Mon, 18 Jun 2012, Mike Snitzer wrote:
> Date: Mon, 18 Jun 2012 23:12:42 -0400
> From: Mike Snitzer <[email protected]>
> To: Dave Chinner <[email protected]>
> Cc: Spelic <[email protected]>,
> device-mapper development <[email protected]>,
> [email protected], [email protected]
> Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
>
> On Mon, Jun 18 2012 at 9:57pm -0400,
> Dave Chinner <[email protected]> wrote:
>
> > On Mon, Jun 18, 2012 at 11:33:50PM +0200, Spelic wrote:
> >
> > > Please note that since I am above MD raid5 (I believe this is the
> > > reason), the passdown of discards does not work, as my dmesg says:
> > > [160508.497879] device-mapper: thin: Discard unsupported by data
> > > device (dm-1): Disabling discard passdown.
> > > but AFAIU, unless there is a thinp bug, this should not affect the
> > > unmapping of thin blocks by fstrimming xfs... and in fact ext4 is
> > > able to do that.
> >
> > Does ext4 report that same error?
>
> That message says the underlying device doesn't support discards
> (because it is an MD device). But the thinp device still has discards
> enabled -- it just won't pass the discards down to the underlying data
> device.
>
> So yes, it'll happen with ext4 -- it is generated when the thin-pool
> device is loaded (which happens independent of the filesystem that is
> layered ontop).
>
> The discards still inform the thin-pool that the corresponding extents
> are no longer allocated.
So do I understand correctly that even though the discard came
through and thinp took advantage of it it still returns EOPNOTSUPP ?
This seems rather suboptimal. IIRC there was a discussion to add an
option to enable/disable sending discard in thinp target down
to the device.
So maybe it might be a bit smarter than that and actually
enable/disable discard pass through depending on the underlying
support, so we do not blindly send discard down to the device even
though it does not support it.
So we'll have three options:
pass through - always send discard down
backstop - never send discard down to the device
auto - send discard only if the underlying device supports it
What do you think ?
-Lukas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On 06/19/12 08:32, Luk?? Czerner wrote:
>
> So do I understand correctly that even though the discard came
> through and thinp took advantage of it it still returns EOPNOTSUPP ?
> This seems rather suboptimal. IIRC there was a discussion to add an
> option to enable/disable sending discard in thinp target down
> to the device.
I'll ask this too...
do I understand correctly that dm-thin returns EOPNOTSUPP to the
filesystem layer even though it is using the discard to unmap blocks,
and at that point XFS stops sending discards down there (while ext4
keeps sending them)?
This looks like a bug of dm-thin to me. Discards are "supported" in such
a scenario.
Do you have a patch for dm-thin so to prevent it sending EOPTNOTSUPP ?
Thank you
S.
On Tue, 19 Jun 2012, Spelic wrote:
> Date: Tue, 19 Jun 2012 13:29:55 +0200
> From: Spelic <[email protected]>
> To: Luk?? Czerner <[email protected]>
> Cc: Mike Snitzer <[email protected]>, Dave Chinner <[email protected]>,
> Spelic <[email protected]>,
> device-mapper development <[email protected]>,
> [email protected], [email protected]
> Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
>
> On 06/19/12 08:32, Luk?? Czerner wrote:
> >
> > So do I understand correctly that even though the discard came
> > through and thinp took advantage of it it still returns EOPNOTSUPP ?
> > This seems rather suboptimal. IIRC there was a discussion to add an
> > option to enable/disable sending discard in thinp target down
> > to the device.
>
> I'll ask this too...
> do I understand correctly that dm-thin returns EOPNOTSUPP to the filesystem
> layer even though it is using the discard to unmap blocks, and at that point
> XFS stops sending discards down there (while ext4 keeps sending them)?
>
> This looks like a bug of dm-thin to me. Discards are "supported" in such a
> scenario.
>
> Do you have a patch for dm-thin so to prevent it sending EOPTNOTSUPP ?
Yes, definitely this behaviour need to change in dm-thin. I do not
have a path, it was merely a proposal how thing could be done. Not
sure what Mike and rest of the dm folks think about this.
-Lukas
>
> Thank you
> S.
>
On Tue, Jun 19 2012 at 2:32am -0400,
Lukáš Czerner <[email protected]> wrote:
> On Mon, 18 Jun 2012, Mike Snitzer wrote:
>
> > Date: Mon, 18 Jun 2012 23:12:42 -0400
> > From: Mike Snitzer <[email protected]>
> > To: Dave Chinner <[email protected]>
> > Cc: Spelic <[email protected]>,
> > device-mapper development <[email protected]>,
> > [email protected], [email protected]
> > Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
> >
> > On Mon, Jun 18 2012 at 9:57pm -0400,
> > Dave Chinner <[email protected]> wrote:
> >
> > > On Mon, Jun 18, 2012 at 11:33:50PM +0200, Spelic wrote:
> > >
> > > > Please note that since I am above MD raid5 (I believe this is the
> > > > reason), the passdown of discards does not work, as my dmesg says:
> > > > [160508.497879] device-mapper: thin: Discard unsupported by data
> > > > device (dm-1): Disabling discard passdown.
> > > > but AFAIU, unless there is a thinp bug, this should not affect the
> > > > unmapping of thin blocks by fstrimming xfs... and in fact ext4 is
> > > > able to do that.
> > >
> > > Does ext4 report that same error?
> >
> > That message says the underlying device doesn't support discards
> > (because it is an MD device). But the thinp device still has discards
> > enabled -- it just won't pass the discards down to the underlying data
> > device.
> >
> > So yes, it'll happen with ext4 -- it is generated when the thin-pool
> > device is loaded (which happens independent of the filesystem that is
> > layered ontop).
> >
> > The discards still inform the thin-pool that the corresponding extents
> > are no longer allocated.
>
> So do I understand correctly that even though the discard came
> through and thinp took advantage of it it still returns EOPNOTSUPP ?
No, not correct. Why are you assuming this? I must be missing
something from this discussion that led you there.
> This seems rather suboptimal. IIRC there was a discussion to add an
> option to enable/disable sending discard in thinp target down
> to the device.
>
> So maybe it might be a bit smarter than that and actually
> enable/disable discard pass through depending on the underlying
> support, so we do not blindly send discard down to the device even
> though it does not support it.
Yes, that is what we did.
Discards are enabled my default (including discard passdown), but if the
underlying data device doesn't support discards then the discards will
not be passed down.
And here are the feature controls that can be provided when loading the
thin-pool's DM table:
ignore_discard: disable discard
no_discard_passdown: don't pass discards down to the data device
-EOPNOTSUPP is only ever returned if 'ignore_discard' is provided.
On Tue, 19 Jun 2012, Mike Snitzer wrote:
> Date: Tue, 19 Jun 2012 09:16:49 -0400
> From: Mike Snitzer <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: Dave Chinner <[email protected]>, Spelic <[email protected]>,
> device-mapper development <[email protected]>,
> [email protected], [email protected]
> Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
>
> On Tue, Jun 19 2012 at 2:32am -0400,
> Lukáš Czerner <[email protected]> wrote:
>
> > On Mon, 18 Jun 2012, Mike Snitzer wrote:
> >
> > > Date: Mon, 18 Jun 2012 23:12:42 -0400
> > > From: Mike Snitzer <[email protected]>
> > > To: Dave Chinner <[email protected]>
> > > Cc: Spelic <[email protected]>,
> > > device-mapper development <[email protected]>,
> > > [email protected], [email protected]
> > > Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
> > >
> > > On Mon, Jun 18 2012 at 9:57pm -0400,
> > > Dave Chinner <[email protected]> wrote:
> > >
> > > > On Mon, Jun 18, 2012 at 11:33:50PM +0200, Spelic wrote:
> > > >
> > > > > Please note that since I am above MD raid5 (I believe this is the
> > > > > reason), the passdown of discards does not work, as my dmesg says:
> > > > > [160508.497879] device-mapper: thin: Discard unsupported by data
> > > > > device (dm-1): Disabling discard passdown.
> > > > > but AFAIU, unless there is a thinp bug, this should not affect the
> > > > > unmapping of thin blocks by fstrimming xfs... and in fact ext4 is
> > > > > able to do that.
> > > >
> > > > Does ext4 report that same error?
> > >
> > > That message says the underlying device doesn't support discards
> > > (because it is an MD device). But the thinp device still has discards
> > > enabled -- it just won't pass the discards down to the underlying data
> > > device.
> > >
> > > So yes, it'll happen with ext4 -- it is generated when the thin-pool
> > > device is loaded (which happens independent of the filesystem that is
> > > layered ontop).
> > >
> > > The discards still inform the thin-pool that the corresponding extents
> > > are no longer allocated.
> >
> > So do I understand correctly that even though the discard came
> > through and thinp took advantage of it it still returns EOPNOTSUPP ?
>
> No, not correct. Why are you assuming this? I must be missing
> something from this discussion that led you there.
Those two paragraphs led me to that conclusion:
That message says the underlying device doesn't support discards
(because it is an MD device). But the thinp device still has discards
enabled -- it just won't pass the discards down to the underlying data
device.
The discards still inform the thin-pool that the corresponding extents
are no longer allocated.
so I am a bit confused now. Why the dm-thin returned EOPNOTSUPP then
? Is that because it has been configured to ignore_discard, or it
actually takes advantage of the discard but underlying device does
not support it (and no_discard_passdown is not set) so it return
EOPNOTSUPP ?
>
> > This seems rather suboptimal. IIRC there was a discussion to add an
> > option to enable/disable sending discard in thinp target down
> > to the device.
> >
> > So maybe it might be a bit smarter than that and actually
> > enable/disable discard pass through depending on the underlying
> > support, so we do not blindly send discard down to the device even
> > though it does not support it.
>
> Yes, that is what we did.
>
> Discards are enabled my default (including discard passdown), but if the
> underlying data device doesn't support discards then the discards will
> not be passed down.
>
> And here are the feature controls that can be provided when loading the
> thin-pool's DM table:
>
> ignore_discard: disable discard
> no_discard_passdown: don't pass discards down to the data device
>
> -EOPNOTSUPP is only ever returned if 'ignore_discard' is provided.
Ok, so in this case 'ignore_discard' has been configured ?
Thanks!
-Lukas
On Tue, Jun 19 2012 at 9:25am -0400,
Lukáš Czerner <[email protected]> wrote:
> On Tue, 19 Jun 2012, Mike Snitzer wrote:
>
> > Date: Tue, 19 Jun 2012 09:16:49 -0400
> > From: Mike Snitzer <[email protected]>
> > To: Lukáš Czerner <[email protected]>
> > Cc: Dave Chinner <[email protected]>, Spelic <[email protected]>,
> > device-mapper development <[email protected]>,
> > [email protected], [email protected]
> > Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
> >
> > On Tue, Jun 19 2012 at 2:32am -0400,
> > Lukáš Czerner <[email protected]> wrote:
> >
> > > So do I understand correctly that even though the discard came
> > > through and thinp took advantage of it it still returns EOPNOTSUPP ?
> >
> > No, not correct. Why are you assuming this? I must be missing
> > something from this discussion that led you there.
>
> Those two paragraphs led me to that conclusion:
>
> That message says the underlying device doesn't support discards
> (because it is an MD device). But the thinp device still has discards
> enabled -- it just won't pass the discards down to the underlying data
> device.
>
> The discards still inform the thin-pool that the corresponding extents
> are no longer allocated.
>
> so I am a bit confused now. Why the dm-thin returned EOPNOTSUPP then
> ? Is that because it has been configured to ignore_discard, or it
> actually takes advantage of the discard but underlying device does
> not support it (and no_discard_passdown is not set) so it return
> EOPNOTSUPP ?
>
> >
> > > This seems rather suboptimal. IIRC there was a discussion to add an
> > > option to enable/disable sending discard in thinp target down
> > > to the device.
> > >
> > > So maybe it might be a bit smarter than that and actually
> > > enable/disable discard pass through depending on the underlying
> > > support, so we do not blindly send discard down to the device even
> > > though it does not support it.
> >
> > Yes, that is what we did.
> >
> > Discards are enabled my default (including discard passdown), but if the
> > underlying data device doesn't support discards then the discards will
> > not be passed down.
> >
> > And here are the feature controls that can be provided when loading the
> > thin-pool's DM table:
> >
> > ignore_discard: disable discard
> > no_discard_passdown: don't pass discards down to the data device
> >
> > -EOPNOTSUPP is only ever returned if 'ignore_discard' is provided.
>
> Ok, so in this case 'ignore_discard' has been configured ?
I don't recall Spelic saying anything about EOPNOTSUPP. So what has
made you zero in on an -EOPNOTSUPP return (which should not be
happening)?
On Tue, Jun 19 2012 at 7:29am -0400,
Spelic <[email protected]> wrote:
> On 06/19/12 08:32, Lukáš Czerner wrote:
> >
> >So do I understand correctly that even though the discard came
> >through and thinp took advantage of it it still returns EOPNOTSUPP ?
> >This seems rather suboptimal. IIRC there was a discussion to add an
> >option to enable/disable sending discard in thinp target down
> >to the device.
>
> I'll ask this too...
> do I understand correctly that dm-thin returns EOPNOTSUPP to the
> filesystem layer even though it is using the discard to unmap
> blocks, and at that point XFS stops sending discards down there
> (while ext4 keeps sending them)?
Are you actually seeing that? Or are you just seizing on Lukas'
misunderstanding?
> This looks like a bug of dm-thin to me. Discards are "supported" in
> such a scenario.
>
> Do you have a patch for dm-thin so to prevent it sending EOPTNOTSUPP ?
thinp should _not_ be sending -EOPNOTSUPP unless 'ignore_discard' is
provided as a feature when loading thin-pool's DM table.
--
dm-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/dm-devel
On 06/19/12 15:30, Mike Snitzer wrote:
> I don't recall Spelic saying anything about EOPNOTSUPP. So what has
> made you zero in on an -EOPNOTSUPP return (which should not be
> happening)?
Exactly: I do not know if EOPNOTSUPP is being returned or not.
If this helps, I have configured dm-thin via lvm2
LVM version: 2.02.95(2) (2012-03-06)
Library version: 1.02.74 (2012-03-06)
Driver version: 4.22.0
from dmsetup table I only see one option : "skip_block_zeroing", if and
only if I configure it with -Zn . I do not see anything regarding
ignore_discard
vg1-pooltry1-tpool: 0 20971520 thin-pool 252:1 252:2 2048 0 1
skip_block_zeroing
vg1-pooltry1_tdata: 0 20971520 linear 9:20 62922752
vg1-pooltry1_tmeta: 0 8192 linear 9:20 83894272
vg1-thinlv1: 0 31457280 thin 252:3 1
and in dmesg:
[ 33.685200] device-mapper: thin: Discard unsupported by data device
(dm-2): Disabling discard passdown.
[ 33.709586] device-mapper: thin: Discard unsupported by data device
(dm-6): Disabling discard passdown.
I do not know what is the mechanism for which xfs cannot unmap blocks
from dm-thin, but it really can't.
If anyone has dm-thin installed he can try. This is 100% reproducible
for me.
On 6/19/12 8:52 AM, Spelic wrote:
> On 06/19/12 15:30, Mike Snitzer wrote:
>> I don't recall Spelic saying anything about EOPNOTSUPP. So what has made you zero in on an -EOPNOTSUPP return (which should not be happening)?
>
> Exactly: I do not know if EOPNOTSUPP is being returned or not.
>
> If this helps, I have configured dm-thin via lvm2
> LVM version: 2.02.95(2) (2012-03-06)
> Library version: 1.02.74 (2012-03-06)
> Driver version: 4.22.0
>
> from dmsetup table I only see one option : "skip_block_zeroing", if and only if I configure it with -Zn . I do not see anything regarding ignore_discard
>
> vg1-pooltry1-tpool: 0 20971520 thin-pool 252:1 252:2 2048 0 1 skip_block_zeroing
> vg1-pooltry1_tdata: 0 20971520 linear 9:20 62922752
> vg1-pooltry1_tmeta: 0 8192 linear 9:20 83894272
> vg1-thinlv1: 0 31457280 thin 252:3 1
>
>
> and in dmesg:
> [ 33.685200] device-mapper: thin: Discard unsupported by data device (dm-2): Disabling discard passdown.
> [ 33.709586] device-mapper: thin: Discard unsupported by data device (dm-6): Disabling discard passdown.
>
>
> I do not know what is the mechanism for which xfs cannot unmap blocks from dm-thin, but it really can't.
> If anyone has dm-thin installed he can try. This is 100% reproducible for me.
Might be worth seeing if xfs is ever getting to its discard code? There is a tracepoint...
# mount -t debugfs none /sys/kernel/debug
# echo 1 > /sys/kernel/debug/tracing/tracing_enabled
# echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_discard_extent/enable
<run test>
# cat /sys/kernel/debug/tracing/trace
-Eric
On Mon, 18 Jun 2012, Spelic wrote:
> Date: Mon, 18 Jun 2012 23:33:50 +0200
> From: Spelic <[email protected]>
> To: [email protected], [email protected],
> device-mapper development <[email protected]>
> Subject: Ext4 and xfs problems in dm-thin on allocation and discard
>
> Hello all
> I am doing some testing of dm-thin on kernel 3.4.2 and latest lvm from source
> (the rest is Ubuntu Precise 12.04).
> There are a few problems with ext4 and (different ones with) xfs
>
> I am doing this:
> dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync
> lvs
> rm zeroes #optional
> dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync #again
> lvs
> rm zeroes #optional
> ...
> dd if=/dev/zero of=zeroes bs=1M count=1000 conv=fsync #again
> lvs
> rm zeroes
> fstrim /mnt/mountpoint
> lvs
>
> On ext4 the problem is that it always reallocates blocks at different places,
> so you can see from lvs that space occupation in the pool and thinlv increases
> at each iteration of dd, again and again, until it has allocated the whole
> thin device (really 100% of it). And this is true regardless of me doing rm or
> not between one dd and the other.
> The other problem is that by doing this, ext4 always gets the worst
> performance from thinp, about 140MB/sec on my system, because it is constantly
> allocating blocks, instead of 350MB/sec which should have been with my system
> if it used already allocated regions (see below compared to xfs). I am on an
> MD raid-5 of 5 hdds.
> I could suggest to add a "thinp mode" mount option to ext4 affecting the
> allocator, so that it tries to reallocate recently used and freed areas and
> not constantly new areas. Note that mount -o discard does work and prevents
> allocation bloating, but it still always gets the worst write performances
> from thinp. Alternatively thinp could be improved so that block allocation is
> fast :-P (*)
> However, good news is that fstrim works correctly on ext4, and is able to drop
> all space allocated by all dd's. Also mount -o discard works.
I am happy to hear that discard actually works with ext4. Regarding
the performance problem, part of it has already been explained by
Dave and I agree with him.
With thin provisioning you'll get totally different file system
layout than on fully provisioned disk as you push more and more
writes to your drive. This unfortunately has great impact on
performance since file systems usually have a lot of optimization on
where to put data/metadata on the drive and how to read them.
However in case of thinly provisioned storage those optimization
would not help. And yes, you just have to expect lower performance
with dm-thin from the file system on top of it. It is not and it
will never be ideal solution for workloads where you expect the best
performance.
However optimization have to be done on dm and fs side and the work
is currently in progress and now when we have "cheap" thinp solution
I guess that the progress will by quite faster in that regard.
-Lukas
>
> On xfs there is a different problem.
> Xfs apparently correctly re-uses the same blocks so that after the first write
> at 140MB/sec, subsequent overwrites of the same file are at full speed such as
> 350MB/sec (same speed as with non-thin lvm), and also you don't see space
> occupation going up at every iteration of dd, either with or without rm
> in-between the dd's. [ok actually now retrying it needed 3 rewrites to
> stabilize allocation... probably an AG count thing.]
> However the problem with XFS is that discard doesn't appear to work. Fstrim
> doesn't work, and neither does "mount -o discard ... + rm zeroes" . There is
> apparently no way to drop the allocated blocks, as seen from lvs. This is in
> contrast to what it is written here http://xfs.org/index.php/FITRIM/discard
> which declare fstrim and mount -o discard to be working.
> Please note that since I am above MD raid5 (I believe this is the reason), the
> passdown of discards does not work, as my dmesg says:
> [160508.497879] device-mapper: thin: Discard unsupported by data device
> (dm-1): Disabling discard passdown.
> but AFAIU, unless there is a thinp bug, this should not affect the unmapping
> of thin blocks by fstrimming xfs... and in fact ext4 is able to do that.
>
> (*) Strange thing is that write performance appears to be roughly the same for
> default thin chunksize and for 1MB thin chunksize. I would have expected thinp
> allocation to be faster with larger thin chunksizes but instead it is actually
> slower (note that there are no snapshots here and hence no CoW). This is also
> true if I set the thinpool to not zero newly allocated blocks: performances
> are about 240 MB/sec then, but again they don't increase with larger
> chunksizes, they actually decrease slightly with very large chunksizes such as
> 16MB. Why is that?
>
> Thanks for your help
> S.
>
On Tue, Jun 19, 2012 at 04:09:48PM +0200, Lukáš Czerner wrote:
>
> With thin provisioning you'll get totally different file system
> layout than on fully provisioned disk as you push more and more
> writes to your drive. This unfortunately has great impact on
> performance since file systems usually have a lot of optimization on
> where to put data/metadata on the drive and how to read them.
> However in case of thinly provisioned storage those optimization
> would not help. And yes, you just have to expect lower performance
> with dm-thin from the file system on top of it. It is not and it
> will never be ideal solution for workloads where you expect the best
> performance.
One of the things which would be nice to be able to easily set up is a
configuration where we get the benefits of thin provisioning with
respect to snapshost, but where the underlying block device used by
the file system is contiguous. That is, it would be really useful to
*not* use thin provisioning for the underlying file system, but to use
thin provisioned snapshots. That way we only pay the thinp
performance penalty for the snapshots, and not for normal file system
operations. This is something that would be very useful both for ext4
and xfs.
I talked to Alasdair about this a few months ago at the Collab Summit,
and I think it's doable today, but it was somewhat complicaed to set
up. I don't recall the details now, but perhaps someone who's more
familiar device mapper could outline the details, and perhaps we can
either simplify it or abstract it away in a convenient front-end
script?
- Ted
On 6/19/12 9:19 AM, Ted Ts'o wrote:
> On Tue, Jun 19, 2012 at 04:09:48PM +0200, Lukáš Czerner wrote:
>>
>> With thin provisioning you'll get totally different file system
>> layout than on fully provisioned disk as you push more and more
>> writes to your drive. This unfortunately has great impact on
>> performance since file systems usually have a lot of optimization on
>> where to put data/metadata on the drive and how to read them.
>> However in case of thinly provisioned storage those optimization
>> would not help. And yes, you just have to expect lower performance
>> with dm-thin from the file system on top of it. It is not and it
>> will never be ideal solution for workloads where you expect the best
>> performance.
>
> One of the things which would be nice to be able to easily set up is a
> configuration where we get the benefits of thin provisioning with
> respect to snapshost, but where the underlying block device used by
> the file system is contiguous. That is, it would be really useful to
> *not* use thin provisioning for the underlying file system, but to use
> thin provisioned snapshots. That way we only pay the thinp
> performance penalty for the snapshots, and not for normal file system
> operations. This is something that would be very useful both for ext4
> and xfs.
I agree, and have asked for exactly the same thing... though I have no
idea how hard it is to disentangle allocation-aware snapshots from thing
provisioned storage.
-Eric
On Tue, 19 Jun 2012, Ted Ts'o wrote:
> Date: Tue, 19 Jun 2012 10:19:33 -0400
> From: Ted Ts'o <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: Spelic <[email protected]>, [email protected],
> [email protected],
> device-mapper development <[email protected]>
> Subject: Re: Ext4 and xfs problems in dm-thin on allocation and discard
>
> On Tue, Jun 19, 2012 at 04:09:48PM +0200, Lukáš Czerner wrote:
> >
> > With thin provisioning you'll get totally different file system
> > layout than on fully provisioned disk as you push more and more
> > writes to your drive. This unfortunately has great impact on
> > performance since file systems usually have a lot of optimization on
> > where to put data/metadata on the drive and how to read them.
> > However in case of thinly provisioned storage those optimization
> > would not help. And yes, you just have to expect lower performance
> > with dm-thin from the file system on top of it. It is not and it
> > will never be ideal solution for workloads where you expect the best
> > performance.
>
> One of the things which would be nice to be able to easily set up is a
> configuration where we get the benefits of thin provisioning with
> respect to snapshost, but where the underlying block device used by
> the file system is contiguous. That is, it would be really useful to
> *not* use thin provisioning for the underlying file system, but to use
> thin provisioned snapshots. That way we only pay the thinp
> performance penalty for the snapshots, and not for normal file system
> operations. This is something that would be very useful both for ext4
> and xfs.
>
> I talked to Alasdair about this a few months ago at the Collab Summit,
> and I think it's doable today, but it was somewhat complicaed to set
> up. I don't recall the details now, but perhaps someone who's more
> familiar device mapper could outline the details, and perhaps we can
> either simplify it or abstract it away in a convenient front-end
> script?
like ssm for example ? :)
Yes this would definitely help and I think there are actually more
possible optimization like this.
If we "cripple" the dm-thin so that only snapshot feature is
provided, but the actual thinp feature is not used. It would
definitely help the performance for those who are only interested in
snapshots. You'll still have your file system layout mixed up once
you start using snapshot, but it'll be definitely better. Also some
king of fs/dm interface for optimizing the layout might helpful as
well.
The other thing which could be done is to still enable to utilize
thinp feature, but try to keep file systems on the dm-thin relatively
separated and contiguous (although probably not in it's entire size).
It would certainly work only to some thin pool utilization threshold,
but it is something. Also if we can add some fs related optimization
to try not to span entire file system but rather utilize smaller parts
first (alter the block allocator so it does not allocate blocks from
random groups from entire fs but rather have smaller block group
working set at start), this can be even more useful.
-Lukas
>
> - Ted
>
On Tue, Jun 19, 2012 at 10:19:33AM -0400, Ted Ts'o wrote:
> One of the things which would be nice to be able to easily set up is a
> configuration where we get the benefits of thin provisioning with
> respect to snapshost, but where the underlying block device used by
> the file system is contiguous.
We're tracking this requirement (for lvm2) here:
https://bugzilla.redhat.com/show_bug.cgi?id=814737
Alasdair
On Tue, Jun 19 2012 at 9:52am -0400,
Spelic <[email protected]> wrote:
> On 06/19/12 15:30, Mike Snitzer wrote:
> >I don't recall Spelic saying anything about EOPNOTSUPP. So what
> >has made you zero in on an -EOPNOTSUPP return (which should not be
> >happening)?
>
> Exactly: I do not know if EOPNOTSUPP is being returned or not.
>
> If this helps, I have configured dm-thin via lvm2
> LVM version: 2.02.95(2) (2012-03-06)
> Library version: 1.02.74 (2012-03-06)
> Driver version: 4.22.0
>
> from dmsetup table I only see one option : "skip_block_zeroing", if
> and only if I configure it with -Zn . I do not see anything
> regarding ignore_discard
>
> vg1-pooltry1-tpool: 0 20971520 thin-pool 252:1 252:2 2048 0 1
> skip_block_zeroing
> vg1-pooltry1_tdata: 0 20971520 linear 9:20 62922752
> vg1-pooltry1_tmeta: 0 8192 linear 9:20 83894272
> vg1-thinlv1: 0 31457280 thin 252:3 1
>
>
> and in dmesg:
> [ 33.685200] device-mapper: thin: Discard unsupported by data
> device (dm-2): Disabling discard passdown.
> [ 33.709586] device-mapper: thin: Discard unsupported by data
> device (dm-6): Disabling discard passdown.
>
>
> I do not know what is the mechanism for which xfs cannot unmap
> blocks from dm-thin, but it really can't.
> If anyone has dm-thin installed he can try. This is 100%
> reproducible for me.
I was initially surprised by this considering the thinp-test-suite does
test a compilebench workload against xfs and ext4 using online discard
(-o discard).
But I just modified that test to use a thin-pool with 'ignore_discard'
and the test still passed on both ext4 and xfs.
So there is more work needed in the thinp-test-suite to use blktrace
hooks to verify that discards are occuring when the compilebench
generated files are removed.
I'll work through that and report back.
On Tue, Jun 19 2012 at 10:43am -0400,
Alasdair G Kergon <[email protected]> wrote:
> On Tue, Jun 19, 2012 at 10:19:33AM -0400, Ted Ts'o wrote:
> > One of the things which would be nice to be able to easily set up is a
> > configuration where we get the benefits of thin provisioning with
> > respect to snapshost, but where the underlying block device used by
> > the file system is contiguous.
>
> We're tracking this requirement (for lvm2) here:
> https://bugzilla.redhat.com/show_bug.cgi?id=814737
That is an lvm2 BZ but there is further kernel work needed.
It should be noted that the "external origin" feature was added to the
thinp target with this commit:
http://git.kernel.org/linus/2dd9c257fbc243aa76ee6d
It is start, but external origin is kept read-only and any writes
trigger allocation of new blocks within the thin-pool.
We've talked some about the desire to have a fully provisioned volume
that only starts to get fragmented once snapshots are taken. The idea
is to move the origin into the data volume, via mapping, rather than
copying:
Dec 14 10:37:08 <ejt> we then build a data dev that consists of a linear mapping to that origin
Dec 14 10:37:12 <ejt> plus some extra stuff
Dec 14 10:37:23 <ejt> (the additonal free space for snapshots)
Dec 14 10:37:49 <ejt> we then prepare thinp metadata with a mapping to that origin
On Tue, Jun 19, 2012 at 11:28:56AM -0400, Mike Snitzer wrote:
> That is an lvm2 BZ but there is further kernel work needed.
In principle, userspace should already be able to handle the replumbing I
think. (But when we work through the details of an online import, perhaps
we'll want some further kernel change for atomicity/speed reasons? In
particular we need to be able to do the last part of the metadata merge
quickly.)
Roughly:
1. rejig the lvm metadata for the new configuration [lvm]
- appends the "whole LV" data to the pool's data
2. Generate metadata for the appended data and append this to the metadata area [dmpd]
3. suspend all the affected devices [lvm]
4. link the already-prepared metadata into the existing metadata [dmpd]
5. resume all the devices (now using the new extended pool)
Alasdair
On Tue, Jun 19 2012 at 10:44am -0400,
Mike Snitzer <[email protected]> wrote:
> On Tue, Jun 19 2012 at 9:52am -0400,
> Spelic <[email protected]> wrote:
>
> > I do not know what is the mechanism for which xfs cannot unmap
> > blocks from dm-thin, but it really can't.
> > If anyone has dm-thin installed he can try. This is 100%
> > reproducible for me.
>
> I was initially surprised by this considering the thinp-test-suite does
> test a compilebench workload against xfs and ext4 using online discard
> (-o discard).
>
> But I just modified that test to use a thin-pool with 'ignore_discard'
> and the test still passed on both ext4 and xfs.
>
> So there is more work needed in the thinp-test-suite to use blktrace
> hooks to verify that discards are occuring when the compilebench
> generated files are removed.
>
> I'll work through that and report back.
blktrace shows discards for both xfs and ext4.
But in general xfs is issuing discards with much smaller extents than
ext4 does, e.g.:
to the thin device:
+ 128 vs + 32
to the thin-pool's data device:
+ 120 vs + 16
On Tue, Jun 19, 2012 at 11:28:56AM -0400, Mike Snitzer wrote:
>
> That is an lvm2 BZ but there is further kernel work needed.
>
> It should be noted that the "external origin" feature was added to the
> thinp target with this commit:
> http://git.kernel.org/linus/2dd9c257fbc243aa76ee6d
>
> It is start, but external origin is kept read-only and any writes
> trigger allocation of new blocks within the thin-pool.
Hmm... maybe this is what I had been told. I thought there was some
feature where you could take a read-only thinp snapshot of an external
volume (i.e., a pre-existing LVM2 volume, or a block device), and then
after that, make read-write snapshots using the read-only snapshot as
a base? Is that something that works today, or is planned? Or am I
totally confused?
And if it is something that works today, is there a web site or
documentation file that gives a recipe for how to use it if we want to
do some performance experiments (i.e., it doesn't have to be a user
friendly interface if that's not ready yet).
Thanks,
- Ted
On Tue, Jun 19, 2012 at 02:48:59PM -0400, Mike Snitzer wrote:
> On Tue, Jun 19 2012 at 10:44am -0400,
> Mike Snitzer <[email protected]> wrote:
>
> > On Tue, Jun 19 2012 at 9:52am -0400,
> > Spelic <[email protected]> wrote:
> >
> > > I do not know what is the mechanism for which xfs cannot unmap
> > > blocks from dm-thin, but it really can't.
> > > If anyone has dm-thin installed he can try. This is 100%
> > > reproducible for me.
> >
> > I was initially surprised by this considering the thinp-test-suite does
> > test a compilebench workload against xfs and ext4 using online discard
> > (-o discard).
> >
> > But I just modified that test to use a thin-pool with 'ignore_discard'
> > and the test still passed on both ext4 and xfs.
> >
> > So there is more work needed in the thinp-test-suite to use blktrace
> > hooks to verify that discards are occuring when the compilebench
> > generated files are removed.
> >
> > I'll work through that and report back.
>
> blktrace shows discards for both xfs and ext4.
>
> But in general xfs is issuing discards with much smaller extents than
> ext4 does, e.g.:
THat's normal when you use -o discard - XFS sends extremely
fine-grained discards as the have to be issued during the checkpoint
commit that frees the extent. Hence they can't be aggregated like is
done in ext4.
As it is, no-one really should be using -o discard - it is extremely
inefficient compared to a background fstrim run given that discards
are unqueued, blocking IOs. It's just a bad idea until the lower
layers get fixed to allow asynchronous, vectored discards and SATA
supports queued discards...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jun 20, 2012 at 06:06:31AM +1000, Dave Chinner wrote:
> > But in general xfs is issuing discards with much smaller extents than
> > ext4 does, e.g.:
>
> THat's normal when you use -o discard - XFS sends extremely
> fine-grained discards as the have to be issued during the checkpoint
> commit that frees the extent. Hence they can't be aggregated like is
> done in ext4.
Actually, ext4 is also sending the discards during (well, actually,
after) the commit which frees the extent/inode. We do aggregate them
while the commit is open, but once the transaction is committed, we
send out the discards. I suspect the difference is in the granularity
of the transactions between ext4 and xfs.
> As it is, no-one really should be using -o discard - it is extremely
> inefficient compared to a background fstrim run given that discards
> are unqueued, blocking IOs. It's just a bad idea until the lower
> layers get fixed to allow asynchronous, vectored discards and SATA
> supports queued discards...
What Dave said. :-) This is true for both ext4 and xfs.
As a result, I can very easily see there being a distinction made
between when we *do* want to pass the discards all the way down to the
device, and when we only want the thinp layer to process them ---
because for current devices, sending discards down to the physical
device is very heavyweight.
I'm not sure how we could do this without a nasty layering violation,
but some way in which we could label fstrim discards versus "we've
committed the unlink/truncate and so thinp can feel free to reuse
these blocks" discards would be interesting to consider.
- Ted
On Tue, Jun 19, 2012 at 04:21:30PM -0400, Ted Ts'o wrote:
> On Wed, Jun 20, 2012 at 06:06:31AM +1000, Dave Chinner wrote:
> > > But in general xfs is issuing discards with much smaller extents than
> > > ext4 does, e.g.:
> >
> > THat's normal when you use -o discard - XFS sends extremely
> > fine-grained discards as the have to be issued during the checkpoint
> > commit that frees the extent. Hence they can't be aggregated like is
> > done in ext4.
>
> Actually, ext4 is also sending the discards during (well, actually,
> after) the commit which frees the extent/inode. We do aggregate them
> while the commit is open, but once the transaction is committed, we
> send out the discards. I suspect the difference is in the granularity
> of the transactions between ext4 and xfs.
Exactly - XFS transactions are fine grained, checkpoints are coarse.
We don't merge extents freed in fine grained transactions inside
checkpoints. We probably could, but, well, it's complex to do in XFS
and merging adjacent requests is something the block layer is
supposed to do....
> > As it is, no-one really should be using -o discard - it is extremely
> > inefficient compared to a background fstrim run given that discards
> > are unqueued, blocking IOs. It's just a bad idea until the lower
> > layers get fixed to allow asynchronous, vectored discards and SATA
> > supports queued discards...
>
> What Dave said. :-) This is true for both ext4 and xfs.
>
> As a result, I can very easily see there being a distinction made
> between when we *do* want to pass the discards all the way down to the
> device, and when we only want the thinp layer to process them ---
> because for current devices, sending discards down to the physical
> device is very heavyweight.
>
> I'm not sure how we could do this without a nasty layering violation,
> but some way in which we could label fstrim discards versus "we've
> committed the unlink/truncate and so thinp can feel free to reuse
> these blocks" discards would be interesting to consider.
I think if we had better discard support from the block layer, it
wouldn't matter from a filesystem POV what discard support is
present in the block layer below it. I think it's better to get the
block layer interface fixed than to add new request types/labels to
filesystems to work around the current deficiencies.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jun 19 2012 at 3:58pm -0400,
Ted Ts'o <[email protected]> wrote:
> On Tue, Jun 19, 2012 at 11:28:56AM -0400, Mike Snitzer wrote:
> >
> > That is an lvm2 BZ but there is further kernel work needed.
> >
> > It should be noted that the "external origin" feature was added to the
> > thinp target with this commit:
> > http://git.kernel.org/linus/2dd9c257fbc243aa76ee6d
> >
> > It is start, but external origin is kept read-only and any writes
> > trigger allocation of new blocks within the thin-pool.
>
> Hmm... maybe this is what I had been told. I thought there was some
> feature where you could take a read-only thinp snapshot of an external
> volume (i.e., a pre-existing LVM2 volume, or a block device), and then
> after that, make read-write snapshots using the read-only snapshot as
> a base? Is that something that works today, or is planned? Or am I
> totally confused?
The commit I referenced basically provides that capability.
> And if it is something that works today, is there a web site or
> documentation file that gives a recipe for how to use it if we want to
> do some performance experiments (i.e., it doesn't have to be a user
> friendly interface if that's not ready yet).
Documentation/device-mapper/thin-provisioning.txt has details on how to
use dmsetup to create a thin device that uses a read-only external
origin volume (so all reads to unprovisioned areas of the thin device
will be remapped to the external origin -- "external" meaning the volume
outside of the thin-pool).
The creation of a thin device w/ a read-only external origin gets you
started with a thin device that is effectively a snapshot of the origin
volume. That thin device is read-write -- all writes are provisioned
from the thin-pool that is backing the thin device. And you can take
snapshots (or recursive snapshots) of that thin device.
On 06/19/12 22:06, Dave Chinner wrote:
> On Tue, Jun 19, 2012 at 02:48:59PM -0400, Mike Snitzer wrote:
>> On Tue, Jun 19 2012 at 10:44am -0400,
>> Mike Snitzer<[email protected]> wrote:
>>
>>> On Tue, Jun 19 2012 at 9:52am -0400,
>>> Spelic<[email protected]> wrote:
>>>
>>>> I do not know what is the mechanism for which xfs cannot unmap
>>>> blocks from dm-thin, but it really can't.
>>>> If anyone has dm-thin installed he can try. This is 100%
>>>> reproducible for me.
>>> I was initially surprised by this considering the thinp-test-suite does
>>> test a compilebench workload against xfs and ext4 using online discard
>>> (-o discard).
>>>
>>> But I just modified that test to use a thin-pool with 'ignore_discard'
>>> and the test still passed on both ext4 and xfs.
>>>
>>> So there is more work needed in the thinp-test-suite to use blktrace
>>> hooks to verify that discards are occuring when the compilebench
>>> generated files are removed.
>>>
>>> I'll work through that and report back.
>> blktrace shows discards for both xfs and ext4.
>>
>> But in general xfs is issuing discards with much smaller extents than
>> ext4 does, e.g.:
> THat's normal when you use -o discard - XFS sends extremely
> fine-grained discards as the have to be issued during the checkpoint
> commit that frees the extent. Hence they can't be aggregated like is
> done in ext4.
>
> As it is, no-one really should be using -o discard - it is extremely
> inefficient compared to a background fstrim run given that discards
> are unqueued, blocking IOs. It's just a bad idea until the lower
> layers get fixed to allow asynchronous, vectored discards and SATA
> supports queued discards...
>
Could it be that the thin blocksize is larger than the discard
granularity by xfs so nothing ever gets unmapped?
I have tried thin pools with the default blocksize (64k afair with lvm2)
and 1MB.
HOWEVER I also have tried fstrim on xfs, and that is also not capable to
unmap things from the dm-thin.
What is the granularity with fstrim in xfs?
Sorry I can't access the machine right now; maybe tomorrow, or in the
weekend.
On Tue, Jun 19, 2012 at 11:37:54PM +0200, Spelic wrote:
> On 06/19/12 22:06, Dave Chinner wrote:
> >On Tue, Jun 19, 2012 at 02:48:59PM -0400, Mike Snitzer wrote:
> >>On Tue, Jun 19 2012 at 10:44am -0400,
> >>Mike Snitzer<[email protected]> wrote:
> >>
> >>>On Tue, Jun 19 2012 at 9:52am -0400,
> >>>Spelic<[email protected]> wrote:
> >>>
> >>>>I do not know what is the mechanism for which xfs cannot unmap
> >>>>blocks from dm-thin, but it really can't.
> >>>>If anyone has dm-thin installed he can try. This is 100%
> >>>>reproducible for me.
> >>>I was initially surprised by this considering the thinp-test-suite does
> >>>test a compilebench workload against xfs and ext4 using online discard
> >>>(-o discard).
> >>>
> >>>But I just modified that test to use a thin-pool with 'ignore_discard'
> >>>and the test still passed on both ext4 and xfs.
> >>>
> >>>So there is more work needed in the thinp-test-suite to use blktrace
> >>>hooks to verify that discards are occuring when the compilebench
> >>>generated files are removed.
> >>>
> >>>I'll work through that and report back.
> >>blktrace shows discards for both xfs and ext4.
> >>
> >>But in general xfs is issuing discards with much smaller extents than
> >>ext4 does, e.g.:
> >THat's normal when you use -o discard - XFS sends extremely
> >fine-grained discards as the have to be issued during the checkpoint
> >commit that frees the extent. Hence they can't be aggregated like is
> >done in ext4.
> >
> >As it is, no-one really should be using -o discard - it is extremely
> >inefficient compared to a background fstrim run given that discards
> >are unqueued, blocking IOs. It's just a bad idea until the lower
> >layers get fixed to allow asynchronous, vectored discards and SATA
> >supports queued discards...
> >
>
> Could it be that the thin blocksize is larger than the discard
> granularity by xfs so nothing ever gets unmapped?
for -o discard, possibly. for fstrim, unlikely.
> I have tried thin pools with the default blocksize (64k afair with
> lvm2) and 1MB.
> HOWEVER I also have tried fstrim on xfs, and that is also not
> capable to unmap things from the dm-thin.
> What is the granularity with fstrim in xfs?
Whatever granularity you passed fstrim. You need to run an event
trace on XFS to find out if it is issuing discards before going
any further..
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jun 20, 2012 at 06:39:38AM +1000, Dave Chinner wrote:
> Exactly - XFS transactions are fine grained, checkpoints are coarse.
> We don't merge extents freed in fine grained transactions inside
> checkpoints. We probably could, but, well, it's complex to do in XFS
> and merging adjacent requests is something the block layer is
> supposed to do....
Last time I checked it actually tries to do that for discard requests,
but then badly falls flat (=oopses). That's the reason why the XFS
transaction commit code still uses the highly suboptimal synchronous
blkdev_issue_discard instead of the async variant I wrote when designing
the code.
Another "issue" with the XFS discard pattern and the current block
layer implementation is that XFS frees a lot of small metadata like
inode clusters and btree blocks and discards them as well. If those
simply fill one of the vectors in a range ATA TRIM command and/or a
queueable command that's not much of an issue, but with the current
combination of non-queueable, non-vetored TRIM that's a fairly nasty
pattern.
So until the block layer is sorted out I can not recommend actually
using -o dicard. I planned to sort out the block layer issues ASAP
when writing that code, but other things have kept me busy every since.
Ok guys, I think I found the bug. One or more bugs.
Pool has chunksize 1MB.
In sysfs the thin volume has: queue/discard_max_bytes and
queue/discard_granularity are 1048576 .
And it has discard_alignment = 0, which based on sysfs-block
documentation is correct (a less misleading name would have been
discard_offset imho).
Here is the blktrace from ext4 fstrim:
...
252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
...
Here is the blktrace from xfs fstrim:
252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
As you can see, while ext4 correctly aligns the discards to 1MB, xfs
does not.
It looks like an fstrim or xfs bug: they don't look at discard_alignment
(=0 ... a less misleading name would be discard_offset imho) +
discard_granularity (=1MB) and they don't base alignments on those.
Clearly the dm-thin cannot unmap anything if the 1MB regions are not
fully covered by a single discard. Note that specifying a large -m
option for fstrim does NOT widen the discard messages above 2048, and
this is correct because discard_max_bytes for that device is 1048576 .
If discard_max_bytes could be made much larger these kind of bugs could
be ameliorated, especially in complex situations like layers over
layers, virtualization etc.
Note that also in ext4 there are parts of the discard without the 1MB
alignment as seen with blktrace (out of my snippet), so this also might
need to be fixed, but most of it is aligned to 1MB. In xfs there are no
parts aligned to 1MB.
Now, another problem:
Firstly I wanted to say that in my original post I missed the
conv=notrunc for dd: I complained about the performances because I
expected the zerofiles would have been rewritten in-place without block
re-provisioning by dm-thin, but clearly without conv=notrunc this was
not happening. I confirm that with conv=notrunc performances are high at
the first rewrite, also in ext4, and occupied space in the thin volume
does not increase at every rewrite by dd.
HOWEVER
by NOT specifying conv=notrunc, the behaviour of dd / ext4 / dm-thin is
different if skip_block_zeroing is specified or not. If
skip_block_zeroing is not specified (provisioned blocks are pre-zeroed)
the space occupied by dd truncate + rewrite INCREASES at every rewrite,
while if skip_block_zeroing is NOT specified, dd truncate + rewrite DOES
NOT increase space occupied on the thin volume. Note: try this on ext4,
not xfs.
This looks very strange to me. The only reason I can think of is some
kind of cooperative behaviour of ext4 with the variable
dm-X/queue/discard_zeroes_data
which is different in the two cases. Can anyone give an explanation or
check if this is the intended behaviour?
And still an open question is: why the speed of provisioning new blocks
does not increase with increasing chunk size (64K --> 1MB --> 16MB...),
not even when skip_block_zeroing has been set and there is no CoW?
On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> Ok guys, I think I found the bug. One or more bugs.
>
>
> Pool has chunksize 1MB.
> In sysfs the thin volume has: queue/discard_max_bytes and
> queue/discard_granularity are 1048576 .
> And it has discard_alignment = 0, which based on sysfs-block
> documentation is correct (a less misleading name would have been
> discard_offset imho).
> Here is the blktrace from ext4 fstrim:
> ...
> 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
> 252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
> 252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
> 252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
> 252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
> 252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
> 252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
> 252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
> 252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
> 252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
> 252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
> 252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
> 252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
> 252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
> 252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
> 252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
> 252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
> 252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
> 252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
> 252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
> 252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
> 252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
> 252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
> ...
>
> Here is the blktrace from xfs fstrim:
> 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
> 252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
> 252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
> 252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
> 252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
> 252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
> 252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
> 252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
> 252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
> 252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
> 252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
> 252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
> 252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
> 252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
> 252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
> 252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
> 252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
> 252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
> 252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
> 252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
> 252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
> 252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
> 252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
> 252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
>
>
> As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> does not.
XFs just sends a large extent to blkdev_issue_discard(), and cares
nothing about discard alignment or granularity.
> It looks like an fstrim or xfs bug: they don't look at
> discard_alignment (=0 ... a less misleading name would be
> discard_offset imho) + discard_granularity (=1MB) and they don't
> base alignments on those.
It looks like blkdev_issue_discard() has reduced each discard to
bios of a single "granule" (1MB), and not aligned them, hence they
are ignore by dm-thinp.
what are the discard parameters exposed by dm-thinp in
/sys/block/<thinp-blkdev>/queue/discard*
It looks to me that dmthinp might be setting discard_max_bytes to
1MB rather than discard_granularity. Looking at dm-thin.c:
static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
{
/*
* FIXME: these limits may be incompatible with the pool's data device
*/
limits->max_discard_sectors = pool->sectors_per_block;
/*
* This is just a hint, and not enforced. We have to cope with
* bios that overlap 2 blocks.
*/
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
limits->discard_zeroes_data = pool->pf.zero_new_blocks;
}
Yes - discard_max_bytes == discard_granularity, and so
blkdev_issue_discard fails to align the request properly. As it is,
setting discard_max_bytes to the thinp block size is silly - it
means you'll never get range requests, and we sent a discard for
every single block in a range rather than having the thinp code
iterate over a range itself.
i.e. this is not a filesystem bug that is causing the problem....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jun 20 2012 at 6:53pm -0400,
Dave Chinner <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > Ok guys, I think I found the bug. One or more bugs.
> >
> >
> > Pool has chunksize 1MB.
> > In sysfs the thin volume has: queue/discard_max_bytes and
> > queue/discard_granularity are 1048576 .
> > And it has discard_alignment = 0, which based on sysfs-block
> > documentation is correct (a less misleading name would have been
> > discard_offset imho).
> > Here is the blktrace from ext4 fstrim:
> > ...
> > 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> > 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> > 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
> > 252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
> > 252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
> > 252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
> > 252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
> > 252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
> > 252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
> > 252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
> > 252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
> > 252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
> > 252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
> > 252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
> > 252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
> > 252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
> > 252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
> > 252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
> > 252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
> > 252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
> > 252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
> > 252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
> > 252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
> > 252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
> > 252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
> > ...
> >
> > Here is the blktrace from xfs fstrim:
> > 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> > 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> > 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
> > 252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
> > 252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
> > 252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
> > 252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
> > 252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
> > 252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
> > 252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
> > 252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
> > 252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
> > 252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
> > 252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
> > 252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
> > 252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
> > 252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
> > 252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
> > 252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
> > 252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
> > 252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
> > 252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
> > 252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
> > 252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
> > 252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
> > 252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
> >
> >
> > As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> > does not.
>
> XFs just sends a large extent to blkdev_issue_discard(), and cares
> nothing about discard alignment or granularity.
>
> > It looks like an fstrim or xfs bug: they don't look at
> > discard_alignment (=0 ... a less misleading name would be
> > discard_offset imho) + discard_granularity (=1MB) and they don't
> > base alignments on those.
>
> It looks like blkdev_issue_discard() has reduced each discard to
> bios of a single "granule" (1MB), and not aligned them, hence they
> are ignore by dm-thinp.
>
> what are the discard parameters exposed by dm-thinp in
> /sys/block/<thinp-blkdev>/queue/discard*
>
> It looks to me that dmthinp might be setting discard_max_bytes to
> 1MB rather than discard_granularity. Looking at dm-thin.c:
>
> static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> {
> /*
> * FIXME: these limits may be incompatible with the pool's data device
> */
> limits->max_discard_sectors = pool->sectors_per_block;
>
> /*
> * This is just a hint, and not enforced. We have to cope with
> * bios that overlap 2 blocks.
> */
> limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> }
>
>
> Yes - discard_max_bytes == discard_granularity, and so
> blkdev_issue_discard fails to align the request properly. As it is,
> setting discard_max_bytes to the thinp block size is silly - it
> means you'll never get range requests, and we sent a discard for
> every single block in a range rather than having the thinp code
> iterate over a range itself.
So 2 different issues:
1) blkdev_issue_discard isn't properly aligning
2) thinp should accept larger discards (up to the stacked
discard_max_bytes rather than setting an override)
> i.e. this is not a filesystem bug that is causing the problem....
Paolo Bonzini fixed blkdev_issue_discard to properly align some time
ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
Jens, and Christoph).
Here are references to Paolo's patches:
0/2 https://lkml.org/lkml/2012/3/14/323
1/2 https://lkml.org/lkml/2012/3/14/324
2/2 https://lkml.org/lkml/2012/3/14/325
Patch 2/2 specifically addresses the case where:
discard_max_bytes == discard_granularity
Paolo, any chance you could resend to Jens (maybe with hch's comments on
patch#2 accounted for)? Also, please add hch's Reviewed-by when
reposting.
(would love to see this fixed for 3.5-rcX but if not 3.6 it is?)
On Thu, Jun 21, 2012 at 01:47:43PM -0400, Mike Snitzer wrote:
> On Wed, Jun 20 2012 at 6:53pm -0400,
> Dave Chinner <[email protected]> wrote:
>
> > On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > > Ok guys, I think I found the bug. One or more bugs.
> > >
> > >
> > > Pool has chunksize 1MB.
> > > In sysfs the thin volume has: queue/discard_max_bytes and
> > > queue/discard_granularity are 1048576 .
> > > And it has discard_alignment = 0, which based on sysfs-block
> > > documentation is correct (a less misleading name would have been
> > > discard_offset imho).
> > > Here is the blktrace from ext4 fstrim:
> > > ...
> > > 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> > > 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> > > 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
....
> > > Here is the blktrace from xfs fstrim:
> > > 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> > > 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> > > 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
.....
> > It looks like blkdev_issue_discard() has reduced each discard to
> > bios of a single "granule" (1MB), and not aligned them, hence they
> > are ignore by dm-thinp.
> >
> > what are the discard parameters exposed by dm-thinp in
> > /sys/block/<thinp-blkdev>/queue/discard*
> >
> > It looks to me that dmthinp might be setting discard_max_bytes to
> > 1MB rather than discard_granularity. Looking at dm-thin.c:
> >
> > static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> > {
> > /*
> > * FIXME: these limits may be incompatible with the pool's data device
> > */
> > limits->max_discard_sectors = pool->sectors_per_block;
> >
> > /*
> > * This is just a hint, and not enforced. We have to cope with
> > * bios that overlap 2 blocks.
> > */
> > limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> > }
> >
> >
> > Yes - discard_max_bytes == discard_granularity, and so
> > blkdev_issue_discard fails to align the request properly. As it is,
> > setting discard_max_bytes to the thinp block size is silly - it
> > means you'll never get range requests, and we sent a discard for
> > every single block in a range rather than having the thinp code
> > iterate over a range itself.
>
> So 2 different issues:
> 1) blkdev_issue_discard isn't properly aligning
> 2) thinp should accept larger discards (up to the stacked
> discard_max_bytes rather than setting an override)
Yes, in effect, but there's no real reason I can see why thinp can't
accept large discard requests than the underlying stack and break
them up appropriately itself....
> > i.e. this is not a filesystem bug that is causing the problem....
>
> Paolo Bonzini fixed blkdev_issue_discard to properly align some time
> ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
> Jens, and Christoph).
>
> Here are references to Paolo's patches:
> 0/2 https://lkml.org/lkml/2012/3/14/323
> 1/2 https://lkml.org/lkml/2012/3/14/324
> 2/2 https://lkml.org/lkml/2012/3/14/325
>
> Patch 2/2 specifically addresses the case where:
> discard_max_bytes == discard_granularity
>
> Paolo, any chance you could resend to Jens (maybe with hch's comments on
> patch#2 accounted for)? Also, please add hch's Reviewed-by when
> reposting.
>
> (would love to see this fixed for 3.5-rcX but if not 3.6 it is?)
That would be good...
Cheers,
Dave.
--
Dave Chinner
[email protected]
Il 21/06/2012 19:47, Mike Snitzer ha scritto:
> Paolo Bonzini fixed blkdev_issue_discard to properly align some time
> ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
> Jens, and Christoph).
>
> Here are references to Paolo's patches:
> 0/2 https://lkml.org/lkml/2012/3/14/323
> 1/2 https://lkml.org/lkml/2012/3/14/324
> 2/2 https://lkml.org/lkml/2012/3/14/325
>
> Patch 2/2 specifically addresses the case where:
> discard_max_bytes == discard_granularity
>
> Paolo, any chance you could resend to Jens (maybe with hch's comments on
> patch#2 accounted for)? Also, please add hch's Reviewed-by when
> reposting.
Sure, I'll do it this week. I just need to retest.
Paolo
On Sun, Jul 01 2012 at 10:53am -0400,
Paolo Bonzini <[email protected]> wrote:
> Il 21/06/2012 19:47, Mike Snitzer ha scritto:
> > Paolo Bonzini fixed blkdev_issue_discard to properly align some time
> > ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
> > Jens, and Christoph).
> >
> > Here are references to Paolo's patches:
> > 0/2 https://lkml.org/lkml/2012/3/14/323
> > 1/2 https://lkml.org/lkml/2012/3/14/324
> > 2/2 https://lkml.org/lkml/2012/3/14/325
> >
> > Patch 2/2 specifically addresses the case where:
> > discard_max_bytes == discard_granularity
> >
> > Paolo, any chance you could resend to Jens (maybe with hch's comments on
> > patch#2 accounted for)? Also, please add hch's Reviewed-by when
> > reposting.
>
> Sure, I'll do it this week. I just need to retest.
Great, thanks.
(cc'ing mkp)
One thing that seemed odd was your adjustment for discard_alignment (in
patch 1/2).
I need to better understand how discard_alignment (an offset despite the
name not saying as much) relates to alignment_offset.
Could just be that once a partition tool, or lvm, etc account for
alignment_offset (which they do now) that discard_alignment is
automagically accounted for as a side-effect?
(I haven't actually seen discard_alignment != 0 in the wild)
Mike
Il 02/07/2012 15:00, Mike Snitzer ha scritto:
> On Sun, Jul 01 2012 at 10:53am -0400,
> Paolo Bonzini <[email protected]> wrote:
>
>> Il 21/06/2012 19:47, Mike Snitzer ha scritto:
>>> Paolo Bonzini fixed blkdev_issue_discard to properly align some time
>>> ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
>>> Jens, and Christoph).
>>>
>>> Here are references to Paolo's patches:
>>> 0/2 https://lkml.org/lkml/2012/3/14/323
>>> 1/2 https://lkml.org/lkml/2012/3/14/324
>>> 2/2 https://lkml.org/lkml/2012/3/14/325
>>>
>>> Patch 2/2 specifically addresses the case where:
>>> discard_max_bytes == discard_granularity
>>>
>>> Paolo, any chance you could resend to Jens (maybe with hch's comments on
>>> patch#2 accounted for)? Also, please add hch's Reviewed-by when
>>> reposting.
>>
>> Sure, I'll do it this week. I just need to retest.
>
> Great, thanks.
>
> (cc'ing mkp)
>
> One thing that seemed odd was your adjustment for discard_alignment (in
> patch 1/2).
>
> I need to better understand how discard_alignment (an offset despite the
> name not saying as much) relates to alignment_offset.
In principle, it doesn't. All SBC says is:
The UNMAP GRANULARITY ALIGNMENT field indicates the LBA of the first
logical block to which the OPTIMAL UNMAP GRANULARITY field applies.
The unmap granularity alignment is used to calculate an optimal unmap
request starting LBA as follows:
optimal unmap request starting LBA = (n * optimal unmap granularity)
+ unmap granularity alignment
and what my patch does is ensure that all requests except the first
start at such an LBA.
In practice, there is a connection between the two, because a sane disk
will make all discard_alignment-aligned sectors also
alignment_offset-aligned, or vice versa, or both (depending on whether
1<<phys_exp is < > or = to discard_granularity).
> Could just be that once a partition tool, or lvm, etc account for
> alignment_offset (which they do now) that discard_alignment is
> automagically accounted for as a side-effect?
Yes, if discard_granularity <= 1<<phys_exp. In that case, the condition
above simplifies to discard_alignment == alignment_offset %
discard_granularity. Your partitions will be already aligned to both
alignment_offset and discard_alignment.
It seems more likely that discard_granularity > 1<<phys_exp if they
differ at all, in which case the partition tool will improve the
situation but still not reach an optimal setting.
The optimal positioning of partitions/logical volumes/etc. would be to
align them to lcm(1<<phys_exp, discard_granularity), and "misalign" the
starting sector by max(discard_alignment, alignment_offset).
> (I haven't actually seen discard_alignment != 0 in the wild)
Me neither, but it was easy to account for it in the patch.
Paolo