LinuxLists.cc - "data=writeback" and TRIM don't get along

2010-04-08 00:50:39

Subject: "data=writeback" and TRIM don't get along

Hello.

TRIM command issued to SSD doesn't work with this mount options:

============================
rootfs / rootfs rw 0 0
/dev/root / ext4
rw,noatime,commit=100,barrier=0,nobh,stripe=128,data=writeback,discard 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
rc-svcdir /lib64/rc/init.d tmpfs
rw,nosuid,nodev,noexec,relatime,size=1024k,mode=755 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
udev /dev tmpfs rw,nosuid,relatime,size=10240k,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime 0 0
============================

Removing "data=writeback" option from /etc/fstab and from rootflags in
kernel boot options results in TRIM working just like it should.

So, the question is:

Is this "data=writeback" and TRIM incompatibility a bug or a feature? :)

Thank you.
Nebojsa Trpkovic

2010-04-08 01:22:45

by Eric Sandeen

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Nebojsa Trpkovic wrote:
> Hello.
>
> TRIM command issued to SSD doesn't work with this mount options:
>
> ============================
> rootfs / rootfs rw 0 0
> /dev/root / ext4
> rw,noatime,commit=100,barrier=0,nobh,stripe=128,data=writeback,discard 0 0
> proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
> rc-svcdir /lib64/rc/init.d tmpfs
> rw,nosuid,nodev,noexec,relatime,size=1024k,mode=755 0 0
> sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
> udev /dev tmpfs rw,nosuid,relatime,size=10240k,mode=755 0 0
> devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620 0 0
> shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime 0 0
> ============================

How does it fail?

> Removing "data=writeback" option from /etc/fstab and from rootflags in
> kernel boot options results in TRIM working just like it should.
>
> So, the question is:
>
> Is this "data=writeback" and TRIM incompatibility a bug or a feature? :)

Surely a bug. :) If you can provide details we'll look into it.
(perhaps it's obvious on first try but still worth saying exactly
what problematic behavior you saw, when reporting a bug you
encountered)

Thanks,
-Eric

2010-04-08 01:37:45

by Nebojsa Trpkovic

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

On 04/08/10 03:22, Eric Sandeen wrote:
> How does it fail?
>
> Surely a bug. :) If you can provide details we'll look into it.
> (perhaps it's obvious on first try but still worth saying exactly
> what problematic behavior you saw, when reporting a bug you
> encountered)

Well, I've done a simple test, described like:

"get the used sectors for a file
hdparm --fibmap filename
read a sector from the file eg. with
sudo hdparm --read-sector 66385920 /dev/sda
delete the file and sync
rm filename;sync
and read the sector a second time"

And I get something like this:

================================
# dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct
100+0 records in
100+0 records out
52428800 bytes (52 MB) copied, 6.47137 s, 8.1 MB/s

# hdparm --fibmap tempfile

tempfile:
filesystem blocksize 4096, begins at LBA 0; assuming 512 byte sectors.
byte_offset begin_LBA end_LBA sectors
0 37094400 37196799 102400

# hdparm --read-sector 37094400 /dev/sdb

/dev/sdb:
reading sector 37094400: succeeded
b0e8 3ad7 d080 84e8 b4b2 7e60 21f1 eff3
0ef9 fa10 b172 89f8 186f 0194 4cb1 e190
d6b5 b2fe 2577 5dba e6f2 5ad7 34a0 f09f
ca5c 07ef 6e86 c3a8 9e77 77f3 78ff 672f
af71 dea7 ac23 a55d e31e ff83 164e bb76
8ea4 416d 343a 9f5e b41f b1d0 b6e9 6ed8
90c0 3cba ec07 1d96 fdd6 3940 1290 7cd2
c506 c3ee c120 3732 17eb 6e68 11aa 721c...

# rm tempfile
# sync
================================

Now, if I have "data=writeback" enable and I run
# hdparm --read-sector 37094400 /dev/sdb
again, I'll get the same random bytes.
Reboot doesn't help.

If I disable "data=writeback" and run
# hdparm --read-sector 37094400 /dev/sdb
I'll get all zeroes:
================================
# hdparm --read-sector 37094400 /dev/sdb

/dev/sdb:
reading sector 37094400: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
...
================================

So, TRIM command works well, but not with "data=writeback" mount option.

Other info:

Intel X25-V 40GB (latest firmware)

Linux box 2.6.33-gentoo #4 SMP Sat Apr 3 05:02:04 CEST 2010 x86_64
Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz GenuineIntel GNU/Linux

Nebojsa

2010-04-08 04:10:45

by Eric Sandeen

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Nebojsa Trpkovic wrote:
> On 04/08/10 03:22, Eric Sandeen wrote:
>> How does it fail?
>>
>> Surely a bug. :) If you can provide details we'll look into it.
>> (perhaps it's obvious on first try but still worth saying exactly
>> what problematic behavior you saw, when reporting a bug you
>> encountered)
>
> Well, I've done a simple test, described like:
>
> "get the used sectors for a file
> hdparm --fibmap filename
> read a sector from the file eg. with
> sudo hdparm --read-sector 66385920 /dev/sda
> delete the file and sync
> rm filename;sync
> and read the sector a second time"

Ok, thanks, perfect test & explanation.

Well the good news is, at least it's nothing like discarding
the wrong block. :)

Long explanation:

in ext4_free_blocks():

/*
* We need to make sure we don't reuse the freed block until
* after the transaction is committed, which we can do by
* treating the block as metadata, below. We make an
* exception if the inode is to be written in writeback mode
* since writeback mode has weak data consistency guarantees.
*/
if (!ext4_should_writeback_data(inode))
flags |= EXT4_FREE_BLOCKS_METADATA;

so here we don't set this flag for writeback mode.

Later in that function we do:

if ((flags & EXT4_FREE_BLOCKS_METADATA) && ext4_handle_valid(handle)) {
...
ext4_mb_free_metadata(handle, &e4b, new_entry);

ext4_mb_free_metadata adds to t_private_list:

/* Add the extent to transaction's private list */
spin_lock(&sbi->s_md_lock);
list_add(&new_entry->list, &handle->h_transaction->t_private_list);
spin_unlock(&sbi->s_md_lock);

Once we finally get to release_blocks_on_commit, only things on t_private_list
ultimately get discarded.

Anyway, at the root of this is right now the discard really only happens for
things flagged as metadata, which is a pretty funky thing to do.

(and indeed if you have a unique xattr block, you'll see it get discarded
even with data=writeback).

I'll have to think about the right way to do this... it seems pretty
convoluted to me right now.

-Eric

2010-04-08 04:37:41

by Eric Sandeen

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Eric Sandeen wrote:
> I'll have to think about the right way to do this... it seems pretty
> convoluted to me right now.
>
Something like this probably works, but I really REALLY would not test
it on an important filesystem. :)

I'm not sure it's a good idea to discard it before returning it
to the prealloc pool, because it may well get re-used again
quickly.... not sure if that's helpful.

Just a note, I think eventually we may move to more of a batch discard
in the background, because these little discards are actually quite
inefficient on the hardware we've tested so far.

-Eric

p.s. really. Don't test this with important data. I haven't tested
it at all yet.

Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -4602,6 +4606,8 @@ do_more:
mb_clear_bits(bitmap_bh->b_data, bit, count);
ext4_mb_free_metadata(handle, &e4b, new_entry);
} else {
+ ext4_fsblk_t discard_block;
+
/* need to update group_info->bb_free and bitmap
* with group lock held. generate_buddy look at
* them with group lock_held
@@ -4609,6 +4615,11 @@ do_more:
ext4_lock_group(sb, block_group);
mb_clear_bits(bitmap_bh->b_data, bit, count);
mb_free_blocks(inode, &e4b, bit, count);
+ discard_block = bit +
+ ext4_group_first_block_no(sb, block_group);
+ trace_ext4_discard_blocks(sb,
+ (unsigned long long)discard_block, count);
+ sb_issue_discard(sb, discard_block, count);
ext4_mb_return_to_preallocation(inode, &e4b, block, count);
}

2010-04-08 04:47:39

by Eric Sandeen

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Eric Sandeen wrote:
> Eric Sandeen wrote:
>> I'll have to think about the right way to do this... it seems pretty
>> convoluted to me right now.
>>
> Something like this probably works, but I really REALLY would not test
> it on an important filesystem. :)
>
> I'm not sure it's a good idea to discard it before returning it
> to the prealloc pool, because it may well get re-used again
> quickly.... not sure if that's helpful.

(now I'm really talking to myself, but scratch that bit -
ext4_mb_return_to_preallocation is pretty much a no-op)

-Eric

> Just a note, I think eventually we may move to more of a batch discard
> in the background, because these little discards are actually quite
> inefficient on the hardware we've tested so far.
>
> -Eric
>
> p.s. really. Don't test this with important data. I haven't tested
> it at all yet.
>
> Index: linux-2.6/fs/ext4/mballoc.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/mballoc.c
> +++ linux-2.6/fs/ext4/mballoc.c
> @@ -4602,6 +4606,8 @@ do_more:
> mb_clear_bits(bitmap_bh->b_data, bit, count);
> ext4_mb_free_metadata(handle, &e4b, new_entry);
> } else {
> + ext4_fsblk_t discard_block;
> +
> /* need to update group_info->bb_free and bitmap
> * with group lock held. generate_buddy look at
> * them with group lock_held
> @@ -4609,6 +4615,11 @@ do_more:
> ext4_lock_group(sb, block_group);
> mb_clear_bits(bitmap_bh->b_data, bit, count);
> mb_free_blocks(inode, &e4b, bit, count);
> + discard_block = bit +
> + ext4_group_first_block_no(sb, block_group);
> + trace_ext4_discard_blocks(sb,
> + (unsigned long long)discard_block, count);
> + sb_issue_discard(sb, discard_block, count);
> ext4_mb_return_to_preallocation(inode, &e4b, block, count);
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-04-08 07:17:57

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Nebojsa Trpkovic <[email protected]> writes:

> On 04/08/10 03:22, Eric Sandeen wrote:
>> How does it fail?
>>
>> Surely a bug. :) If you can provide details we'll look into it.
>> (perhaps it's obvious on first try but still worth saying exactly
>> what problematic behavior you saw, when reporting a bug you
>> encountered)
>
> Well, I've done a simple test, described like:
>
> "get the used sectors for a file
> hdparm --fibmap filename
> read a sector from the file eg. with
> sudo hdparm --read-sector 66385920 /dev/sda
> delete the file and sync
> rm filename;sync
> and read the sector a second time"
>
>
> And I get something like this:
>
> ================================
> # dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct
> 100+0 records in
> 100+0 records out
> 52428800 bytes (52 MB) copied, 6.47137 s, 8.1 MB/s
>
> # hdparm --fibmap tempfile
>
> tempfile:
> filesystem blocksize 4096, begins at LBA 0; assuming 512 byte sectors.
> byte_offset begin_LBA end_LBA sectors
> 0 37094400 37196799 102400
>
> # hdparm --read-sector 37094400 /dev/sdb
>
> /dev/sdb:
> reading sector 37094400: succeeded
> b0e8 3ad7 d080 84e8 b4b2 7e60 21f1 eff3
> 0ef9 fa10 b172 89f8 186f 0194 4cb1 e190
> d6b5 b2fe 2577 5dba e6f2 5ad7 34a0 f09f
> ca5c 07ef 6e86 c3a8 9e77 77f3 78ff 672f
> af71 dea7 ac23 a55d e31e ff83 164e bb76
> 8ea4 416d 343a 9f5e b41f b1d0 b6e9 6ed8
> 90c0 3cba ec07 1d96 fdd6 3940 1290 7cd2
> c506 c3ee c120 3732 17eb 6e68 11aa 721c...
>
> # rm tempfile
> # sync
> ================================
>
> Now, if I have "data=writeback" enable and I run
> # hdparm --read-sector 37094400 /dev/sdb
> again, I'll get the same random bytes.
> Reboot doesn't help.
>
> If I disable "data=writeback" and run
> # hdparm --read-sector 37094400 /dev/sdb
> I'll get all zeroes:
> ================================
> # hdparm --read-sector 37094400 /dev/sdb
>
> /dev/sdb:
> reading sector 37094400: succeeded
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> ...
> ================================
>
> So, TRIM command works well, but not with "data=writeback" mount option.
>
> Other info:
>
> Intel X25-V 40GB (latest firmware)
can you please provide an actual version of firmware.
As soon as i know X25 zeroing was disabled.
Can you please post an output of your queue flags
cat /sys/block/sdXXX/queue/discard_zeroes_data
>
> Linux box 2.6.33-gentoo #4 SMP Sat Apr 3 05:02:04 CEST 2010 x86_64
> Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz GenuineIntel GNU/Linux
>
>
> Nebojsa
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-04-08 11:47:51

by Nebojsa Trpkovic

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

On 04/08/10 09:17, Dmitry Monakhov wrote:
> Nebojsa Trpkovic <[email protected]> writes:
>> Intel X25-V 40GB (latest firmware)
> can you please provide an actual version of firmware.
> As soon as i know X25 zeroing was disabled.
> Can you please post an output of your queue flags
> cat /sys/block/sdXXX/queue/discard_zeroes_data

======================
cat /sys/block/sda/queue/discard_zeroes_data
1
cat /sys/block/sda/queue/discard_granularity
512
======================

Short history (you're probably talking about Intel's firmware problems):

- Intel releases X25-M (80 and 160GB) G1. G1 doesn't support TRIM.
- Intel releases X25-M (80 and 160GB) G2. G2's initial firmware doesn't
support TRIM.
- Intel makes TRIM-enabled firmware for G2 and publishes it.
- Customers hammer some of their G2 SSDs by flashing new firmware.
- Intel withdraws new firmware.
- Intel fixes firmware and releases new one (this time without nasty
behaviour). Read 2 updates on the start of first page:
http://www.anandtech.com/show/2865
- Intel sells it's technology to Kingston and Kingston starts to produce
40GB version of G2 named Kingston BootDrive SNV125-S2/40GB that doesn't
support TRIM. (Kingston produced some full-grown G1 and G2, but that is
not a point here. SNM125 and SNM225).
http://www.anandtech.com/show/2865/4
- Intel starts selling it's own 40GB G2 named X25-V with TRIM enabled
out of the box.
http://www.anandtech.com/show/2968/intel-s-x25-v-kingston-s-30gb-ssdnow-v-series-battle-of-the-125-ssds
- (guess) Intel needs too much royalties from Kingston for TRIM enabled
firmware, so Kingston SNV125-S2/40GB doesn't get (official) TRIM enabled
firmware and Kingston stops selling Intel-based SSDs.

Anyways, all X25-V are shipped with TRIM enabled firmware.
My current firmware is 2CV102HD.
hdparm -I /dev/sda gives this, too:
Enabled Supported:
* Data Set Management TRIM supported
* Deterministic read ZEROs after TRIM

Nebojsa

2010-04-08 11:48:56

by Nebojsa Trpkovic

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

On 04/08/10 06:47, Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> I'll have to think about the right way to do this... it seems pretty
>>> convoluted to me right now.
>>>
>> Something like this probably works, but I really REALLY would not test
>> it on an important filesystem. :)
>>
>> I'm not sure it's a good idea to discard it before returning it
>> to the prealloc pool, because it may well get re-used again
>> quickly.... not sure if that's helpful.
>
> (now I'm really talking to myself, but scratch that bit -
> ext4_mb_return_to_preallocation is pretty much a no-op)
>
> -Eric
>
>> Just a note, I think eventually we may move to more of a batch discard
>> in the background, because these little discards are actually quite
>> inefficient on the hardware we've tested so far.
>>
>> -Eric
>>
>> p.s. really. Don't test this with important data. I haven't tested
>> it at all yet.
>>
>> Index: linux-2.6/fs/ext4/mballoc.c
>> ===================================================================
>> --- linux-2.6.orig/fs/ext4/mballoc.c
>> +++ linux-2.6/fs/ext4/mballoc.c
>> @@ -4602,6 +4606,8 @@ do_more:
>> mb_clear_bits(bitmap_bh->b_data, bit, count);
>> ext4_mb_free_metadata(handle, &e4b, new_entry);
>> } else {
>> + ext4_fsblk_t discard_block;
>> +
>> /* need to update group_info->bb_free and bitmap
>> * with group lock held. generate_buddy look at
>> * them with group lock_held
>> @@ -4609,6 +4615,11 @@ do_more:
>> ext4_lock_group(sb, block_group);
>> mb_clear_bits(bitmap_bh->b_data, bit, count);
>> mb_free_blocks(inode, &e4b, bit, count);
>> + discard_block = bit +
>> + ext4_group_first_block_no(sb, block_group);
>> + trace_ext4_discard_blocks(sb,
>> + (unsigned long long)discard_block, count);
>> + sb_issue_discard(sb, discard_block, count);
>> ext4_mb_return_to_preallocation(inode, &e4b, block, count);
>> }

Well, to be honest, I'm not some programmer guy, so I doubt my skills
can be of any help here.

Second, unfortunately, my SSD is now my root partition (just one big
sda1), so I cannot experiment with it too much.

And finally,
>> I'm not sure it's a good idea to discard it before returning it
>> to the prealloc pool, because it may well get re-used again
>> quickly.... not sure if that's helpful.

I'm not sure I understood you well about this prealloc pool - re-using
mechanism, but...
AFAIK, modern SSDs are using very aggressive wear-leveling algorithms.
Writing two times into the same filesystem sector almost newer goes to
the same hardware sector. Therefore, saving sectors from discarding and
for re-using makes no much sense - once re-used it will be written to
some other physical NAND memory cell anyways. Guess we can discard it as
soon as we have no valid data in it.

Nebojsa

2010-04-08 15:32:09

by Eric Sandeen

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Nebojsa Trpkovic wrote:

> Well, to be honest, I'm not some programmer guy, so I doubt my skills
> can be of any help here.
>
> Second, unfortunately, my SSD is now my root partition (just one big
> sda1), so I cannot experiment with it too much.

Well, you might just keep in mind that:

1) trimming these small amounts has actually looked very inefficient, and
2) data=writeback really isn't very safe in the face of a crash or power loss, and
3) hopefully we'll have a better trim solution eventually.

-Eric

2010-04-08 16:21:53

by Nebojsa Trpkovic

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

On 04/08/10 17:32, Eric Sandeen wrote:
> Well, you might just keep in mind that:
>
> 1) trimming these small amounts has actually looked very inefficient, and
> 2) data=writeback really isn't very safe in the face of a crash or power loss, and
> 3) hopefully we'll have a better trim solution eventually.

1) I understand that big TRIMs are better then small ones, but skipping
some TRIMs completely would lead to slow but sure drive degradation as
drive would have less and less spare space for wear leveling.

2) Yes, I'm aware of possible data=writeback inconsistency, but I've
tried to let IO scheduler to merge and reorganize as many writes as it
can, all to avoid small writes to SSD which are main cause of write
amplification.

3) I'll stick with no data=writeback for the time being. I guess I'm
doing just fine even without it. :)

One more noob quotestion completely out-of-topic:
Will md layer pass TRIM command to drive if one has ext4 on linux
software RAID 0/1/5 ?

Nebojsa

2010-04-08 16:54:45

by Eric Sandeen

[permalink] [raw]

Subject: Re: "data=writeback" and TRIM don't get along

Nebojsa Trpkovic wrote:
> On 04/08/10 17:32, Eric Sandeen wrote:
>> Well, you might just keep in mind that:
>>
>> 1) trimming these small amounts has actually looked very inefficient, and
>> 2) data=writeback really isn't very safe in the face of a crash or power loss, and
>> 3) hopefully we'll have a better trim solution eventually.
>
> 1) I understand that big TRIMs are better then small ones, but skipping
> some TRIMs completely would lead to slow but sure drive degradation as
> drive would have less and less spare space for wear leveling.
>
> 2) Yes, I'm aware of possible data=writeback inconsistency, but I've
> tried to let IO scheduler to merge and reorganize as many writes as it
> can, all to avoid small writes to SSD which are main cause of write
> amplification.
>
> 3) I'll stick with no data=writeback for the time being. I guess I'm
> doing just fine even without it. :)
>
>
> One more noob quotestion completely out-of-topic:
> Will md layer pass TRIM command to drive if one has ext4 on linux
> software RAID 0/1/5 ?

I think the answer is "not yet but it's being worked on"

-Eric

> Nebojsa