LinuxLists.cc - Ext4: batched discard support

2010-07-07 07:54:01

Subject: Ext4: batched discard support - simplified version

Hi all,

since my last post I have done some more testing with various SSD's and the
trend is clear. Trim performance is getting better and the performance loss
without trim is getting lower. So I have decided to abandon the initial idea
to track free blocks within some internal data structure - it takes time and
memory.

Today there are some SSD's which performance does not seems to degrade over
the time (writes). I have filled those devices up to 200% and still did not
seen any performance loss. On the other hand, there are still some devices
which shows about 300% performance degradation, so I suppose TRIM will be
still needed for some time.

You can try it out with the simple program attached below. Just create ext4
fs, mount it with -o discard and invoke attached program on ext4 mount point.

>From my experience the time needed to trim whole file system is strongly device
dependent. It may take few seconds on one device up to one minute on another,
under the heavy io load the time to trim whole fs gets longer.

There are two pathes:

[PATCH 1/2] Add ioctl FITRIM.
fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

[PATCH 2/2] Add batched discard support for ext4
fs/ext4/ext4.h | 2 +
fs/ext4/mballoc.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/super.c | 1 +
3 files changed, 118 insertions(+), 11 deletions(-)

Signed-off-by: "Lukas Czerner" <[email protected]>

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/ioctl.h>

#define FITRIM _IOWR('X', 121, int)

int main(int argc, char **argv)
{
int minsize = 4096;
int fd;

if (argc != 2) {
fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
return 1;
}

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}

if (ioctl(fd, FITRIM, &minsize)) {
if (errno == EOPNOTSUPP)
fprintf(stderr, "TRIM not supported\n");
else
perror("EXT4_IOC_TRIM");
return 1;
}

return 0;
}

2010-07-07 07:54:03

by Lukas Czerner

[permalink] [raw]

Subject: [PATCH 1/2] Add ioctl FITRIM.

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7faefb4..09b33ae 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -551,6 +551,33 @@ static int ioctl_fsthaw(struct file *filp)
return thaw_bdev(sb->s_bdev, sb);
}

+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+ struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+ unsigned int minlen;
+ int err;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* If filesystem doesn't support trim feature, return. */
+ if (sb->s_op->trim_fs == NULL)
+ return -EOPNOTSUPP;
+
+ /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+ if (sb->s_bdev == NULL)
+ return -EINVAL;
+
+ err = get_user(minlen, (unsigned int __user *) arg);
+ if (err)
+ return err;
+
+ err = sb->s_op->trim_fs(minlen, sb);
+ if (err)
+ return err;
+ return 0;
+}
+
/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
@@ -601,6 +628,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
error = ioctl_fsthaw(filp);
break;

+ case FITRIM:
+ error = ioctl_fstrim(filp, arg);
+ break;
+
case FS_IOC_FIEMAP:
return ioctl_fiemap(filp, arg);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 44f35ae..7a27fa4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
#define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
#define FIFREEZE _IOWR('X', 119, int) /* Freeze */
#define FITHAW _IOWR('X', 120, int) /* Thaw */
+#define FITRIM _IOWR('X', 121, int) /* Trim */

#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
@@ -1580,6 +1581,7 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+ int (*trim_fs) (unsigned int, struct super_block *);
};

/*
--
1.6.6.1

2010-07-07 07:54:04

by Lukas Czerner

[permalink] [raw]

Subject: [PATCH 2/2] Add batched discard support for ext4

Walk through each allocation group and trim all free extents. It can be
invoked through TRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search fro free extents in each allocation group. When the free
extent is found, blocks are marked as used and then trimmed. Afterwards
these blocks are marked as free in per-group bitmap.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/mballoc.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/super.c | 1 +
3 files changed, 118 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..ba0fff0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b423a36..c7b541c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2535,17 +2535,6 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
entry->count, entry->group, entry);

- if (test_opt(sb, DISCARD)) {
- ext4_fsblk_t discard_block;
-
- discard_block = entry->start_blk +
- ext4_group_first_block_no(sb, entry->group);
- trace_ext4_discard_blocks(sb,
- (unsigned long long)discard_block,
- entry->count);
- sb_issue_discard(sb, discard_block, entry->count);
- }
-
err = ext4_mb_load_buddy(sb, entry->group, &e4b);
/* we expect to find existing buddy because it's pinned */
BUG_ON(err != 0);
@@ -4640,3 +4629,118 @@ error_return:
kmem_cache_free(ext4_ac_cachep, ac);
return;
}
+
+/**
+ * Trim "count" blocks starting at "start" in "group"
+ * This must be called under group lock
+ */
+static void ext4_trim_extent(struct super_block *sb, int start, int count,
+ ext4_group_t group, struct ext4_buddy *e4b)
+{
+ ext4_fsblk_t discard_block;
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ struct ext4_free_extent ex;
+
+ assert_spin_locked(ext4_group_lock_ptr(sb, group));
+
+ ex.fe_start = start;
+ ex.fe_group = group;
+ ex.fe_len = count;
+
+ /**
+ * Mark blocks used, so no one can reuse them while
+ * being trimmed.
+ */
+ mb_mark_used(e4b, &ex);
+ ext4_unlock_group(sb, group);
+
+ discard_block = (ext4_fsblk_t)group *
+ EXT4_BLOCKS_PER_GROUP(sb)
+ + start
+ + le32_to_cpu(es->s_first_data_block);
+ trace_ext4_discard_blocks(sb,
+ (unsigned long long)discard_block,
+ count);
+ sb_issue_discard(sb, discard_block, count);
+
+ ext4_lock_group(sb, group);
+ mb_free_blocks(NULL, e4b, start, ex.fe_len);
+}
+
+/**
+ * Trim all free extents in group at least minblocks long
+ */
+ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b,
+ ext4_grpblk_t minblocks)
+{
+ void *bitmap;
+ ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+ ext4_grpblk_t start, next, count = 0;
+ ext4_group_t group;
+
+ BUG_ON(e4b == NULL);
+
+ bitmap = e4b->bd_bitmap;
+ group = e4b->bd_group;
+ start = e4b->bd_info->bb_first_free;
+ ext4_lock_group(sb, group);
+
+ while (start < max) {
+
+ start = mb_find_next_zero_bit(bitmap, max, start);
+ if (start >= max)
+ break;
+ next = mb_find_next_bit(bitmap, max, start);
+
+ if ((next - start) >= minblocks) {
+ count += next - start;
+ ext4_trim_extent(sb, start,
+ next - start, group, e4b);
+ }
+ start = next + 1;
+
+ if ((e4b->bd_info->bb_free - count) < minblocks)
+ break;
+ }
+
+ ext4_unlock_group(sb, group);
+
+ ext4_debug("trimmed %d blocks in the group %d\n",
+ count, group);
+
+ return count;
+}
+
+int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
+{
+ struct ext4_buddy e4b;
+ ext4_group_t group;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
+ ext4_grpblk_t minblocks;
+
+ if (!test_opt(sb, DISCARD))
+ return 0;
+
+ minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
+ if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb)))
+ return -EINVAL;
+
+ for (group = 0; group < ngroups; group++) {
+ int err;
+
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ if (err) {
+ ext4_error(sb, "Error in loading buddy "
+ "information for %u", group);
+ continue;
+ }
+
+ if (e4b.bd_info->bb_free >= minblocks) {
+ ext4_trim_all_free(sb, &e4b, minblocks);
+ }
+
+ ext4_mb_release_desc(&e4b);
+ }
+
+ return 0;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e14d22c..253eb98 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = {
.quota_write = ext4_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .trim_fs = ext4_trim_fs
};

static const struct super_operations ext4_nojournal_sops = {
--
1.6.6.1

2010-07-14 08:33:10

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH 2/2] Add batched discard support for ext4

Attachments:

(No filename) (1.06 kB)
0001-ext4-Add-interrupt-points-to-batched-discard.patch (2.18 kB)
Download all attachments

2010-07-14 09:41:03

by Lukas Czerner

[permalink] [raw]

Subject: Re: [PATCH 2/2] Add batched discard support for ext4

On Wed, 14 Jul 2010, Dmitry Monakhov wrote:

> Lukas Czerner <[email protected]> writes:
>
> > Walk through each allocation group and trim all free extents. It can be
> > invoked through TRIM ioctl on the file system. The main idea is to
> > provide a way to trim the whole file system if needed, since some SSD's
> > may suffer from performance loss after the whole device was filled (it
> > does not mean that fs is full!).
> >
> > It search fro free extents in each allocation group. When the free
> > extent is found, blocks are marked as used and then trimmed. Afterwards
> > these blocks are marked as free in per-group bitmap.
> Looks ok, except two small notes:
> 1) trim_fs is a time consuming operation and we have to add
> condresced, and signal_pending checks to allow user to interrupt
> cmd if necessery. See patch attached.

Hi, Dimitry

thanks for your patch! Although I have one question:

for (group = 0; group < ngroups; group++) {
- int err;
-
- err = ext4_mb_load_buddy(sb, group, &e4b);
- if (err) {
+ ret = ext4_mb_load_buddy(sb, group, &e4b);
+ if (ret) {
ext4_error(sb, "Error in loading buddy "
"information for %u", group);
- continue;
+ break;
}

Is there really need to jump out from the loop and exit in the case of
load_buddy failure ? Next group may very well succeed in loading buddy,
or am I missing something ?

> 2) IMHO runtime trim support is useful sometimes, for example when
> user really care about data security i.e. unlinked file should be
> trimmed ASAP. I think we have to provide 'secdel' mount option
> similar to secdeletion flag for inode, but this is another story
> not directly connected with the patch.

I like the idea, but IMO this should work for any underlying storage,
not just for SSDs.

Thanks!
-Lukas

2010-07-14 10:03:13

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH 2/2] Add batched discard support for ext4

Lukas Czerner <[email protected]> writes:

> On Wed, 14 Jul 2010, Dmitry Monakhov wrote:
>
>> Lukas Czerner <[email protected]> writes:
>>
>> > Walk through each allocation group and trim all free extents. It can be
>> > invoked through TRIM ioctl on the file system. The main idea is to
>> > provide a way to trim the whole file system if needed, since some SSD's
>> > may suffer from performance loss after the whole device was filled (it
>> > does not mean that fs is full!).
>> >
>> > It search fro free extents in each allocation group. When the free
>> > extent is found, blocks are marked as used and then trimmed. Afterwards
>> > these blocks are marked as free in per-group bitmap.
>> Looks ok, except two small notes:
>> 1) trim_fs is a time consuming operation and we have to add
>> condresced, and signal_pending checks to allow user to interrupt
>> cmd if necessery. See patch attached.
>
> Hi, Dimitry
>
> thanks for your patch! Although I have one question:
>
>
> for (group = 0; group < ngroups; group++) {
> - int err;
> -
> - err = ext4_mb_load_buddy(sb, group, &e4b);
> - if (err) {
> + ret = ext4_mb_load_buddy(sb, group, &e4b);
> + if (ret) {
> ext4_error(sb, "Error in loading buddy "
> "information for %u", group);
> - continue;
> + break;
> }
>
> Is there really need to jump out from the loop and exit in the case of
> load_buddy failure ? Next group may very well succeed in loading buddy,
> or am I missing something ?
Well, it may fail due to -ENOMEM which is not scary but in some places
it may fail due to EIO which is a very bad sign. So i think it is
slightly dangerous to continue if we have found a same group.
>
>> 2) IMHO runtime trim support is useful sometimes, for example when
>> user really care about data security i.e. unlinked file should be
>> trimmed ASAP. I think we have to provide 'secdel' mount option
>> similar to secdeletion flag for inode, but this is another story
>> not directly connected with the patch.
>
> I like the idea, but IMO this should work for any underlying storage,
> not just for SSDs.
Off course. We may use blkdev_issue_zeroout() if disk does not support
discard with zeroing.
>
> Thanks!
> -Lukas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-07-14 11:43:38

by Lukas Czerner

[permalink] [raw]

Subject: Re: [PATCH 2/2] Add batched discard support for ext4

On Wed, 14 Jul 2010, Dmitry Monakhov wrote:

> Lukas Czerner <[email protected]> writes:
>
> > On Wed, 14 Jul 2010, Dmitry Monakhov wrote:
> >
> >> Lukas Czerner <[email protected]> writes:
> >>
> >> > Walk through each allocation group and trim all free extents. It can be
> >> > invoked through TRIM ioctl on the file system. The main idea is to
> >> > provide a way to trim the whole file system if needed, since some SSD's
> >> > may suffer from performance loss after the whole device was filled (it
> >> > does not mean that fs is full!).
> >> >
> >> > It search fro free extents in each allocation group. When the free
> >> > extent is found, blocks are marked as used and then trimmed. Afterwards
> >> > these blocks are marked as free in per-group bitmap.
> >> Looks ok, except two small notes:
> >> 1) trim_fs is a time consuming operation and we have to add
> >> condresced, and signal_pending checks to allow user to interrupt
> >> cmd if necessery. See patch attached.
> >
> > Hi, Dimitry
> >
> > thanks for your patch! Although I have one question:
> >
> >
> > for (group = 0; group < ngroups; group++) {
> > - int err;
> > -
> > - err = ext4_mb_load_buddy(sb, group, &e4b);
> > - if (err) {
> > + ret = ext4_mb_load_buddy(sb, group, &e4b);
> > + if (ret) {
> > ext4_error(sb, "Error in loading buddy "
> > "information for %u", group);
> > - continue;
> > + break;
> > }
> >
> > Is there really need to jump out from the loop and exit in the case of
> > load_buddy failure ? Next group may very well succeed in loading buddy,
> > or am I missing something ?
> Well, it may fail due to -ENOMEM which is not scary but in some places
> it may fail due to EIO which is a very bad sign. So i think it is
> slightly dangerous to continue if we have found a same group.

Ok, it seems reasonable.

> >
> >> 2) IMHO runtime trim support is useful sometimes, for example when
> >> user really care about data security i.e. unlinked file should be
> >> trimmed ASAP. I think we have to provide 'secdel' mount option
> >> similar to secdeletion flag for inode, but this is another story
> >> not directly connected with the patch.
> >
> > I like the idea, but IMO this should work for any underlying storage,
> > not just for SSDs.
> Off course. We may use blkdev_issue_zeroout() if disk does not support
> discard with zeroing.
> >

-Lukas

2010-07-23 14:36:06

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
>
> Hi all,
>
> since my last post I have done some more testing with various SSD's and the
> trend is clear. Trim performance is getting better and the performance loss
> without trim is getting lower. So I have decided to abandon the initial idea
> to track free blocks within some internal data structure - it takes time and
> memory.

Do you have some numbers about how bad trim actually might be on
various devices? I can imagine some devices where it might be better
(for wear levelling and better write endurance if nothing else) where
it's better to do the trim right away instead of batching things.

So what I'm thinking about doing is keeping the "discard" mount option
to mean non-batched discard. If you want to use the explicit FITRIM
ioctl, I don't think we need to test to see if the dicard mount option
is set; if the user issues the ioctl, then we should do the batched
discard, and if we don't trust the user to do that, then well, the
ioctl should be restricted to privileged users only --- especially if
it could take up to a minute.

- Ted

2010-07-23 15:13:56

by Jeff Moyer

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

"Ted Ts'o" <[email protected]> writes:

> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
>>
>> Hi all,
>>
>> since my last post I have done some more testing with various SSD's and the
>> trend is clear. Trim performance is getting better and the performance loss
>> without trim is getting lower. So I have decided to abandon the initial idea
>> to track free blocks within some internal data structure - it takes time and
>> memory.
>
> Do you have some numbers about how bad trim actually might be on
> various devices?

I'll let Lukas answer that when he gets back to the office next week.
The performance of the trim command itself varies by vendor, of course.

> I can imagine some devices where it might be better (for wear
> levelling and better write endurance if nothing else) where it's
> better to do the trim right away instead of batching things.

I don't think so. In all of the configurations tested, I'm pretty sure
we saw a performance hit from doing the TRIMs right away. The queue
flush really hurts. Of course, I have no idea what you had in mind for
the amount of time in between batched discards.

> So what I'm thinking about doing is keeping the "discard" mount option
> to mean non-batched discard. If you want to use the explicit FITRIM
> ioctl, I don't think we need to test to see if the dicard mount option
> is set; if the user issues the ioctl, then we should do the batched
> discard, and if we don't trust the user to do that, then well, the
> ioctl should be restricted to privileged users only --- especially if
> it could take up to a minute.

That sounds reasonable to me.

Cheers,
Jeff

2010-07-23 15:19:27

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote:
>
> I don't think so. In all of the configurations tested, I'm pretty sure
> we saw a performance hit from doing the TRIMs right away. The queue
> flush really hurts. Of course, I have no idea what you had in mind for
> the amount of time in between batched discards.

Sure, but not all the world is SATA-attached SSD's. I'm thinking in
particular of PCIe-attached SSD's, where the TRIM command might be
very fast indeed... I believe Ric Wheeler tells me you have TMS
RamSan SSD's in house that you are testing? And of course those
aren't the only PCIe-attached flash devices out there....

- Ted

2010-07-23 15:30:34

by Greg Freemyer

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

On Fri, Jul 23, 2010 at 11:13 AM, Jeff Moyer <[email protected]> wrote:
> "Ted Ts'o" <[email protected]> writes:
>
>> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
>>>
>>> Hi all,
>>>
>>> since my last post I have done some more testing with various SSD's and the
>>> trend is clear. Trim performance is getting better and the performance loss
>>> without trim is getting lower. So I have decided to abandon the initial idea
>>> to track free blocks within some internal data structure - it takes time and
>>> memory.
>>
>> Do you have some numbers about how bad trim actually might be on
>> various devices?
>
> I'll let Lukas answer that when he gets back to the office next week.
> The performance of the trim command itself varies by vendor, of course.
>
>> I can imagine some devices where it might be better (for wear
>> levelling and better write endurance if nothing else) where it's
>> better to do the trim right away instead of batching things.
>
> I don't think so. ?In all of the configurations tested, I'm pretty sure
> we saw a performance hit from doing the TRIMs right away. ?The queue
> flush really hurts. ?Of course, I have no idea what you had in mind for
> the amount of time in between batched discards.

It was my understanding that way back when, Intel was pushing for the
TRIMs to be right away. That may be why they never fully implemented
the TRIM command to accept more than one sectors worth of vectorized
data. (That is still multiple ranges per discard, just not
hundreds/thousands.)

Along those lines, does this patch create multi-sector discard trim
commands when there is a large list of discard ranges (ie. thousands
of ranges to discard)? And if so, does it have a blacklist for SSDs
like the Intel that don't implement the multi-sector payload part of
the spec?

Greg

2010-07-23 15:41:02

by Jeff Moyer

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

"Ted Ts'o" <[email protected]> writes:

> On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote:
>>
>> I don't think so. In all of the configurations tested, I'm pretty sure
>> we saw a performance hit from doing the TRIMs right away. The queue
>> flush really hurts. Of course, I have no idea what you had in mind for
>> the amount of time in between batched discards.
>
> Sure, but not all the world is SATA-attached SSD's. I'm thinking in
> particular of PCIe-attached SSD's, where the TRIM command might be
> very fast indeed... I believe Ric Wheeler tells me you have TMS
> RamSan SSD's in house that you are testing? And of course those
> aren't the only PCIe-attached flash devices out there....

You are right, and we have to consider thinly provisioned luns, as well.
The only case I can think of where it makes sense to issue those
discards immediately is if you are running tight on allocated space in
your thinly provisioned lun. Aside from that, I'm not sure why you
would want to send those commands down with every journal commit,
instead of batched daily, for example. But, I can certainly understand
wanting to allow this flexibility.

Cheers,
Jeff

2010-07-23 17:00:10

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

On Fri, Jul 23, 2010 at 11:40:58AM -0400, Jeff Moyer wrote:
>
> You are right, and we have to consider thinly provisioned luns, as well.
> The only case I can think of where it makes sense to issue those
> discards immediately is if you are running tight on allocated space in
> your thinly provisioned lun. Aside from that, I'm not sure why you
> would want to send those commands down with every journal commit,
> instead of batched daily, for example. But, I can certainly understand
> wanting to allow this flexibility.

The two reasons I could imagine is to give more flexibility to the
wear leveling algorithms (depending on how often you are turning over
files --- i.e., deleting blocks and then reusing them), and to
minimize latency (it might be nicer for the system to send down the
deleted blocks on a continuing basis rather than to send them down all
at once).

The other issue is that by sending TRIM commands for all free extents,
even those that haven't been recently been released, the flash
translation layer needs to look up a large number of blocks in its
translation table to see if it needs to update it. This can end up
burning CPU unnecessarily, especially for those flash devices (such as
FusionIO, for example) manage their FTL using the host CPU.

So this is one of the reasons why I want to leave some flexibility
here; BTW, for some systems, it may make sense for the FITRIM ioctl to
throttle the rate at which it locks block groups and sends down TRIM
requests so it doesn't end up causing performance hiccups for live
applications while the FITRIM ioctl is running.

- Ted

2010-07-24 16:32:40

by Ric Wheeler

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

On 07/23/2010 11:19 AM, Ted Ts'o wrote:
> On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote:
>
>> I don't think so. In all of the configurations tested, I'm pretty sure
>> we saw a performance hit from doing the TRIMs right away. The queue
>> flush really hurts. Of course, I have no idea what you had in mind for
>> the amount of time in between batched discards.
>>
> Sure, but not all the world is SATA-attached SSD's. I'm thinking in
> particular of PCIe-attached SSD's, where the TRIM command might be
> very fast indeed... I believe Ric Wheeler tells me you have TMS
> RamSan SSD's in house that you are testing? And of course those
> aren't the only PCIe-attached flash devices out there....
>
> - Ted
>

I think that some of the PCI-e cards might want that information right
away, a lot of high end arrays actually prefer fewer larger chunks.

One other user might be virtual devices or some remote replication
mechanism. I wonder if the drbd people for example might (do?) use these?

ric

2010-07-26 10:30:34

by Lukas Czerner

[permalink] [raw]

Subject: Re: Ext4: batched discard support - simplified version

On Fri, 23 Jul 2010, Ted Ts'o wrote:

> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
> >
> > Hi all,
> >
> > since my last post I have done some more testing with various SSD's and the
> > trend is clear. Trim performance is getting better and the performance loss
> > without trim is getting lower. So I have decided to abandon the initial idea
> > to track free blocks within some internal data structure - it takes time and
> > memory.
>
> Do you have some numbers about how bad trim actually might be on
> various devices? I can imagine some devices where it might be better
> (for wear levelling and better write endurance if nothing else) where
> it's better to do the trim right away instead of batching things.

Hi,

Yes, I have those numbers.

http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_discard.html
This page presents my test results on three different devices. I have
tested the current ext4 discard implementation (do the trim right away).
However, one tested device is still not on that page. With this
(Vendor4) device I have got only about 1.83% performance loss, which is
very good.

http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_ioctltrim.html
This page provides test results with my batched discard implementation.
Take those numbers with discretion, because the patch still does not
involve journaling and I have tested the "worst case" scenario, which
involves issuing FITRIM in endless loop without any sleep.

Generally the FITRIM ioctl can take from 2 seconds on fast devices to
several (2-4) minutes on very slow devices, or under heavy load.

>
> So what I'm thinking about doing is keeping the "discard" mount option
> to mean non-batched discard. If you want to use the explicit FITRIM
> ioctl, I don't think we need to test to see if the dicard mount option
> is set; if the user issues the ioctl, then we should do the batched
> discard, and if we don't trust the user to do that, then well, the
> ioctl should be restricted to privileged users only --- especially if
> it could take up to a minute.

I agree.

>
> - Ted
>

Thanks.

-Lukas