2010-04-19 10:55:42

by Lukas Czerner

[permalink] [raw]
Subject: Ext4: batched discard support

Hi all,

I would like to present a new way to deal with TRIM in ext4 file system.
The current solution is not ideal because of its bad performance impact.
So basic idea to improve things is to avoid discarding every time some
blocks are freed. and instead batching is together into bigger trims,
which tends to be more effective.

The basic idea behind my discard support is to create an ioctl which
walks through all the free extents in each allocating group and discard
those extents. As an addition to improve its performance one can specify
minimum free extent length, so ioctl will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted.

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents. But there is a
better solution to this problem. The bb_bitmap_deleted can be stored on
disk an can be restored in mount time along with other bitmaps, but I
think it is a quite big change and should be discussed further.

I have also benchmarked it a little. You can find results here:

people.redhat.com/jmoyer/discard/ext4_batched_discard/

comparison with current solution included. Keep in mind that ideal ioctl
invocation interval is yet to be determined, so in benchmark I have used
the performance-worst scenario - without any sleep between execution.


There are two patches for this. The first one just creates file system
independent ioctl for this and the second one it the batched discard
support itself.

I will very much appreciate any comment on this, your opinions, ideas to
make this better etc. Thanks.

If you want to try it, just create EXT4 file system mount it and invoke
ioctl on the mount point. You can use following code for this (I have
taken this from xfs patch for the same thing). You can also see some
debugging messages, but you may want to set EXT4FS_DEBUG for this.

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/ioctl.h>

#define FITRIM _IOWR('X', 121, int)

int main(int argc, char **argv)
{
int minsize = 4096;
int fd;

if (argc != 2) {
fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
return 1;
}

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}

if (ioctl(fd, FITRIM, &minsize)) {
if (errno == EOPNOTSUPP)
fprintf(stderr, "TRIM not supported\n");
else
perror("EXT4_IOC_TRIM");
return 1;
}

return 0;
}

fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

fs/ext4/ext4.h | 4 +
fs/ext4/mballoc.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/super.c | 1 +
3 files changed, 202 insertions(+), 10 deletions(-)


2010-04-19 10:56:07

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 1/2] Add ioctl FITRIM.

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 6c75110..a43eaea 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -537,6 +537,33 @@ static int ioctl_fsthaw(struct file *filp)
return thaw_bdev(sb->s_bdev, sb);
}

+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+ struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+ unsigned int minlen;
+ int err;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* If filesystem doesn't support trim feature, return. */
+ if (sb->s_op->trim_fs == NULL)
+ return -EOPNOTSUPP;
+
+ /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+ if (sb->s_bdev == NULL)
+ return -EINVAL;
+
+ err = get_user(minlen, (unsigned int __user *) arg);
+ if (err)
+ return err;
+
+ err = sb->s_op->trim_fs(minlen, sb);
+ if (err)
+ return err;
+ return 0;
+}
+
/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
@@ -587,6 +614,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
error = ioctl_fsthaw(filp);
break;

+ case FITRIM:
+ error = ioctl_fstrim(filp, arg);
+ break;
+
case FS_IOC_FIEMAP:
return ioctl_fiemap(filp, arg);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..f6ecdff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
#define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
#define FIFREEZE _IOWR('X', 119, int) /* Freeze */
#define FITHAW _IOWR('X', 120, int) /* Thaw */
+#define FITRIM _IOWR('X', 121, int) /* Trim */

#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
@@ -1580,6 +1581,7 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+ int (*trim_fs) (unsigned int, struct super_block *);
};

/*
--
1.6.6.1


2010-04-19 10:56:11

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 2/2] Add batched discard support for ext4.

Create an ioctl which walks through all the free extents in each
allocating group and discard those extents. As an addition to improve
its performance one can specify minimum free extent length, so ioctl
will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted.

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 4 +
fs/ext4/mballoc.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/super.c | 1 +
3 files changed, 202 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..e25f672 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
@@ -1682,6 +1684,8 @@ struct ext4_group_info {
#ifdef DOUBLE_CHECK
void *bb_bitmap;
#endif
+ void *bb_bitmap_deleted;
+ ext4_grpblk_t bb_deleted;
struct rw_semaphore alloc_sem;
ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
* regions, index is order.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bde9d0b..fbc83fe 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2255,6 +2255,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
meta_group_info[i]->bb_free_root = RB_ROOT;
+ meta_group_info[i]->bb_deleted = -1;
+
+

#ifdef DOUBLE_CHECK
{
@@ -2469,6 +2472,7 @@ int ext4_mb_release(struct super_block *sb)
#ifdef DOUBLE_CHECK
kfree(grinfo->bb_bitmap);
#endif
+ kfree(grinfo->bb_bitmap_deleted);
ext4_lock_group(sb, i);
ext4_mb_cleanup_pa(grinfo);
ext4_unlock_group(sb, i);
@@ -2528,6 +2532,7 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
int err, count = 0, count2 = 0;
struct ext4_free_data *entry;
struct list_head *l, *ltmp;
+ void *bitmap_deleted = NULL;

list_for_each_safe(l, ltmp, &txn->t_private_list) {
entry = list_entry(l, struct ext4_free_data, list);
@@ -2543,6 +2548,14 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
/* there are blocks to put in buddy to make them really free */
count += entry->count;
count2++;
+
+ if (test_opt(sb, DISCARD) &&
+ (db->bb_bitmap_deleted == NULL) &&
+ (db->bb_deleted >= 0)) {
+ bitmap_deleted =
+ kmalloc(sb->s_blocksize, GFP_KERNEL);
+ }
+
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
rb_erase(&entry->node, &(db->bb_free_root));
@@ -2555,17 +2568,24 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
page_cache_release(e4b.bd_buddy_page);
page_cache_release(e4b.bd_bitmap_page);
}
- ext4_unlock_group(sb, entry->group);
- if (test_opt(sb, DISCARD)) {
- ext4_fsblk_t discard_block;
-
- discard_block = entry->start_blk +
- ext4_group_first_block_no(sb, entry->group);
- trace_ext4_discard_blocks(sb,
- (unsigned long long)discard_block,
- entry->count);
- sb_issue_discard(sb, discard_block, entry->count);
+ if (test_opt(sb, DISCARD) && (db->bb_deleted >= 0)) {
+ if (db->bb_bitmap_deleted == NULL) {
+ db->bb_bitmap_deleted = bitmap_deleted;
+ BUG_ON(db->bb_bitmap_deleted == NULL);
+
+ bitmap_deleted = NULL;
+ mb_clear_bits(db->bb_bitmap_deleted,
+ 0, EXT4_BLOCKS_PER_GROUP(sb));
+ }
+
+ db->bb_deleted += entry->count;
+ mb_set_bits(db->bb_bitmap_deleted, entry->start_blk,
+ entry->count);
}
+ ext4_unlock_group(sb, entry->group);
+
+ kfree(bitmap_deleted);
+
kmem_cache_free(ext4_free_ext_cachep, entry);
ext4_mb_release_desc(&e4b);
}
@@ -4639,3 +4659,170 @@ error_return:
kmem_cache_free(ext4_ac_cachep, ac);
return;
}
+
+/* Trim "count" blocks starting at "start" in "group"
+ * This must be called under group lock
+ */
+void ext4_trim_extent(struct super_block *sb, int start, int count,
+ ext4_group_t group, struct ext4_buddy *e4b)
+{
+ ext4_fsblk_t discard_block;
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ struct ext4_free_extent ex;
+
+ assert_spin_locked(ext4_group_lock_ptr(sb, group));
+
+ ex.fe_start = start;
+ ex.fe_group = group;
+ ex.fe_len = count;
+
+ mb_mark_used(e4b, &ex);
+ ext4_unlock_group(sb, group);
+
+ discard_block = (ext4_fsblk_t)group *
+ EXT4_BLOCKS_PER_GROUP(sb)
+ + start
+ + le32_to_cpu(es->s_first_data_block);
+ trace_ext4_discard_blocks(sb,
+ (unsigned long long)discard_block,
+ count);
+ sb_issue_discard(sb, discard_block, count);
+
+ ext4_lock_group(sb, group);
+ mb_free_blocks(NULL, e4b, start, ex.fe_len);
+}
+
+/* Trim all free blocks, which have at least minlen length */
+ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b,
+ ext4_grpblk_t minblocks)
+{
+ void *bitmap;
+ ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+ ext4_grpblk_t start, next, count = 0;
+ ext4_group_t group;
+
+ BUG_ON(e4b == NULL);
+
+ bitmap = e4b->bd_bitmap;
+ group = e4b->bd_group;
+ start = 0;
+ ext4_lock_group(sb, group);
+
+ while (start < max) {
+
+ start = mb_find_next_zero_bit(bitmap, max, start);
+ if (start >= max)
+ break;
+ next = mb_find_next_bit(bitmap, max, start);
+
+ if ((next - start) >= minblocks) {
+
+ count += next - start;
+ ext4_trim_extent(sb, start,
+ next - start, group, e4b);
+
+ }
+ start = next + 1;
+ }
+
+ e4b->bd_info->bb_deleted = 0;
+ ext4_unlock_group(sb, group);
+
+ return count;
+}
+
+/* Trim only blocks which is free and in bb_bitmap_deleted */
+ext4_grpblk_t ext4_trim_deleted(struct super_block *sb, struct ext4_buddy *e4b,
+ ext4_grpblk_t minblocks)
+{
+ void *bitmap;
+ struct ext4_group_info *grp;
+ ext4_group_t group;
+ ext4_grpblk_t max, next, count = 0, start = 0;
+
+ BUG_ON(e4b == NULL);
+
+ bitmap = e4b->bd_bitmap;
+ group = e4b->bd_group;
+ grp = ext4_get_group_info(sb, group);
+
+ if (grp->bb_deleted < minblocks)
+ return 0;
+
+ ext4_lock_group(sb, group);
+
+ while (start < EXT4_BLOCKS_PER_GROUP(sb)) {
+ start = mb_find_next_bit(grp->bb_bitmap_deleted,
+ EXT4_BLOCKS_PER_GROUP(sb), start);
+ max = mb_find_next_zero_bit(grp->bb_bitmap_deleted,
+ EXT4_BLOCKS_PER_GROUP(sb), start);
+
+ while (start < max) {
+ start = mb_find_next_zero_bit(bitmap, max, start);
+ if (start >= max)
+ break;
+ next = mb_find_next_bit(bitmap, max, start);
+ if (next > max)
+ next = max;
+
+ if ((next - start) >= minblocks) {
+ count += next - start;
+
+ ext4_trim_extent(sb, start,
+ next - start, group, e4b);
+
+ mb_clear_bits(grp->bb_bitmap_deleted,
+ start, next - start);
+ }
+ start = next + 1;
+ }
+ }
+ grp->bb_deleted -= count;
+
+ ext4_unlock_group(sb, group);
+
+ ext4_debug("trimmed %d blocks in the group %d\n",
+ count, group);
+
+ return count;
+}
+
+int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
+{
+ struct ext4_buddy e4b;
+ struct ext4_group_info *grp;
+ ext4_group_t group;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
+ ext4_grpblk_t minblocks;
+
+ if (!test_opt(sb, DISCARD))
+ return 0;
+
+ minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
+ if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb)))
+ return -EINVAL;
+
+ for (group = 0; group < ngroups; group++) {
+ int err;
+
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ if (err) {
+ ext4_error(sb, "Error in loading buddy "
+ "information for %u", group);
+ continue;
+ }
+ grp = ext4_get_group_info(sb, group);
+
+ if (grp->bb_deleted < 0) {
+ /* First run after mount */
+ ext4_trim_all_free(sb, &e4b, minblocks);
+ } else if (grp->bb_deleted >= minblocks) {
+ /* Trim only blocks deleted since first run */
+ ext4_trim_deleted(sb, &e4b, minblocks);
+ }
+
+ ext4_mb_release_desc(&e4b);
+ }
+
+ return 0;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e14d22c..253eb98 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = {
.quota_write = ext4_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .trim_fs = ext4_trim_fs
};

static const struct super_operations ext4_nojournal_sops = {
--
1.6.6.1


2010-04-19 16:20:58

by Greg Freemyer

[permalink] [raw]
Subject: Re: Ext4: batched discard support

Adding Mark Lord in cc.

He wrote a preliminary discard solution last summer. I'm not sure how
it has progressed.

Mark, you can find the 2 patches at:

http://patchwork.ozlabs.org/patch/50441/
http://patchwork.ozlabs.org/patch/50442/

Greg

On Mon, Apr 19, 2010 at 6:55 AM, Lukas Czerner <[email protected]> wrote:
> Hi all,
>
> I would like to present a new way to deal with TRIM in ext4 file system.
> The current solution is not ideal because of its bad performance impact.
> So basic idea to improve things is to avoid discarding every time some
> blocks are freed. and instead batching is together into bigger trims,
> which tends to be more effective.
>
> The basic idea behind my discard support is to create an ioctl which
> walks through all the free extents in each allocating group and discard
> those extents. As an addition to improve its performance one can specify
> minimum free extent length, so ioctl will not bother with shorter extents.
>
> This of course means, that with each invocation the ioctl must walk
> through whole file system, checking and discarding free extents, which
> is not very efficient. The best way to avoid this is to keep track of
> deleted (freed) blocks. Then the ioctl have to trim just those free
> extents which were recently freed.
>
> In order to implement this I have added new bitmap into ext4_group_info
> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
> walk through bb_bitmap_deleted, compare deleted extents with free
> extents trim them and then removes it from the bb_bitmap_deleted.
>
> But you may notice, that there is one problem. bb_bitmap_deleted does
> not survive umount. To bypass the problem the first ioctl call have to
> walk through whole file system trimming all free extents. But there is a
> better solution to this problem. The bb_bitmap_deleted can be stored on
> disk an can be restored in mount time along with other bitmaps, but I
> think it is a quite big change and should be discussed further.
>
> I have also benchmarked it a little. You can find results here:
>
> people.redhat.com/jmoyer/discard/ext4_batched_discard/
>
> comparison with current solution included. Keep in mind that ideal ioctl
> invocation interval is yet to be determined, so in benchmark I have used
> the performance-worst scenario - without any sleep between execution.
>
>
> There are two patches for this. The first one just creates file system
> independent ioctl for this and the second one it the batched discard
> support itself.
>
> I will very much appreciate any comment on this, your opinions, ideas to
> make this better etc. Thanks.
>
> If you want to try it, just create EXT4 file system mount it and invoke
> ioctl on the mount point. You can use following code for this (I have
> taken this from xfs patch for the same thing). You can also see some
> debugging messages, but you may want to set EXT4FS_DEBUG for this.
>
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <sys/ioctl.h>
>
> #define FITRIM ? ? ? ? ?_IOWR('X', 121, int)
>
> int main(int argc, char **argv)
> {
> ? ? ? ?int minsize = 4096;
> ? ? ? ?int fd;
>
> ? ? ? ?if (argc != 2) {
> ? ? ? ? ? ? ? ?fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
> ? ? ? ? ? ? ? ?return 1;
> ? ? ? ?}
>
> ? ? ? ?fd = open(argv[1], O_RDONLY);
> ? ? ? ?if (fd < 0) {
> ? ? ? ? ? ? ? ?perror("open");
> ? ? ? ? ? ? ? ?return 1;
> ? ? ? ?}
>
> ? ? ? ?if (ioctl(fd, FITRIM, &minsize)) {
> ? ? ? ? ? ? ? ?if (errno == EOPNOTSUPP)
> ? ? ? ? ? ? ? ? ? ? ? ?fprintf(stderr, "TRIM not supported\n");
> ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ?perror("EXT4_IOC_TRIM");
> ? ? ? ? ? ? ? ?return 1;
> ? ? ? ?}
>
> ? ? ? ?return 0;
> }
>
> ?fs/ioctl.c ? ? ? ? | ? 31 +++++++++++++++++++++++++++++++
> ?include/linux/fs.h | ? ?2 ++
> ?2 files changed, 33 insertions(+), 0 deletions(-)
>
> ?fs/ext4/ext4.h ? ?| ? ?4 +
> ?fs/ext4/mballoc.c | ?207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> ?fs/ext4/super.c ? | ? ?1 +
> ?3 files changed, 202 insertions(+), 10 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2010-04-19 16:56:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: Ext4: batched discard support

Greg Freemyer wrote:
> Adding Mark Lord in cc.
>
> He wrote a preliminary discard solution last summer. I'm not sure how
> it has progressed.

The difference here is that Mark's stuff wasn't as tightly integrated
with the kernel, IIRC. What I saw was more at a user level - make a big
file, map it, discard all the blocks, unlink the file.

It was a good first step, but I think we can do a lot better by using
fs-specific calls to be efficient & targeted about the discards.

Christoph has a similar approach for XFS, FWIW.

-Eric

> Mark, you can find the 2 patches at:
>
> http://patchwork.ozlabs.org/patch/50441/
> http://patchwork.ozlabs.org/patch/50442/
>
> Greg
>
> On Mon, Apr 19, 2010 at 6:55 AM, Lukas Czerner <[email protected]> wrote:
>> Hi all,
>>
>> I would like to present a new way to deal with TRIM in ext4 file system.
>> The current solution is not ideal because of its bad performance impact.
>> So basic idea to improve things is to avoid discarding every time some
>> blocks are freed. and instead batching is together into bigger trims,
>> which tends to be more effective.
>>
>> The basic idea behind my discard support is to create an ioctl which
>> walks through all the free extents in each allocating group and discard
>> those extents. As an addition to improve its performance one can specify
>> minimum free extent length, so ioctl will not bother with shorter extents.
>>
>> This of course means, that with each invocation the ioctl must walk
>> through whole file system, checking and discarding free extents, which
>> is not very efficient. The best way to avoid this is to keep track of
>> deleted (freed) blocks. Then the ioctl have to trim just those free
>> extents which were recently freed.
>>
>> In order to implement this I have added new bitmap into ext4_group_info
>> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
>> walk through bb_bitmap_deleted, compare deleted extents with free
>> extents trim them and then removes it from the bb_bitmap_deleted.
>>
>> But you may notice, that there is one problem. bb_bitmap_deleted does
>> not survive umount. To bypass the problem the first ioctl call have to
>> walk through whole file system trimming all free extents. But there is a
>> better solution to this problem. The bb_bitmap_deleted can be stored on
>> disk an can be restored in mount time along with other bitmaps, but I
>> think it is a quite big change and should be discussed further.
>>
>> I have also benchmarked it a little. You can find results here:
>>
>> people.redhat.com/jmoyer/discard/ext4_batched_discard/
>>
>> comparison with current solution included. Keep in mind that ideal ioctl
>> invocation interval is yet to be determined, so in benchmark I have used
>> the performance-worst scenario - without any sleep between execution.
>>
>>
>> There are two patches for this. The first one just creates file system
>> independent ioctl for this and the second one it the batched discard
>> support itself.
>>
>> I will very much appreciate any comment on this, your opinions, ideas to
>> make this better etc. Thanks.
>>
>> If you want to try it, just create EXT4 file system mount it and invoke
>> ioctl on the mount point. You can use following code for this (I have
>> taken this from xfs patch for the same thing). You can also see some
>> debugging messages, but you may want to set EXT4FS_DEBUG for this.
>>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <stdio.h>
>> #include <stdint.h>
>> #include <sys/ioctl.h>
>>
>> #define FITRIM _IOWR('X', 121, int)
>>
>> int main(int argc, char **argv)
>> {
>> int minsize = 4096;
>> int fd;
>>
>> if (argc != 2) {
>> fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
>> return 1;
>> }
>>
>> fd = open(argv[1], O_RDONLY);
>> if (fd < 0) {
>> perror("open");
>> return 1;
>> }
>>
>> if (ioctl(fd, FITRIM, &minsize)) {
>> if (errno == EOPNOTSUPP)
>> fprintf(stderr, "TRIM not supported\n");
>> else
>> perror("EXT4_IOC_TRIM");
>> return 1;
>> }
>>
>> return 0;
>> }
>>
>> fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
>> include/linux/fs.h | 2 ++
>> 2 files changed, 33 insertions(+), 0 deletions(-)
>>
>> fs/ext4/ext4.h | 4 +
>> fs/ext4/mballoc.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>> fs/ext4/super.c | 1 +
>> 3 files changed, 202 insertions(+), 10 deletions(-)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>


2010-04-19 17:58:20

by Greg Freemyer

[permalink] [raw]
Subject: Re: Ext4: batched discard support

On Mon, Apr 19, 2010 at 12:30 PM, Eric Sandeen <[email protected]> wrote:
> Greg Freemyer wrote:
>> Adding Mark Lord in cc.
>>
>> He wrote a preliminary discard solution last summer. ?I'm not sure how
>> it has progressed.
>
> The difference here is that Mark's stuff wasn't as tightly integrated
> with the kernel, IIRC. ?What I saw was more at a user level - make a big
> file, map it, discard all the blocks, unlink the file.
>
> It was a good first step, but I think we can do a lot better by using
> fs-specific calls to be efficient & targeted about the discards.
>
> Christoph has a similar approach for XFS, FWIW.
>
> -Eric

I haven't looked closely at this patch, but I recall Mark consolidated
numerous discontinuous trim/discard/unmap ranges into a single command
to the SSD drive.

That was why he felt he was getting superior performance. ie. There
was an overhead per command to the drive that was eliminated if a
single more complex command with multiple ranges went to the SSD
drive.

But he's the one that did the work and the benchmarking, so I'll let
him take it from here, especially if I mis-understood what he was
doing.

Greg

2010-04-19 18:02:42

by Ric Wheeler

[permalink] [raw]
Subject: Re: Ext4: batched discard support

On 04/19/2010 01:58 PM, Greg Freemyer wrote:
> On Mon, Apr 19, 2010 at 12:30 PM, Eric Sandeen<[email protected]> wrote:
>> Greg Freemyer wrote:
>>> Adding Mark Lord in cc.
>>>
>>> He wrote a preliminary discard solution last summer. I'm not sure how
>>> it has progressed.
>>
>> The difference here is that Mark's stuff wasn't as tightly integrated
>> with the kernel, IIRC. What I saw was more at a user level - make a big
>> file, map it, discard all the blocks, unlink the file.
>>
>> It was a good first step, but I think we can do a lot better by using
>> fs-specific calls to be efficient& targeted about the discards.
>>
>> Christoph has a similar approach for XFS, FWIW.
>>
>> -Eric
>
> I haven't looked closely at this patch, but I recall Mark consolidated
> numerous discontinuous trim/discard/unmap ranges into a single command
> to the SSD drive.
>
> That was why he felt he was getting superior performance. ie. There
> was an overhead per command to the drive that was eliminated if a
> single more complex command with multiple ranges went to the SSD
> drive.
>
> But he's the one that did the work and the benchmarking, so I'll let
> him take it from here, especially if I mis-understood what he was
> doing.
>
> Greg

This work certainly builds on Mark's early results which clearly showed that
several devices see a win from doing larger/batched discards instead of the fine
grained ones.

Thanks!

Ric


2010-04-20 20:24:47

by Mark Lord

[permalink] [raw]
Subject: Re: Ext4: batched discard support

On 19/04/10 12:20 PM, Greg Freemyer wrote:
> Adding Mark Lord in cc.
..
> On Mon, Apr 19, 2010 at 6:55 AM, Lukas Czerner<[email protected]> wrote:
..
>> The basic idea behind my discard support is to create an ioctl which
>> walks through all the free extents in each allocating group and discard
>> those extents. As an addition to improve its performance one can specify
>> minimum free extent length, so ioctl will not bother with shorter extents.
..

Perfect. I proposed exactly this last year, but never found time to work it further.
Please proceed with it!

..
>> But you may notice, that there is one problem. bb_bitmap_deleted does
>> not survive umount. To bypass the problem the first ioctl call have to
>> walk through whole file system trimming all free extents. But there is a
>> better solution to this problem. The bb_bitmap_deleted can be stored on
>> disk an can be restored in mount time along with other bitmaps, but I
>> think it is a quite big change and should be discussed further.
..

That's not necessary. It is a simple matter to TRIM a clean filesystem
before it is mounted R/W during the boot sequence. No new information
is required for that operation.

My wiper.sh script already shows walking the freelists and trimming all
available space, something which takes only a couple of seconds on a 120GB
filesystem that has not been kept up-to-date with on-the-fly trim.

Cheers

2010-04-20 20:34:40

by Mark Lord

[permalink] [raw]
Subject: Re: Ext4: batched discard support

On 20/04/10 04:24 PM, Mark Lord wrote:
> On 19/04/10 12:20 PM, Greg Freemyer wrote:
>> Adding Mark Lord in cc.
> ..
>> On Mon, Apr 19, 2010 at 6:55 AM, Lukas Czerner<[email protected]>
>> wrote:
> ..
>>> The basic idea behind my discard support is to create an ioctl which
>>> walks through all the free extents in each allocating group and discard
>>> those extents. As an addition to improve its performance one can specify
>>> minimum free extent length, so ioctl will not bother with shorter
>>> extents.
> ..
>
> Perfect. I proposed exactly this last year, but never found time to work it further.
> Please proceed with it!
..

Ahh. here it is/was:

On August 17, 1009, Mark Lord wrote:
> Taking care of mounted RAID / LVM filesystems requires in-kernel TRIM
> support, possibly exported via an ioctl().
>
> Taking care of unmounted RAID / LVM filesystems is possible in userland,
> but would also benefit from in-kernel support, where layouts are defined
> and known better than in userland.
>
> The XFS_TRIM was an idea that Cristoph floated, as a concept for examination.
>
> I think something along those lines would be best, but perhaps with an
> interface at the VFS layer. Something that permits a userland tool
> to work like this (below) might be nearly ideal:
>
> main() {
> int fd = open(filesystem_device);
> while (1) {
> int g, ngroups = ioctl(fd, GET_NUMBER_OF_BLOCK_GROUPS);
> for (g = 0; g < ngroups; ++g) {
> ioctl(fd, TRIM_ALL_FREE_EXTENTS_OF_GROUP, g);
> }
> sleep(3600);
> }
> }
>
> Not all filesystems have a "block group", or "allocation group" structure,
> but I suspect that it's an easy mapping in most cases.
>
> With this scheme, the kernel is absolved of the need to track/coallesce
> TRIM requests entirely.
>
> Something like that, perhaps.
>
>

2010-04-20 21:28:28

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

Mark,

This is the patch implementing the new discard logic.

You did the benchmarking last year, but I thought you found calling
trim one contiguous sector range at a time was too inefficient.

See my highlight below:

On Mon, Apr 19, 2010 at 6:55 AM, Lukas Czerner <[email protected]> wrote:
> Create an ioctl which walks through all the free extents in each
> allocating group and discard those extents. As an addition to improve
> its performance one can specify minimum free extent length, so ioctl
> will not bother with shorter extents.
>
> This of course means, that with each invocation the ioctl must walk
> through whole file system, checking and discarding free extents, which
> is not very efficient. The best way to avoid this is to keep track of
> deleted (freed) blocks. Then the ioctl have to trim just those free
> extents which were recently freed.
>
> In order to implement this I have added new bitmap into ext4_group_info
> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
> walk through bb_bitmap_deleted, compare deleted extents with free
> extents trim them and then removes it from the bb_bitmap_deleted.
>
> But you may notice, that there is one problem. bb_bitmap_deleted does
> not survive umount. To bypass the problem the first ioctl call have to
> walk through whole file system trimming all free extents.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> ?fs/ext4/ext4.h ? ?| ? ?4 +
> ?fs/ext4/mballoc.c | ?207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> ?fs/ext4/super.c ? | ? ?1 +
> ?3 files changed, 202 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bf938cf..e25f672 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
> ?extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
> ?extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_group_t, int);
> +extern int ext4_trim_fs(unsigned int, struct super_block *);
> +
> ?/* inode.c */
> ?struct buffer_head *ext4_getblk(handle_t *, struct inode *,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_lblk_t, int, int *);
> @@ -1682,6 +1684,8 @@ struct ext4_group_info {
> ?#ifdef DOUBLE_CHECK
> ? ? ? ?void ? ? ? ? ? ?*bb_bitmap;
> ?#endif
> + ? ? ? void ? ? ? ? ? ?*bb_bitmap_deleted;
> + ? ? ? ext4_grpblk_t ? bb_deleted;
> ? ? ? ?struct rw_semaphore alloc_sem;
> ? ? ? ?ext4_grpblk_t ? bb_counters[]; ?/* Nr of free power-of-two-block
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * regions, index is order.
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index bde9d0b..fbc83fe 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2255,6 +2255,9 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
> ? ? ? ?INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
> ? ? ? ?init_rwsem(&meta_group_info[i]->alloc_sem);
> ? ? ? ?meta_group_info[i]->bb_free_root = RB_ROOT;
> + ? ? ? meta_group_info[i]->bb_deleted = -1;
> +
> +
>
> ?#ifdef DOUBLE_CHECK
> ? ? ? ?{
> @@ -2469,6 +2472,7 @@ int ext4_mb_release(struct super_block *sb)
> ?#ifdef DOUBLE_CHECK
> ? ? ? ? ? ? ? ? ? ? ? ?kfree(grinfo->bb_bitmap);
> ?#endif
> + ? ? ? ? ? ? ? ? ? ? ? kfree(grinfo->bb_bitmap_deleted);
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_lock_group(sb, i);
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_mb_cleanup_pa(grinfo);
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_unlock_group(sb, i);
> @@ -2528,6 +2532,7 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
> ? ? ? ?int err, count = 0, count2 = 0;
> ? ? ? ?struct ext4_free_data *entry;
> ? ? ? ?struct list_head *l, *ltmp;
> + ? ? ? void *bitmap_deleted = NULL;
>
> ? ? ? ?list_for_each_safe(l, ltmp, &txn->t_private_list) {
> ? ? ? ? ? ? ? ?entry = list_entry(l, struct ext4_free_data, list);
> @@ -2543,6 +2548,14 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
> ? ? ? ? ? ? ? ?/* there are blocks to put in buddy to make them really free */
> ? ? ? ? ? ? ? ?count += entry->count;
> ? ? ? ? ? ? ? ?count2++;
> +
> + ? ? ? ? ? ? ? if (test_opt(sb, DISCARD) &&
> + ? ? ? ? ? ? ? ? ? ? ? (db->bb_bitmap_deleted == NULL) &&
> + ? ? ? ? ? ? ? ? ? ? ? (db->bb_deleted >= 0)) {
> + ? ? ? ? ? ? ? ? ? ? ? bitmap_deleted =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? kmalloc(sb->s_blocksize, GFP_KERNEL);
> + ? ? ? ? ? ? ? }
> +
> ? ? ? ? ? ? ? ?ext4_lock_group(sb, entry->group);
> ? ? ? ? ? ? ? ?/* Take it out of per group rb tree */
> ? ? ? ? ? ? ? ?rb_erase(&entry->node, &(db->bb_free_root));
> @@ -2555,17 +2568,24 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
> ? ? ? ? ? ? ? ? ? ? ? ?page_cache_release(e4b.bd_buddy_page);
> ? ? ? ? ? ? ? ? ? ? ? ?page_cache_release(e4b.bd_bitmap_page);
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? ext4_unlock_group(sb, entry->group);
> - ? ? ? ? ? ? ? if (test_opt(sb, DISCARD)) {
> - ? ? ? ? ? ? ? ? ? ? ? ext4_fsblk_t discard_block;
> -
> - ? ? ? ? ? ? ? ? ? ? ? discard_block = entry->start_blk +
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ext4_group_first_block_no(sb, entry->group);
> - ? ? ? ? ? ? ? ? ? ? ? trace_ext4_discard_blocks(sb,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (unsigned long long)discard_block,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? entry->count);
> - ? ? ? ? ? ? ? ? ? ? ? sb_issue_discard(sb, discard_block, entry->count);
> + ? ? ? ? ? ? ? if (test_opt(sb, DISCARD) && (db->bb_deleted >= 0)) {
> + ? ? ? ? ? ? ? ? ? ? ? if (db->bb_bitmap_deleted == NULL) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? db->bb_bitmap_deleted = bitmap_deleted;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BUG_ON(db->bb_bitmap_deleted == NULL);
> +
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bitmap_deleted = NULL;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mb_clear_bits(db->bb_bitmap_deleted,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0, EXT4_BLOCKS_PER_GROUP(sb));
> + ? ? ? ? ? ? ? ? ? ? ? }
> +
> + ? ? ? ? ? ? ? ? ? ? ? db->bb_deleted += entry->count;
> + ? ? ? ? ? ? ? ? ? ? ? mb_set_bits(db->bb_bitmap_deleted, entry->start_blk,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?entry->count);
> ? ? ? ? ? ? ? ?}
> + ? ? ? ? ? ? ? ext4_unlock_group(sb, entry->group);
> +
> + ? ? ? ? ? ? ? kfree(bitmap_deleted);
> +
> ? ? ? ? ? ? ? ?kmem_cache_free(ext4_free_ext_cachep, entry);
> ? ? ? ? ? ? ? ?ext4_mb_release_desc(&e4b);
> ? ? ? ?}
> @@ -4639,3 +4659,170 @@ error_return:
> ? ? ? ? ? ? ? ?kmem_cache_free(ext4_ac_cachep, ac);
> ? ? ? ?return;
> ?}
> +
> +/* Trim "count" blocks starting at "start" in "group"
> + * This must be called under group lock
> + */
> +void ext4_trim_extent(struct super_block *sb, int start, int count,
> + ? ? ? ? ? ? ? ext4_group_t group, struct ext4_buddy *e4b)
> +{
> + ? ? ? ext4_fsblk_t discard_block;
> + ? ? ? struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> + ? ? ? struct ext4_free_extent ex;
> +
> + ? ? ? assert_spin_locked(ext4_group_lock_ptr(sb, group));
> +
> + ? ? ? ex.fe_start = start;
> + ? ? ? ex.fe_group = group;
> + ? ? ? ex.fe_len = count;
> +
> + ? ? ? mb_mark_used(e4b, &ex);
> + ? ? ? ext4_unlock_group(sb, group);
> +
> + ? ? ? discard_block = (ext4_fsblk_t)group *
> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb)
> + ? ? ? ? ? ? ? ? ? ? ? + start
> + ? ? ? ? ? ? ? ? ? ? ? + le32_to_cpu(es->s_first_data_block);
> + ? ? ? trace_ext4_discard_blocks(sb,
> + ? ? ? ? ? ? ? ? ? ? ? (unsigned long long)discard_block,
> + ? ? ? ? ? ? ? ? ? ? ? count);
> + ? ? ? sb_issue_discard(sb, discard_block, count);
> +
> + ? ? ? ext4_lock_group(sb, group);
> + ? ? ? mb_free_blocks(NULL, e4b, start, ex.fe_len);
> +}

Mark, unless I'm missing something, sb_issue_discard() above is going
to trigger a trim command for just the one range. I thought the
benchmarks you did showed that a collection of ranges needed to be
built, then a single trim command invoked that trimmed that group of
ranges.

Greg

> +
> +/* Trim all free blocks, which have at least minlen length */
> +ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b,
> + ? ? ? ? ? ? ? ext4_grpblk_t minblocks)
> +{
> + ? ? ? void *bitmap;
> + ? ? ? ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
> + ? ? ? ext4_grpblk_t start, next, count = 0;
> + ? ? ? ext4_group_t group;
> +
> + ? ? ? BUG_ON(e4b == NULL);
> +
> + ? ? ? bitmap = e4b->bd_bitmap;
> + ? ? ? group = e4b->bd_group;
> + ? ? ? start = 0;
> + ? ? ? ext4_lock_group(sb, group);
> +
> + ? ? ? while (start < max) {
> +
> + ? ? ? ? ? ? ? start = mb_find_next_zero_bit(bitmap, max, start);
> + ? ? ? ? ? ? ? if (start >= max)
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? next = mb_find_next_bit(bitmap, max, start);
> +
> + ? ? ? ? ? ? ? if ((next - start) >= minblocks) {
> +
> + ? ? ? ? ? ? ? ? ? ? ? count += next - start;
> + ? ? ? ? ? ? ? ? ? ? ? ext4_trim_extent(sb, start,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next - start, group, e4b);
> +
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? start = next + 1;
> + ? ? ? }
> +
> + ? ? ? e4b->bd_info->bb_deleted = 0;
> + ? ? ? ext4_unlock_group(sb, group);
> +
> + ? ? ? return count;
> +}
> +
> +/* Trim only blocks which is free and in bb_bitmap_deleted */
> +ext4_grpblk_t ext4_trim_deleted(struct super_block *sb, struct ext4_buddy *e4b,
> + ? ? ? ? ? ? ? ext4_grpblk_t minblocks)
> +{
> + ? ? ? void *bitmap;
> + ? ? ? struct ext4_group_info *grp;
> + ? ? ? ext4_group_t group;
> + ? ? ? ext4_grpblk_t max, next, count = 0, start = 0;
> +
> + ? ? ? BUG_ON(e4b == NULL);
> +
> + ? ? ? bitmap = e4b->bd_bitmap;
> + ? ? ? group = e4b->bd_group;
> + ? ? ? grp = ext4_get_group_info(sb, group);
> +
> + ? ? ? if (grp->bb_deleted < minblocks)
> + ? ? ? ? ? ? ? return 0;
> +
> + ? ? ? ext4_lock_group(sb, group);
> +
> + ? ? ? while (start < EXT4_BLOCKS_PER_GROUP(sb)) {
> + ? ? ? ? ? ? ? start = mb_find_next_bit(grp->bb_bitmap_deleted,
> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb), start);
> + ? ? ? ? ? ? ? max = mb_find_next_zero_bit(grp->bb_bitmap_deleted,
> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb), start);
> +
> + ? ? ? ? ? ? ? while (start < max) {
> + ? ? ? ? ? ? ? ? ? ? ? start = mb_find_next_zero_bit(bitmap, max, start);
> + ? ? ? ? ? ? ? ? ? ? ? if (start >= max)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? next = mb_find_next_bit(bitmap, max, start);
> + ? ? ? ? ? ? ? ? ? ? ? if (next > max)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next = max;
> +
> + ? ? ? ? ? ? ? ? ? ? ? if ((next - start) >= minblocks) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count += next - start;
> +
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ext4_trim_extent(sb, start,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? next - start, group, e4b);
> +
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? mb_clear_bits(grp->bb_bitmap_deleted,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? start, next - start);
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? start = next + 1;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> + ? ? ? grp->bb_deleted -= count;
> +
> + ? ? ? ext4_unlock_group(sb, group);
> +
> + ? ? ? ext4_debug("trimmed %d blocks in the group %d\n",
> + ? ? ? ? ? ? ? count, group);
> +
> + ? ? ? return count;
> +}
> +
> +int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
> +{
> + ? ? ? struct ext4_buddy e4b;
> + ? ? ? struct ext4_group_info *grp;
> + ? ? ? ext4_group_t group;
> + ? ? ? ext4_group_t ngroups = ext4_get_groups_count(sb);
> + ? ? ? ext4_grpblk_t minblocks;
> +
> + ? ? ? if (!test_opt(sb, DISCARD))
> + ? ? ? ? ? ? ? return 0;
> +
> + ? ? ? minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
> + ? ? ? if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb)))
> + ? ? ? ? ? ? ? return -EINVAL;
> +
> + ? ? ? for (group = 0; group < ngroups; group++) {
> + ? ? ? ? ? ? ? int err;
> +
> + ? ? ? ? ? ? ? err = ext4_mb_load_buddy(sb, group, &e4b);
> + ? ? ? ? ? ? ? if (err) {
> + ? ? ? ? ? ? ? ? ? ? ? ext4_error(sb, "Error in loading buddy "
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "information for %u", group);
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? grp = ext4_get_group_info(sb, group);
> +
> + ? ? ? ? ? ? ? if (grp->bb_deleted < 0) {
> + ? ? ? ? ? ? ? ? ? ? ? /* First run after mount */
> + ? ? ? ? ? ? ? ? ? ? ? ext4_trim_all_free(sb, &e4b, minblocks);
> + ? ? ? ? ? ? ? } else if (grp->bb_deleted >= minblocks) {
> + ? ? ? ? ? ? ? ? ? ? ? /* Trim only blocks deleted since first run */
> + ? ? ? ? ? ? ? ? ? ? ? ext4_trim_deleted(sb, &e4b, minblocks);
> + ? ? ? ? ? ? ? }
> +
> + ? ? ? ? ? ? ? ext4_mb_release_desc(&e4b);
> + ? ? ? }
> +
> + ? ? ? return 0;
> +}
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e14d22c..253eb98 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = {
> ? ? ? ?.quota_write ? ?= ext4_quota_write,
> ?#endif
> ? ? ? ?.bdev_try_to_free_page = bdev_try_to_free_page,
> + ? ? ? .trim_fs ? ? ? ?= ext4_trim_fs
> ?};
>
> ?static const struct super_operations ext4_nojournal_sops = {
> --
> 1.6.6.1

2010-04-21 02:26:15

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 20/04/10 05:21 PM, Greg Freemyer wrote:
> Mark,
>
> This is the patch implementing the new discard logic.
..
> Signed-off-by: Lukas Czerner <[email protected]>
..
>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>> + ext4_group_t group, struct ext4_buddy *e4b)
>> +{
>> + ext4_fsblk_t discard_block;
>> + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>> + struct ext4_free_extent ex;
>> +
>> + assert_spin_locked(ext4_group_lock_ptr(sb, group));
>> +
>> + ex.fe_start = start;
>> + ex.fe_group = group;
>> + ex.fe_len = count;
>> +
>> + mb_mark_used(e4b,&ex);
>> + ext4_unlock_group(sb, group);
>> +
>> + discard_block = (ext4_fsblk_t)group *
>> + EXT4_BLOCKS_PER_GROUP(sb)
>> + + start
>> + + le32_to_cpu(es->s_first_data_block);
>> + trace_ext4_discard_blocks(sb,
>> + (unsigned long long)discard_block,
>> + count);
>> + sb_issue_discard(sb, discard_block, count);
>> +
>> + ext4_lock_group(sb, group);
>> + mb_free_blocks(NULL, e4b, start, ex.fe_len);
>> +}
>
> Mark, unless I'm missing something, sb_issue_discard() above is going
> to trigger a trim command for just the one range. I thought the
> benchmarks you did showed that a collection of ranges needed to be
> built, then a single trim command invoked that trimmed that group of
> ranges.
..

Mmm.. If that's what it is doing, then this patch set would be a complete disaster.
It would take *hours* to do the initial TRIM.

Lukas ?

2010-04-21 02:45:35

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

Mark Lord wrote:
> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>> Mark,
>>
>> This is the patch implementing the new discard logic.
> ..
>> Signed-off-by: Lukas Czerner <[email protected]>
> ..
>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>> + ext4_group_t group, struct ext4_buddy *e4b)
>>> +{
>>> + ext4_fsblk_t discard_block;
>>> + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>> + struct ext4_free_extent ex;
>>> +
>>> + assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>> +
>>> + ex.fe_start = start;
>>> + ex.fe_group = group;
>>> + ex.fe_len = count;
>>> +
>>> + mb_mark_used(e4b,&ex);
>>> + ext4_unlock_group(sb, group);
>>> +
>>> + discard_block = (ext4_fsblk_t)group *
>>> + EXT4_BLOCKS_PER_GROUP(sb)
>>> + + start
>>> + + le32_to_cpu(es->s_first_data_block);
>>> + trace_ext4_discard_blocks(sb,
>>> + (unsigned long long)discard_block,
>>> + count);
>>> + sb_issue_discard(sb, discard_block, count);
>>> +
>>> + ext4_lock_group(sb, group);
>>> + mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>> +}
>>
>> Mark, unless I'm missing something, sb_issue_discard() above is going
>> to trigger a trim command for just the one range. I thought the
>> benchmarks you did showed that a collection of ranges needed to be
>> built, then a single trim command invoked that trimmed that group of
>> ranges.
> ..
>
> Mmm.. If that's what it is doing, then this patch set would be a
> complete disaster.
> It would take *hours* to do the initial TRIM.
>
> Lukas ?

I'm confused; do we have an interface to send a trim command for multiple ranges?

I didn't think so ... Lukas' patch is finding free ranges (above a size threshold)
to discard; it's not doing it a block at a time, if that's the concern.

-Eric

2010-04-21 18:59:24

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Tue, Apr 20, 2010 at 10:45 PM, Eric Sandeen <[email protected]> wrote:
> Mark Lord wrote:
>> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>>> Mark,
>>>
>>> This is the patch implementing the new discard logic.
>> ..
>>> Signed-off-by: Lukas Czerner <[email protected]>
>> ..
>>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>>> + ? ? ? ? ? ? ? ext4_group_t group, struct ext4_buddy *e4b)
>>>> +{
>>>> + ? ? ? ext4_fsblk_t discard_block;
>>>> + ? ? ? struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>>> + ? ? ? struct ext4_free_extent ex;
>>>> +
>>>> + ? ? ? assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>>> +
>>>> + ? ? ? ex.fe_start = start;
>>>> + ? ? ? ex.fe_group = group;
>>>> + ? ? ? ex.fe_len = count;
>>>> +
>>>> + ? ? ? mb_mark_used(e4b,&ex);
>>>> + ? ? ? ext4_unlock_group(sb, group);
>>>> +
>>>> + ? ? ? discard_block = (ext4_fsblk_t)group *
>>>> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb)
>>>> + ? ? ? ? ? ? ? ? ? ? ? + start
>>>> + ? ? ? ? ? ? ? ? ? ? ? + le32_to_cpu(es->s_first_data_block);
>>>> + ? ? ? trace_ext4_discard_blocks(sb,
>>>> + ? ? ? ? ? ? ? ? ? ? ? (unsigned long long)discard_block,
>>>> + ? ? ? ? ? ? ? ? ? ? ? count);
>>>> + ? ? ? sb_issue_discard(sb, discard_block, count);
>>>> +
>>>> + ? ? ? ext4_lock_group(sb, group);
>>>> + ? ? ? mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>>> +}
>>>
>>> Mark, unless I'm missing something, sb_issue_discard() above is going
>>> to trigger a trim command for just the one range. ?I thought the
>>> benchmarks you did showed that a collection of ranges needed to be
>>> built, then a single trim command invoked that trimmed that group of
>>> ranges.
>> ..
>>
>> Mmm.. If that's what it is doing, then this patch set would be a
>> complete disaster.
>> It would take *hours* to do the initial TRIM.
>>
>> Lukas ?
>
> I'm confused; do we have an interface to send a trim command for multiple ranges?
>
> I didn't think so ... ?Lukas' patch is finding free ranges (above a size threshold)
> to discard; it's not doing it a block at a time, if that's the concern.
>
> -Eric

Eric,

I don't know what kernel APIs have been created to support discard,
but the ATA8 draft spec. allows for specifying multiple ranges in one
trim command.

See section 7.10.3.1 and .2 of the latest draft spec.

Both talk about multiple trim ranges per trim command (think thousands
of ranges per command).

Recent hdparm versions accept a trim command argument that causes
multiple ranges to be trimmed per command.

--trim-sector-ranges Tell SSD firmware to discard unneeded
data sectors: lba:count ..
--trim-sector-ranges-stdin Same as above, but reads lba:count pairs from stdin

As I understand it, this is critical from a performance perspective
for the SSDs Mark tested with. ie. He found a single trim command
with 1000 ranges takes much less time than 1000 discrete trim
commands.

Per Mark's comment's in wiper.sh, a trim command can have a minimum of
128KB of associated range information, so it is thousands of ranges
that can be discarded in a single command

ie. hdparm can accept extremely large lists of ranges on stdin, but it
parses the list into discrete trim commands with thousands of ranges
per command.

A kernel implementation which is trying to implement after that fact
discards as this patch is doing, also needs to somehow craft trim
commands with a large payload of ranges if it is going to be
efficient.

If the block layer cannot do this yet, then in my opinion this type of
batched discarding needs to stay in user space as done with Mark's
wiper.sh script and enhanced hdparm until the block layer grows that
ability.

Greg

2010-04-21 19:02:20

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/21/2010 02:59 PM, Greg Freemyer wrote:
> On Tue, Apr 20, 2010 at 10:45 PM, Eric Sandeen<[email protected]> wrote:
>> Mark Lord wrote:
>>> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>>>> Mark,
>>>>
>>>> This is the patch implementing the new discard logic.
>>> ..
>>>> Signed-off-by: Lukas Czerner<[email protected]>
>>> ..
>>>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>>>> + ext4_group_t group, struct ext4_buddy *e4b)
>>>>> +{
>>>>> + ext4_fsblk_t discard_block;
>>>>> + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>>>> + struct ext4_free_extent ex;
>>>>> +
>>>>> + assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>>>> +
>>>>> + ex.fe_start = start;
>>>>> + ex.fe_group = group;
>>>>> + ex.fe_len = count;
>>>>> +
>>>>> + mb_mark_used(e4b,&ex);
>>>>> + ext4_unlock_group(sb, group);
>>>>> +
>>>>> + discard_block = (ext4_fsblk_t)group *
>>>>> + EXT4_BLOCKS_PER_GROUP(sb)
>>>>> + + start
>>>>> + + le32_to_cpu(es->s_first_data_block);
>>>>> + trace_ext4_discard_blocks(sb,
>>>>> + (unsigned long long)discard_block,
>>>>> + count);
>>>>> + sb_issue_discard(sb, discard_block, count);
>>>>> +
>>>>> + ext4_lock_group(sb, group);
>>>>> + mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>>>> +}
>>>>
>>>> Mark, unless I'm missing something, sb_issue_discard() above is going
>>>> to trigger a trim command for just the one range. I thought the
>>>> benchmarks you did showed that a collection of ranges needed to be
>>>> built, then a single trim command invoked that trimmed that group of
>>>> ranges.
>>> ..
>>>
>>> Mmm.. If that's what it is doing, then this patch set would be a
>>> complete disaster.
>>> It would take *hours* to do the initial TRIM.
>>>
>>> Lukas ?
>>
>> I'm confused; do we have an interface to send a trim command for multiple ranges?
>>
>> I didn't think so ... Lukas' patch is finding free ranges (above a size threshold)
>> to discard; it's not doing it a block at a time, if that's the concern.
>>
>> -Eric
>
> Eric,
>
> I don't know what kernel APIs have been created to support discard,
> but the ATA8 draft spec. allows for specifying multiple ranges in one
> trim command.
>

Greg,

We have full support for this in the "discard" support at the file system layer
for several file systems.

The block layer effectively muxes the "discard" into the right target device
command. TRIM for ATA, WRITE_SAME (with unmap) or UNMAP for SCSI...

If your favourite fs supports this, you can enable this feature with "-o
discard" for fine grained discards,

ric

2010-04-21 19:22:26

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

Ric Wheeler <[email protected]> writes:

> On 04/21/2010 02:59 PM, Greg Freemyer wrote:
>> On Tue, Apr 20, 2010 at 10:45 PM, Eric Sandeen<[email protected]> wrote:
>>> Mark Lord wrote:
>>>> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>>>>> Mark,
>>>>>
>>>>> This is the patch implementing the new discard logic.
>>>> ..
>>>>> Signed-off-by: Lukas Czerner<[email protected]>
>>>> ..
>>>>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>>>>> + ext4_group_t group, struct ext4_buddy *e4b)
>>>>>> +{
>>>>>> + ext4_fsblk_t discard_block;
>>>>>> + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>>>>> + struct ext4_free_extent ex;
>>>>>> +
>>>>>> + assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>>>>> +
>>>>>> + ex.fe_start = start;
>>>>>> + ex.fe_group = group;
>>>>>> + ex.fe_len = count;
>>>>>> +
>>>>>> + mb_mark_used(e4b,&ex);
>>>>>> + ext4_unlock_group(sb, group);
>>>>>> +
>>>>>> + discard_block = (ext4_fsblk_t)group *
>>>>>> + EXT4_BLOCKS_PER_GROUP(sb)
>>>>>> + + start
>>>>>> + + le32_to_cpu(es->s_first_data_block);
>>>>>> + trace_ext4_discard_blocks(sb,
>>>>>> + (unsigned long long)discard_block,
>>>>>> + count);
>>>>>> + sb_issue_discard(sb, discard_block, count);
>>>>>> +
>>>>>> + ext4_lock_group(sb, group);
>>>>>> + mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>>>>> +}
>>>>>
>>>>> Mark, unless I'm missing something, sb_issue_discard() above is going
>>>>> to trigger a trim command for just the one range. I thought the
>>>>> benchmarks you did showed that a collection of ranges needed to be
>>>>> built, then a single trim command invoked that trimmed that group of
>>>>> ranges.
>>>> ..
>>>>
>>>> Mmm.. If that's what it is doing, then this patch set would be a
>>>> complete disaster.
>>>> It would take *hours* to do the initial TRIM.

Except it doesn't. Lukas did provide numbers in his original email.

>>>> Lukas ?
>>>
>>> I'm confused; do we have an interface to send a trim command for multiple ranges?
>>>
>>> I didn't think so ... Lukas' patch is finding free ranges (above a size threshold)
>>> to discard; it's not doing it a block at a time, if that's the concern.
>>>
>>> -Eric
>>
>> Eric,
>>
>> I don't know what kernel APIs have been created to support discard,
>> but the ATA8 draft spec. allows for specifying multiple ranges in one
>> trim command.

Well, sb_issue_discard is what ext4 is using, and that takes a single
range. I don't know if anyone has looked into adding a vectored API.

>
> Greg,
>
> We have full support for this in the "discard" support at the file
> system layer for several file systems.

Actually, we don't support what Greg is talking about, to my knowledge.

> The block layer effectively muxes the "discard" into the right target
> device command. TRIM for ATA, WRITE_SAME (with unmap) or UNMAP for
> SCSI...
>
> If your favourite fs supports this, you can enable this feature with
> "-o
> discard" for fine grained discards,

Thanks, it's worth pointing out that TRIM is not the only backend to the
discard API. However, even if we do implement a vectored API, we can
translate that to dumber commands if a given spec doesn't support it.

Getting back to the problem...

>From the file system, you want to discard discrete ranges of blocks.
The API to support this can either take care of the data integrity
guarantees by itself, or make the upper layer ensure that trim and write
do not pass each other. The current implementation does the latter. In
order to do the former, there is the potential for a lot of overhead to
be introduced into the block allocation layers for the file systems.

So, given the above, it is up to the file system to send down the
biggest discard requests it can in order to reduce the overhead of the
command. If a vectored approach is made available, then that would be
even better. Christoph, is this something that's on your radar?

Cheers,
Jeff

2010-04-21 20:44:52

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Wed, Apr 21, 2010 at 3:22 PM, Jeff Moyer <[email protected]> wrote:
> Ric Wheeler <[email protected]> writes:
>
>> On 04/21/2010 02:59 PM, Greg Freemyer wrote:
>>> On Tue, Apr 20, 2010 at 10:45 PM, Eric Sandeen<[email protected]> ?wrote:
>>>> Mark Lord wrote:
>>>>> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>>>>>> Mark,
>>>>>>
>>>>>> This is the patch implementing the new discard logic.
>>>>> ..
>>>>>> Signed-off-by: Lukas Czerner<[email protected]>
>>>>> ..
>>>>>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>>>>>> + ? ? ? ? ? ? ? ext4_group_t group, struct ext4_buddy *e4b)
>>>>>>> +{
>>>>>>> + ? ? ? ext4_fsblk_t discard_block;
>>>>>>> + ? ? ? struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>>>>>> + ? ? ? struct ext4_free_extent ex;
>>>>>>> +
>>>>>>> + ? ? ? assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>>>>>> +
>>>>>>> + ? ? ? ex.fe_start = start;
>>>>>>> + ? ? ? ex.fe_group = group;
>>>>>>> + ? ? ? ex.fe_len = count;
>>>>>>> +
>>>>>>> + ? ? ? mb_mark_used(e4b,&ex);
>>>>>>> + ? ? ? ext4_unlock_group(sb, group);
>>>>>>> +
>>>>>>> + ? ? ? discard_block = (ext4_fsblk_t)group *
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb)
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? + start
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? + le32_to_cpu(es->s_first_data_block);
>>>>>>> + ? ? ? trace_ext4_discard_blocks(sb,
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? (unsigned long long)discard_block,
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? count);
>>>>>>> + ? ? ? sb_issue_discard(sb, discard_block, count);
>>>>>>> +
>>>>>>> + ? ? ? ext4_lock_group(sb, group);
>>>>>>> + ? ? ? mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>>>>>> +}
>>>>>>
>>>>>> Mark, unless I'm missing something, sb_issue_discard() above is going
>>>>>> to trigger a trim command for just the one range. ?I thought the
>>>>>> benchmarks you did showed that a collection of ranges needed to be
>>>>>> built, then a single trim command invoked that trimmed that group of
>>>>>> ranges.
>>>>> ..
>>>>>
>>>>> Mmm.. If that's what it is doing, then this patch set would be a
>>>>> complete disaster.
>>>>> It would take *hours* to do the initial TRIM.
>
> Except it doesn't. ?Lukas did provide numbers in his original email.
>

Looking at the benchmarks (for the first time) at
http://people.redhat.com/jmoyer/discard/ext4_batched_discard/

I don't see anything that says how long the proposed trim ioctl takes
to complete on the full filesystem.

What they do show is that with the 3 test SSDs used for this
benchmark, the current released discard implementation is a net loss.
ie. You are better off running without the discards for all 3 vendors.
(at least under the conditions tested.)

After the patch is applied and optimizing the discards to large free
extents only, it works out to same performance with or without the
discards. ie. no net gain or loss.

That is extremely cool because one assumes that the non-discard case
would degrade over time, but that the discard case will not.

So that argues for the current proposed patch going in.

But quoting from the first email:

==
The basic idea behind my discard support is to create an ioctl which
walks through all the free extents in each allocating group and discard
those extents. As an addition to improve its performance one can specify
minimum free extent length, so ioctl will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted.

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents. But there is a
better solution to this problem. The bb_bitmap_deleted can be stored on
disk an can be restored in mount time along with other bitmaps, but I
think it is a quite big change and should be discussed further.
==

The above seems to argue against the patch going in until the
mount/umount issues are addressed.

So in addition to this patch, Lukas is proposing a on disk change to
address the fact that calling trim upteen times at mount time is too
slow.

Per Mark's testing of last summer, an alternative solution is to use a
vectored trim approach that is far more efficient.

Mark's benchmarks showed this as doable in seconds which seems like a
reasonable amount of time for a mount time operation.

Greg

2010-04-21 20:52:57

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

correcting Christoph's email address - no other edits/comments

On Wed, Apr 21, 2010 at 3:22 PM, Jeff Moyer <[email protected]> wrote:
> Ric Wheeler <[email protected]> writes:
>
>> On 04/21/2010 02:59 PM, Greg Freemyer wrote:
>>> On Tue, Apr 20, 2010 at 10:45 PM, Eric Sandeen<[email protected]> ?wrote:
>>>> Mark Lord wrote:
>>>>> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>>>>>> Mark,
>>>>>>
>>>>>> This is the patch implementing the new discard logic.
>>>>> ..
>>>>>> Signed-off-by: Lukas Czerner<[email protected]>
>>>>> ..
>>>>>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>>>>>> + ? ? ? ? ? ? ? ext4_group_t group, struct ext4_buddy *e4b)
>>>>>>> +{
>>>>>>> + ? ? ? ext4_fsblk_t discard_block;
>>>>>>> + ? ? ? struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>>>>>> + ? ? ? struct ext4_free_extent ex;
>>>>>>> +
>>>>>>> + ? ? ? assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>>>>>> +
>>>>>>> + ? ? ? ex.fe_start = start;
>>>>>>> + ? ? ? ex.fe_group = group;
>>>>>>> + ? ? ? ex.fe_len = count;
>>>>>>> +
>>>>>>> + ? ? ? mb_mark_used(e4b,&ex);
>>>>>>> + ? ? ? ext4_unlock_group(sb, group);
>>>>>>> +
>>>>>>> + ? ? ? discard_block = (ext4_fsblk_t)group *
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb)
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? + start
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? + le32_to_cpu(es->s_first_data_block);
>>>>>>> + ? ? ? trace_ext4_discard_blocks(sb,
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? (unsigned long long)discard_block,
>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? count);
>>>>>>> + ? ? ? sb_issue_discard(sb, discard_block, count);
>>>>>>> +
>>>>>>> + ? ? ? ext4_lock_group(sb, group);
>>>>>>> + ? ? ? mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>>>>>> +}
>>>>>>
>>>>>> Mark, unless I'm missing something, sb_issue_discard() above is going
>>>>>> to trigger a trim command for just the one range. ?I thought the
>>>>>> benchmarks you did showed that a collection of ranges needed to be
>>>>>> built, then a single trim command invoked that trimmed that group of
>>>>>> ranges.
>>>>> ..
>>>>>
>>>>> Mmm.. If that's what it is doing, then this patch set would be a
>>>>> complete disaster.
>>>>> It would take *hours* to do the initial TRIM.
>
> Except it doesn't. ?Lukas did provide numbers in his original email.
>
>>>>> Lukas ?
>>>>
>>>> I'm confused; do we have an interface to send a trim command for multiple ranges?
>>>>
>>>> I didn't think so ... ?Lukas' patch is finding free ranges (above a size threshold)
>>>> to discard; it's not doing it a block at a time, if that's the concern.
>>>>
>>>> -Eric
>>>
>>> Eric,
>>>
>>> I don't know what kernel APIs have been created to support discard,
>>> but the ATA8 draft spec. allows for specifying multiple ranges in one
>>> trim command.
>
> Well, sb_issue_discard is what ext4 is using, and that takes a single
> range. ?I don't know if anyone has looked into adding a vectored API.
>
>>
>> Greg,
>>
>> We have full support for this in the "discard" support at the file
>> system layer for several file systems.
>
> Actually, we don't support what Greg is talking about, to my knowledge.
>
>> The block layer effectively muxes the "discard" into the right target
>> device command. TRIM for ATA, WRITE_SAME (with unmap) or UNMAP for
>> SCSI...
>>
>> If your favourite fs supports this, you can enable this feature with
>> "-o
>> discard" for fine grained discards,
>
> Thanks, it's worth pointing out that TRIM is not the only backend to the
> discard API. ?However, even if we do implement a vectored API, we can
> translate that to dumber commands if a given spec doesn't support it.
>
> Getting back to the problem...
>
> From the file system, you want to discard discrete ranges of blocks.
> The API to support this can either take care of the data integrity
> guarantees by itself, or make the upper layer ensure that trim and write
> do not pass each other. ?The current implementation does the latter. ?In
> order to do the former, there is the potential for a lot of overhead to
> be introduced into the block allocation layers for the file systems.
>
> So, given the above, it is up to the file system to send down the
> biggest discard requests it can in order to reduce the overhead of the
> command. ?If a vectored approach is made available, then that would be
> even better. ?Christoph, is this something that's on your radar?
>
> Cheers,
> Jeff
>



--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2010-04-21 20:53:54

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

correcting Christoph's email address - no other edits/comments

On Wed, Apr 21, 2010 at 4:44 PM, Greg Freemyer <[email protected]> wrote:
> On Wed, Apr 21, 2010 at 3:22 PM, Jeff Moyer <[email protected]> wrote:
>> Ric Wheeler <[email protected]> writes:
>>
>>> On 04/21/2010 02:59 PM, Greg Freemyer wrote:
>>>> On Tue, Apr 20, 2010 at 10:45 PM, Eric Sandeen<[email protected]> ?wrote:
>>>>> Mark Lord wrote:
>>>>>> On 20/04/10 05:21 PM, Greg Freemyer wrote:
>>>>>>> Mark,
>>>>>>>
>>>>>>> This is the patch implementing the new discard logic.
>>>>>> ..
>>>>>>> Signed-off-by: Lukas Czerner<[email protected]>
>>>>>> ..
>>>>>>>> +void ext4_trim_extent(struct super_block *sb, int start, int count,
>>>>>>>> + ? ? ? ? ? ? ? ext4_group_t group, struct ext4_buddy *e4b)
>>>>>>>> +{
>>>>>>>> + ? ? ? ext4_fsblk_t discard_block;
>>>>>>>> + ? ? ? struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>>>>>>>> + ? ? ? struct ext4_free_extent ex;
>>>>>>>> +
>>>>>>>> + ? ? ? assert_spin_locked(ext4_group_lock_ptr(sb, group));
>>>>>>>> +
>>>>>>>> + ? ? ? ex.fe_start = start;
>>>>>>>> + ? ? ? ex.fe_group = group;
>>>>>>>> + ? ? ? ex.fe_len = count;
>>>>>>>> +
>>>>>>>> + ? ? ? mb_mark_used(e4b,&ex);
>>>>>>>> + ? ? ? ext4_unlock_group(sb, group);
>>>>>>>> +
>>>>>>>> + ? ? ? discard_block = (ext4_fsblk_t)group *
>>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb)
>>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? + start
>>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? + le32_to_cpu(es->s_first_data_block);
>>>>>>>> + ? ? ? trace_ext4_discard_blocks(sb,
>>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? (unsigned long long)discard_block,
>>>>>>>> + ? ? ? ? ? ? ? ? ? ? ? count);
>>>>>>>> + ? ? ? sb_issue_discard(sb, discard_block, count);
>>>>>>>> +
>>>>>>>> + ? ? ? ext4_lock_group(sb, group);
>>>>>>>> + ? ? ? mb_free_blocks(NULL, e4b, start, ex.fe_len);
>>>>>>>> +}
>>>>>>>
>>>>>>> Mark, unless I'm missing something, sb_issue_discard() above is going
>>>>>>> to trigger a trim command for just the one range. ?I thought the
>>>>>>> benchmarks you did showed that a collection of ranges needed to be
>>>>>>> built, then a single trim command invoked that trimmed that group of
>>>>>>> ranges.
>>>>>> ..
>>>>>>
>>>>>> Mmm.. If that's what it is doing, then this patch set would be a
>>>>>> complete disaster.
>>>>>> It would take *hours* to do the initial TRIM.
>>
>> Except it doesn't. ?Lukas did provide numbers in his original email.
>>
>
> Looking at the benchmarks (for the first time) at
> http://people.redhat.com/jmoyer/discard/ext4_batched_discard/
>
> I don't see anything that says how long the proposed trim ioctl takes
> to complete on the full filesystem.
>
> What they do show is that with the 3 test SSDs used for this
> benchmark, the current released discard implementation is a net loss.
> ie. You are better off running without the discards for all 3 vendors.
> ?(at least under the conditions tested.)
>
> After the patch is applied and optimizing the discards to large free
> extents only, it works out to same performance with or without the
> discards. ?ie. no net gain or loss.
>
> That is extremely cool because one assumes that the non-discard case
> would degrade over time, but that the discard case will not.
>
> So that argues for the current proposed patch going in.
>
> But quoting from the first email:
>
> ==
> The basic idea behind my discard support is to create an ioctl which
> walks through all the free extents in each allocating group and discard
> those extents. As an addition to improve its performance one can specify
> minimum free extent length, so ioctl will not bother with shorter extents.
>
> This of course means, that with each invocation the ioctl must walk
> through whole file system, checking and discarding free extents, which
> is not very efficient. The best way to avoid this is to keep track of
> deleted (freed) blocks. Then the ioctl have to trim just those free
> extents which were recently freed.
>
> In order to implement this I have added new bitmap into ext4_group_info
> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
> walk through bb_bitmap_deleted, compare deleted extents with free
> extents trim them and then removes it from the bb_bitmap_deleted.
>
> But you may notice, that there is one problem. bb_bitmap_deleted does
> not survive umount. To bypass the problem the first ioctl call have to
> walk through whole file system trimming all free extents. But there is a
> better solution to this problem. The bb_bitmap_deleted can be stored on
> disk an can be restored in mount time along with other bitmaps, but I
> think it is a quite big change and should be discussed further.
> ==
>
> The above seems to argue against the patch going in until the
> mount/umount issues are addressed.
>
> So in addition to this patch, Lukas is proposing a on disk change to
> address the fact that calling trim upteen times at mount time is too
> slow.
>
> Per Mark's testing of last summer, an alternative solution is to use a
> vectored trim approach that is far more efficient.
>
> Mark's benchmarks showed this as doable in seconds which seems like a
> reasonable amount of time for a mount time operation.
>
> Greg
>



--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2010-04-21 21:01:33

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/21/2010 03:44 PM, Greg Freemyer wrote:

> Mark's benchmarks showed this as doable in seconds which seems like a
> reasonable amount of time for a mount time operation.

All the other things aside, mount-time is interesting, but it's an
infrequent operation, at least in my world. I think we need something
that can be done runtime.

For anything with uptime, I don't think it's acceptable to wait until
the next mount to trim unused blocks.

But as long as the mechanism can be called either at mount time and/or
kicked off runtime somehow, I'm happy.

-Eric

2010-04-21 21:03:49

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/21/2010 05:01 PM, Eric Sandeen wrote:
> On 04/21/2010 03:44 PM, Greg Freemyer wrote:
>
>
>> Mark's benchmarks showed this as doable in seconds which seems like a
>> reasonable amount of time for a mount time operation.
>>
> All the other things aside, mount-time is interesting, but it's an
> infrequent operation, at least in my world. I think we need something
> that can be done runtime.
>
> For anything with uptime, I don't think it's acceptable to wait until
> the next mount to trim unused blocks.
>
> But as long as the mechanism can be called either at mount time and/or
> kicked off runtime somehow, I'm happy.
>
> -Eric
>

That makes sense to me. Most enterprise servers will go without
remounting a file system for (hopefully!) a very long time.

It is really important to keep in mind that this is not just a laptop
feature for laptop SSD's, this is also used by high end arrays and
*could* be useful for virt IO, etc as well :-)

ric


2010-04-21 21:47:28

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

Adding James Bottomley because high-end scsi is entering the
discussion. James, I have a couple scsi questions for you at the end.

On Wed, Apr 21, 2010 at 5:03 PM, Ric Wheeler <[email protected]> wrote:
> On 04/21/2010 05:01 PM, Eric Sandeen wrote:
>>
>> On 04/21/2010 03:44 PM, Greg Freemyer wrote:
>>
>>
>>>
>>> Mark's benchmarks showed this as doable in seconds which seems like a
>>> reasonable amount of time for a mount time operation.
>>>
>>
>> All the other things aside, mount-time is interesting, but it's an
>> infrequent operation, at least in my world. ?I think we need something
>> that can be done runtime.
>>
>> For anything with uptime, I don't think it's acceptable to wait until
>> the next mount to trim unused blocks.
>>
>> But as long as the mechanism can be called either at mount time and/or
>> kicked off runtime somehow, I'm happy.
>>
>> -Eric
>>
>
> That makes sense to me. ?Most enterprise servers will go without remounting
> a file system for (hopefully!) a very long time.
>
> It is really important to keep in mind that this is not just a laptop
> feature for laptop SSD's, this is also used by high end arrays and *could*
> be useful for virt IO, etc as well :-)
>
> ric

I'm not arguing that a runtime solution is not needed.

I'm arguing that at least for SSD backed filesystems Mark's userspace
implementation shows how the mount time initialization of the runtime
bitmap can be accomplished in a few seconds by leveraging the hardware
and using vector'ed trims as opposed to having to build an additional
on-disk structure.

At least for SSDs, the primary purpose of the proposed on-disk
structure seems to be to overcome the current lack of a vector'ed
discard implementation.

If it is too difficult to implement a fully functional vector'ed
discard in the block layer due to locking issues, possibly a special
purpose version could be written that is only used at mount time when
one can be assured no other i/o is occurring to the filesystem.

James,

The ATA-8 spec. supports vectored trims and requires a minimum of 255
sectors worth of range payload be supported. That equates to a single
trim being able to trim thousands of ranges in one command.

Mark Lord has benchmarked in found a vectored trim to be drastically
faster than calling trim individually for each of those ranges.

Does scsi support vector'ed discard? (ie. write-same commands)

Or are high-end scsi arrays so fast they can process tens of thousands
of discard commands in a reasonable amount of time, unlike the SSDs
have so far proven to do.

It would be interesting to find out that a SSD can discard thousands
of ranges drastically faster than a high-end scsi device can. But if
true, that might argue for the on-disk bitmap to track previously
discarded blocks/extents.

Greg

2010-04-21 21:56:51

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Wed, 2010-04-21 at 17:47 -0400, Greg Freemyer wrote:
> Adding James Bottomley because high-end scsi is entering the
> discussion. James, I have a couple scsi questions for you at the end.
>
> On Wed, Apr 21, 2010 at 5:03 PM, Ric Wheeler <[email protected]> wrote:
> > On 04/21/2010 05:01 PM, Eric Sandeen wrote:
> >>
> >> On 04/21/2010 03:44 PM, Greg Freemyer wrote:
> >>
> >>
> >>>
> >>> Mark's benchmarks showed this as doable in seconds which seems like a
> >>> reasonable amount of time for a mount time operation.
> >>>
> >>
> >> All the other things aside, mount-time is interesting, but it's an
> >> infrequent operation, at least in my world. I think we need something
> >> that can be done runtime.
> >>
> >> For anything with uptime, I don't think it's acceptable to wait until
> >> the next mount to trim unused blocks.

So what's wrong with using wiper.sh? It can do online discard of
filesystems that support delayed allocation (ext4, xfs etc.)?

> >> But as long as the mechanism can be called either at mount time and/or
> >> kicked off runtime somehow, I'm happy.
> >>
> >> -Eric
> >>
> >
> > That makes sense to me. Most enterprise servers will go without remounting
> > a file system for (hopefully!) a very long time.
> >
> > It is really important to keep in mind that this is not just a laptop
> > feature for laptop SSD's, this is also used by high end arrays and *could*
> > be useful for virt IO, etc as well :-)
> >
> > ric
>
> I'm not arguing that a runtime solution is not needed.
>
> I'm arguing that at least for SSD backed filesystems Mark's userspace
> implementation shows how the mount time initialization of the runtime
> bitmap can be accomplished in a few seconds by leveraging the hardware
> and using vector'ed trims as opposed to having to build an additional
> on-disk structure.
>
> At least for SSDs, the primary purpose of the proposed on-disk
> structure seems to be to overcome the current lack of a vector'ed
> discard implementation.
>
> If it is too difficult to implement a fully functional vector'ed
> discard in the block layer due to locking issues, possibly a special
> purpose version could be written that is only used at mount time when
> one can be assured no other i/o is occurring to the filesystem.
>
> James,
>
> The ATA-8 spec. supports vectored trims and requires a minimum of 255
> sectors worth of range payload be supported. That equates to a single
> trim being able to trim thousands of ranges in one command.
>
> Mark Lord has benchmarked in found a vectored trim to be drastically
> faster than calling trim individually for each of those ranges.
>
> Does scsi support vector'ed discard? (ie. write-same commands)

only with UNMAP. WRITE SAME is effectively single range.

> Or are high-end scsi arrays so fast they can process tens of thousands
> of discard commands in a reasonable amount of time, unlike the SSDs
> have so far proven to do.

No ... they actually have two problems: firstly they can only use
discard ranges which align with their internal block size (usually
something huge like 3/4MB) and then a trim operation tends to be O(1)
and slow, so they'd actually like discard accumulation.

> It would be interesting to find out that a SSD can discard thousands
> of ranges drastically faster than a high-end scsi device can. But if
> true, that might argue for the on-disk bitmap to track previously
> discarded blocks/extents.

I think SSDs and Arrays both have discard problems, arrays more to do
with the time and expense of the operation, SSDs because the TRIM
command isn't queued.

James



2010-04-21 21:59:19

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 21/04/10 05:47 PM, Greg Freemyer wrote:
>
> The ATA-8 spec. supports vectored trims and requires a minimum of 255
> sectors worth of range payload be supported. That equates to a single
> trim being able to trim thousands of ranges in one command.
>
> Mark Lord has benchmarked in found a vectored trim to be drastically
> faster than calling trim individually for each of those ranges.
..

Does anyone have an Intel-based SSD they could spare? :)

Rumour has it they they do not comply with the ATA-8 spec for TRIM,
in that they don't support more than one "sector" of range data.
The SSDs I have here all support much more than that.

Still, even if we just do a single 512-byte payload of TRIM ranges,
it would be a huge win.

Cheers

2010-04-23 08:23:26

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Wed, 21 Apr 2010, Greg Freemyer wrote:

> >>>>> Mmm.. If that's what it is doing, then this patch set would be a
> >>>>> complete disaster.
> >>>>> It would take *hours* to do the initial TRIM.
> >
> > Except it doesn't. ?Lukas did provide numbers in his original email.
> >
>
> Looking at the benchmarks (for the first time) at
> http://people.redhat.com/jmoyer/discard/ext4_batched_discard/
>
> I don't see anything that says how long the proposed trim ioctl takes
> to complete on the full filesystem.

Well, it strongly depends on how is the file system fragmented. On the
fresh file system (147G) the initial ioctl takes 2 seconds to finish (it
may be worth to mention that on another SSD (111G) it takes 56s). I will
try to get some numbers for the "usual" file system (not full, not
fresh).

>
> What they do show is that with the 3 test SSDs used for this
> benchmark, the current released discard implementation is a net loss.
> ie. You are better off running without the discards for all 3 vendors.
> (at least under the conditions tested.)
>
> After the patch is applied and optimizing the discards to large free
> extents only, it works out to same performance with or without the
> discards. ie. no net gain or loss.
>
> That is extremely cool because one assumes that the non-discard case
> would degrade over time, but that the discard case will not.
>
> So that argues for the current proposed patch going in.
>
> But quoting from the first email:
>
> ==
> The basic idea behind my discard support is to create an ioctl which
> walks through all the free extents in each allocating group and discard
> those extents. As an addition to improve its performance one can specify
> minimum free extent length, so ioctl will not bother with shorter extents.
>
> This of course means, that with each invocation the ioctl must walk
> through whole file system, checking and discarding free extents, which
> is not very efficient. The best way to avoid this is to keep track of
> deleted (freed) blocks. Then the ioctl have to trim just those free
> extents which were recently freed.
>
> In order to implement this I have added new bitmap into ext4_group_info
> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
> walk through bb_bitmap_deleted, compare deleted extents with free
> extents trim them and then removes it from the bb_bitmap_deleted.
>
> But you may notice, that there is one problem. bb_bitmap_deleted does
> not survive umount. To bypass the problem the first ioctl call have to
> walk through whole file system trimming all free extents. But there is a
> better solution to this problem. The bb_bitmap_deleted can be stored on
> disk an can be restored in mount time along with other bitmaps, but I
> think it is a quite big change and should be discussed further.
> ==
>
> The above seems to argue against the patch going in until the
> mount/umount issues are addressed.

I do not know much about how production system is being used, but I
doubt this is that big issue. Sure the initial ioctl takes long to
finish and there is a place for improvement, there was a proposal to
do the initial trim at mount time. I do not think that it is wise,
why to block mount, when the trim can be run at background when the fs
is mounted ? Of course there will be some performance loss, while ioctl
will be in progress, but it will not block.

There are also another way to overcome this problem. We can assure that
the file system is left trimmed after umount. To do this, we can simply
trim the fs at umount time. I think this won't be any problem and we even
do not prolong the umount time too much, because we will not trim whole
fs, but just recently freed blocks.

This of course bring another problem, when the system is not properly
terminated and the umount is not properly finished (or done at all). But
this can be solved in fsck at boot time I think. This will entirely
eliminate the need to trim the whole fs (except the fsck obviously),
since it is done when fs is created.

>
> So in addition to this patch, Lukas is proposing a on disk change to
> address the fact that calling trim upteen times at mount time is too
> slow.
>
> Per Mark's testing of last summer, an alternative solution is to use a
> vectored trim approach that is far more efficient.

Vectored trim will be great, I did not tested anything like that but
obviously it will strongly reduce time needed to trim the fs. But we do
not have this support just yet.

>
> Mark's benchmarks showed this as doable in seconds which seems like a
> reasonable amount of time for a mount time operation.
>
> Greg
>


And also, currently I am rewriting the patch do use rbtree instead of the
bitmap, because there were some concerns of memory consumption. It is a
question whether or not the rbtree will be more memory friendly.
Generally I think that in most "normal" cases it will, but there are some
extreme scenarios, where the rbtree will be much worse. Any comment on
this ?


Thanks.
-Lukas

2010-04-24 13:24:10

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Fri, Apr 23, 2010 at 4:23 AM, Lukas Czerner <[email protected]> wrote:
> On Wed, 21 Apr 2010, Greg Freemyer wrote:
>
>> >>>>> Mmm.. If that's what it is doing, then this patch set would be a
>> >>>>> complete disaster.
>> >>>>> It would take *hours* to do the initial TRIM.
>> >
>> > Except it doesn't. ?Lukas did provide numbers in his original email.
>> >
>>
>> Looking at the benchmarks (for the first time) at
>> http://people.redhat.com/jmoyer/discard/ext4_batched_discard/
>>
>> I don't see anything that says how long the proposed trim ioctl takes
>> to complete on the full filesystem.
>
> Well, it strongly depends on how is the file system fragmented. On the
> fresh file system (147G) the initial ioctl takes 2 seconds to finish (it
> may be worth to mention that on another SSD (111G) it takes 56s). I will
> try to get some numbers for the "usual" file system (not full, not
> fresh).
>
>>
>> What they do show is that with the 3 test SSDs used for this
>> benchmark, the current released discard implementation is a net loss.
>> ie. You are better off running without the discards for all 3 vendors.
>> ?(at least under the conditions tested.)
>>
>> After the patch is applied and optimizing the discards to large free
>> extents only, it works out to same performance with or without the
>> discards. ?ie. no net gain or loss.
>>
>> That is extremely cool because one assumes that the non-discard case
>> would degrade over time, but that the discard case will not.
>>
>> So that argues for the current proposed patch going in.
>>
>> But quoting from the first email:
>>
>> ==
>> The basic idea behind my discard support is to create an ioctl which
>> walks through all the free extents in each allocating group and discard
>> those extents. As an addition to improve its performance one can specify
>> minimum free extent length, so ioctl will not bother with shorter extents.
>>
>> This of course means, that with each invocation the ioctl must walk
>> through whole file system, checking and discarding free extents, which
>> is not very efficient. The best way to avoid this is to keep track of
>> deleted (freed) blocks. Then the ioctl have to trim just those free
>> extents which were recently freed.
>>
>> In order to implement this I have added new bitmap into ext4_group_info
>> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
>> walk through bb_bitmap_deleted, compare deleted extents with free
>> extents trim them and then removes it from the bb_bitmap_deleted.
>>
>> But you may notice, that there is one problem. bb_bitmap_deleted does
>> not survive umount. To bypass the problem the first ioctl call have to
>> walk through whole file system trimming all free extents. But there is a
>> better solution to this problem. The bb_bitmap_deleted can be stored on
>> disk an can be restored in mount time along with other bitmaps, but I
>> think it is a quite big change and should be discussed further.
>> ==
>>
>> The above seems to argue against the patch going in until the
>> mount/umount issues are addressed.
>
> I do not know much about how production system is being used, but I
> doubt this is that big issue. Sure the initial ioctl takes long to
> finish and there is a place for improvement, there was a proposal to
> do the initial trim at mount time. I do not think that it is wise,
> why to block mount, when the trim can be run at background when the fs
> is mounted ? Of course there will be some performance loss, while ioctl
> will be in progress, but it will not block.
>
> There are also another way to overcome this problem. We can assure that
> the file system is left trimmed after umount. To do this, we can simply
> trim the fs at umount time. I think this won't be any problem and we even
> do not prolong the umount time too much, because we will not trim whole
> fs, but just recently freed blocks.
>
> This of course bring another problem, when the system is not properly
> terminated and the umount is not properly finished (or done at all). But
> this can be solved in fsck at boot time I think. This will entirely
> eliminate the need to trim the whole fs (except the fsck obviously),
> since it is done when fs is created.
>
>>
>> So in addition to this patch, Lukas is proposing a on disk change to
>> address the fact that calling trim upteen times at mount time is too
>> slow.
>>
>> Per Mark's testing of last summer, an alternative solution is to use a
>> vectored trim approach that is far more efficient.
>
> Vectored trim will be great, I did not tested anything like that but
> obviously it will strongly reduce time needed to trim the fs. But we do
> not have this support just yet.
>
>>
>> Mark's benchmarks showed this as doable in seconds which seems like a
>> reasonable amount of time for a mount time operation.
>>
>> Greg
>>
>
>
> And also, currently I am rewriting the patch do use rbtree instead of the
> bitmap, because there were some concerns of memory consumption. It is a
> question whether or not the rbtree will be more memory friendly.
> Generally I think that in most "normal" cases it will, but there are some
> extreme scenarios, where the rbtree will be much worse. Any comment on
> this ?

I know I've been arguing against this patch for the single SSD case
and I still think that use case should be handled by userspace as
hdparm/wiper.sh currently does. In particular for those extreme
scenarios with JBOD SSDs, the user space solution wins because it
knows how to optimize the trim calls via vectorized ranges in the
payload.

Thus I think the community and distro's should be testing that pair
and pushing it out in the distro's for typical laptop use.

But, that still leaves high-end external raid arrays, mdraid, and lvm
unaddressed.

Those use cases will likely benefit from the approach this patch takes
the most. In particular, mdraid with raid 5/6 requires an approach
like this patch provides, or it has to create its own in kernel
discard aggregator which seems like a waste of time.

In general, those use cases have large minimum discard units. Thus I
think this patch should be tuned to work with large discard units and
ignore small ones. That means it needs to get the underlying block
layer topology and ignore unused space smaller than underlying layers
minimum discard unit. That should allow a rb tree to be used and
eliminate the extreme scenarios. (ie. I assume your extreme scenarios
involve large numbers of very small unused ranges.)

That may mean the topology information needs to grow some discard
info. Does anyone know if that info is easily derived from the
currently existing topo info?

Greg






--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2010-04-24 13:48:22

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/24/2010 09:24 AM, Greg Freemyer wrote:
> On Fri, Apr 23, 2010 at 4:23 AM, Lukas Czerner<[email protected]> wrote:
>
>> On Wed, 21 Apr 2010, Greg Freemyer wrote:
>>
>>
>>>>>>>> Mmm.. If that's what it is doing, then this patch set would be a
>>>>>>>> complete disaster.
>>>>>>>> It would take *hours* to do the initial TRIM.
>>>>>>>>
>>>> Except it doesn't. Lukas did provide numbers in his original email.
>>>>
>>>>
>>> Looking at the benchmarks (for the first time) at
>>> http://people.redhat.com/jmoyer/discard/ext4_batched_discard/
>>>
>>> I don't see anything that says how long the proposed trim ioctl takes
>>> to complete on the full filesystem.
>>>
>> Well, it strongly depends on how is the file system fragmented. On the
>> fresh file system (147G) the initial ioctl takes 2 seconds to finish (it
>> may be worth to mention that on another SSD (111G) it takes 56s). I will
>> try to get some numbers for the "usual" file system (not full, not
>> fresh).
>>
>>
>>> What they do show is that with the 3 test SSDs used for this
>>> benchmark, the current released discard implementation is a net loss.
>>> ie. You are better off running without the discards for all 3 vendors.
>>> (at least under the conditions tested.)
>>>
>>> After the patch is applied and optimizing the discards to large free
>>> extents only, it works out to same performance with or without the
>>> discards. ie. no net gain or loss.
>>>
>>> That is extremely cool because one assumes that the non-discard case
>>> would degrade over time, but that the discard case will not.
>>>
>>> So that argues for the current proposed patch going in.
>>>
>>> But quoting from the first email:
>>>
>>> ==
>>> The basic idea behind my discard support is to create an ioctl which
>>> walks through all the free extents in each allocating group and discard
>>> those extents. As an addition to improve its performance one can specify
>>> minimum free extent length, so ioctl will not bother with shorter extents.
>>>
>>> This of course means, that with each invocation the ioctl must walk
>>> through whole file system, checking and discarding free extents, which
>>> is not very efficient. The best way to avoid this is to keep track of
>>> deleted (freed) blocks. Then the ioctl have to trim just those free
>>> extents which were recently freed.
>>>
>>> In order to implement this I have added new bitmap into ext4_group_info
>>> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
>>> walk through bb_bitmap_deleted, compare deleted extents with free
>>> extents trim them and then removes it from the bb_bitmap_deleted.
>>>
>>> But you may notice, that there is one problem. bb_bitmap_deleted does
>>> not survive umount. To bypass the problem the first ioctl call have to
>>> walk through whole file system trimming all free extents. But there is a
>>> better solution to this problem. The bb_bitmap_deleted can be stored on
>>> disk an can be restored in mount time along with other bitmaps, but I
>>> think it is a quite big change and should be discussed further.
>>> ==
>>>
>>> The above seems to argue against the patch going in until the
>>> mount/umount issues are addressed.
>>>
>> I do not know much about how production system is being used, but I
>> doubt this is that big issue. Sure the initial ioctl takes long to
>> finish and there is a place for improvement, there was a proposal to
>> do the initial trim at mount time. I do not think that it is wise,
>> why to block mount, when the trim can be run at background when the fs
>> is mounted ? Of course there will be some performance loss, while ioctl
>> will be in progress, but it will not block.
>>
>> There are also another way to overcome this problem. We can assure that
>> the file system is left trimmed after umount. To do this, we can simply
>> trim the fs at umount time. I think this won't be any problem and we even
>> do not prolong the umount time too much, because we will not trim whole
>> fs, but just recently freed blocks.
>>
>> This of course bring another problem, when the system is not properly
>> terminated and the umount is not properly finished (or done at all). But
>> this can be solved in fsck at boot time I think. This will entirely
>> eliminate the need to trim the whole fs (except the fsck obviously),
>> since it is done when fs is created.
>>
>>
>>> So in addition to this patch, Lukas is proposing a on disk change to
>>> address the fact that calling trim upteen times at mount time is too
>>> slow.
>>>
>>> Per Mark's testing of last summer, an alternative solution is to use a
>>> vectored trim approach that is far more efficient.
>>>
>> Vectored trim will be great, I did not tested anything like that but
>> obviously it will strongly reduce time needed to trim the fs. But we do
>> not have this support just yet.
>>
>>
>>> Mark's benchmarks showed this as doable in seconds which seems like a
>>> reasonable amount of time for a mount time operation.
>>>
>>> Greg
>>>
>>>
>>
>> And also, currently I am rewriting the patch do use rbtree instead of the
>> bitmap, because there were some concerns of memory consumption. It is a
>> question whether or not the rbtree will be more memory friendly.
>> Generally I think that in most "normal" cases it will, but there are some
>> extreme scenarios, where the rbtree will be much worse. Any comment on
>> this ?
>>
> I know I've been arguing against this patch for the single SSD case
> and I still think that use case should be handled by userspace as
> hdparm/wiper.sh currently does. In particular for those extreme
> scenarios with JBOD SSDs, the user space solution wins because it
> knows how to optimize the trim calls via vectorized ranges in the
> payload.
>

I think that you have missed the broader point. This is not on by
default, so you can mount without discard and use whatever user space
utility you like at your discretion.

ric

> Thus I think the community and distro's should be testing that pair
> and pushing it out in the distro's for typical laptop use.
>
> But, that still leaves high-end external raid arrays, mdraid, and lvm
> unaddressed.
>
> Those use cases will likely benefit from the approach this patch takes
> the most. In particular, mdraid with raid 5/6 requires an approach
> like this patch provides, or it has to create its own in kernel
> discard aggregator which seems like a waste of time.
>
> In general, those use cases have large minimum discard units. Thus I
> think this patch should be tuned to work with large discard units and
> ignore small ones. That means it needs to get the underlying block
> layer topology and ignore unused space smaller than underlying layers
> minimum discard unit. That should allow a rb tree to be used and
> eliminate the extreme scenarios. (ie. I assume your extreme scenarios
> involve large numbers of very small unused ranges.)
>
> That may mean the topology information needs to grow some discard
> info. Does anyone know if that info is easily derived from the
> currently existing topo info?
>
> Greg
>
>
>
>
>
>
>


2010-04-24 14:30:46

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Sat, Apr 24, 2010 at 9:48 AM, Ric Wheeler <[email protected]> wrote:
> On 04/24/2010 09:24 AM, Greg Freemyer wrote:
>>
>> On Fri, Apr 23, 2010 at 4:23 AM, Lukas Czerner<[email protected]>
>> ?wrote:
>>
>>>
>>> On Wed, 21 Apr 2010, Greg Freemyer wrote:
>>>
>>>
>>>>>>>>>
>>>>>>>>> Mmm.. If that's what it is doing, then this patch set would be a
>>>>>>>>> complete disaster.
>>>>>>>>> It would take *hours* to do the initial TRIM.
>>>>>>>>>
>>>>>
>>>>> Except it doesn't. ?Lukas did provide numbers in his original email.
>>>>>
>>>>>
>>>>
>>>> Looking at the benchmarks (for the first time) at
>>>> http://people.redhat.com/jmoyer/discard/ext4_batched_discard/
>>>>
>>>> I don't see anything that says how long the proposed trim ioctl takes
>>>> to complete on the full filesystem.
>>>>
>>>
>>> Well, it strongly depends on how is the file system fragmented. On the
>>> fresh file system (147G) the initial ioctl takes 2 seconds to finish (it
>>> may be worth to mention that on another SSD (111G) it takes 56s). I will
>>> try to get some numbers for the "usual" file system (not full, not
>>> fresh).
>>>
>>>
>>>>
>>>> What they do show is that with the 3 test SSDs used for this
>>>> benchmark, the current released discard implementation is a net loss.
>>>> ie. You are better off running without the discards for all 3 vendors.
>>>> ?(at least under the conditions tested.)
>>>>
>>>> After the patch is applied and optimizing the discards to large free
>>>> extents only, it works out to same performance with or without the
>>>> discards. ?ie. no net gain or loss.
>>>>
>>>> That is extremely cool because one assumes that the non-discard case
>>>> would degrade over time, but that the discard case will not.
>>>>
>>>> So that argues for the current proposed patch going in.
>>>>
>>>> But quoting from the first email:
>>>>
>>>> ==
>>>> The basic idea behind my discard support is to create an ioctl which
>>>> walks through all the free extents in each allocating group and discard
>>>> those extents. As an addition to improve its performance one can specify
>>>> minimum free extent length, so ioctl will not bother with shorter
>>>> extents.
>>>>
>>>> This of course means, that with each invocation the ioctl must walk
>>>> through whole file system, checking and discarding free extents, which
>>>> is not very efficient. The best way to avoid this is to keep track of
>>>> deleted (freed) blocks. Then the ioctl have to trim just those free
>>>> extents which were recently freed.
>>>>
>>>> In order to implement this I have added new bitmap into ext4_group_info
>>>> (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
>>>> walk through bb_bitmap_deleted, compare deleted extents with free
>>>> extents trim them and then removes it from the bb_bitmap_deleted.
>>>>
>>>> But you may notice, that there is one problem. bb_bitmap_deleted does
>>>> not survive umount. To bypass the problem the first ioctl call have to
>>>> walk through whole file system trimming all free extents. But there is a
>>>> better solution to this problem. The bb_bitmap_deleted can be stored on
>>>> disk an can be restored in mount time along with other bitmaps, but I
>>>> think it is a quite big change and should be discussed further.
>>>> ==
>>>>
>>>> The above seems to argue against the patch going in until the
>>>> mount/umount issues are addressed.
>>>>
>>>
>>> I do not know much about how production system is being used, but I
>>> doubt this is that big issue. Sure the initial ioctl takes long to
>>> finish and there is a place for improvement, there was a proposal to
>>> do the initial trim at mount time. I do not think that it is wise,
>>> why to block mount, when the trim can be run at background when the fs
>>> is mounted ? Of course there will be some performance loss, while ioctl
>>> will be in progress, but it will not block.
>>>
>>> There are also another way to overcome this problem. We can assure that
>>> the file system is left trimmed after umount. To do this, we can simply
>>> trim the fs at umount time. I think this won't be any problem and we even
>>> do not prolong the umount time too much, because we will not trim whole
>>> fs, but just recently freed blocks.
>>>
>>> This of course bring another problem, when the system is not properly
>>> terminated and the umount is not properly finished (or done at all). But
>>> this can be solved in fsck at boot time I think. This will entirely
>>> eliminate the need to trim the whole fs (except the fsck obviously),
>>> since it is done when fs is created.
>>>
>>>
>>>>
>>>> So in addition to this patch, Lukas is proposing a on disk change to
>>>> address the fact that calling trim upteen times at mount time is too
>>>> slow.
>>>>
>>>> Per Mark's testing of last summer, an alternative solution is to use a
>>>> vectored trim approach that is far more efficient.
>>>>
>>>
>>> Vectored trim will be great, I did not tested anything like that but
>>> obviously it will strongly reduce time needed to trim the fs. But we do
>>> not have this support just yet.
>>>
>>>
>>>>
>>>> Mark's benchmarks showed this as doable in seconds which seems like a
>>>> reasonable amount of time for a mount time operation.
>>>>
>>>> Greg
>>>>
>>>>
>>>
>>> And also, currently I am rewriting the patch do use rbtree instead of the
>>> bitmap, because there were some concerns of memory consumption. It is a
>>> question whether or not the rbtree will be more memory friendly.
>>> Generally I think that in most "normal" cases it will, but there are some
>>> extreme scenarios, where the rbtree will be much worse. Any comment on
>>> this ?
>>>
>>
>> I know I've been arguing against this patch for the single SSD case
>> and I still think that use case should be handled by userspace as
>> hdparm/wiper.sh currently does. ?In particular for those extreme
>> scenarios with JBOD SSDs, the user space solution wins because it
>> knows how to optimize the trim calls via vectorized ranges in the
>> payload.
>>
>
> I think that you have missed the broader point. This is not on by default,
> so you can mount without discard and use whatever user space utility you
> like at your discretion.
>
> ric

Ric,

I was trying to say the design should be driven by the large discard
range use case, not the potentially pathological small discard range
use case that would only benefit SSDs.

Greg

2010-04-24 14:43:43

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

Greg Freemyer wrote:
> On Sat, Apr 24, 2010 at 9:48 AM, Ric Wheeler <[email protected]> wrote:
>> On 04/24/2010 09:24 AM, Greg Freemyer wrote:

...

>>> I know I've been arguing against this patch for the single SSD case
>>> and I still think that use case should be handled by userspace as
>>> hdparm/wiper.sh currently does. In particular for those extreme
>>> scenarios with JBOD SSDs, the user space solution wins because it
>>> knows how to optimize the trim calls via vectorized ranges in the
>>> payload.
>>>
>> I think that you have missed the broader point. This is not on by default,
>> so you can mount without discard and use whatever user space utility you
>> like at your discretion.
>>
>> ric
>
> Ric,
>
> I was trying to say the design should be driven by the large discard
> range use case, not the potentially pathological small discard range
> use case that would only benefit SSDs.
>
> Greg

Bear in mind that this patch makes the discard range requests substantially
-larger- than what mount -o discard does on ext4 today, in fact that was
a main goal.

If the kernel could assemble vectors of ranges and pass them down, I think it
could be extended to use that as well.

-Eric

2010-04-24 15:03:28

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Sat, Apr 24, 2010 at 10:43 AM, Eric Sandeen <[email protected]> wrote:
> Greg Freemyer wrote:
>> On Sat, Apr 24, 2010 at 9:48 AM, Ric Wheeler <[email protected]> wrote:
>>> On 04/24/2010 09:24 AM, Greg Freemyer wrote:
>
> ...
>
>>>> I know I've been arguing against this patch for the single SSD case
>>>> and I still think that use case should be handled by userspace as
>>>> hdparm/wiper.sh currently does. ?In particular for those extreme
>>>> scenarios with JBOD SSDs, the user space solution wins because it
>>>> knows how to optimize the trim calls via vectorized ranges in the
>>>> payload.
>>>>
>>> I think that you have missed the broader point. This is not on by default,
>>> so you can mount without discard and use whatever user space utility you
>>> like at your discretion.
>>>
>>> ric
>>
>> Ric,
>>
>> I was trying to say the design should be driven by the large discard
>> range use case, not the potentially pathological small discard range
>> use case that would only benefit SSDs.
>>
>> Greg
>
> Bear in mind that this patch makes the discard range requests substantially
> -larger- than what mount -o discard does on ext4 today, in fact that was
> a main goal.
>
> If the kernel could assemble vectors of ranges and pass them down, I think it
> could be extended to use that as well.
>
> -Eric
>
Eric/Ric,

I was responding to the Lukas latest post which stated:

==
And also, currently I am rewriting the patch do use rbtree instead of the
bitmap, because there were some concerns of memory consumption. It is a
question whether or not the rbtree will be more memory friendly.
Generally I think that in most "normal" cases it will, but there are some
extreme scenarios, where the rbtree will be much worse. Any comment on
this ?
==

If one optimizes for large discard ranges and ignores the pathological
cases only beneficial to SSDs, then a rbtree wins.

Correct?

Greg

2010-04-24 17:05:00

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/24/2010 11:03 AM, Greg Freemyer wrote:
> On Sat, Apr 24, 2010 at 10:43 AM, Eric Sandeen<[email protected]> wrote:
>
>> Greg Freemyer wrote:
>>
>>> On Sat, Apr 24, 2010 at 9:48 AM, Ric Wheeler<[email protected]> wrote:
>>>
>>>> On 04/24/2010 09:24 AM, Greg Freemyer wrote:
>>>>
>> ...
>>
>>
>>>>> I know I've been arguing against this patch for the single SSD case
>>>>> and I still think that use case should be handled by userspace as
>>>>> hdparm/wiper.sh currently does. In particular for those extreme
>>>>> scenarios with JBOD SSDs, the user space solution wins because it
>>>>> knows how to optimize the trim calls via vectorized ranges in the
>>>>> payload.
>>>>>
>>>>>
>>>> I think that you have missed the broader point. This is not on by default,
>>>> so you can mount without discard and use whatever user space utility you
>>>> like at your discretion.
>>>>
>>>> ric
>>>>
>>> Ric,
>>>
>>> I was trying to say the design should be driven by the large discard
>>> range use case, not the potentially pathological small discard range
>>> use case that would only benefit SSDs.
>>>
>>> Greg
>>>
>> Bear in mind that this patch makes the discard range requests substantially
>> -larger- than what mount -o discard does on ext4 today, in fact that was
>> a main goal.
>>
>> If the kernel could assemble vectors of ranges and pass them down, I think it
>> could be extended to use that as well.
>>
>> -Eric
>>
>>
> Eric/Ric,
>
> I was responding to the Lukas latest post which stated:
>
> ==
> And also, currently I am rewriting the patch do use rbtree instead of the
> bitmap, because there were some concerns of memory consumption. It is a
> question whether or not the rbtree will be more memory friendly.
> Generally I think that in most "normal" cases it will, but there are some
> extreme scenarios, where the rbtree will be much worse. Any comment on
> this ?
> ==
>
> If one optimizes for large discard ranges and ignores the pathological
> cases only beneficial to SSDs, then a rbtree wins.
>
> Correct?
>
> Greg
>

Let's summarize.

1. Everyone agrees that doing larger discard instead of little discards
is a good thing.

2. Some devices care about this more than others (various types of SSD's
and arrays have different designs and performance with discards). Some
devices do small discards well, others don't.

3. How you get to those bigger discards in our implementation - using a
series of single range requests, using vectored requests, tracking
extents that can be combined in an rbtree or not - is something that we
are working on. Using rbtrees versus a bitmap efficiency is about DRAM
consumption, not performance of the resulting discard on the target.

4. Devices (some devices) can export their preferences in a standard way
(look in /sys/block/....).

If you want to influence the code, please do try the various options on
devices you have at hand and report results. That is what we are doing
(we includes Lukas, Eric, Jeff and others on this thread) will real
devices from vendors that have given us access. We are talking to them
directly and trying out different work loads but certainly welcome real
world results and suggestions.

Thanks!

Ric



2010-04-24 18:30:56

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Sat, Apr 24, 2010 at 1:04 PM, Ric Wheeler <[email protected]> wrote:
<snip>
>
> Let's summarize.
>
> 1. Everyone agrees that doing larger discard instead of little discards is a
> good thing.
>
> 2. Some devices care about this more than others (various types of SSD's and
> arrays have different designs and performance with discards). Some devices
> do small discards well, others don't.
>
> 3. How you get to those bigger discards in our implementation - using a
> series of single range requests, using vectored requests, tracking extents
> that can be combined in an rbtree or not - is something that we are working
> on. Using rbtrees versus a bitmap efficiency is about DRAM consumption, not
> performance of the resulting discard on the target.
>
> 4. Devices (some devices) can export their preferences in a standard way
> (look in /sys/block/....).
>
> If you want to influence the code, please do try the various options on
> devices you have at hand and report results. ?That is what we are doing (we
> includes Lukas, Eric, Jeff and others on this thread) will real devices from
> vendors that have given us access. We are talking to them directly and
> trying out different work loads but certainly welcome real world results and
> suggestions.
>
> Thanks!
>
> Ric

Ric,

Is it also agreed by all that the current ext4 kernel implementation
of calling discard is a poor solution for most hardware / block layers
stacks / workloads and therefore is not worth retaining nor performing
further benchmarks?

I've not seen anyone arguing to keep the current kernel implementation
and I for one accept the previously posted benchmarks that show it is
not adding any value relative to the traditional non-discard case.

Therefore benchmarks between the current hdparm/wiper.sh userspace
implementation and a proposed new kernel implementation would be the
most beneficial?

I have yet to see any of those benchmarks posted.

Greg



--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2010-04-24 18:40:49

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

>>>>> "Greg" == Greg Freemyer <[email protected]> writes:

Greg> That may mean the topology information needs to grow some discard
Greg> info. Does anyone know if that info is easily derived from the
Greg> currently existing topo info?

The values are already part of the topology information (assuming the
device reports them). Went into 2.6.33.

--
Martin K. Petersen Oracle Linux Engineering

2010-04-24 18:41:45

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/24/2010 02:30 PM, Greg Freemyer wrote:
> On Sat, Apr 24, 2010 at 1:04 PM, Ric Wheeler<[email protected]> wrote:
> <snip>
>
>> Let's summarize.
>>
>> 1. Everyone agrees that doing larger discard instead of little discards is a
>> good thing.
>>
>> 2. Some devices care about this more than others (various types of SSD's and
>> arrays have different designs and performance with discards). Some devices
>> do small discards well, others don't.
>>
>> 3. How you get to those bigger discards in our implementation - using a
>> series of single range requests, using vectored requests, tracking extents
>> that can be combined in an rbtree or not - is something that we are working
>> on. Using rbtrees versus a bitmap efficiency is about DRAM consumption, not
>> performance of the resulting discard on the target.
>>
>> 4. Devices (some devices) can export their preferences in a standard way
>> (look in /sys/block/....).
>>
>> If you want to influence the code, please do try the various options on
>> devices you have at hand and report results. That is what we are doing (we
>> includes Lukas, Eric, Jeff and others on this thread) will real devices from
>> vendors that have given us access. We are talking to them directly and
>> trying out different work loads but certainly welcome real world results and
>> suggestions.
>>
>> Thanks!
>>
>> Ric
>>
> Ric,
>
> Is it also agreed by all that the current ext4 kernel implementation
> of calling discard is a poor solution for most hardware / block layers
> stacks / workloads and therefore is not worth retaining nor performing
> further benchmarks?
>
> I've not seen anyone arguing to keep the current kernel implementation
> and I for one accept the previously posted benchmarks that show it is
> not adding any value relative to the traditional non-discard case.
>
> Therefore benchmarks between the current hdparm/wiper.sh userspace
> implementation and a proposed new kernel implementation would be the
> most beneficial?
>
> I have yet to see any of those benchmarks posted.
>
> Greg
>

Greg,

I don't like the user space wiper.sh approach in general, but running
wiper.sh is entirely your choice.

Most users prefer having the file system and the IO stack take care of
this for them, but again, entirely configurable.

The benchmarks we have done are often done on hardware that is under NDA
so we cannot publish results across the board.

Feel free to run on hardware that you buy and share the results.

Thanks!

Ric


2010-04-24 19:06:57

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

>>>>> "Greg" == Greg Freemyer <[email protected]> writes:

Greg> Is it also agreed by all that the current ext4 kernel
Greg> implementation of calling discard is a poor solution for most
Greg> hardware / block layers stacks / workloads and therefore is not
Greg> worth retaining nor performing further benchmarks?

Our discard implementation is meant to accommodate a wide range of
devices. Just because some of the currently shipping low-end consumer
SSDs implement TRIM poorly does not mean we're going to scrap what we
have.

We are not in the business of designing for the past. Especially not
when the past can be handled by a shell script.

For future devices TRIM/UNMAP is going to be an inherent part of the
command set. And that's what our kernel support is aimed at. There are
some deficiencies right now because the block layer was not built to
handle what is essentially a hybrid between a filesystem and a blk_pc
type request. I'm working on fixing that.


Greg> I've not seen anyone arguing to keep the current kernel
Greg> implementation and I for one accept the previously posted
Greg> benchmarks that show it is not adding any value relative to the
Greg> traditional non-discard case.

For enterprise storage the cost of sending discards is limited to the
overhead of sending the command. I.e. negligible.

Eventually that's going to be the case for ATA devices as well. And let
me just reiterate that Windows 7 does issue TRIM like we do (at
runtime). And consequently Windows 7 will help weed out the crap SSD
implementations from the market. That's already happening.

There is also work underway to make TRIM a queueable command which would
further alleviate the situation.


Greg> Therefore benchmarks between the current hdparm/wiper.sh userspace
Greg> implementation and a proposed new kernel implementation would be
Greg> the most beneficial?

I don't know what you mean by "new" kernel implementation. We're
working on tweaking what we have so that we can split, merge, and
coalesce requests.

I'm also not sure why you're so hung up on benchmarks. The purpose of
TRIM is to increase longevity of a flash-based device. For dumb devices
there's currently a tradeoff - do you want speed or endurance? Once
devices get smarter that tradeoff will disappear.

--
Martin K. Petersen Oracle Linux Engineering

2010-04-26 14:06:49

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 24/04/10 03:06 PM, Martin K. Petersen wrote:
..
> Our discard implementation is meant to accommodate a wide range of
> devices. Just because some of the currently shipping low-end consumer
> SSDs implement TRIM poorly does not mean we're going to scrap what we
> have.
..

The current implementation works extremely poorly for the singlemost
common style of hardware that's out there. Millions and millions
of SATA SSDs, and they're becoming more and more ubiquitous.

A "solution" that ignores reality isn't very helpful.
We can do better than that, a lot better.

> We are not in the business of designing for the past. Especially not
> when the past can be handled by a shell script.
..

We are in the business of supporting the hardware that people run Linux on.
Today, and tomorrow, and the next few years, that means SATA SSDs by the gazillions,
as well as a relatively smaller number of enterprise behemoths.

A shell script cannot currently deal with LVM, RAID, or btrfs filesystems.

Cheers

2010-04-26 14:10:15

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 24/04/10 02:41 PM, Ric Wheeler wrote:
..
> I don't like the user space wiper.sh approach in general, but running
> wiper.sh is entirely your choice.
>
> Most users prefer having the file system and the IO stack take care of
> this for them, but again, entirely configurable.
..

I wrote wiper.sh, and I would prefer more of an in-kernel solution,
simply because it could potentially have the smarts to deal with RAID,
LVM, and btrfs's own duplicate implementation of RAID/LVM.

Those cannot be done from userspace on a mounted filesystem.

So.. again.. in an ideal kernel, I'd like to see use of larger ranges
(and _multiple_ ranges) per TRIM command. And options for the kernel
to do it automatically (-o discard), as well as an ioctl() interface
for userspace to "scrub" (or "wipe") all free ranges in a gradual fashion.

Cheers


2010-04-26 14:43:10

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

>>>>> "Mark" == Mark Lord <[email protected]> writes:

Mark> So.. again.. in an ideal kernel, I'd like to see use of larger
Mark> ranges (and _multiple_ ranges) per TRIM command. And options for
Mark> the kernel to do it automatically (-o discard), as well as an
Mark> ioctl() interface for userspace to "scrub" (or "wipe") all free
Mark> ranges in a gradual fashion.

Again: Discard splitting, merging, and coalescing is coming. I'm
working on it. It's not exactly trivial.

My objection was purely wrt. nuking realtime discard in favor of
scrubbing. We absolutely need both approaches.

But I'm not going to let crappy SSD firmware implementations set the
direction for what I'm working on. It is much more interesting to
ensure that we work well for the devices that'll be shipping 6-12 months
from now (at the time where this stuff will realistically end up in
distro kernels).

--
Martin K. Petersen Oracle Linux Engineering

2010-04-26 15:27:14

by Greg Freemyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Mon, Apr 26, 2010 at 10:42 AM, Martin K. Petersen
<[email protected]> wrote:
>>>>>> "Mark" == Mark Lord <[email protected]> writes:
>
> Mark> So.. again.. in an ideal kernel, I'd like to see use of larger
> Mark> ranges (and _multiple_ ranges) per TRIM command. ?And options for
> Mark> the kernel to do it automatically (-o discard), as well as an
> Mark> ioctl() interface for userspace to "scrub" (or "wipe") all free
> Mark> ranges in a gradual fashion.
>
> Again: Discard splitting, merging, and coalescing is coming. ?I'm
> working on it. ?It's not exactly trivial.
>
> My objection was purely wrt. nuking realtime discard in favor of
> scrubbing. ?We absolutely need both approaches.
>
> But I'm not going to let crappy SSD firmware implementations set the
> direction for what I'm working on. ?It is much more interesting to
> ensure that we work well for the devices that'll be shipping 6-12 months
> from now (at the time where this stuff will realistically end up in
> distro kernels).
>
> --
> Martin K. Petersen ? ? ?Oracle Linux Engineering
>

Is this an agreed summary as relates to ext4:

1) Current "-o discard" causes real-time calls to discard. Although
not optimized for current generation hardware, both block-layer
optimizations and new hardware are coming, so it needs to be kept.

2) Kernel based discard scrubbing - Doesn't currently exist in 2.6.34,
all agree that for the LVM, mdraid, btrfs cases it is needed and there
is no linux tool (kernel or userspace) at present. The Lukas proposed
patch is userspace invoked so a mount option is not needed.

3) Userspace discard is available today and works well with ext4 on a
single JBOD SSD which will be the typical laptop use as one example.

Mark, or anyone, do you think it would be a bad idea for me to push
for Opensuse 11.3 (2.6.34 based) to use wiper.sh as a standard ext4
discard tool? hdparm and wiper.sh are already in the distro, it just
needs a cron entry and maybe some basic management infrastructure.
They're at feature freeze for 11.3, so I don't know if I can get it in
or not.

If and when the suse team want to move to the kernel based scrubber,
they can. ie. a year from now when the next release after 11.3 comes
out.

Greg

2010-04-26 15:45:36

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/26/2010 10:42 AM, Martin K. Petersen wrote:
>>>>>> "Mark" == Mark Lord<[email protected]> writes:
>
> Mark> So.. again.. in an ideal kernel, I'd like to see use of larger
> Mark> ranges (and _multiple_ ranges) per TRIM command. And options for
> Mark> the kernel to do it automatically (-o discard), as well as an
> Mark> ioctl() interface for userspace to "scrub" (or "wipe") all free
> Mark> ranges in a gradual fashion.
>
> Again: Discard splitting, merging, and coalescing is coming. I'm
> working on it. It's not exactly trivial.
>
> My objection was purely wrt. nuking realtime discard in favor of
> scrubbing. We absolutely need both approaches.
>
> But I'm not going to let crappy SSD firmware implementations set the
> direction for what I'm working on. It is much more interesting to
> ensure that we work well for the devices that'll be shipping 6-12 months
> from now (at the time where this stuff will realistically end up in
> distro kernels).
>

And one thing to keep in mind is that we are likely to need both run time
support for discard as well as occasional resync in bulk since the storage can
choose to ignore the command and still be in spec. Kind of like the unfortunate
"defrag" thing that windows people do,

ric

2010-04-26 15:51:34

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Mon, 26 Apr 2010, Greg Freemyer wrote:

> On Mon, Apr 26, 2010 at 10:42 AM, Martin K. Petersen
> <[email protected]> wrote:
> >>>>>> "Mark" == Mark Lord <[email protected]> writes:
> >
> > Mark> So.. again.. in an ideal kernel, I'd like to see use of larger
> > Mark> ranges (and _multiple_ ranges) per TRIM command. ?And options for
> > Mark> the kernel to do it automatically (-o discard), as well as an
> > Mark> ioctl() interface for userspace to "scrub" (or "wipe") all free
> > Mark> ranges in a gradual fashion.
> >
> > Again: Discard splitting, merging, and coalescing is coming. ?I'm
> > working on it. ?It's not exactly trivial.
> >
> > My objection was purely wrt. nuking realtime discard in favor of
> > scrubbing. ?We absolutely need both approaches.
> >
> > But I'm not going to let crappy SSD firmware implementations set the
> > direction for what I'm working on. ?It is much more interesting to
> > ensure that we work well for the devices that'll be shipping 6-12 months
> > from now (at the time where this stuff will realistically end up in
> > distro kernels).
> >
> > --
> > Martin K. Petersen ? ? ?Oracle Linux Engineering
> >
>
> Is this an agreed summary as relates to ext4:
>
> 1) Current "-o discard" causes real-time calls to discard. Although
> not optimized for current generation hardware, both block-layer
> optimizations and new hardware are coming, so it needs to be kept.
>
> 2) Kernel based discard scrubbing - Doesn't currently exist in 2.6.34,
> all agree that for the LVM, mdraid, btrfs cases it is needed and there
> is no linux tool (kernel or userspace) at present. The Lukas proposed
> patch is userspace invoked so a mount option is not needed.

It is userspace invoked indeed, but once you set the nodiscard mount
option it will not do anything. It seems only logical to me to NOT
discard anything, when "nodiscard" option is set.

>
> 3) Userspace discard is available today and works well with ext4 on a
> single JBOD SSD which will be the typical laptop use as one example.

It (wiper) do not work on devices which does not support vectored trim
ranges.

>
> Mark, or anyone, do you think it would be a bad idea for me to push
> for Opensuse 11.3 (2.6.34 based) to use wiper.sh as a standard ext4
> discard tool? hdparm and wiper.sh are already in the distro, it just
> needs a cron entry and maybe some basic management infrastructure.
> They're at feature freeze for 11.3, so I don't know if I can get it in
> or not.
>
> If and when the suse team want to move to the kernel based scrubber,
> they can. ie. a year from now when the next release after 11.3 comes
> out.
>
> Greg
>

-Lukas

2010-04-26 16:55:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

> On Wed, 21 Apr 2010, Greg Freemyer wrote:
> And also, currently I am rewriting the patch do use rbtree instead of the
> bitmap, because there were some concerns of memory consumption. It is a
> question whether or not the rbtree will be more memory friendly.
> Generally I think that in most "normal" cases it will, but there are some
> extreme scenarios, where the rbtree will be much worse. Any comment on
> this ?
I see two possible improvements here:
a) At a cost of some code complexity, you can bound the worst case by combining
RB-trees with bitmaps. The basic idea is that when space to TRIM gets too
fragmented (memory to keep to-TRIM blocks in RB-tree for a given group exceeds
the memory needed to keep it in a bitmap), you convert RB-tree for a
problematic group to a bitmap and attach it to an appropriate RB-node. If you
track with a bitmap also a number of to-TRIM extents in the bitmap, you can
also decide whether it's benefitial to switch back to an RB-tree.

b) Another idea might be: When to-TRIM space is fragmented (again, let's say
in some block group), there's not much point in sending tiny trim commands
anyway (at least that's what I've understood from this discussion). So you
might as well stop maintaining information which blocks we need to trim
for that group. When the situation gets better, you can always walk block
bitmap and issue trim commands. You might even trigger this rescan from
kernel - if you'd maintain number of free block extents for each block group
(which is rather easy), you could trigger the bitmap rescan and trim as soon
as ratio number of free blocks / number of extents gets above a reasonable
threshold.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-04-26 17:46:43

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Mon, 26 Apr 2010, Jan Kara wrote:

> > On Wed, 21 Apr 2010, Greg Freemyer wrote:
> > And also, currently I am rewriting the patch do use rbtree instead of the
> > bitmap, because there were some concerns of memory consumption. It is a
> > question whether or not the rbtree will be more memory friendly.
> > Generally I think that in most "normal" cases it will, but there are some
> > extreme scenarios, where the rbtree will be much worse. Any comment on
> > this ?
> I see two possible improvements here:
> a) At a cost of some code complexity, you can bound the worst case by combining
> RB-trees with bitmaps. The basic idea is that when space to TRIM gets too
> fragmented (memory to keep to-TRIM blocks in RB-tree for a given group exceeds
> the memory needed to keep it in a bitmap), you convert RB-tree for a
> problematic group to a bitmap and attach it to an appropriate RB-node. If you
> track with a bitmap also a number of to-TRIM extents in the bitmap, you can
> also decide whether it's benefitial to switch back to an RB-tree.

This sounds like a good idea, but I wonder if it is worth it :
1. The tree will have very short life, because with next ioctl all
stored deleted extents will be trimmed and removed from the tree.
2. Also note, that the longer it lives the less fragmented it possibly
became.
3. I do not expect, that deleted ranges can be too fragmented, and
even if it is, it will be probably merged into one big extent very
soon.

>
> b) Another idea might be: When to-TRIM space is fragmented (again, let's say
> in some block group), there's not much point in sending tiny trim commands
> anyway (at least that's what I've understood from this discussion). So you
> might as well stop maintaining information which blocks we need to trim
> for that group. When the situation gets better, you can always walk block
> bitmap and issue trim commands. You might even trigger this rescan from
> kernel - if you'd maintain number of free block extents for each block group
> (which is rather easy), you could trigger the bitmap rescan and trim as soon
> as ratio number of free blocks / number of extents gets above a reasonable
> threshold.
>
> Honza
>

In what I am preparing now, I simple ignore small extents, which would
be created by splitting the deleted extent into smaller pieces by chunks
of used blocks. This, in my opinion, will prevent the fragmentation,
which otherwise may occur in the longer term (between ioctl calls).

Thanks for suggestions.
-Lukas

2010-04-26 17:49:14

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 04/26/2010 01:46 PM, Lukas Czerner wrote:
> On Mon, 26 Apr 2010, Jan Kara wrote:
>
>>> On Wed, 21 Apr 2010, Greg Freemyer wrote:
>>> And also, currently I am rewriting the patch do use rbtree instead of the
>>> bitmap, because there were some concerns of memory consumption. It is a
>>> question whether or not the rbtree will be more memory friendly.
>>> Generally I think that in most "normal" cases it will, but there are some
>>> extreme scenarios, where the rbtree will be much worse. Any comment on
>>> this ?
>> I see two possible improvements here:
>> a) At a cost of some code complexity, you can bound the worst case by combining
>> RB-trees with bitmaps. The basic idea is that when space to TRIM gets too
>> fragmented (memory to keep to-TRIM blocks in RB-tree for a given group exceeds
>> the memory needed to keep it in a bitmap), you convert RB-tree for a
>> problematic group to a bitmap and attach it to an appropriate RB-node. If you
>> track with a bitmap also a number of to-TRIM extents in the bitmap, you can
>> also decide whether it's benefitial to switch back to an RB-tree.
>
> This sounds like a good idea, but I wonder if it is worth it :
> 1. The tree will have very short life, because with next ioctl all
> stored deleted extents will be trimmed and removed from the tree.
> 2. Also note, that the longer it lives the less fragmented it possibly
> became.
> 3. I do not expect, that deleted ranges can be too fragmented, and
> even if it is, it will be probably merged into one big extent very
> soon.
>
>>
>> b) Another idea might be: When to-TRIM space is fragmented (again, let's say
>> in some block group), there's not much point in sending tiny trim commands
>> anyway (at least that's what I've understood from this discussion). So you
>> might as well stop maintaining information which blocks we need to trim
>> for that group. When the situation gets better, you can always walk block
>> bitmap and issue trim commands. You might even trigger this rescan from
>> kernel - if you'd maintain number of free block extents for each block group
>> (which is rather easy), you could trigger the bitmap rescan and trim as soon
>> as ratio number of free blocks / number of extents gets above a reasonable
>> threshold.
>>
>> Honza
>>
>
> In what I am preparing now, I simple ignore small extents, which would
> be created by splitting the deleted extent into smaller pieces by chunks
> of used blocks. This, in my opinion, will prevent the fragmentation,
> which otherwise may occur in the longer term (between ioctl calls).
>
> Thanks for suggestions.
> -Lukas

I am not convinced that ignoring small extents is a good idea. Remember that for
SSD's specifically, they remap *everything* internally so our "fragmentation"
set of small spaces could be useful for them.

That does not mean that we should not try to send larger requests down to the
target device which is always a good idea I think :-)

ric


2010-04-26 18:14:40

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On Mon, 26 Apr 2010, Ric Wheeler wrote:

> On 04/26/2010 01:46 PM, Lukas Czerner wrote:
> > On Mon, 26 Apr 2010, Jan Kara wrote:
> >
> > > > On Wed, 21 Apr 2010, Greg Freemyer wrote:
> > > > And also, currently I am rewriting the patch do use rbtree instead of
> > > > the
> > > > bitmap, because there were some concerns of memory consumption. It is a
> > > > question whether or not the rbtree will be more memory friendly.
> > > > Generally I think that in most "normal" cases it will, but there are
> > > > some
> > > > extreme scenarios, where the rbtree will be much worse. Any comment on
> > > > this ?
> > > I see two possible improvements here:
> > > a) At a cost of some code complexity, you can bound the worst case by
> > > combining
> > > RB-trees with bitmaps. The basic idea is that when space to TRIM gets too
> > > fragmented (memory to keep to-TRIM blocks in RB-tree for a given group
> > > exceeds
> > > the memory needed to keep it in a bitmap), you convert RB-tree for a
> > > problematic group to a bitmap and attach it to an appropriate RB-node. If
> > > you
> > > track with a bitmap also a number of to-TRIM extents in the bitmap, you
> > > can
> > > also decide whether it's benefitial to switch back to an RB-tree.
> >
> > This sounds like a good idea, but I wonder if it is worth it :
> > 1. The tree will have very short life, because with next ioctl all
> > stored deleted extents will be trimmed and removed from the tree.
> > 2. Also note, that the longer it lives the less fragmented it possibly
> > became.
> > 3. I do not expect, that deleted ranges can be too fragmented, and
> > even if it is, it will be probably merged into one big extent very
> > soon.
> >
> > >
> > > b) Another idea might be: When to-TRIM space is fragmented (again, let's
> > > say
> > > in some block group), there's not much point in sending tiny trim commands
> > > anyway (at least that's what I've understood from this discussion). So you
> > > might as well stop maintaining information which blocks we need to trim
> > > for that group. When the situation gets better, you can always walk block
> > > bitmap and issue trim commands. You might even trigger this rescan from
> > > kernel - if you'd maintain number of free block extents for each block
> > > group
> > > (which is rather easy), you could trigger the bitmap rescan and trim as
> > > soon
> > > as ratio number of free blocks / number of extents gets above a reasonable
> > > threshold.
> > >
> > > Honza
> > >
> >
> > In what I am preparing now, I simple ignore small extents, which would
> > be created by splitting the deleted extent into smaller pieces by chunks
> > of used blocks. This, in my opinion, will prevent the fragmentation,
> > which otherwise may occur in the longer term (between ioctl calls).
> >
> > Thanks for suggestions.
> > -Lukas
>
> I am not convinced that ignoring small extents is a good idea. Remember that
> for SSD's specifically, they remap *everything* internally so our
> "fragmentation" set of small spaces could be useful for them.
>
> That does not mean that we should not try to send larger requests down to the
> target device which is always a good idea I think :-)
>
> ric
>

That's right, so the other approach would be probably better. Merge
small extents together into one, but there must be some limit, because I
do not want two little extents at the beginning and the end of the group
to force trimming whole group. The whole rbtree thing gets a little
complicated :)

-Lukas

2010-04-26 18:28:17

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

Lukas Czerner <[email protected]> writes:

> On Mon, 26 Apr 2010, Ric Wheeler wrote:
>
>> On 04/26/2010 01:46 PM, Lukas Czerner wrote:
>> > On Mon, 26 Apr 2010, Jan Kara wrote:
>> >
>> > > > On Wed, 21 Apr 2010, Greg Freemyer wrote:
>> > > > And also, currently I am rewriting the patch do use rbtree instead of
>> > > > the
>> > > > bitmap, because there were some concerns of memory consumption. It is a
>> > > > question whether or not the rbtree will be more memory friendly.
>> > > > Generally I think that in most "normal" cases it will, but there are
>> > > > some
>> > > > extreme scenarios, where the rbtree will be much worse. Any comment on
>> > > > this ?
>> > > I see two possible improvements here:
>> > > a) At a cost of some code complexity, you can bound the worst case by
>> > > combining
>> > > RB-trees with bitmaps. The basic idea is that when space to TRIM gets too
>> > > fragmented (memory to keep to-TRIM blocks in RB-tree for a given group
>> > > exceeds
>> > > the memory needed to keep it in a bitmap), you convert RB-tree for a
>> > > problematic group to a bitmap and attach it to an appropriate RB-node. If
>> > > you
>> > > track with a bitmap also a number of to-TRIM extents in the bitmap, you
>> > > can
>> > > also decide whether it's benefitial to switch back to an RB-tree.
>> >
>> > This sounds like a good idea, but I wonder if it is worth it :
>> > 1. The tree will have very short life, because with next ioctl all
>> > stored deleted extents will be trimmed and removed from the tree.
>> > 2. Also note, that the longer it lives the less fragmented it possibly
>> > became.
>> > 3. I do not expect, that deleted ranges can be too fragmented, and
>> > even if it is, it will be probably merged into one big extent very
>> > soon.
>> >
>> > >
>> > > b) Another idea might be: When to-TRIM space is fragmented (again, let's
>> > > say
>> > > in some block group), there's not much point in sending tiny trim commands
>> > > anyway (at least that's what I've understood from this discussion). So you
>> > > might as well stop maintaining information which blocks we need to trim
>> > > for that group. When the situation gets better, you can always walk block
>> > > bitmap and issue trim commands. You might even trigger this rescan from
>> > > kernel - if you'd maintain number of free block extents for each block
>> > > group
>> > > (which is rather easy), you could trigger the bitmap rescan and trim as
>> > > soon
>> > > as ratio number of free blocks / number of extents gets above a reasonable
>> > > threshold.
>> > >
>> > > Honza
>> > >
>> >
>> > In what I am preparing now, I simple ignore small extents, which would
>> > be created by splitting the deleted extent into smaller pieces by chunks
>> > of used blocks. This, in my opinion, will prevent the fragmentation,
>> > which otherwise may occur in the longer term (between ioctl calls).
>> >
>> > Thanks for suggestions.
>> > -Lukas
>>
>> I am not convinced that ignoring small extents is a good idea. Remember that
>> for SSD's specifically, they remap *everything* internally so our
>> "fragmentation" set of small spaces could be useful for them.
>>
>> That does not mean that we should not try to send larger requests down to the
>> target device which is always a good idea I think :-)
>>
>> ric
>>
>
> That's right, so the other approach would be probably better. Merge
> small extents together into one, but there must be some limit, because I
> do not want two little extents at the beginning and the end of the group
> to force trimming whole group. The whole rbtree thing gets a little
> complicated :)

This discussion is getting a bit too abstract for me. Show us the code
and we can make some progress. =)

On the topic of discarding small blocks, I agree with Ric, it should be
done.

Cheers,
Jeff

2010-04-26 18:38:17

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4 - using rbtree

Create an ioctl which walks through all the free extents in each
allocating group and discard those extents. As an addition to improve
its performance one can specify minimum free extent length, so ioctl
will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have created new structure
ext4_deleted_data which represents deleted extent in per-group rbtree.
When blocks are freed, extents are stored in the tree (possibly merged
with other extents). The ioctl then can walk through the tree, take out
extents from the tree, compare them with the free extents and possibly
trim them.

Note that there is support for setting minimum extent length in ioctl
call, so it will not bother with shorter extents. Also, when the
previously deleted range is taken from the tree and it is not entirely
free, the free fragments are discarded and extents shorter than minlen
are NOT returned back to the tree to avoid fragmentation of the tree
which could lead to the big memory consumption.

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 5 +
fs/ext4/mballoc.c | 321 +++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/mballoc.h | 9 ++
fs/ext4/super.c | 1 +
4 files changed, 325 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..41fe9ec 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
@@ -1682,6 +1684,9 @@ struct ext4_group_info {
#ifdef DOUBLE_CHECK
void *bb_bitmap;
#endif
+ ext4_grpblk_t bb_deleted;
+ struct rb_root bb_deleted_root;
+
struct rw_semaphore alloc_sem;
ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
* regions, index is order.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b423a36..ebfe9d8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -338,6 +338,7 @@
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
static struct kmem_cache *ext4_free_ext_cachep;
+static struct kmem_cache *ext4_deleted_ext_cachep;
static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
ext4_group_t group);
static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
@@ -2255,6 +2256,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
meta_group_info[i]->bb_free_root = RB_ROOT;
+ meta_group_info[i]->bb_deleted_root = RB_ROOT;
+ meta_group_info[i]->bb_deleted = -1;
+
+

#ifdef DOUBLE_CHECK
{
@@ -2516,6 +2521,106 @@ int ext4_mb_release(struct super_block *sb)
return 0;
}

+static int can_merge_deleted(struct ext4_deleted_data *entry1,
+ struct ext4_deleted_data *entry2)
+{
+ if (((entry1->start_blk + entry1->count) >= entry2->start_blk))
+ return 1;
+ return 0;
+}
+
+static int store_deleted_entry(struct ext4_free_data *free_entry,
+ struct ext4_group_info *db)
+{
+ struct rb_node **n = &db->bb_deleted_root.rb_node, *node;
+ struct rb_node *parent = NULL, *new_node;
+ ext4_grpblk_t block, blk_end;
+ struct ext4_deleted_data *new_entry;
+ int count = 0;
+
+ block = free_entry->start_blk;
+ blk_end = free_entry->start_blk + free_entry->count;
+
+ /* Find out where to put the new entry */
+ while (*n) {
+ struct ext4_deleted_data *entry;
+
+ parent = *n;
+ entry = rb_entry(parent, struct ext4_deleted_data, node);
+ if ((block >= entry->start_blk) &&
+ (blk_end <= (entry->start_blk + entry->count))) {
+ /* Embedded interval */
+ return count;
+ } else if (block < entry->start_blk)
+ n = &(*n)->rb_left;
+ else if (block >= entry->start_blk)
+ n = &(*n)->rb_right;
+ else
+ break;
+ }
+
+ /* Allocate and insert the new entry */
+ new_entry = kmem_cache_alloc(ext4_deleted_ext_cachep, GFP_NOFS);
+ new_entry->start_blk = free_entry->start_blk;
+ new_entry->count = count = free_entry->count;
+
+ new_node = &new_entry->node;
+
+ rb_link_node(new_node, parent, n);
+ rb_insert_color(new_node, &db->bb_deleted_root);
+
+ /* Now lets see if the new entry can be merged to left */
+ node = rb_prev(new_node);
+ if (node) {
+ struct ext4_deleted_data *entry;
+
+ entry = rb_entry(node, struct ext4_deleted_data, node);
+ if (can_merge_deleted(entry, new_entry)) {
+ count -= entry->count;
+
+
+ new_entry->count = count =
+ (new_entry->count + new_entry->start_blk)
+ - entry->start_blk;
+ new_entry->start_blk = entry->start_blk;
+
+
+ rb_erase(node, &(db->bb_deleted_root));
+ kmem_cache_free(ext4_deleted_ext_cachep, entry);
+ }
+ }
+
+ /*
+ * Lets see if the new entry can be merged to the right
+ * There can be multiple merges as the new entry's interval
+ * can overlap other entries.
+ */
+ node = rb_next(new_node);
+ while (node) {
+ struct ext4_deleted_data *entry;
+
+ entry = rb_entry(node, struct ext4_deleted_data, node);
+ if (can_merge_deleted(new_entry, entry)) {
+ ext4_grpblk_t new_count;
+ count -= entry->count;
+
+ new_count = (entry->start_blk + entry->count)
+ - new_entry->start_blk;
+ if (new_count > new_entry->count)
+ new_entry->count = new_count;
+
+ rb_erase(node, &(db->bb_deleted_root));
+ kmem_cache_free(ext4_deleted_ext_cachep, entry);
+ } else {
+ /* No more merging needed */
+ break;
+ }
+ node = rb_next(new_node);
+ }
+
+ return count;
+}
+
/*
* This function is called by the jbd2 layer once the commit has finished,
* so we know we can free the blocks that were released with that commit.
@@ -2535,17 +2640,6 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
entry->count, entry->group, entry);

- if (test_opt(sb, DISCARD)) {
- ext4_fsblk_t discard_block;
-
- discard_block = entry->start_blk +
- ext4_group_first_block_no(sb, entry->group);
- trace_ext4_discard_blocks(sb,
- (unsigned long long)discard_block,
- entry->count);
- sb_issue_discard(sb, discard_block, entry->count);
- }
-
err = ext4_mb_load_buddy(sb, entry->group, &e4b);
/* we expect to find existing buddy because it's pinned */
BUG_ON(err != 0);
@@ -2554,6 +2648,7 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
/* there are blocks to put in buddy to make them really free */
count += entry->count;
count2++;
+
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
rb_erase(&entry->node, &(db->bb_free_root));
@@ -2566,6 +2661,9 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
page_cache_release(e4b.bd_buddy_page);
page_cache_release(e4b.bd_bitmap_page);
}
+ if (test_opt(sb, DISCARD) && (db->bb_deleted >= 0))
+ db->bb_deleted += store_deleted_entry(entry, db);
+
ext4_unlock_group(sb, entry->group);
kmem_cache_free(ext4_free_ext_cachep, entry);
ext4_mb_release_desc(&e4b);
@@ -2635,6 +2733,17 @@ int __init init_ext4_mballoc(void)
kmem_cache_destroy(ext4_ac_cachep);
return -ENOMEM;
}
+
+ ext4_deleted_ext_cachep =
+ kmem_cache_create("ext4_deleted_block_extents",
+ sizeof(struct ext4_deleted_data),
+ 0, SLAB_RECLAIM_ACCOUNT, NULL);
+ if (ext4_deleted_ext_cachep == NULL) {
+ kmem_cache_destroy(ext4_pspace_cachep);
+ kmem_cache_destroy(ext4_ac_cachep);
+ kmem_cache_destroy(ext4_free_ext_cachep);
+ return -ENOMEM;
+ }
ext4_create_debugfs_entry();
return 0;
}
@@ -2649,6 +2758,7 @@ void exit_ext4_mballoc(void)
kmem_cache_destroy(ext4_pspace_cachep);
kmem_cache_destroy(ext4_ac_cachep);
kmem_cache_destroy(ext4_free_ext_cachep);
+ kmem_cache_destroy(ext4_deleted_ext_cachep);
ext4_remove_debugfs_entry();
}

@@ -4640,3 +4750,192 @@ error_return:
kmem_cache_free(ext4_ac_cachep, ac);
return;
}
+
+/*
+ * Trim "count" blocks starting at "start" in "group"
+ * This must be called under group lock
+ */
+void ext4_trim_extent(struct super_block *sb, int start, int count,
+ ext4_group_t group, struct ext4_buddy *e4b)
+{
+ ext4_fsblk_t discard_block;
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ struct ext4_free_extent ex;
+
+ assert_spin_locked(ext4_group_lock_ptr(sb, group));
+
+ ex.fe_start = start;
+ ex.fe_group = group;
+ ex.fe_len = count;
+
+ mb_mark_used(e4b, &ex);
+ ext4_unlock_group(sb, group);
+
+ discard_block = (ext4_fsblk_t)group *
+ EXT4_BLOCKS_PER_GROUP(sb)
+ + start
+ + le32_to_cpu(es->s_first_data_block);
+ trace_ext4_discard_blocks(sb,
+ (unsigned long long)discard_block,
+ count);
+ sb_issue_discard(sb, discard_block, count);
+
+ ext4_lock_group(sb, group);
+ mb_free_blocks(NULL, e4b, start, ex.fe_len);
+}
+
+/*
+ * Trim all free blocks, which have at least minlen length
+ * */
+ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b,
+ ext4_grpblk_t minblocks)
+{
+ void *bitmap;
+ ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+ ext4_grpblk_t start, next, count = 0;
+ ext4_group_t group;
+
+ BUG_ON(e4b == NULL);
+
+ bitmap = e4b->bd_bitmap;
+ group = e4b->bd_group;
+ start = 0;
+ ext4_lock_group(sb, group);
+
+ while (start < max) {
+
+ start = mb_find_next_zero_bit(bitmap, max, start);
+ if (start >= max)
+ break;
+ next = mb_find_next_bit(bitmap, max, start);
+
+ if ((next - start) >= minblocks) {
+
+ count += next - start;
+ ext4_trim_extent(sb, start,
+ next - start, group, e4b);
+
+ }
+ start = next + 1;
+ }
+
+ e4b->bd_info->bb_deleted = 0;
+
+ ext4_unlock_group(sb, group);
+
+ return count;
+}
+
+/*
+ *Trim only blocks which is free and in bb_deleted rbtree
+ */
+ext4_grpblk_t ext4_trim_deleted(struct super_block *sb, struct ext4_buddy *e4b,
+ ext4_grpblk_t minblocks)
+{
+ void *bitmap;
+ struct ext4_group_info *grp;
+ ext4_group_t group;
+ ext4_grpblk_t max, next, count = 0, start = 0;
+ struct rb_node *node;
+ struct ext4_deleted_data *entry;
+
+ BUG_ON(e4b == NULL);
+
+ bitmap = e4b->bd_bitmap;
+ group = e4b->bd_group;
+ grp = ext4_get_group_info(sb, group);
+
+ if (grp->bb_deleted < minblocks)
+ return 0;
+
+ ext4_lock_group(sb, group);
+
+ for (node = rb_first(&grp->bb_deleted_root);
+ node; node = rb_next(node))
+ {
+
+ entry = rb_entry(node, struct ext4_deleted_data, node);
+
+ start = entry->start_blk;
+ max = entry->start_blk + entry->count;
+
+ if (entry->count < minblocks)
+ continue;
+
+ count += entry->count;
+
+ rb_erase(node, &(grp->bb_deleted_root));
+ kmem_cache_free(ext4_deleted_ext_cachep, entry);
+
+ while (start < max) {
+ start = mb_find_next_zero_bit(bitmap, max, start);
+ if (start >= max)
+ break;
+ next = mb_find_next_bit(bitmap, max, start);
+ if (next > max)
+ next = max;
+
+ if ((next - start) >= minblocks) {
+
+ ext4_trim_extent(sb, start,
+ next - start, group, e4b);
+ } else {
+ /*
+ * Do not bother with shorter extents to
+ * eliminate extreme scenarios, when rbtree
+ * can consume just too much memory.
+ */
+ }
+ start = next + 1;
+ }
+ }
+
+ grp->bb_deleted -= count;
+
+ ext4_unlock_group(sb, group);
+
+ ext4_debug("trimmed %d blocks in the group %d\n",
+ count, group);
+
+ return count;
+}
+
+int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
+{
+ struct ext4_buddy e4b;
+ struct ext4_group_info *grp;
+ ext4_group_t group;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
+ ext4_grpblk_t minblocks;
+
+ if (!test_opt(sb, DISCARD))
+ return 0;
+
+ minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
+ if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb)))
+ return -EINVAL;
+
+ for (group = 0; group < ngroups; group++) {
+ int err;
+
+ err = ext4_mb_load_buddy(sb, group, &e4b);
+ if (err) {
+ ext4_error(sb, "Error in loading buddy "
+ "information for %u", group);
+ continue;
+ }
+ grp = ext4_get_group_info(sb, group);
+
+ if (grp->bb_deleted < 0) {
+ /* First run after mount */
+ ext4_trim_all_free(sb, &e4b, minblocks);
+ } else if (grp->bb_deleted >= minblocks) {
+ /* Trim only blocks deleted since first run */
+ ext4_trim_deleted(sb, &e4b, minblocks);
+ }
+
+ ext4_mb_release_desc(&e4b);
+ }
+
+ return 0;
+}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index b619322..b25397a 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -95,6 +95,15 @@ extern u8 mb_enable_debug;
#define MB_DEFAULT_GROUP_PREALLOC 512


+struct ext4_deleted_data {
+ /* this links the deleted block information from group_info */
+ struct rb_node node;
+
+ /* deleted block extent */
+ ext4_grpblk_t start_blk;
+ ext4_grpblk_t count;
+};
+
struct ext4_free_data {
/* this links the free block information from group_info */
struct rb_node node;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e14d22c..253eb98 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = {
.quota_write = ext4_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .trim_fs = ext4_trim_fs
};

static const struct super_operations ext4_nojournal_sops = {
--
1.6.6.1


2010-04-26 18:43:02

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4 - using rbtree

On Mon, 26 Apr 2010, Lukas Czerner wrote:

> Create an ioctl which walks through all the free extents in each
> allocating group and discard those extents. As an addition to improve
> its performance one can specify minimum free extent length, so ioctl
> will not bother with shorter extents.
>
> This of course means, that with each invocation the ioctl must walk
> through whole file system, checking and discarding free extents, which
> is not very efficient. The best way to avoid this is to keep track of
> deleted (freed) blocks. Then the ioctl have to trim just those free
> extents which were recently freed.
>
> In order to implement this I have created new structure
> ext4_deleted_data which represents deleted extent in per-group rbtree.
> When blocks are freed, extents are stored in the tree (possibly merged
> with other extents). The ioctl then can walk through the tree, take out
> extents from the tree, compare them with the free extents and possibly
> trim them.
>
> Note that there is support for setting minimum extent length in ioctl
> call, so it will not bother with shorter extents. Also, when the
> previously deleted range is taken from the tree and it is not entirely
> free, the free fragments are discarded and extents shorter than minlen
> are NOT returned back to the tree to avoid fragmentation of the tree
> which could lead to the big memory consumption.
>
> But you may notice, that there is one problem. bb_bitmap_deleted does
> not survive umount. To bypass the problem the first ioctl call have to
> walk through whole file system trimming all free extents.
>
> Signed-off-by: Lukas Czerner <[email protected]>

For now it just ignores the small extents to avoid fragmentation. As I
said before, I agree that they should not be ignored, I just need to
figure out the way to do this efficiently.

Also it was not properly tested yet.


-Lukas.

2010-04-27 15:29:17

by Edward Shishkin

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4 - using rbtree

Lukas Czerner wrote:
> On Mon, 26 Apr 2010, Lukas Czerner wrote:
>
>
>> Create an ioctl which walks through all the free extents in each
>> allocating group and discard those extents. As an addition to improve
>> its performance one can specify minimum free extent length, so ioctl
>> will not bother with shorter extents.
>>
>> This of course means, that with each invocation the ioctl must walk
>> through whole file system, checking and discarding free extents, which
>> is not very efficient. The best way to avoid this is to keep track of
>> deleted (freed) blocks. Then the ioctl have to trim just those free
>> extents which were recently freed.
>>
>> In order to implement this I have created new structure
>> ext4_deleted_data which represents deleted extent in per-group rbtree.
>> When blocks are freed, extents are stored in the tree (possibly merged
>> with other extents). The ioctl then can walk through the tree, take out
>> extents from the tree, compare them with the free extents and possibly
>> trim them.
>>
>> Note that there is support for setting minimum extent length in ioctl
>> call, so it will not bother with shorter extents. Also, when the
>> previously deleted range is taken from the tree and it is not entirely
>> free, the free fragments are discarded and extents shorter than minlen
>> are NOT returned back to the tree to avoid fragmentation of the tree
>> which could lead to the big memory consumption.
>>
>> But you may notice, that there is one problem. bb_bitmap_deleted does
>> not survive umount. To bypass the problem the first ioctl call have to
>> walk through whole file system trimming all free extents.
>>
>> Signed-off-by: Lukas Czerner <[email protected]>
>>
>
> For now it just ignores the small extents to avoid fragmentation. As I
> said before, I agree that they should not be ignored, I just need to
> figure out the way to do this efficiently.

I suggest to not start with optimisations: let's first take a look
what is going on in the simple case. Benchmarks are our friends..

Edward.

>
>
> Also it was not properly tested yet.
>
>
> -Lukas.
>


2010-04-28 01:25:04

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 2/2] Add batched discard support for ext4.

On 26/04/10 11:27 AM, Greg Freemyer wrote:
..
> Mark, or anyone, do you think it would be a bad idea for me to push
> for Opensuse 11.3 (2.6.34 based) to use wiper.sh as a standard ext4
> discard tool? hdparm and wiper.sh are already in the distro, it just
> needs a cron entry and maybe some basic management infrastructure.
> They're at feature freeze for 11.3, so I don't know if I can get it in
> or not.
..

I'd hold off for now. Rumour has it that Intel SSDs are not 100% compliant
with TRIM -- they fail with large amounts of range data.

I don't have an Intel SSD to verify that or fix it with.

Cheers