LinuxLists.cc - [PATCH 1/4] block: implement compatible DISCARD support

2010-02-11 12:00:10

Subject: [PATCH 1/4] block: implement compatible DISCARD support

Currently there are no many discs has native TRIM (aka) discard
feature support. But in fact this is good feature. We can easily
simlulate it for devices which has not native support.
In compat mode discard dequest transforms in to simple zerofiled
write request.
In fact currently blkdev_issue_discard function implemented
incorrectly.
1) Whait flags not optimal we dont have to wait for each bio in flight.
2) Not wait by default. Which makes it fairly useless.
3) Send each bio with barrier flag(if requested). Which result in
bad performance. In fact caller just want to make shure that full
request is completed and ordered against other requests.
5) It use allocated_page instead of ZERO_PAGE.

This patch introduce generic blkdev_issue_zeroout() function which also
may be used for native discard request support, in this case zero payload
simply ignored.

Signed-off-by: Dmitry Monakhov <[email protected]>
---
block/blk-barrier.c | 190 +++++++++++++++++++++++++++++-------------------
block/ioctl.c | 3 +-
include/linux/blkdev.h | 6 +-
3 files changed, 121 insertions(+), 78 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index 8618d89..5894a38 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -340,106 +340,148 @@ int blkdev_issue_flush(struct block_device *bdev, sector_t *error_sector)
}
EXPORT_SYMBOL(blkdev_issue_flush);

-static void blkdev_discard_end_io(struct bio *bio, int err)
+struct bio_batch
{
+ atomic_t done;
+ unsigned long flags;
+ struct completion *wait;
+};
+
+static void bio_batch_end_io(struct bio *bio, int err)
+{
+ struct bio_batch *bb = bio->bi_private;
if (err) {
if (err == -EOPNOTSUPP)
- set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
- clear_bit(BIO_UPTODATE, &bio->bi_flags);
+ set_bit(BIO_EOPNOTSUPP, &bb->flags);
+ else
+ clear_bit(BIO_UPTODATE, &bb->flags);
}
-
- if (bio->bi_private)
- complete(bio->bi_private);
- __free_page(bio_page(bio));
-
+ atomic_inc(&bb->done);
+ complete(bb->wait);
bio_put(bio);
}

/**
- * blkdev_issue_discard - queue a discard
- * @bdev: blockdev to issue discard for
+ * blkdev_issue_zeroout generate number of zero filed write bios
+ * @bdev: blockdev to issue
* @sector: start sector
* @nr_sects: number of sectors to discard
* @gfp_mask: memory allocation flags (for bio_alloc)
- * @flags: DISCARD_FL_* flags to control behaviour
+ * @rw: RW flags
*
* Description:
- * Issue a discard request for the sectors in question.
+ * Generate and issue number of bios with zerofiled pages.
+ * Send barrier at the end if requested. This guarantie that. All bios
+ * submitted before the barrier will be completed before the barrier.
+ * Empty barrier allow us to avoid post queue flush.
*/
-int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
- sector_t nr_sects, gfp_t gfp_mask, int flags)
+
+int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, sector_t max_sects, gfp_t mask, int rw)
{
- DECLARE_COMPLETION_ONSTACK(wait);
- struct request_queue *q = bdev_get_queue(bdev);
- int type = flags & DISCARD_FL_BARRIER ?
- DISCARD_BARRIER : DISCARD_NOBARRIER;
- struct bio *bio;
- struct page *page;
int ret = 0;
+ struct bio *bio;
+ struct bio_batch bb;
+ unsigned int sz, issued = 0;
+ DECLARE_COMPLETION_ONSTACK(wait);
+ unsigned int do_barrier = rw | (1 << BIO_RW_BARRIER);
+ rw &= ~(1 << BIO_RW_BARRIER);
+ BUG_ON(!(rw | (1 << BIO_RW)));

- if (!q)
- return -ENXIO;
-
- if (!blk_queue_discard(q))
- return -EOPNOTSUPP;
-
- while (nr_sects && !ret) {
- unsigned int sector_size = q->limits.logical_block_size;
- unsigned int max_discard_sectors =
- min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+ atomic_set(&bb.done, 0);
+ bb.flags = 1 << BIO_UPTODATE;
+ bb.wait = &wait;

- bio = bio_alloc(gfp_mask, 1);
+submit:
+ while (nr_sects != 0) {
+ bio = bio_alloc(mask, min(nr_sects, (sector_t)BIO_MAX_PAGES));
if (!bio)
- goto out;
- bio->bi_sector = sector;
- bio->bi_end_io = blkdev_discard_end_io;
- bio->bi_bdev = bdev;
- if (flags & DISCARD_FL_WAIT)
- bio->bi_private = &wait;
-
- /*
- * Add a zeroed one-sector payload as that's what
- * our current implementations need. If we'll ever need
- * more the interface will need revisiting.
- */
- page = alloc_page(gfp_mask | __GFP_ZERO);
- if (!page)
- goto out_free_bio;
- if (bio_add_pc_page(q, bio, page, sector_size, 0) < sector_size)
- goto out_free_page;
+ break;

+ bio->bi_sector = sector;
+ bio->bi_bdev = bdev;
+ bio->bi_private = &bb;
+ bio->bi_end_io = bio_batch_end_io;
+
+ while(nr_sects != 0) {
+ sz = min(PAGE_SIZE >> 9 , nr_sects);
+ if (max_sects - bio_sectors(bio) < sz)
+ sz = max_sects - bio_sectors(bio);
+ if (sz == 0)
+ /* bio has maximum size possible */
+ break;
+ ret = bio_add_page(bio, ZERO_PAGE(0), sz << 9, 0);
+ nr_sects -= ret >> 9;
+ sector += ret >> 9;
+ if (ret != sz)
+ break;
+ }
+ issued++;
+ submit_bio(rw, bio);
+ }
+ if (nr_sects == 0) {
/*
- * And override the bio size - the way discard works we
- * touch many more blocks on disk than the actual payload
- * length.
+ * We have issued all data. Send final barrier if necessery.
*/
- if (nr_sects > max_discard_sectors) {
- bio->bi_size = max_discard_sectors << 9;
- nr_sects -= max_discard_sectors;
- sector += max_discard_sectors;
- } else {
- bio->bi_size = nr_sects << 9;
- nr_sects = 0;
- }
+ if (do_barrier)
+ ret = blkdev_issue_flush(bdev, NULL);
+ }
+ /* Wait for submitted bios and then continue */
+ while (issued != atomic_read(&bb.done))
+ wait_for_completion(&wait);

- bio_get(bio);
- submit_bio(type, bio);
+ if (!test_bit(BIO_UPTODATE, &bb.flags))
+ /* One of bios in the batch was completed with error.*/
+ ret = -EIO;

- if (flags & DISCARD_FL_WAIT)
- wait_for_completion(&wait);
+ if (ret)
+ goto out;

- if (bio_flagged(bio, BIO_EOPNOTSUPP))
- ret = -EOPNOTSUPP;
- else if (!bio_flagged(bio, BIO_UPTODATE))
- ret = -EIO;
- bio_put(bio);
+ if (test_bit(BIO_EOPNOTSUPP, &bb.flags)) {
+ ret = -EOPNOTSUPP;
+ goto out;
}
- return ret;
-out_free_page:
- __free_page(page);
-out_free_bio:
- bio_put(bio);
+ if (nr_sects != 0)
+ goto submit;
out:
- return -ENOMEM;
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * blkdev_issue_discard - queue a discard
+ * @bdev: blockdev to issue discard for
+ * @sector: start sector
+ * @nr_sects: number of sectors to discard
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ * @flags: DISCARD_FL_* flags to control behaviour
+ *
+ * Description:
+ * Issue a discard request for the sectors in question.
+ */
+int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, gfp_t gfp_mask, int flags)
+{
+ int type = flags & DISCARD_FL_BARRIER ?
+ DISCARD_BARRIER : DISCARD_NOBARRIER;
+ struct request_queue *q = bdev_get_queue(bdev);
+ unsigned int max_size;
+
+ if (!blk_queue_discard(q) && !(flags & DISCARD_FL_BARRIER))
+ return -EOPNOTSUPP;
+ /*
+ * Generate request with zeroed payload.
+ * If device has native discard support it simply ignore this payload.
+ * In case of compat mode this request will be sent as a simple
+ * write request.
+ */
+ if (!blk_queue_discard(q)) {
+ type &= ~(1 << BIO_RW_DISCARD);
+ max_size = UINT_MAX >> 9;
+ } else {
+ max_size = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+ }
+ return blkdev_issue_zeroout(bdev, sector, nr_sects, max_size,
+ gfp_mask, type);
}
EXPORT_SYMBOL(blkdev_issue_discard);
diff --git a/block/ioctl.c b/block/ioctl.c
index be48ea5..384b71c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -124,8 +124,7 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,

if (start + len > (bdev->bd_inode->i_size >> 9))
return -EINVAL;
- return blkdev_issue_discard(bdev, start, len, GFP_KERNEL,
- DISCARD_FL_WAIT);
+ return blkdev_issue_discard(bdev, start, len, GFP_KERNEL, 0);
}

static int put_ushort(unsigned long arg, unsigned short val)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ffb13ad..c762c9f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -997,8 +997,10 @@ static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
}

extern int blkdev_issue_flush(struct block_device *, sector_t *);
-#define DISCARD_FL_WAIT 0x01 /* wait for completion */
-#define DISCARD_FL_BARRIER 0x02 /* issue DISCARD_BARRIER request */
+#define DISCARD_FL_BARRIER 0x01 /* issue DISCARD_BARRIER request */
+#define DISCARD_FL_COMPAT 0x02 /* allow discard compat request mode */
+extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
+ sector_t nr_sects, sector_t max_sects, gfp_t mask, int rw);
extern int blkdev_issue_discard(struct block_device *, sector_t sector,
sector_t nr_sects, gfp_t, int flags);

--
1.6.6

2010-02-11 11:06:09

by Dmitry Monakhov

[permalink] [raw]

Subject: [PATCH 3/4] ext4: convert extent zeroout to generic function

Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/ext4/extents.c | 70 +++-------------------------------------------------
1 files changed, 4 insertions(+), 66 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 765a482..ff4473a 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2432,76 +2432,14 @@ void ext4_ext_release(struct super_block *sb)
#endif
}

-static void bi_complete(struct bio *bio, int error)
-{
- complete((struct completion *)bio->bi_private);
-}
-
/* FIXME!! we need to try to merge to left or right after zero-out */
static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
{
- int ret = -EIO;
- struct bio *bio;
- int blkbits, blocksize;
- sector_t ee_pblock;
- struct completion event;
- unsigned int ee_len, len, done, offset;
-
-
- blkbits = inode->i_blkbits;
- blocksize = inode->i_sb->s_blocksize;
- ee_len = ext4_ext_get_actual_len(ex);
- ee_pblock = ext_pblock(ex);
+ sector_t nr_sects = ((sector_t)ext4_ext_get_actual_len(ex)) <<
+ inode->i_blkbits;

- /* convert ee_pblock to 512 byte sectors */
- ee_pblock = ee_pblock << (blkbits - 9);
-
- while (ee_len > 0) {
-
- if (ee_len > BIO_MAX_PAGES)
- len = BIO_MAX_PAGES;
- else
- len = ee_len;
-
- bio = bio_alloc(GFP_NOIO, len);
- bio->bi_sector = ee_pblock;
- bio->bi_bdev = inode->i_sb->s_bdev;
-
- done = 0;
- offset = 0;
- while (done < len) {
- ret = bio_add_page(bio, ZERO_PAGE(0),
- blocksize, offset);
- if (ret != blocksize) {
- /*
- * We can't add any more pages because of
- * hardware limitations. Start a new bio.
- */
- break;
- }
- done++;
- offset += blocksize;
- if (offset >= PAGE_CACHE_SIZE)
- offset = 0;
- }
-
- init_completion(&event);
- bio->bi_private = &event;
- bio->bi_end_io = bi_complete;
- submit_bio(WRITE, bio);
- wait_for_completion(&event);
-
- if (test_bit(BIO_UPTODATE, &bio->bi_flags))
- ret = 0;
- else {
- ret = -EIO;
- break;
- }
- bio_put(bio);
- ee_len -= done;
- ee_pblock += done << (blkbits - 9);
- }
- return ret;
+ return blkdev_issue_zeroout(inode->i_sb->s_bdev, ext_pblock(ex),
+ nr_sects, UINT_MAX >> 9, GFP_NOIO, WRITE);
}

#define EXT4_EXT_ZERO_LEN 7
--
1.6.6

2010-02-11 11:07:31

by Dmitry Monakhov

[permalink] [raw]

Subject: [PATCH 4/4] btrfs: add discard_compat support

If any device in the list has no native discard support we add
discard compat support. Devices with native discard support still
getting real discard requests.

Signed-off-by: Dmitry Monakhov <[email protected]>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/disk-io.c | 8 ++++++++
fs/btrfs/extent-tree.c | 14 ++++++++------
fs/btrfs/super.c | 8 +++++++-
fs/btrfs/volumes.c | 6 ++++++
fs/btrfs/volumes.h | 6 +++++-
6 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9f806dd..54854d1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1161,6 +1161,7 @@ struct btrfs_root {
#define BTRFS_MOUNT_SSD_SPREAD (1 << 8)
#define BTRFS_MOUNT_NOSSD (1 << 9)
#define BTRFS_MOUNT_DISCARD (1 << 10)
+#define BTRFS_MOUNT_DISCARD_COMPAT (1 << 11)

#define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
#define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 009e3bd..c7d4812 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1945,6 +1945,14 @@ struct btrfs_root *open_ctree(struct super_block *sb,
"mode\n");
btrfs_set_opt(fs_info->mount_opt, SSD);
}
+ if (btrfs_test_opt(tree_root, DISCARD) &&
+ fs_info->fs_devices->discard_compat) {
+ printk(KERN_INFO "Btrfs detected devices without native discard"
+ " support, enabling discard_compat mode mode\n");
+ btrfs_clear_opt(fs_info->mount_opt, DISCARD);
+ btrfs_set_opt(fs_info->mount_opt, DISCARD_COMPAT);
+ }
+

if (btrfs_super_log_root(disk_super) != 0) {
u64 bytenr = btrfs_super_log_root(disk_super);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 432a2da..b4c3124 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1585,20 +1585,21 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans,
}

static void btrfs_issue_discard(struct block_device *bdev,
- u64 start, u64 len)
+ u64 start, u64 len, int do_compat)
{
blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL,
- DISCARD_FL_BARRIER);
+ DISCARD_FL_BARRIER |
+ (do_compat ? DISCARD_FL_COMPAT : 0));
}

static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
u64 num_bytes)
{
- int ret;
+ int ret, do_compat = btrfs_test_opt(root, DISCARD_COMPAT);
u64 map_length = num_bytes;
struct btrfs_multi_bio *multi = NULL;

- if (!btrfs_test_opt(root, DISCARD))
+ if (!btrfs_test_opt(root, DISCARD) && !do_compat)
return 0;

/* Tell the block device(s) that the sectors can be discarded */
@@ -1614,7 +1615,8 @@ static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
for (i = 0; i < multi->num_stripes; i++, stripe++) {
btrfs_issue_discard(stripe->dev->bdev,
stripe->physical,
- map_length);
+ map_length,
+ do_compat);
}
kfree(multi);
}
@@ -3700,7 +3702,7 @@ static int pin_down_bytes(struct btrfs_trans_handle *trans,
* individual btree blocks isn't a good plan. Just
* pin everything in discard mode.
*/
- if (btrfs_test_opt(root, DISCARD))
+ if (btrfs_test_opt(root, DISCARD) || btrfs_test_opt(root, DISCARD_COMPAT))
goto pinit;

buf = btrfs_find_tree_block(root, bytenr, num_bytes);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3f9b457..fe7ccf4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -67,7 +67,7 @@ enum {
Opt_max_extent, Opt_max_inline, Opt_alloc_start, Opt_nobarrier,
Opt_ssd, Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl,
Opt_compress, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
- Opt_discard, Opt_err,
+ Opt_discard, Opt_discard_compat, Opt_err,
};

static match_table_t tokens = {
@@ -90,6 +90,7 @@ static match_table_t tokens = {
{Opt_flushoncommit, "flushoncommit"},
{Opt_ratio, "metadata_ratio=%d"},
{Opt_discard, "discard"},
+ {Opt_discard_compat, "discard_compat"},
{Opt_err, NULL},
};

@@ -263,6 +264,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
case Opt_discard:
btrfs_set_opt(info->mount_opt, DISCARD);
break;
+ case Opt_discard_compat:
+ btrfs_set_opt(info->mount_opt, DISCARD_COMPAT);
+ break;
case Opt_err:
printk(KERN_INFO "btrfs: unrecognized mount option "
"'%s'\n", p);
@@ -459,6 +463,8 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",flushoncommit");
if (btrfs_test_opt(root, DISCARD))
seq_puts(seq, ",discard");
+ if (btrfs_test_opt(root, DISCARD_COMPAT))
+ seq_puts(seq, ",discard_compat");
if (!(root->fs_info->sb->s_flags & MS_POSIXACL))
seq_puts(seq, ",noacl");
return 0;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 220dad5..02e18c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -621,6 +621,9 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
if (!blk_queue_nonrot(bdev_get_queue(bdev)))
fs_devices->rotating = 1;

+ if (!blk_queue_discard(bdev_get_queue(bdev)))
+ fs_devices->discard_compat = 1;
+
fs_devices->open_devices++;
if (device->writeable) {
fs_devices->rw_devices++;
@@ -1522,6 +1525,9 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
if (!blk_queue_nonrot(bdev_get_queue(bdev)))
root->fs_info->fs_devices->rotating = 1;

+ if (!blk_queue_discard(bdev_get_queue(bdev)))
+ root->fs_info->fs_devices->discard_compat = 1;
+
total_bytes = btrfs_super_total_bytes(&root->fs_info->super_copy);
btrfs_set_super_total_bytes(&root->fs_info->super_copy,
total_bytes + device->total_bytes);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 31b0fab..ceb66f0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -116,7 +116,11 @@ struct btrfs_fs_devices {
/* set when we find or add a device that doesn't have the
* nonrot flag set
*/
- int rotating;
+ unsigned int rotating:1;
+ /* set then we find or add a device that doesn't have the
+ * DISCARD flags set
+ */
+ unsigned int discard_compat:1;
};

struct btrfs_bio_stripe {
--
1.6.6

2010-02-11 11:15:59

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH 4/4] btrfs: add discard_compat support

Dmitry Monakhov <[email protected]> writes:

> If any device in the list has no native discard support we add
> discard compat support. Devices with native discard support still
> getting real discard requests.
Note: Seems what enabling discard requests result in triggering
hidden bug. Orphan list corruption.

------------[ cut here ]------------
WARNING: at lib/list_debug.c:26 __list_add+0x3f/0x81()
Hardware name:
list_add corruption. next->prev should be prev (ffff88012029eb20), but was ffff88012239c730. (next=ffff8801223bc348).
Modules linked in: btrfs zlib_deflate libcrc32c cpufreq_ondemand acpi_cpufreq freq_table ppdev parport_pc parport pcspkr [last unloaded: btrfs]
Pid: 1254, comm: fsstress Not tainted 2.6.33-rc5 #3
Call Trace:
[<ffffffff81038f79>] warn_slowpath_common+0x7c/0x94
[<ffffffff81038fe8>] warn_slowpath_fmt+0x41/0x43
[<ffffffffa00df87a>] ? btrfs_unlink_inode+0x243/0x258 [btrfs]
[<ffffffff811c0e22>] __list_add+0x3f/0x81
[<ffffffffa00dfac7>] btrfs_orphan_add+0x61/0x80 [btrfs]
[<ffffffffa00e015a>] btrfs_unlink+0xb2/0xf2 [btrfs]
[<ffffffff810ec5a3>] vfs_unlink+0x67/0xa2
[<ffffffff810eafec>] ? lookup_hash+0x36/0x3a
[<ffffffff810edd97>] do_unlinkat+0xcd/0x15b
[<ffffffff810eabc4>] ? path_put+0x22/0x27
[<ffffffff8107edac>] ? audit_syscall_entry+0x11e/0x14a
[<ffffffff810ede3b>] sys_unlink+0x16/0x18
[<ffffffff81002adb>] system_call_fastpath+0x16/0x1b
---[ end trace 18d84c057a649cf7 ]---
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:3320!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/uevent
CPU 3
Pid: 1250, comm: fsstress Tainted: G W 2.6.33-rc5 #3 DH55TC /
RIP: 0010:[<ffffffffa00e071b>] [<ffffffffa00e071b>] btrfs_setattr+0x101/0x244 [btrfs]
RSP: 0018:ffff88012095de08 EFLAGS: 00010292
RAX: 0000000000000021 RBX: ffff8801223b4bd8 RCX: 000000000000711a
RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff81730984
RBP: ffff88012095de48 R08: ffff88012095ddb4 R09: 0000000000000000
R10: 0000000000000001 R11: ffff88012095dd00 R12: ffff88012095ded8
R13: ffff88012029e800 R14: ffff88012233d140 R15: 0000000000000000
FS: 00007f313c3c6700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffa0ac3f4c CR3: 00000001212c7000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process fsstress (pid: 1250, threadinfo ffff88012095c000, task ffff880122c68000)
Stack:
ffff8801223b4cb0 ffff88011c452180 ffff88012095de28 ffff88012095ded8
<0> ffff88011c452180 ffff8801223b4bd8 0000000000000008 0000000000000008
<0> ffff88012095deb8 ffffffff810f5e52 ffff88012095df38 ffffffff810ee418
Call Trace:
[<ffffffff810f5e52>] notify_change+0x17b/0x2c9
[<ffffffff810ee418>] ? user_path_at+0x64/0x93
[<ffffffff810e1e0d>] do_truncate+0x6a/0x87
[<ffffffff810ea962>] ? generic_permission+0x1c/0x9f
[<ffffffff810eaa5e>] ? get_write_access+0x1d/0x48
[<ffffffff810e1f46>] sys_truncate+0x11c/0x160
[<ffffffff81002adb>] system_call_fastpath+0x16/0x1b
Code: e0 48 89 de 4c 89 f7 49 89 46 20 e8 66 f3 ff ff 85 c0 74 1b 89 c2 48 c7 c6 e0 b4 10 a0 48 c7 c7 9c d1 10 a0 31 c0 e8 7c 5c 25 e1 <0f> 0b eb 3d 49 8b 46 10 4c 89 ee 4c 89 f7 48 89 45 c8 e8 47 7d
RIP [<ffffffffa00e071b>] btrfs_setattr+0x101/0x244 [btrfs]
RSP <ffff88012095de08>
---[ end trace 18d84c057a649cf8 ]---

2010-02-11 11:25:17

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH 2/4] block: support compat discard mode by default.

Dmitry Monakhov <[email protected]> writes:

> Currently there are many filesystems which has implemented
> discard support, but ssd discs not widely used yet.
> Let's allow user to use compat discard mode by default.
> After this feature is enabled by default for all devices which has
> no native discard support it will be possible to use this feature
> simply by passing appropriate mount option to fs (-odiscard in ext4)
BTW i've run tested ext4 with -odiscard option, and it survived more
24hour of stress tests test which consists of fsstress, compilation,
and etc
>
> This default option has many advantages:
> - Hope that this helps in real filesystem testing.
> - People who are crazy about data security whould be really happy.
> - Virtual machine developers also would like this feature.
>
> Signed-off-by: Dmitry Monakhov <[email protected]>
> ---
> include/linux/blkdev.h | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c762c9f..d7d028c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1010,7 +1010,7 @@ static inline int sb_issue_discard(struct super_block *sb,
> block <<= (sb->s_blocksize_bits - 9);
> nr_blocks <<= (sb->s_blocksize_bits - 9);
> return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
> - DISCARD_FL_BARRIER);
> + DISCARD_FL_BARRIER|DISCARD_FL_COMPAT);
> }
>
> extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);

2010-02-11 12:04:11

by Dmitry Monakhov

[permalink] [raw]

Subject: [PATCH 2/4] block: support compat discard mode by default.

Currently there are many filesystems which has implemented
discard support, but ssd discs not widely used yet.
Let's allow user to use compat discard mode by default.
After this feature is enabled by default for all devices which has
no native discard support it will be possible to use this feature
simply by passing appropriate mount option to fs (-odiscard in ext4)

This default option has many advantages:
- Hope that this helps in real filesystem testing.
- People who are crazy about data security whould be really happy.
- Virtual machine developers also would like this feature.

Signed-off-by: Dmitry Monakhov <[email protected]>
---
include/linux/blkdev.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c762c9f..d7d028c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1010,7 +1010,7 @@ static inline int sb_issue_discard(struct super_block *sb,
block <<= (sb->s_blocksize_bits - 9);
nr_blocks <<= (sb->s_blocksize_bits - 9);
return blkdev_issue_discard(sb->s_bdev, block, nr_blocks, GFP_KERNEL,
- DISCARD_FL_BARRIER);
+ DISCARD_FL_BARRIER|DISCARD_FL_COMPAT);
}

extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
--
1.6.6

2010-02-11 12:21:57

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/4] block: implement compatible DISCARD support

On Thu, Feb 11, 2010 at 01:53:45PM +0300, Dmitry Monakhov wrote:
> Currently there are no many discs has native TRIM (aka) discard
> feature support. But in fact this is good feature. We can easily
> simlulate it for devices which has not native support.
> In compat mode discard dequest transforms in to simple zerofiled
> write request.
> In fact currently blkdev_issue_discard function implemented
> incorrectly.
> 1) Whait flags not optimal we dont have to wait for each bio in flight.

Indeed.

> 2) Not wait by default. Which makes it fairly useless.

Not sure what you mean with that. We do not need to wait for the
discard request for the "online" type of use - the barrier flag
means we can't re-order I/O before and after the request so there
is no reason to wait - it stays in the places where it needs to be
in the I/O stream.

> 3) Send each bio with barrier flag(if requested). Which result in
> bad performance. In fact caller just want to make shure that full
> request is completed and ordered against other requests.

Yes, we need to ensure ordering only against reordering with other
requests. Your patch only issues a cache flush, which means that
we miss the queue drain before submitting the discard bios

> 5) It use allocated_page instead of ZERO_PAGE.

That's incorrect - both the scsi WRITE SAME and ATA UNMAP
implementations write to the payload. I have some plans to change that
an get rid of the payload entirely, but I need to get back to the
discard work and look at it in more detail.

> This patch introduce generic blkdev_issue_zeroout() function which also
> may be used for native discard request support, in this case zero payload
> simply ignored.

Which is a bit different from fixing efficiency issues in discard, I'd
prefer that to be split into a separate patch, especially as there might
be quite a bit of discussion on the zeroout behaviour.

Note that a discard is not actually required to zero out data it
has discarded, it's an optional feature in the command sets.

2010-02-11 12:22:51

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2/4] block: support compat discard mode by default.

On Thu, Feb 11, 2010 at 01:57:50PM +0300, Dmitry Monakhov wrote:
> Currently there are many filesystems which has implemented
> discard support, but ssd discs not widely used yet.
> Let's allow user to use compat discard mode by default.
> After this feature is enabled by default for all devices which has
> no native discard support it will be possible to use this feature
> simply by passing appropriate mount option to fs (-odiscard in ext4)

So why do you even bother adding the DISCARD_FL_COMPAT option?
This is a debug option that should probably be controlled at the
block level, not anywhere near the filesystem.

2010-02-11 12:59:38

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH 1/4] block: implement compatible DISCARD support

Christoph Hellwig <[email protected]> writes:

> On Thu, Feb 11, 2010 at 01:53:45PM +0300, Dmitry Monakhov wrote:
>> Currently there are no many discs has native TRIM (aka) discard
>> feature support. But in fact this is good feature. We can easily
>> simlulate it for devices which has not native support.
>> In compat mode discard dequest transforms in to simple zerofiled
>> write request.
>> In fact currently blkdev_issue_discard function implemented
>> incorrectly.
>> 1) Whait flags not optimal we dont have to wait for each bio in flight.
>
> Indeed.
>
>> 2) Not wait by default. Which makes it fairly useless.
>
> Not sure what you mean with that. We do not need to wait for the
> discard request for the "online" type of use - the barrier flag
> means we can't re-order I/O before and after the request so there
> is no reason to wait - it stays in the places where it needs to be
> in the I/O stream.
I mean that it is impossible to know was it really successful or not.
We may just replace this function with this following function.
const char* make_me_hapy()
{
return "you are happy already.";
}

AFAIK Currently there is no any generic block interface which send
io requests without ability to check the result.
The question is should it be sync or async it is not easy to design
simple async interface so let's use sync by default
BTW: That's why blkdev_issue_barrier has to wait by default.
>
>> 3) Send each bio with barrier flag(if requested). Which result in
>> bad performance. In fact caller just want to make shure that full
>> request is completed and ordered against other requests.
>
> Yes, we need to ensure ordering only against reordering with other
> requests. Your patch only issues a cache flush, which means that
> we miss the queue drain before submitting the discard bios
Yes you right from theoretical point of view. But practically
It is hard to imagine that some one send something like this:
read_bio(sectcor,..)
discard_bio(sector,..BARRIER)
But off course you right, I'll add ability to pass pre_flush
barrier in next patch version.
>
>> 5) It use allocated_page instead of ZERO_PAGE.
>
> That's incorrect - both the scsi WRITE SAME and ATA UNMAP
> implementations write to the payload.
WOW. What for?
> I have some plans to change that
> an get rid of the payload entirely, but I need to get back to the
> discard work and look at it in more detail.

>
>> This patch introduce generic blkdev_issue_zeroout() function which also
>> may be used for native discard request support, in this case zero payload
>> simply ignored.
>
> Which is a bit different from fixing efficiency issues in discard, I'd
> prefer that to be split into a separate patch, especially as there might
> be quite a bit of discussion on the zeroout behaviour.
Seems that you also right here. At list it is not obvious how we should
send compat_discard bios WRITE,WRITE_SYNC or WRITE_SYNC_PLUG?
But blkdev_issue_zeroout() interface allow all this flags.
let's wait a bit and i'll redesign the patchset correspondingly
>
> Note that a discard is not actually required to zero out data it
> has discarded, it's an optional feature in the command sets.
Yes, i know. But zero payload will be used only by compat mode.
We assumes that default compat mode zero data on descard.

2010-02-11 13:09:47

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/4] block: implement compatible DISCARD support

On Thu, Feb 11, 2010 at 03:59:31PM +0300, Dmitry Monakhov wrote:
> I mean that it is impossible to know was it really successful or not.
> We may just replace this function with this following function.
> const char* make_me_hapy()
> {
> return "you are happy already.";
> }

Which is an entirely valid, although suboptimal implementation.

> AFAIK Currently there is no any generic block interface which send
> io requests without ability to check the result.
> The question is should it be sync or async it is not easy to design
> simple async interface so let's use sync by default
> BTW: That's why blkdev_issue_barrier has to wait by default.

This is going to kill performance.

> > That's incorrect - both the scsi WRITE SAME and ATA UNMAP
> > implementations write to the payload.
> WOW. What for?

Becuase these commands contain ranges of to be flushed blocks
in their payload.

> > Which is a bit different from fixing efficiency issues in discard, I'd
> > prefer that to be split into a separate patch, especially as there might
> > be quite a bit of discussion on the zeroout behaviour.
> Seems that you also right here. At list it is not obvious how we should
> send compat_discard bios WRITE,WRITE_SYNC or WRITE_SYNC_PLUG?
> But blkdev_issue_zeroout() interface allow all this flags.
> let's wait a bit and i'll redesign the patchset correspondingly

The !wait case is ansynchronous, so WRITE seems fine. The wait and
!barrier case is more interesting, as this one is very close to
synchrous write semantics. But I'm not sure it's really worth
optimization for that, this case is not used for online discarding
but things like mkfs that have the filesystem for itself, or that
not yet merged background discard for xfs which doesn't care too
much about I/O priority.

2010-02-11 13:45:20

by Dmitry Monakhov

[permalink] [raw]

Subject: Re: [PATCH 1/4] block: implement compatible DISCARD support

Christoph Hellwig <[email protected]> writes:

> On Thu, Feb 11, 2010 at 03:59:31PM +0300, Dmitry Monakhov wrote:
>> I mean that it is impossible to know was it really successful or not.
>> We may just replace this function with this following function.
>> const char* make_me_hapy()
>> {
>> return "you are happy already.";
>> }
>
> Which is an entirely valid, although suboptimal implementation.
>
>> AFAIK Currently there is no any generic block interface which send
>> io requests without ability to check the result.
>> The question is should it be sync or async it is not easy to design
>> simple async interface so let's use sync by default
>> BTW: That's why blkdev_issue_barrier has to wait by default.
In fact wait is the only interface for issue_barrier.
> This is going to kill performance.
But it may be reasonable to allow caller to choose would it
wait and work fair, or to cheat in a name of performance.
>
>> > That's incorrect - both the scsi WRITE SAME and ATA UNMAP
>> > implementations write to the payload.
>> WOW. What for?
>
> Becuase these commands contain ranges of to be flushed blocks
> in their payload.
Ok i've found.
libata-scsi.c: ata_scsi_write_same_xlat
ata_set_lba_range_entries
It's was not obvious from the first glance. But it is the way how it
works for now. But seems what we still optimize things a bit
1) alloc page with GFP_HIGHUSER (because x86 arch still used)
2) Share page between eight bios.
>
>> > Which is a bit different from fixing efficiency issues in discard, I'd
>> > prefer that to be split into a separate patch, especially as there might
>> > be quite a bit of discussion on the zeroout behaviour.
>> Seems that you also right here. At list it is not obvious how we should
>> send compat_discard bios WRITE,WRITE_SYNC or WRITE_SYNC_PLUG?
>> But blkdev_issue_zeroout() interface allow all this flags.
>> let's wait a bit and i'll redesign the patchset correspondingly
>
> The !wait case is ansynchronous, so WRITE seems fine. The wait and
> !barrier case is more interesting, as this one is very close to
> synchrous write semantics. But I'm not sure it's really worth
> optimization for that, this case is not used for online discarding
> but things like mkfs that have the filesystem for itself, or that
> not yet merged background discard for xfs which doesn't care too
> much about I/O priority.

2010-02-11 14:06:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/4] block: implement compatible DISCARD support

On Thu, Feb 11, 2010 at 04:45:13PM +0300, Dmitry Monakhov wrote:
> >> The question is should it be sync or async it is not easy to design
> >> simple async interface so let's use sync by default
> >> BTW: That's why blkdev_issue_barrier has to wait by default.
> In fact wait is the only interface for issue_barrier.
> > This is going to kill performance.
> But it may be reasonable to allow caller to choose would it
> wait and work fair, or to cheat in a name of performance.

It's not a cheat. The discard is a _hint_ to the hardware that it
can reclaim space. Think become a bit different when we start relying
on the zeroing behaviour for hardware that supports it, but so far
we don't. And given that out of two TRIM capable devices I have one
does not reliably zero the trimmed regions I'm not sure it's a good
idea to rely on that yet, either.

> libata-scsi.c: ata_scsi_write_same_xlat
> ata_set_lba_range_entries
> It's was not obvious from the first glance. But it is the way how it
> works for now. But seems what we still optimize things a bit
> 1) alloc page with GFP_HIGHUSER (because x86 arch still used)

I'm not sure it's worth over-optimization this.

> 2) Share page between eight bios.

If we introduce the common completion handler for a batch of bios as
your patch does we can do that.

2010-02-11 15:10:41

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [PATCH 1/4] block: implement compatible DISCARD support

>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
>> 5) It use allocated_page instead of ZERO_PAGE.

Christoph> That's incorrect - both the scsi WRITE SAME and ATA UNMAP
Christoph> implementations write to the payload. I have some plans to
Christoph> change that an get rid of the payload entirely, but I need to
Christoph> get back to the discard work and look at it in more detail.

I've been away for a couple of weeks (got back yesterday). Just a heads
up that I've been working on block layer support for WRITE SAME as well
as a discard revamp the last little while. I'll try to post those
patches today or tomorrow.

Just to give you an idea: I distinguish between discarding, clearing,
and filling a block range.

blkdev_issue_clear() is used for explicitly zeroing a region. If the
target supports RZAT/TPRZ, discard will be used instead of zero filling.

blkdev_issue_discard() is for trimming without caring about what's left
in the blocks. And blkdev_issue_fill() can be used to do a write same
pass on a range (for bulk initializing RAID blocks, for instance).

--
Martin K. Petersen Oracle Linux Engineering