2008-11-23 04:46:20

by Tejun Heo

[permalink] [raw]
Subject: about TRIM/DISCARD support and barriers

Hello, all.

Dongjun Shin who works for Samsung SSD dep asked me about libata TRIM
support and pointed me to the new DISCARD support David Woodhouse got
merged for 2.6.28. I took a look at the code and blk-layer
interface-wise we seemed to be ready both for filesystems and userland
(so that fsck or something which runs background can mark unused
blocks) but there doesn't seem to be any low level driver which
actually implements ->prepare_discard_fn or fs which sets the DISCARD
flag.

Adding ->prepare_discard_fn wouldn't be difficult at all but I became
curious about a few things after looking at the DISCARD interface.
First of all - how to avoid racing aginst reusing and how to schedule
DISCARDs.

* There are two variants of DISCARD - DISCARD w/o barrier and DISCARD
w/ barrier, if a fs uses the former, it would need to make sure that
it the DISCARD finishes before re-using the block. Block layer will
make sure order will be kept for the latter but depending on how
often those DICARDs are issued it can disrupt IO scheduling.

* It looks like non-barrier DISCARD will be put into the IO sched and
scheduled the same way as regular IOs. I don't relly think this is
necessary or a good idea. DISCARDs probably don't need any kind of
sorting anyway and it's likely to disrupt IO sched heuristics.
Also, DISCARDs can be postponed w/o affecting correct operation.
However, DISCARDs are not likely to take a long time and we might
not have to worry about it too much unless it starves regular IOs.

With the above three points, I think it might be better to make block
layer manage and order DISCARD requests than putting it onto the
filesystem or barrier mechanism. If block layer manages map of
pending DISCARDs and FSes just tell block layer newly freed blocks,
block layer can schedule DISCARDs as it sees fit and cancels pending
ones if IO access to it occurs before the DISCARD is issued to the
drive. This way, adding DISCARD support to FSes become much easier -
it can just put blk_discard(lba, range) where it's discarding blocks
and don't have to worry about ordering or error handling.

What do you think?

Also, I have a question regarding the current barrier implementation.
When I asked it to Chris Mason some time ago, I was told that btrfs
doesn't really make use of barrier in that btrfs itself waits for the
barrier to complete before proceeding. I've been thinking about
colored barrier implementation because I heard that the current
barrier ordering is too crude or heavy handed. But, then again, if
the filesystem waits for requests to complete itself and those
dependent requests are marked SYNC as necessary so that they don't get
postponed too much, all that's needed is flush cache. Doing it that
way will add a bit of latency but as long as things can progress in
parallel, it will probably perform better than the current barrier.

After all, it's not like we have selective FLUSH on actual devices
anyway. Where the selective barriering can make difference is how
it's handled in the IO scheduler and FS waiting for requests to finish
and then issuing barrier achieves that quite alright and communicating
the partitial ordering of requests to block layer wouldn't be much
simpler than doing it in FS proper and there's also the problem of how
to communicate or handle when one of the request in the partial
ordering fails. So, would selective / more intelligent barrier be
beneficial to filesystems or is the way things are just fine?

Thanks.

--
tejun


2008-11-23 07:12:18

by Tejun Heo

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

Hello,

cc'ing Matthew whom I forgot to cc in the first message. Matthew, the
original message is...

http://lkml.org/lkml/2008/11/22/260

Tejun Heo wrote:
> Dongjun Shin who works for Samsung SSD dep asked me about libata TRIM
> support and pointed me to the new DISCARD support David Woodhouse got
> merged for 2.6.28. I took a look at the code and blk-layer
> interface-wise we seemed to be ready both for filesystems and userland
> (so that fsck or something which runs background can mark unused
> blocks) but there doesn't seem to be any low level driver which
> actually implements ->prepare_discard_fn or fs which sets the DISCARD
> flag.

Dongjun, the only doc I can find about ATA TRIM is the following one.

http://t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

And AFAICS this hasn't made into ACS yet. Is this what you guys are
gonna implement and Windows7 is gonna use?

Thanks.

--
tejun

2008-11-23 07:57:50

by Tejun Heo

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

Tejun Heo wrote:
> Dongjun, the only doc I can find about ATA TRIM is the following one.
>
> http://t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
>
> And AFAICS this hasn't made into ACS yet. Is this what you guys are
> gonna implement and Windows7 is gonna use?

Just went over it. Matthew, if ATA trim is gonna be implemented as
described in the above document, it will support multiple ranges per
command.

Dongjun, the above document strikes out all the latency/performance
related stuff, which looks like the right move to me. Most of those
information can be extracted from access pattern by the device itself
and exposing such optimization parameters to outside seldom works well.
I'm fairly sure such over complexity will end up being
counter-optimization due to different interpretations and executions by
different parties (be it harddrive vendors or different filesystems).

So, can you please confirm that, what we eventually get is simple TRIM
w/ multiple ranges? Which, BTW, makes sense as it's something the
device can't infer from the access pattern. Also, if there still is
wiggle room, what would be a worthy optimization is to allow TRIM
commands to be sent together with other NCQ commands as otherwise the
drive will have to drain all other commands to process a TRIM command
which will be inefficient.

Thanks.

--
tejun

2008-11-23 12:35:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Sun, Nov 23, 2008 at 04:11:56PM +0900, Tejun Heo wrote:
> Hello,
>
> cc'ing Matthew whom I forgot to cc in the first message. Matthew, the
> original message is...
>
> http://lkml.org/lkml/2008/11/22/260
>
> Tejun Heo wrote:
> > Dongjun Shin who works for Samsung SSD dep asked me about libata TRIM
> > support and pointed me to the new DISCARD support David Woodhouse got
> > merged for 2.6.28. I took a look at the code and blk-layer
> > interface-wise we seemed to be ready both for filesystems and userland
> > (so that fsck or something which runs background can mark unused
> > blocks) but there doesn't seem to be any low level driver which
> > actually implements ->prepare_discard_fn or fs which sets the DISCARD
> > flag.

The current VFAT filesystem implements DISCARD. Here's the patch I've
been testing -- it's not right; it causes SSDs from two different vendors
to hang. I'm waiting for a free slot on the SATA protocol analyser to
figure out what's we're doing wrong.

Note that the SCSI UNMAP command does not yet have an official number,
so this patch cannot yet be applied (... how much longer until ATA is no
longer part of SCSI?)

We don't attempt to put non-contiguous ranges into a single TRIM yet.

We currently assume all SCSI devices support UNMAP instead of checking
the feature flag (because last time I looked, I couldn't tell what flag
I was supposed to check). Once we do that, I need to set that flag in
libata-scsi depending on the support for the TRIM bit in the identify
command.

It's been a couple of weeks since I looked at this patch, there's
probably something other caveats too.

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index 5c99ff8..531c18c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -324,6 +324,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
}

+ if (bio_has_data(bio))
+ __free_page(bio_page(bio));
bio_put(bio);
}

@@ -355,7 +357,7 @@ int blkdev_issue_discard(struct block_device *bdev,
return -EOPNOTSUPP;

while (nr_sects && !ret) {
- bio = bio_alloc(gfp_mask, 0);
+ bio = bio_alloc(gfp_mask, 1);
if (!bio)
return -ENOMEM;

diff --git a/block/blk-core.c b/block/blk-core.c
index c3df30c..7ec7ab5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1085,6 +1085,8 @@ EXPORT_SYMBOL(blk_put_request);

void init_request_from_bio(struct request *req, struct bio *bio)
{
+ might_sleep();
+
req->cpu = bio->bi_comp_cpu;
req->cmd_type = REQ_TYPE_FS;

@@ -1108,7 +1110,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
req->cmd_flags |= REQ_DISCARD;
if (bio_barrier(bio))
req->cmd_flags |= REQ_SOFTBARRIER;
- req->q->prepare_discard_fn(req->q, req);
+ req->q->prepare_discard_fn(req->q, req, bio);
} else if (unlikely(bio_barrier(bio)))
req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index bbb30d8..303cc82 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1700,6 +1700,40 @@ nothing_to_do:
return 1;
}

+static unsigned int ata_scsi_unmap_xlat(struct ata_queued_cmd *qc)
+{
+ struct scsi_cmnd *scmd = qc->scsicmd;
+ struct request *req = scmd->request;
+ char *buffer = bio_data(req->bio);
+ struct ata_taskfile *tf = &qc->tf;
+ unsigned i = 0, size;
+
+ /*
+ * Ignore what SCSI has written to the buffer. Will make it easier
+ * to implement TRIM when ATA is no longer part of SCSI.
+ */
+
+ i = ata_set_lba_range_entries(buffer, PAGE_SIZE / 8,
+ req->sector, req->nr_sectors);
+ size = ALIGN(i * 8, 512);
+ memset(buffer + i * 8, 0, size - i * 8);
+
+ qc->flags |= ATA_QCFLAG_IO;
+ qc->nbytes = size;
+
+ tf->command = 0x06; /* Data Set Management */
+ tf->feature = 0x01; /* TRIM */
+ tf->hob_feature = 0x00;
+ tf->nsect = size / 512;
+ tf->hob_nsect = (size / 512) >> 8;
+ tf->protocol = ATA_PROT_DMA;
+ tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE | ATA_TFLAG_WRITE |
+ ATA_TFLAG_LBA48 | ATA_TFLAG_LBA;
+ tf->device = ATA_LBA;
+
+ return 0;
+}
+
static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
{
struct ata_port *ap = qc->ap;
@@ -2936,6 +2970,9 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)
case VERIFY_16:
return ata_scsi_verify_xlat;

+ case UNMAP:
+ return ata_scsi_unmap_xlat;
+
case ATA_12:
case ATA_16:
return ata_scsi_pass_thru;
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index eb9fac4..db2ec75 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -409,6 +409,49 @@ static void idedisk_prepare_flush(struct request_queue *q, struct request *rq)
rq->special = task;
}

+static int idedisk_prepare_discard(struct request_queue *q, struct request *rq,
+ struct bio *bio)
+{
+ ide_task_t *task;
+ struct page *page;
+ char *buffer;
+ unsigned i, size;
+
+ /* FIXME: map struct ide_taskfile on rq->cmd[] */
+ task = kzalloc(sizeof(*task), GFP_KERNEL);
+ if (!task)
+ return -ENOMEM;
+
+ page = alloc_page(GFP_KERNEL);
+ if (!page) {
+ kfree(task);
+ return -ENOMEM;
+ }
+
+ buffer = page_address(page);
+ i = ata_set_lba_range_entries(buffer, PAGE_SIZE / 8,
+ bio->bi_sector, bio_sectors(bio));
+ size = ALIGN(i * 8, 512);
+ memset(buffer + i * 8, 0, size - i * 8);
+ bio_add_pc_page(q, bio, page, i * 8, 0);
+
+ task->tf.command = 0x06; /* Data Set Management */
+ task->tf.feature = 0x01; /* TRIM */
+ task->tf.hob_feature = 0x00;
+ task->tf.nsect = size / 512;
+ task->tf.hob_nsect = (size / 512) >> 8;
+
+ task->tf_flags = IDE_TFLAG_LBA48 | IDE_TFLAG_OUT_HOB |
+ IDE_TFLAG_OUT_TF | IDE_TFLAG_OUT_DEVICE |
+ IDE_TFLAG_DYN;
+ task->data_phase = TASKFILE_OUT_DMA;
+
+ rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
+ rq->special = task;
+
+ return 0;
+}
+
ide_devset_get(multcount, mult_count);

/*
@@ -526,6 +569,9 @@ static int set_wcache(ide_drive_t *drive, int arg)

update_ordered(drive);

+ if (ata_id_has_trim(drive->id))
+ blk_queue_set_discard(drive->queue, idedisk_prepare_discard);
+
return err;
}

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 7162d67..103a2cc 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -576,7 +576,12 @@ void ide_map_sg(ide_drive_t *drive, struct request *rq)
if (hwif->sg_mapped) /* needed by ide-scsi */
return;

- if (rq->cmd_type != REQ_TYPE_ATA_TASKFILE) {
+ if (blk_discard_rq(rq)) {
+ ide_task_t *task = rq->special;
+ unsigned len = (task->tf.hob_nsect << 8) + task->tf.nsect;
+ sg_init_one(sg, rq->buffer, len * 512);
+ hwif->sg_nents = 1;
+ } else if (rq->cmd_type != REQ_TYPE_ATA_TASKFILE) {
hwif->sg_nents = blk_rq_map_sg(drive->queue, rq, sg);
} else {
sg_init_one(sg, rq->buffer, rq->nr_sectors * SECTOR_SIZE);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c9e1242..c54b5da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -49,6 +49,7 @@
#include <linux/mutex.h>
#include <linux/string_helpers.h>
#include <asm/uaccess.h>
+#include <asm/unaligned.h>

#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
@@ -367,6 +368,32 @@ static void scsi_disk_put(struct scsi_disk *sdkp)
mutex_unlock(&sd_ref_mutex);
}

+/*
+ * sd_discard_fn - Prepare a discard request for transmission
+ * XXX: Add support for sending multiple extents to UNMAP.
+ */
+static int sd_discard_fn(struct request_queue *q, struct request *rq,
+ struct bio *bio)
+{
+ char *unmap;
+ struct page *page = alloc_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+ bio_add_pc_page(q, bio, page, 24, 0);
+
+ unmap = bio_data(bio);
+ memset(unmap, 0, 12);
+ put_unaligned_be64(bio->bi_sector, unmap + 12);
+ put_unaligned_be32(bio_sectors(bio), unmap + 20);
+
+ rq->cmd_type = REQ_TYPE_BLOCK_PC;
+ rq->cmd_len = 10;
+ rq->cmd[0] = UNMAP;
+ put_unaligned_be16(24, rq->cmd + 7);
+
+ return 0;
+}
+
/**
* sd_init_command - build a scsi (read or write) command from
* information in the request structure.
@@ -1894,6 +1921,7 @@ static int sd_probe(struct device *dev)
sd_revalidate_disk(gd);

blk_queue_prep_rq(sdp->request_queue, sd_prep_fn);
+ blk_queue_set_discard(sdp->request_queue, sd_discard_fn);

gd->driverfs_dev = &sdp->sdev_gendev;
gd->flags = GENHD_FL_EXT_DEVT | GENHD_FL_DRIVERFS;
diff --git a/include/linux/ata.h b/include/linux/ata.h
index a53318b..d15196d 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -89,6 +89,7 @@ enum {
ATA_ID_DLF = 128,
ATA_ID_CSFO = 129,
ATA_ID_CFA_POWER = 160,
+ ATA_ID_DATA_SET_MGMT = 169,
ATA_ID_ROT_SPEED = 217,
ATA_ID_PIO4 = (1 << 1),

@@ -717,6 +718,14 @@ static inline int ata_id_has_unload(const u16 *id)
return 0;
}

+static inline int ata_id_has_trim(const u16 *id)
+{
+ if (ata_id_major_version(id) >= 7 &&
+ (id[ATA_ID_DATA_SET_MGMT] & 1))
+ return 1;
+ return 0;
+}
+
static inline int ata_id_current_chs_valid(const u16 *id)
{
/* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command
@@ -852,6 +861,29 @@ static inline void ata_id_to_hd_driveid(u16 *id)
#endif
}

+/*
+ * Write up to 'max' LBA Range Entries to the buffer that will cover the
+ * extent from sector to sector + count. This is used for TRIM and for
+ * ADD LBA(S) TO NV CACHE PINNED SET.
+ */
+static inline unsigned ata_set_lba_range_entries(void *_buffer, unsigned max,
+ u64 sector, unsigned long count)
+{
+ __le64 *buffer = _buffer;
+ unsigned i = 0;
+ while (i < max) {
+ u64 entry = sector |
+ ((u64)(count > 0xffff ? 0xffff : count) << 48);
+ buffer[i++] = __cpu_to_le64(entry);
+ if (count <= 0xffff)
+ break;
+ count -= 0xffff;
+ sector += 0xffff;
+ }
+
+ return i;
+}
+
static inline int is_multi_taskfile(struct ata_taskfile *tf)
{
return (tf->command == ATA_CMD_READ_MULTI) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a135256..79dd6d8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -259,7 +259,8 @@ typedef void (request_fn_proc) (struct request_queue *q);
typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
typedef int (prep_rq_fn) (struct request_queue *, struct request *);
typedef void (unplug_fn) (struct request_queue *);
-typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
+typedef int (prepare_discard_fn) (struct request_queue *, struct request *,
+ struct bio *bio);

struct bio_vec;
struct bvec_merge_data {
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index a109165..e91c0b7 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -93,6 +93,7 @@
#define WRITE_LONG 0x3f
#define CHANGE_DEFINITION 0x40
#define WRITE_SAME 0x41
+#define UNMAP 0x42 /* XXX: Provisional */
#define READ_TOC 0x43
#define LOG_SELECT 0x4c
#define LOG_SENSE 0x4d

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-23 13:40:05

by David Woodhouse

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Sun, 2008-11-23 at 05:35 -0700, Matthew Wilcox wrote:
> The current VFAT filesystem implements DISCARD.

As does btrfs. I think I posted ext2 patches a little while ago too? Or
someone else did. Did they get merged?

> Here's the patch I've
> been testing -- it's not right; it causes SSDs from two different vendors
> to hang. I'm waiting for a free slot on the SATA protocol analyser to
> figure out what's we're doing wrong.
>
> Note that the SCSI UNMAP command does not yet have an official number,
> so this patch cannot yet be applied (... how much longer until ATA is no
> longer part of SCSI?)
>
> We don't attempt to put non-contiguous ranges into a single TRIM yet.

We don't even merge contiguous ranges -- I still need to fix the
elevators to stop writes crossing writes, before we can stop discards
from also being barriers. (Discards are just writes, for the purpose of
that conversation).

> We currently assume all SCSI devices support UNMAP instead of checking
> the feature flag (because last time I looked, I couldn't tell what flag
> I was supposed to check). Once we do that, I need to set that flag in
> libata-scsi depending on the support for the TRIM bit in the identify
> command.

The code in drivers/ide does check the feature, although I'm not 100%
sure I got that part right.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2008-11-23 22:53:19

by James Bottomley

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > We don't attempt to put non-contiguous ranges into a single TRIM yet.
>
> We don't even merge contiguous ranges -- I still need to fix the
> elevators to stop writes crossing writes,

I don't think we want to do that ... it's legal if the write isn't a
barrier and it will inhibit merging. That may be just fine for a SSD,
but it's not for spinning media since they get better performance out of
merged writes.

> before we can stop discards
> from also being barriers. (Discards are just writes, for the purpose of
> that conversation).

Perhaps they shouldn't be ... they have different characteristics. For
instance, a discard may cross a read or write that has no sectors in
common with it; a discard may be merged as a non contiguous range
(assuming the drive supports multiple ranges), etc.

I think it might be better to give it its own type for the elevators.

James

2008-11-24 03:06:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Sun, Nov 23, 2008 at 01:39:44PM +0000, David Woodhouse wrote:
> On Sun, 2008-11-23 at 05:35 -0700, Matthew Wilcox wrote:
> > The current VFAT filesystem implements DISCARD.
>
> As does btrfs. I think I posted ext2 patches a little while ago too? Or
> someone else did. Did they get merged?

Looks like they didn't get merged. I thought akpm was going to merge
them. Ext4 has discard support though. It doesn't port easily to
ext3 since it ext4 has a new block allocator.

- Ted

2008-11-24 05:40:40

by Dongjun Shin

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Sun, Nov 23, 2008 at 4:57 PM, Tejun Heo <[email protected]> wrote:
> Tejun Heo wrote:
>> Dongjun, the only doc I can find about ATA TRIM is the following one.
>>
>> http://t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
>>
>> And AFAICS this hasn't made into ACS yet. Is this what you guys are
>> gonna implement and Windows7 is gonna use?
>

Sorry, I thought the trim spec has got into the final ACS, but it's not yet.
Microsoft can collaborate with SSD vendors for their internal evaluation &
feedback for Win7(with NDA between them, of course).

> Just went over it. Matthew, if ATA trim is gonna be implemented as
> described in the above document, it will support multiple ranges per
> command.
>
> Dongjun, the above document strikes out all the latency/performance
> related stuff, which looks like the right move to me. Most of those
> information can be extracted from access pattern by the device itself
> and exposing such optimization parameters to outside seldom works well.
> I'm fairly sure such over complexity will end up being
> counter-optimization due to different interpretations and executions by
> different parties (be it harddrive vendors or different filesystems).
>
> So, can you please confirm that, what we eventually get is simple TRIM
> w/ multiple ranges? Which, BTW, makes sense as it's something the
> device can't infer from the access pattern. Also, if there still is
> wiggle room, what would be a worthy optimization is to allow TRIM
> commands to be sent together with other NCQ commands as otherwise the
> drive will have to drain all other commands to process a TRIM command
> which will be inefficient.
>

AFAIK, only the trim feature of data set management command will be included
and a single trim may contain multiple ranges. But, it's not possible to send
non-read/write command during NCQ as specified by the t13(1).

--
Dongjun

(1) In 4.15 of ATA8-ACS,
"If the device receives a command that is not an NCQ command while NCQ commands
are in the queue, then the device shall return command aborted for the
new command
and for all of the NCQ commands that are in thequeue."

2008-11-24 05:45:45

by Tejun Heo

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

Dongjun Shin wrote:
>> So, can you please confirm that, what we eventually get is simple TRIM
>> w/ multiple ranges? Which, BTW, makes sense as it's something the
>> device can't infer from the access pattern. Also, if there still is
>> wiggle room, what would be a worthy optimization is to allow TRIM
>> commands to be sent together with other NCQ commands as otherwise the
>> drive will have to drain all other commands to process a TRIM command
>> which will be inefficient.
>>
>
> AFAIK, only the trim feature of data set management command will be included
> and a single trim may contain multiple ranges. But, it's not possible to send
> non-read/write command during NCQ as specified by the t13(1).

Yeah, I know the scheduling restriction and that's exactly why I was
asking. To rephrase: I think it would be nice to make TRIM an NCQ
command as it's likely to get mixed with regular IO traffic a lot but
maybe we can schedule it strategically such that it follows FLUSH or the
queue is mostly idle. Anyways, if someone knows anyone working on this,
please feel free to poke.

Thanks.

--
tejun

2008-11-24 05:58:13

by James Bottomley

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, 2008-11-24 at 14:40 +0900, Dongjun Shin wrote:
> On Sun, Nov 23, 2008 at 4:57 PM, Tejun Heo <[email protected]> wrote:
> > Tejun Heo wrote:
> >> Dongjun, the only doc I can find about ATA TRIM is the following one.
> >>
> >> http://t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
> >>
> >> And AFAICS this hasn't made into ACS yet. Is this what you guys are
> >> gonna implement and Windows7 is gonna use?
> >
>
> Sorry, I thought the trim spec has got into the final ACS, but it's not yet.
> Microsoft can collaborate with SSD vendors for their internal evaluation &
> feedback for Win7(with NDA between them, of course).

Actually so can we assuming the developers are willing. If we have to
sign an NDA to work with them, the Linux Foundation can stand as
corporate guarantor for it:

http://www.linuxfoundation.org/en/NDA_program

It's crafted to allow development of open source drivers. It still
requires individual developers to sign an NDA agreement with the LF (and
the LF then does a corporate one with the companies involved).

James

2008-11-24 09:05:32

by David Woodhouse

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> >
> > We don't even merge contiguous ranges -- I still need to fix the
> > elevators to stop writes crossing writes,
>
> I don't think we want to do that ... it's legal if the write isn't a
> barrier and it will inhibit merging. That may be just fine for a SSD,
> but it's not for spinning media since they get better performance out of
> merged writes.

No, I just mean writes _to the same sector_. At the moment, we happily
let those cross each other in the queue.

We do notice this situation and preserve the ordering if the two
requests cover _precisely_ the same range, but _overlapping_ writes may
happen in any order.

We should fix that, and it's only for _that_ purpose that I'm saying we
treat writes and discards as identical. And then we can drop the barrier
flag on discards.

And _then_ we can think about special cases which let us merge
non-contiguous discards.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2008-11-24 18:43:21

by James Bottomley

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, 2008-11-24 at 09:03 +0000, David Woodhouse wrote:
> On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> > On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> > >
> > > We don't even merge contiguous ranges -- I still need to fix the
> > > elevators to stop writes crossing writes,
> >
> > I don't think we want to do that ... it's legal if the write isn't a
> > barrier and it will inhibit merging. That may be just fine for a SSD,
> > but it's not for spinning media since they get better performance out of
> > merged writes.
>
> No, I just mean writes _to the same sector_. At the moment, we happily
> let those cross each other in the queue.

That's legal ... if you want the ordering to matter, you either wait or
insert a barrier.

> We do notice this situation and preserve the ordering if the two
> requests cover _precisely_ the same range, but _overlapping_ writes may
> happen in any order.
>
> We should fix that, and it's only for _that_ purpose that I'm saying we
> treat writes and discards as identical. And then we can drop the barrier
> flag on discards.

It's not a bug ... but changing it might be feasible ... as long as it
doesn't affect write performance too much (which I don't think it will),
since it is in the critical path.

> And _then_ we can think about special cases which let us merge
> non-contiguous discards.

I still think that treating discards as a special command from the
outset is the better way forwards.

James

2008-11-24 18:53:04

by David Woodhouse

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, 2008-11-24 at 13:42 -0500, James Bottomley wrote:
> On Mon, 2008-11-24 at 09:03 +0000, David Woodhouse wrote:
> > On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> > > On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> > > >
> > > > We don't even merge contiguous ranges -- I still need to fix the
> > > > elevators to stop writes crossing writes,
> > >
> > > I don't think we want to do that ... it's legal if the write isn't a
> > > barrier and it will inhibit merging. That may be just fine for a SSD,
> > > but it's not for spinning media since they get better performance out of
> > > merged writes.
> >
> > No, I just mean writes _to the same sector_. At the moment, we happily
> > let those cross each other in the queue.
...
> It's not a bug ... but changing it might be feasible ... as long as it
> doesn't affect write performance too much (which I don't think it will),
> since it is in the critical path.

We could argue about how much sense it makes to let two writes to the
same sector actually happen in reverse order.

Especially given the fact that we actually _do_ preserve ordering in
some cases; just not in others. (We preserve ordering only if the start
and end of the duplicate writes are _precisely_ matching; if it's just
overlapping (which may well happen in the presence of merges), then this
check doesn't trigger.

But that's just semantics. Yes, changing it should be feasible. I talked
to Jens about that at the kernel summit, and we agreed that it should
probably be done.

> > And _then_ we can think about special cases which let us merge
> > non-contiguous discards.
>
> I still think that treating discards as a special command from the
> outset is the better way forwards.

They're already treated as a special command and you can special-case
them wherever you like, so I'm not entirely sure what you're suggesting.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2008-11-24 18:59:57

by Jens Axboe

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, Nov 24 2008, David Woodhouse wrote:
> On Mon, 2008-11-24 at 13:42 -0500, James Bottomley wrote:
> > On Mon, 2008-11-24 at 09:03 +0000, David Woodhouse wrote:
> > > On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> > > > On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > > > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> > > > >
> > > > > We don't even merge contiguous ranges -- I still need to fix the
> > > > > elevators to stop writes crossing writes,
> > > >
> > > > I don't think we want to do that ... it's legal if the write isn't a
> > > > barrier and it will inhibit merging. That may be just fine for a SSD,
> > > > but it's not for spinning media since they get better performance out of
> > > > merged writes.
> > >
> > > No, I just mean writes _to the same sector_. At the moment, we happily
> > > let those cross each other in the queue.
> ...
> > It's not a bug ... but changing it might be feasible ... as long as it
> > doesn't affect write performance too much (which I don't think it will),
> > since it is in the critical path.
>
> We could argue about how much sense it makes to let two writes to the
> same sector actually happen in reverse order.
>
> Especially given the fact that we actually _do_ preserve ordering in
> some cases; just not in others. (We preserve ordering only if the start
> and end of the duplicate writes are _precisely_ matching; if it's just
> overlapping (which may well happen in the presence of merges), then this
> check doesn't trigger.
>
> But that's just semantics. Yes, changing it should be feasible. I talked
> to Jens about that at the kernel summit, and we agreed that it should
> probably be done.

The way this currently works is that we sort based on the first sector
in the request. So if you have have an overlap condition between rq1 and
rq2 and then a write gets merged into rq1, then you may have passing
writes. Linux has never guarenteed any write ordering for non-barriers,
so we've never attempted to handle it. Direct aliases (matching first
sectors) are handled as you mention, but that's more of an algorithmic
thing than by design.

My main worry is that this will add considerable overhead to request
sorting. For the rbtree based sorting, we'd have to do a rb_next/rb_prev
to look at adjacent requests. For CFQ it's even worse, since there's no
per-queue big rbtree for sorting.

--
Jens Axboe

2008-11-24 19:08:49

by James Bottomley

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, 2008-11-24 at 19:57 +0100, Jens Axboe wrote:
> On Mon, Nov 24 2008, David Woodhouse wrote:
> > On Mon, 2008-11-24 at 13:42 -0500, James Bottomley wrote:
> > > On Mon, 2008-11-24 at 09:03 +0000, David Woodhouse wrote:
> > > > On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> > > > > On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > > > > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> > > > > >
> > > > > > We don't even merge contiguous ranges -- I still need to fix the
> > > > > > elevators to stop writes crossing writes,
> > > > >
> > > > > I don't think we want to do that ... it's legal if the write isn't a
> > > > > barrier and it will inhibit merging. That may be just fine for a SSD,
> > > > > but it's not for spinning media since they get better performance out of
> > > > > merged writes.
> > > >
> > > > No, I just mean writes _to the same sector_. At the moment, we happily
> > > > let those cross each other in the queue.
> > ...
> > > It's not a bug ... but changing it might be feasible ... as long as it
> > > doesn't affect write performance too much (which I don't think it will),
> > > since it is in the critical path.
> >
> > We could argue about how much sense it makes to let two writes to the
> > same sector actually happen in reverse order.
> >
> > Especially given the fact that we actually _do_ preserve ordering in
> > some cases; just not in others. (We preserve ordering only if the start
> > and end of the duplicate writes are _precisely_ matching; if it's just
> > overlapping (which may well happen in the presence of merges), then this
> > check doesn't trigger.
> >
> > But that's just semantics. Yes, changing it should be feasible. I talked
> > to Jens about that at the kernel summit, and we agreed that it should
> > probably be done.
>
> The way this currently works is that we sort based on the first sector
> in the request. So if you have have an overlap condition between rq1 and
> rq2 and then a write gets merged into rq1, then you may have passing
> writes. Linux has never guarenteed any write ordering for non-barriers,
> so we've never attempted to handle it. Direct aliases (matching first
> sectors) are handled as you mention, but that's more of an algorithmic
> thing than by design.
>
> My main worry is that this will add considerable overhead to request
> sorting. For the rbtree based sorting, we'd have to do a rb_next/rb_prev
> to look at adjacent requests. For CFQ it's even worse, since there's no
> per-queue big rbtree for sorting.

Which is why I suggest special casing: Only invoke the expensive
overlap checking if one of the requests is a discard. Otherwise use the
standard path for writes.

James

2008-11-24 19:09:52

by James Bottomley

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, 2008-11-24 at 18:52 +0000, David Woodhouse wrote:
> On Mon, 2008-11-24 at 13:42 -0500, James Bottomley wrote:
> > On Mon, 2008-11-24 at 09:03 +0000, David Woodhouse wrote:
> > > On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> > > > On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > > > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> > > > >
> > > > > We don't even merge contiguous ranges -- I still need to fix the
> > > > > elevators to stop writes crossing writes,
> > > >
> > > > I don't think we want to do that ... it's legal if the write isn't a
> > > > barrier and it will inhibit merging. That may be just fine for a SSD,
> > > > but it's not for spinning media since they get better performance out of
> > > > merged writes.
> > >
> > > No, I just mean writes _to the same sector_. At the moment, we happily
> > > let those cross each other in the queue.
> ...
> > It's not a bug ... but changing it might be feasible ... as long as it
> > doesn't affect write performance too much (which I don't think it will),
> > since it is in the critical path.
>
> We could argue about how much sense it makes to let two writes to the
> same sector actually happen in reverse order.
>
> Especially given the fact that we actually _do_ preserve ordering in
> some cases; just not in others. (We preserve ordering only if the start
> and end of the duplicate writes are _precisely_ matching; if it's just
> overlapping (which may well happen in the presence of merges), then this
> check doesn't trigger.
>
> But that's just semantics. Yes, changing it should be feasible. I talked
> to Jens about that at the kernel summit, and we agreed that it should
> probably be done.
>
> > > And _then_ we can think about special cases which let us merge
> > > non-contiguous discards.
> >
> > I still think that treating discards as a special command from the
> > outset is the better way forwards.
>
> They're already treated as a special command and you can special-case
> them wherever you like, so I'm not entirely sure what you're suggesting.

I mean that since it's not a bug, you don't have to do it for every
write, just between a write and a discard, thus special casing the
overlap checking code.

James

2008-11-25 03:29:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, Nov 24, 2008 at 09:03:50AM +0000, David Woodhouse wrote:
> And _then_ we can think about special cases which let us merge
> non-contiguous discards.

I've been thinking about what a solution would look like that lets us
send non-contiguous discards down, and I don't like the look of any of
them. Right now, discard bios get turned into discard requests and
those are handled by the block device drivers as being a discard of the
range (sector, sector + nr_sectors).

If we're going to allow discontiguous ranges to be merged into one
request, then we need somewhere to store the ranges. The obvious answer
is in the ->bio list. But then, an unconverted driver will discard the
wrong sectors (presuming nr_sectors gets updated to the length of all
discarded sectors).

There's also murmurings from vendors that they want to restrict the
number of ranges transmitted in a single UNMAP/TRIM command, and that's
more information to be passed to the elevator.

How about an interface that lets the driver's discard function scan back
through the queue and see if there are any more discard bios queued up?
If there are, (and it has room for them) it can retire them from the
queue early.

I'm open to other ideas, of course.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-25 09:17:51

by Jens Axboe

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, Nov 24 2008, Matthew Wilcox wrote:
> On Mon, Nov 24, 2008 at 09:03:50AM +0000, David Woodhouse wrote:
> > And _then_ we can think about special cases which let us merge
> > non-contiguous discards.
>
> I've been thinking about what a solution would look like that lets us
> send non-contiguous discards down, and I don't like the look of any of
> them. Right now, discard bios get turned into discard requests and
> those are handled by the block device drivers as being a discard of the
> range (sector, sector + nr_sectors).
>
> If we're going to allow discontiguous ranges to be merged into one
> request, then we need somewhere to store the ranges. The obvious answer
> is in the ->bio list. But then, an unconverted driver will discard the
> wrong sectors (presuming nr_sectors gets updated to the length of all
> discarded sectors).

It's completely parallel to normal fs merges, I think it's a good fit.
And it's not like there are a lot of drivers to check for this
particular command type.

> There's also murmurings from vendors that they want to restrict the
> number of ranges transmitted in a single UNMAP/TRIM command, and that's
> more information to be passed to the elevator.

If need be, then we can just add a setting for that like we have for
request sizes, segments, etc.

> How about an interface that lets the driver's discard function scan back
> through the queue and see if there are any more discard bios queued up?
> If there are, (and it has room for them) it can retire them from the
> queue early.

Irk, that sounds pretty horrible and slow...

--
Jens Axboe

2008-11-25 09:19:07

by Jens Axboe

[permalink] [raw]
Subject: Re: about TRIM/DISCARD support and barriers

On Mon, Nov 24 2008, James Bottomley wrote:
> On Mon, 2008-11-24 at 19:57 +0100, Jens Axboe wrote:
> > On Mon, Nov 24 2008, David Woodhouse wrote:
> > > On Mon, 2008-11-24 at 13:42 -0500, James Bottomley wrote:
> > > > On Mon, 2008-11-24 at 09:03 +0000, David Woodhouse wrote:
> > > > > On Mon, 2008-11-24 at 07:52 +0900, James Bottomley wrote:
> > > > > > On Sun, 2008-11-23 at 13:39 +0000, David Woodhouse wrote:
> > > > > > > > We don't attempt to put non-contiguous ranges into a single TRIM yet.
> > > > > > >
> > > > > > > We don't even merge contiguous ranges -- I still need to fix the
> > > > > > > elevators to stop writes crossing writes,
> > > > > >
> > > > > > I don't think we want to do that ... it's legal if the write isn't a
> > > > > > barrier and it will inhibit merging. That may be just fine for a SSD,
> > > > > > but it's not for spinning media since they get better performance out of
> > > > > > merged writes.
> > > > >
> > > > > No, I just mean writes _to the same sector_. At the moment, we happily
> > > > > let those cross each other in the queue.
> > > ...
> > > > It's not a bug ... but changing it might be feasible ... as long as it
> > > > doesn't affect write performance too much (which I don't think it will),
> > > > since it is in the critical path.
> > >
> > > We could argue about how much sense it makes to let two writes to the
> > > same sector actually happen in reverse order.
> > >
> > > Especially given the fact that we actually _do_ preserve ordering in
> > > some cases; just not in others. (We preserve ordering only if the start
> > > and end of the duplicate writes are _precisely_ matching; if it's just
> > > overlapping (which may well happen in the presence of merges), then this
> > > check doesn't trigger.
> > >
> > > But that's just semantics. Yes, changing it should be feasible. I talked
> > > to Jens about that at the kernel summit, and we agreed that it should
> > > probably be done.
> >
> > The way this currently works is that we sort based on the first sector
> > in the request. So if you have have an overlap condition between rq1 and
> > rq2 and then a write gets merged into rq1, then you may have passing
> > writes. Linux has never guarenteed any write ordering for non-barriers,
> > so we've never attempted to handle it. Direct aliases (matching first
> > sectors) are handled as you mention, but that's more of an algorithmic
> > thing than by design.
> >
> > My main worry is that this will add considerable overhead to request
> > sorting. For the rbtree based sorting, we'd have to do a rb_next/rb_prev
> > to look at adjacent requests. For CFQ it's even worse, since there's no
> > per-queue big rbtree for sorting.
>
> Which is why I suggest special casing: Only invoke the expensive
> overlap checking if one of the requests is a discard. Otherwise use the
> standard path for writes.

Good point, we can easily track if we have discard requests pending.
That doesn't really make it a lot better for CFQ though, currently it'll
still have to dump all queues if a discard comes in.

--
Jens Axboe