LinuxLists.cc - Is TRIM/DISCARD going to be a performance problem?

2009-05-09 21:14:14

Subject: Is TRIM/DISCARD going to be a performance problem?

Currently, ext4 is wired up to call sb_issue_discard, which is a wrapper
around blkdev_issue_discard(). The way we do this is we keep track of
deleted extents, coalescing them as much as possible, and then once we
commit the transaction where they are deleted, we send the discards down
the pipe via sb_issue_discard. For example, after marking approximately
200 mail messages as deleted, and running the mbsync command which
synchronizes my local Maildir store with my IMAP server (and thus
deleting approximately 200 files), and the next commit, we see this:

3480.770129: jbd2_start_commit: dev dm-0 transaction 760204 sync 0
3480.783797: ext4_discard_blocks: dev dm-0 blk 15967955 count 1
3480.783830: ext4_discard_blocks: dev dm-0 blk 15970048 count 104
3480.783839: ext4_discard_blocks: dev dm-0 blk 17045096 count 14
3480.783842: ext4_discard_blocks: dev dm-0 blk 15702398 count 2
.
.
.
3480.784009: ext4_discard_blocks: dev dm-0 blk 15461632 count 32
3480.784015: ext4_discard_blocks: dev dm-0 blk 17057632 count 32
3480.784023: ext4_discard_blocks: dev dm-0 blk 17049120 count 32
3480.784026: ext4_discard_blocks: dev dm-0 blk 17045408 count 32
3480.784031: ext4_discard_blocks: dev dm-0 blk 15448634 count 6
3480.784036: ext4_discard_blocks: dev dm-0 blk 17146618 count 1
3480.784039: ext4_discard_blocks: dev dm-0 blk 17146370 count 1
3480.784043: ext4_discard_blocks: dev dm-0 blk 15967947 count 6
3480.784046: jbd2_end_commit: dev dm-0 transaction 760204 sync 0 head 758551

There were 42 calls to blkdev_issue_discard (I ommitted some for the
sake of brevity), and that's a relatively minimal example. A "make
mrclean" in the kernel tree, especially one that tends to be more
fragmented due to a mix of source and binary files getting updated via
"git pull", will be much, much worse, and could result in potential
hundreds of calls to blkev_issue_discard(). Given that each call to
blkdeV_issue_discard() acts like a barrier command and requires that the
queue be completely drained (of both read and write requests, if I
understand things correctly) if there's anything else happening in
parallel, such as other write or read requests, performance is going to
go down the tubes.

What I'm thinking that we might have to do is:

*) Batch the trim requests more than a single commit, by having a
separate rbtree for trim requests
*) If blocks get reused, we'll need to remove them from the rbtree
*) In some cases, we may be able to collapse the rbtree by querying the
filesystem block allocation data structures to determine that if
we have an entry for blocks 1003-1008 and 1011-1050, and block
1009 and 1010 are unused, we can combine this into a single
trim request for 1003-1050.
*) Create an upcall from the block layer to the trim management layer
indicating that the I/O device is idle, so this would be a good
time to send down a whole bunch of trim requeusts.
*) Optionally have a mode to support stupid thin-provision
devices that require the trim request to be aligned on some
large 1 or 4 megabyte boundaries, and be multiples of 1-4
megabyte ranges, or they will ignroe them.
*) Optionally have a mode which allows the filesystem's block allocator
to query the list of blocks on the "to be trimmed" list, so they
can be reused and hopefully avoid needing to send the trim
request in the first place.

This could either be done as ext4-specific code, or as a generic "trim
management layer" which could be utilized by any filesystem.

So, a couple of questions: First of all, do people agree with my
concerns? Secondly, does the above design seem sane? And finally, if
the answers to the first two questions are yes, I'm rather busy and
could really use a minion to implement my evil plans --- anyone have any
ideas about how to contact the vendors of these large thin-provisioning
devices, and perhaps gently suggest to them that if they plan to make
$$$ off their devices, maybe they should fund this particular piece of
work? :-)

- Ted

2009-05-10 16:53:00

by Jörn Engel

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Sat, 9 May 2009 17:14:14 -0400, Theodore Ts'o wrote:
>
> 3480.784009: ext4_discard_blocks: dev dm-0 blk 15461632 count 32
> 3480.784015: ext4_discard_blocks: dev dm-0 blk 17057632 count 32
> 3480.784023: ext4_discard_blocks: dev dm-0 blk 17049120 count 32
> 3480.784026: ext4_discard_blocks: dev dm-0 blk 17045408 count 32
> 3480.784031: ext4_discard_blocks: dev dm-0 blk 15448634 count 6
> 3480.784036: ext4_discard_blocks: dev dm-0 blk 17146618 count 1
> 3480.784039: ext4_discard_blocks: dev dm-0 blk 17146370 count 1
> 3480.784043: ext4_discard_blocks: dev dm-0 blk 15967947 count 6
>
> What I'm thinking that we might have to do is:
>
> *) Batch the trim requests more than a single commit, by having a
> separate rbtree for trim requests
> *) If blocks get reused, we'll need to remove them from the rbtree
> *) In some cases, we may be able to collapse the rbtree by querying the
> filesystem block allocation data structures to determine that if
> we have an entry for blocks 1003-1008 and 1011-1050, and block
> 1009 and 1010 are unused, we can combine this into a single
> trim request for 1003-1050.
> *) Create an upcall from the block layer to the trim management layer
> indicating that the I/O device is idle, so this would be a good
> time to send down a whole bunch of trim requeusts.
> *) Optionally have a mode to support stupid thin-provision
> devices that require the trim request to be aligned on some
> large 1 or 4 megabyte boundaries, and be multiples of 1-4
> megabyte ranges, or they will ignroe them.
> *) Optionally have a mode which allows the filesystem's block allocator
> to query the list of blocks on the "to be trimmed" list, so they
> can be reused and hopefully avoid needing to send the trim
> request in the first place.

I'm somewhat surprised. Imo both the current performance impact and
much of your proposal above is ludicrous. Given the alternative, I
would much rather accept that overlapping writes and discards (and
possibly reads) are illegal and will give undefined results than deal
with an rbtree. If necessary, the filesystem itself can generate
barriers - and hopefully not an insane number of them.

Independently of that question, though, you seem to send down a large
number of fairly small discard requests. And I'd wager that many, if
not most, will be completely useless for the underlying device. Unless
at least part of the discard matches the granularity, it will be
ignored. And even on large discards, the head and tail bits will likely
be ignored. So I would have expected that you already handle discard by
looking at the allocator and combining the current request with any free
space on either side.

Also, if the devices would actually announce their granularity, useless
discards could already get ignored at the block layer or filesystem
level. Even better, if devices are known to ignore discards, none
should every be sent. That may be wishful thinking, though.

Jörn

--
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche

Attachments:

(No filename) (3.03 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2009-05-11 08:12:16

by Jens Axboe

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Sat, May 09 2009, Theodore Ts'o wrote:
>
> Currently, ext4 is wired up to call sb_issue_discard, which is a wrapper
> around blkdev_issue_discard(). The way we do this is we keep track of
> deleted extents, coalescing them as much as possible, and then once we
> commit the transaction where they are deleted, we send the discards down
> the pipe via sb_issue_discard. For example, after marking approximately
> 200 mail messages as deleted, and running the mbsync command which
> synchronizes my local Maildir store with my IMAP server (and thus
> deleting approximately 200 files), and the next commit, we see this:
>
> 3480.770129: jbd2_start_commit: dev dm-0 transaction 760204 sync 0
> 3480.783797: ext4_discard_blocks: dev dm-0 blk 15967955 count 1
> 3480.783830: ext4_discard_blocks: dev dm-0 blk 15970048 count 104
> 3480.783839: ext4_discard_blocks: dev dm-0 blk 17045096 count 14
> 3480.783842: ext4_discard_blocks: dev dm-0 blk 15702398 count 2
> .
> .
> .
> 3480.784009: ext4_discard_blocks: dev dm-0 blk 15461632 count 32
> 3480.784015: ext4_discard_blocks: dev dm-0 blk 17057632 count 32
> 3480.784023: ext4_discard_blocks: dev dm-0 blk 17049120 count 32
> 3480.784026: ext4_discard_blocks: dev dm-0 blk 17045408 count 32
> 3480.784031: ext4_discard_blocks: dev dm-0 blk 15448634 count 6
> 3480.784036: ext4_discard_blocks: dev dm-0 blk 17146618 count 1
> 3480.784039: ext4_discard_blocks: dev dm-0 blk 17146370 count 1
> 3480.784043: ext4_discard_blocks: dev dm-0 blk 15967947 count 6
> 3480.784046: jbd2_end_commit: dev dm-0 transaction 760204 sync 0 head 758551
>
> There were 42 calls to blkdev_issue_discard (I ommitted some for the
> sake of brevity), and that's a relatively minimal example. A "make
> mrclean" in the kernel tree, especially one that tends to be more
> fragmented due to a mix of source and binary files getting updated via
> "git pull", will be much, much worse, and could result in potential
> hundreds of calls to blkev_issue_discard(). Given that each call to
> blkdeV_issue_discard() acts like a barrier command and requires that the
> queue be completely drained (of both read and write requests, if I
> understand things correctly) if there's anything else happening in
> parallel, such as other write or read requests, performance is going to
> go down the tubes.
>
> What I'm thinking that we might have to do is:
>
> *) Batch the trim requests more than a single commit, by having a
> separate rbtree for trim requests
> *) If blocks get reused, we'll need to remove them from the rbtree
> *) In some cases, we may be able to collapse the rbtree by querying the
> filesystem block allocation data structures to determine that if
> we have an entry for blocks 1003-1008 and 1011-1050, and block
> 1009 and 1010 are unused, we can combine this into a single
> trim request for 1003-1050.
> *) Create an upcall from the block layer to the trim management layer
> indicating that the I/O device is idle, so this would be a good
> time to send down a whole bunch of trim requeusts.
> *) Optionally have a mode to support stupid thin-provision
> devices that require the trim request to be aligned on some
> large 1 or 4 megabyte boundaries, and be multiples of 1-4
> megabyte ranges, or they will ignroe them.
> *) Optionally have a mode which allows the filesystem's block allocator
> to query the list of blocks on the "to be trimmed" list, so they
> can be reused and hopefully avoid needing to send the trim
> request in the first place.

I largely agree with this. I think that trims should be queued and
postponed until the drive is largely idle. I don't want to put this IO
tracking in the block layer though, it's going to slow down our iops
rates for writes. Providing the functionality in the block layer does
make sense though, since it sits between that and the fs anyway. So just
not part of the generic IO path, but a set of helpers on the side.

I don't think that issuing trims along side with normal IO is going to
be very fast, since it does add barriers. It's much better to be able to
generate a set of deleted blocks at certain intervals, and then inform
the drive of those at a suitable time.

> This could either be done as ext4-specific code, or as a generic "trim
> management layer" which could be utilized by any filesystem.

The latter please, it's completely generic functionality and other
filesystems will want to use it as well.

> So, a couple of questions: First of all, do people agree with my
> concerns? Secondly, does the above design seem sane? And finally, if
> the answers to the first two questions are yes, I'm rather busy and
> could really use a minion to implement my evil plans --- anyone have any
> ideas about how to contact the vendors of these large thin-provisioning
> devices, and perhaps gently suggest to them that if they plan to make
> $$$ off their devices, maybe they should fund this particular piece of
> work? :-)

I've ordered a few SSD's that have TRIM support, so I can start playing
with this soonish I hope.

--
Jens Axboe

2009-05-11 08:38:03

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Sun, May 10, 2009 at 06:53:00PM +0200, J?rn Engel wrote:
> I'm somewhat surprised. Imo both the current performance impact and
> much of your proposal above is ludicrous. Given the alternative, I
> would much rather accept that overlapping writes and discards (and
> possibly reads) are illegal and will give undefined results than deal
> with an rbtree. If necessary, the filesystem itself can generate
> barriers - and hopefully not an insane number of them.
>
> Independently of that question, though, you seem to send down a large
> number of fairly small discard requests. And I'd wager that many, if
> not most, will be completely useless for the underlying device. Unless
> at least part of the discard matches the granularity, it will be
> ignored.

Well, no one has actually implemented the low-level TRIM support yet;
and what I did is basically the same as the TRIM support which Matthew
Wilcox implemented (most of which was never merged, although the call
so that the FAT filesystem would call TRIM is in mainline ---
currently the two users of sb_issue_blkdev() are the FAT and ext4
filesystems). And actually, what I did is much *better* than what
Matthew implemented --- he sent the sb_issue_discard() after every
single unlink command, whereas with ext4 at leat we combined the trim
requests and only issued them after the journal commit. So for
example, in the test where I deleted 200 files, ext4 only sent 42
discard requests. For the FAT filesystem, which issues the discard
after each unlink() system call, it would have issued at least 200
discard requests, and perhaps significantly more if the file system
was fragmented.

> And even on large discards, the head and tail bits will likely
> be ignored. So I would have expected that you already handle discard by
> looking at the allocator and combining the current request with any free
> space on either side.

Well, no, Matthew's changes didn't do any of that, I suspect because
most SSD's, including X25-M, are expected to have a granularity size
of 1 block. It's the crazy people in the SCSI standards world who've
been pushing for granlarity sizes in the 1-4 megabyte range; as I
understand things, the granularity issue was not going to be a problem
for the ATA TRIM command.

Hence my suggestion that if they want to support these large
granlarity writes, since they're the ones who are going to be making
$$$ on these thin-provisioned clients, we ought to hit them up for
funding to implement discard management layer. Personally, I only
care about SSD's (because I have one in my laptop) and the associated
performance issues. If they want to make huge amounts of money, and
they're too lazy to track unallocated regions on a smaller granularity
than multiple megabytes, and want to push this complexity into Linux,
let *them* help pay for the development work. :-)

As far as thinking that the proposal is ludicrous --- what precisely
did you find ludicrous about it? These are problems that all
filesystems will have to face; so we might as well solve the problem
once, generically. Figuring out when we have to issue discards is a
very hard problem. It may very well be that for thin-provisioned
clients, the answer may be that we should only issue the discard
requests at unmount time. That means that the system won't be
informed about a large-scale "rm -rf", but at least it will be much
simpler; we can have a program that reads out the block allocation
bitmaps, and then updates the thin-provisioned client after the
filesystem has been unmounted.

However, the requirements are different for SSD's, where (a) the SSD's
want the SSD information on a fine-grained basis, and (b) from a
wear-management point of view, giving the SSD the information sooner
rather than later is a *good* thing, since if the blocks have been
deleted, you want the SSD to know right away, to avoid needlessly
GC'ing that region of disk, since that will improve the SSD's write
endurance.

The only problem with SSD's is the people who designed the ATA TRIM
command requires us to completely drian the I/O queue before issuing
it. Because of this incompetence, we need to be a bit more careful
about how we issue them.

- Ted

2009-05-11 08:41:21

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 10:12:16AM +0200, Jens Axboe wrote:
>
> I largely agree with this. I think that trims should be queued and
> postponed until the drive is largely idle. I don't want to put this IO
> tracking in the block layer though, it's going to slow down our iops
> rates for writes. Providing the functionality in the block layer does
> make sense though, since it sits between that and the fs anyway. So just
> not part of the generic IO path, but a set of helpers on the side.

Yes, I agree. However, in that case, we need two things from the
block I/O path. (A) The discard management layer needs a way of
knowing that the block device has become idle, and (B) ideally there
should be a more efficient method for sending trim requests to the I/O
submission path. If we batch the results, when we *do* send the
discard requests, we may be sending several hundred discards, and it
would be useful if we could pass into the I/O submission path a linked
list of regions, so the queue can be drained *once*, and then a whole
series of discards can be sent to the device all at once.

Does that make sense to you?

- Ted

2009-05-11 08:49:41

by Jens Axboe

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11 2009, Theodore Tso wrote:
> On Mon, May 11, 2009 at 10:12:16AM +0200, Jens Axboe wrote:
> >
> > I largely agree with this. I think that trims should be queued and
> > postponed until the drive is largely idle. I don't want to put this IO
> > tracking in the block layer though, it's going to slow down our iops
> > rates for writes. Providing the functionality in the block layer does
> > make sense though, since it sits between that and the fs anyway. So just
> > not part of the generic IO path, but a set of helpers on the side.
>
> Yes, I agree. However, in that case, we need two things from the
> block I/O path. (A) The discard management layer needs a way of
> knowing that the block device has become idle, and (B) ideally there

We don't have to inform of such a condition, the block layer can check
for existing pending trims and kick those off at an appropriate time.

> should be a more efficient method for sending trim requests to the I/O
> submission path. If we batch the results, when we *do* send the
> discard requests, we may be sending several hundred discards, and it
> would be useful if we could pass into the I/O submission path a linked
> list of regions, so the queue can be drained *once*, and then a whole
> series of discards can be sent to the device all at once.
>
> Does that make sense to you?

Agree, we definitely only want to do the queue quiesce once for passing
down a series of trims. With the delayed trim queuing, that isn't very
difficult.

--
Jens Axboe

2009-05-11 10:06:43

by Jörn Engel

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, 11 May 2009 04:37:54 -0400, Theodore Tso wrote:
>
> Well, no one has actually implemented the low-level TRIM support yet;

Iirc dwmw2 did so for some of the FTL drivers. More a curiosity than a
useful device, though.

> Well, no, Matthew's changes didn't do any of that, I suspect because
> most SSD's, including X25-M, are expected to have a granularity size
> of 1 block. It's the crazy people in the SCSI standards world who've
> been pushing for granlarity sizes in the 1-4 megabyte range; as I
> understand things, the granularity issue was not going to be a problem
> for the ATA TRIM command.

I am not sure about this part. So far Intel has been the only party to
release any information about their dark-grey box. All other boxes are
still solid black. And until I'm told otherwise I'd consider them to be
stupid devices that use erase block size as trim granularity.

Given the total lack of information, your guess is as good as mine,
though.

> As far as thinking that the proposal is ludicrous --- what precisely
> did you find ludicrous about it?

Mainly the idea that discard requests should act as barriers and instead
of fixing that, you propose a lot of complexity to work around it.

> The only problem with SSD's is the people who designed the ATA TRIM
> command requires us to completely drian the I/O queue before issuing
> it. Because of this incompetence, we need to be a bit more careful
> about how we issue them.

And this bit that I wasn't aware of. Such a requirement in the standard
is a trainwreck indeed.

Jörn

--
Victory in war is not repetitious.
-- Sun Tzu

2009-05-11 10:18:15

by Jens Axboe

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11 2009, J?rn Engel wrote:
> On Mon, 11 May 2009 04:37:54 -0400, Theodore Tso wrote:
> >
> > Well, no one has actually implemented the low-level TRIM support yet;
>
> Iirc dwmw2 did so for some of the FTL drivers. More a curiosity than a
> useful device, though.
>
> > Well, no, Matthew's changes didn't do any of that, I suspect because
> > most SSD's, including X25-M, are expected to have a granularity size
> > of 1 block. It's the crazy people in the SCSI standards world who've
> > been pushing for granlarity sizes in the 1-4 megabyte range; as I
> > understand things, the granularity issue was not going to be a problem
> > for the ATA TRIM command.
>
> I am not sure about this part. So far Intel has been the only party to
> release any information about their dark-grey box. All other boxes are
> still solid black. And until I'm told otherwise I'd consider them to be
> stupid devices that use erase block size as trim granularity.
>
> Given the total lack of information, your guess is as good as mine,
> though.
>
> > As far as thinking that the proposal is ludicrous --- what precisely
> > did you find ludicrous about it?
>
> Mainly the idea that discard requests should act as barriers and instead
> of fixing that, you propose a lot of complexity to work around it.

But the command is effectively a barrier at the device level anyway,
since you need to drain the hardware queue before issuing.

> > The only problem with SSD's is the people who designed the ATA TRIM
> > command requires us to completely drian the I/O queue before issuing
> > it. Because of this incompetence, we need to be a bit more careful
> > about how we issue them.
>
> And this bit that I wasn't aware of. Such a requirement in the standard
> is a trainwreck indeed.

Precisely, but that's how basically anything works with SATA NCQ, only
read/write dma commands may be queued. Anything else requires an idle
drive before issue.

--
Jens Axboe

2009-05-11 11:27:42

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 12:06:24PM +0200, J?rn Engel wrote:
> I am not sure about this part. So far Intel has been the only party to
> release any information about their dark-grey box. All other boxes are
> still solid black. And until I'm told otherwise I'd consider them to be
> stupid devices that use erase block size as trim granularity.

I believe the ATA TRIM draft standards specs don't have the 1-4
megabyte; that craziness is only coming from the SCSI world. So we do
have more information than what Intel has released; also, note that
OCZ is the first vendor who has shipped publically available SSD
firmware with Trim support. Supposely Intel is going to try to get me
their trim-enabled firmware under NDA, but that hasn't happened yet.

> > As far as thinking that the proposal is ludicrous --- what precisely
> > did you find ludicrous about it?
>
> Mainly the idea that discard requests should act as barriers and instead
> of fixing that, you propose a lot of complexity to work around it.

I can't fix hardware braindamage. Given that the standard
specifications is terminally broken, (and we can't really fix it
without getting the drive manufacturers to rip out and replace NCQ
with something sane --- good luck with that) the complexity is pretty
much unaviodable. Still think my proposal is ludicrous?

- Ted

2009-05-11 12:10:10

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 07:27:29AM -0400, Theodore Tso wrote:
>
> I believe the ATA TRIM draft standards specs don't have the 1-4
> megabyte; that craziness is only coming from the SCSI world. So we do
^^^^ I left out the worlds "granularity requirement", sorry.

> have more information than what Intel has released; also, note that
> OCZ is the first vendor who has shipped publically available SSD
> firmware with Trim support. Supposely Intel is going to try to get me
> their trim-enabled firmware under NDA, but that hasn't happened yet.

I just did a bit more web browsing, and it appears that OCZ's
userspace support for trim is currently Windows-only, and they've
implemented it by taking the filesystme off line, and running a
userspace utility that sends TRIM requests for all of the free space
on the drive.

After doing this, write speeds for sequential writes, random 512k
writes, and random 4k writes all went up by approximately 15-20% on
the OCZ Vertex, at least according to one user who did some
benchmarks. I'm not sure how repeatable that is, and how many random
writes you can do before performance levels fall back to the pre-TRIM
levels.

It's also supported only on 32-bit Windows XP. On 64-bit platforms,
there seems to be an unfortunate tendency (probability around 50%)
that the TRIM enablement software trashes the data stored on the SSD.
So there is currently a warning on the OCZ discussion web forum that
the tools should only be used on 32-bit Windows platforms.

All of the web browsing I've doen confirms that the ATA folks expect
trim to work on 512-sector granularity. It's only the lazy b*stards
who don't want to change how their large high-end storage boxes work
that are trying to push for 1-4 megabyte alignment and granularity
requirements in the SCSI standards. I'm not that worred about the
crappy flash devices (mostly SD and Compact flash devices, not SSD's)
that don't do sub-erase block wear-leveling and management; those will
probably get weeded out of the market pretty quickly, since SSD's that
crappy will have really lousy small random write performance as well,
and web sites like Anandtech and PC Perspectives have picked up on why
that really hurts your OS performance on said crappy SSD's.

- Ted

2009-05-11 12:43:49

by Jörn Engel

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, 11 May 2009 07:27:29 -0400, Theodore Tso wrote:
>
> I can't fix hardware braindamage. Given that the standard
> specifications is terminally broken, (and we can't really fix it
> without getting the drive manufacturers to rip out and replace NCQ
> with something sane --- good luck with that) the complexity is pretty
> much unaviodable. Still think my proposal is ludicrous?

Given the hardware braindamage it is relatively sane. As always, it
would be much better to fix the problem and not add workarounds, but we
seem to lack the gods favor this time around.

Can't anyone explain to the SATA folks that a discard is much closer to
a write than to a secure erase or some other rare and slow command?

Jörn

--
Mundie uses a textbook tactic of manipulation: start with some
reasonable talk, and lead the audience to an unreasonable conclusion.
-- Bruce Perens

2009-05-11 12:48:22

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 02:43:25PM +0200, J??rn Engel wrote:
> Given the hardware braindamage it is relatively sane. As always, it
> would be much better to fix the problem and not add workarounds, but we
> seem to lack the gods favor this time around.
>
> Can't anyone explain to the SATA folks that a discard is much closer to
> a write than to a secure erase or some other rare and slow command?

I've heard the ATA committee are working on an NCQ version of TRIM.
Don't expect support soon.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-05-11 13:10:15

by Greg Freemyer

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 8:09 AM, Theodore Tso
> All of the web browsing I've doen confirms that the ATA folks expect
> trim to work on 512-sector granularity.

Ted,

That implies that the SSD folks are not treating erase blocks as a
contiguous group of sectors. For some reason, I thought their was
only one mapping per erase block and within the erase block the
sectors were contiguous..

If I'm right, then the ata spec may allow you to send sub-erase block
trim commands down, but the spec does not prevent the (blackbox)
hardware from clipping the size of the trim to be on erase block
boundaries and ignoring the sub-erase block portions on each end. Or
ignoring the whole command if your trim command does not span a whole
erase block.

Also the mdraid people plan to clip at the stripe width boundary for
raid 5, 6, etc. Their expectation is that discards will be coalesced
into bigger blocks before it gets to the mdraid layer.

I still think reshaping a raid 5 online will be next to impossible
when some of the stripes may contain indeterminate data.

More realistic is to figure out a way to make it deterministic at
least for the short term (by writing data to all the trimmed blocks?),
then reshaping, then having a tool to scan the filesystem and re-issue
all the trim commands.

Obviously, if the ata spec had a signaling mechanism that
differentiated between deterministic data and non-deterministic data
then the above code excess could be simplified greatly.

Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2009-05-11 13:21:53

by Ric Wheeler

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On 05/11/2009 08:09 AM, Theodore Tso wrote:
> On Mon, May 11, 2009 at 07:27:29AM -0400, Theodore Tso wrote:
>> I believe the ATA TRIM draft standards specs don't have the 1-4
>> megabyte; that craziness is only coming from the SCSI world. So we do
> ^^^^ I left out the worlds "granularity requirement", sorry.
>
>> have more information than what Intel has released; also, note that
>> OCZ is the first vendor who has shipped publically available SSD
>> firmware with Trim support. Supposely Intel is going to try to get me
>> their trim-enabled firmware under NDA, but that hasn't happened yet.
>
> I just did a bit more web browsing, and it appears that OCZ's
> userspace support for trim is currently Windows-only, and they've
> implemented it by taking the filesystme off line, and running a
> userspace utility that sends TRIM requests for all of the free space
> on the drive.
>
> After doing this, write speeds for sequential writes, random 512k
> writes, and random 4k writes all went up by approximately 15-20% on
> the OCZ Vertex, at least according to one user who did some
> benchmarks. I'm not sure how repeatable that is, and how many random
> writes you can do before performance levels fall back to the pre-TRIM
> levels.
>
> It's also supported only on 32-bit Windows XP. On 64-bit platforms,
> there seems to be an unfortunate tendency (probability around 50%)
> that the TRIM enablement software trashes the data stored on the SSD.
> So there is currently a warning on the OCZ discussion web forum that
> the tools should only be used on 32-bit Windows platforms.
>
> All of the web browsing I've doen confirms that the ATA folks expect
> trim to work on 512-sector granularity. It's only the lazy b*stards
> who don't want to change how their large high-end storage boxes work
> that are trying to push for 1-4 megabyte alignment and granularity
> requirements in the SCSI standards. I'm not that worred about the
> crappy flash devices (mostly SD and Compact flash devices, not SSD's)
> that don't do sub-erase block wear-leveling and management; those will
> probably get weeded out of the market pretty quickly, since SSD's that
> crappy will have really lousy small random write performance as well,
> and web sites like Anandtech and PC Perspectives have picked up on why
> that really hurts your OS performance on said crappy SSD's.
>
> - Ted

I don't think that the large arrays are being lazy - it is more a matter of
having to track an enormous amount of storage and running out of bits. There has
been some movement towards using smaller erase chunk sizes which should make
this less of an issue.

One thing that will bite people in the SCSI space might be the WRITE_SAME with
discard bit set. (Adding linux-scsi to this thread)

On the plus side, this has very clear semantics. but if you send down requests
that are not aligned or too small, the device will have to "zero" the contents
of the specified sectors in order to be compliant if I understand correctly.

In this case, coalescing would almost always be a win as well,

ric

2009-05-11 13:39:48

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote:
> That implies that the SSD folks are not treating erase blocks as a
> contiguous group of sectors. For some reason, I thought their was
> only one mapping per erase block and within the erase block the
> sectors were contiguous..

I believe there is a mapping per LBA, not per erase block. Of course,
different technologies will have different limitations here, but it
would be foolish to assume anything about SSDs at this point.

(For those who haven't heard my disclaimer before, the Intel SSD team
don't tell me anything fun about how the drives work internally).

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-05-11 14:27:58

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote:
>
> That implies that the SSD folks are not treating erase blocks as a
> contiguous group of sectors.

Correct.

> For some reason, I thought their was
> only one mapping per erase block and within the erase block the
> sectors were contiguous..

No, if you try to treat erase blocks as a contiguous group of
sectors, you'll have terrible write amplification problems (leading to
premature death of the SSD) and terrible small random write
performance. Flash devices optimized for digital cameras might have
done that, but for SSD's, this will result in catastrophically bad
performance, and very limited lifespan. As I said, I expect these
SSD's to be weeded out of the market very shortly.

For any sane implementation of an SSD, the mapping will be on a per
LBA basis, not on a per-erase block basis.

> More realistic is to figure out a way to make it deterministic at
> least for the short term (by writing data to all the trimmed blocks?),
> then reshaping, then having a tool to scan the filesystem and re-issue
> all the trim commands.

Writing data to all of the trimmed block? Um, no. That would be a
diaster, since it accelerates the wear and tear of the SSD. The whole
*point* of the TRIM command is to avoid needing to do that.

The whole worry about determinism is highly overrated. If the
filesystem doesn't need a block, then it doesn't need it. What you
read after you send a TRIM command, whether it is the old data because
the device applied some kind of rounding, or random data, or all
zero's, won't matter to the filesystem. Why should the filesystem
care? I know I certainly don't....

- Ted

2009-05-11 14:29:51

by Ric Wheeler

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On 05/11/2009 10:27 AM, Theodore Tso wrote:
> On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote:
>> That implies that the SSD folks are not treating erase blocks as a
>> contiguous group of sectors.
>
> Correct.
>
>> For some reason, I thought their was
>> only one mapping per erase block and within the erase block the
>> sectors were contiguous..
>
> No, if you try to treat erase blocks as a contiguous group of
> sectors, you'll have terrible write amplification problems (leading to
> premature death of the SSD) and terrible small random write
> performance. Flash devices optimized for digital cameras might have
> done that, but for SSD's, this will result in catastrophically bad
> performance, and very limited lifespan. As I said, I expect these
> SSD's to be weeded out of the market very shortly.
>
> For any sane implementation of an SSD, the mapping will be on a per
> LBA basis, not on a per-erase block basis.
>
>> More realistic is to figure out a way to make it deterministic at
>> least for the short term (by writing data to all the trimmed blocks?),
>> then reshaping, then having a tool to scan the filesystem and re-issue
>> all the trim commands.
>
> Writing data to all of the trimmed block? Um, no. That would be a
> diaster, since it accelerates the wear and tear of the SSD. The whole
> *point* of the TRIM command is to avoid needing to do that.
>
> The whole worry about determinism is highly overrated. If the
> filesystem doesn't need a block, then it doesn't need it. What you
> read after you send a TRIM command, whether it is the old data because
> the device applied some kind of rounding, or random data, or all
> zero's, won't matter to the filesystem. Why should the filesystem
> care? I know I certainly don't....
>
> - Ted

The key is not at the FS layer - this is an issue for people who RAID these
beasts together and want to actually check that the bits are what they should be
(say doing a checksum validity check for a stripe).

ric

2009-05-11 14:50:59

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
>
> The key is not at the FS layer - this is an issue for people who RAID
> these beasts together and want to actually check that the bits are what
> they should be (say doing a checksum validity check for a stripe).
>

Good point, yes I can see why they need that. In that case, the
storage device can't just silently truncate a TRIM request; it would
have to expose to the OS its alignment requirements. The risk though
is that more they try push this compleixity into the OS, the higher
the risk that the OS will simply decide not to take advantage of the
functionality. Of course, there is the question why anyone would want
to build a software-raid device on top of a thin-provisioned hardware
storage unit. :-)

- Ted

2009-05-11 14:58:38

by Ric Wheeler

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On 05/11/2009 10:50 AM, Theodore Tso wrote:
> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
>> The key is not at the FS layer - this is an issue for people who RAID
>> these beasts together and want to actually check that the bits are what
>> they should be (say doing a checksum validity check for a stripe).
>>
>
> Good point, yes I can see why they need that. In that case, the
> storage device can't just silently truncate a TRIM request; it would
> have to expose to the OS its alignment requirements. The risk though
> is that more they try push this compleixity into the OS, the higher
> the risk that the OS will simply decide not to take advantage of the
> functionality. Of course, there is the question why anyone would want
> to build a software-raid device on top of a thin-provisioned hardware
> storage unit. :-)
>
> - Ted

Probably not as uncommon as you would think, but not as you suggest to raid thin
provisioned luns (those are done usually as RAID devices inside an array).

Think more of the array providing a thinly provisioned LUN made up out of T13
TRIM enabled SSD's devices internally. RAID makes sense here (data protection
is still needed to avoid a single point of failure) and the relative expense of
the SSD's devices makes "thin provisioning" really attractive to external users :-)

ric

2009-05-11 15:00:40

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 10:50:59AM -0400, Theodore Tso wrote:
> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
> > The key is not at the FS layer - this is an issue for people who RAID
> > these beasts together and want to actually check that the bits are what
> > they should be (say doing a checksum validity check for a stripe).
>
> Good point, yes I can see why they need that. In that case, the
> storage device can't just silently truncate a TRIM request; it would
> have to expose to the OS its alignment requirements. The risk though
> is that more they try push this compleixity into the OS, the higher
> the risk that the OS will simply decide not to take advantage of the
> functionality. Of course, there is the question why anyone would want
> to build a software-raid device on top of a thin-provisioned hardware
> storage unit. :-)

It's not a problem for people who use Thin Provisioning, it's a problem
for people who want to run RAID-5 on top of SSDs. If you have a sector
whose reads are indeterminate, your parity for that stripe will always
be wrong.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-05-11 15:43:44

by Jeff Garzik

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

Jens Axboe wrote:
>>> The only problem with SSD's is the people who designed the ATA TRIM
>>> command requires us to completely drian the I/O queue before issuing
>>> it. Because of this incompetence, we need to be a bit more careful
>>> about how we issue them.
>> And this bit that I wasn't aware of. Such a requirement in the standard
>> is a trainwreck indeed.
>
> Precisely, but that's how basically anything works with SATA NCQ, only
> read/write dma commands may be queued. Anything else requires an idle
> drive before issue.

Very true -- but FWIW, one option being considered at T13 is having a
queue-able TRIM.

Jeff

2009-05-11 16:30:08

by Chris Worley

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 4:06 AM, J?rn Engel <[email protected]> wrote:
>
> I am not sure about this part. ?So far Intel has been the only party to
> release any information about their dark-grey box. ?All other boxes are
> still solid black. ?And until I'm told otherwise I'd consider them to be
> stupid devices that use erase block size as trim granularity.

Consider yourself informed otherwise: I can't imagine a vendor
brain-dead enough to require that. It's the same thing as requiring
that the only way to re-write an LBA is to re-write all LBA's in the
same erase block.

Note that Fusion-io supports "discard" in their released 1.2.5 drivers.

Chris

2009-05-11 17:19:08

by Chris Mason

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, 2009-05-11 at 04:41 -0400, Theodore Tso wrote:
> On Mon, May 11, 2009 at 10:12:16AM +0200, Jens Axboe wrote:
> >
> > I largely agree with this. I think that trims should be queued and
> > postponed until the drive is largely idle. I don't want to put this IO
> > tracking in the block layer though, it's going to slow down our iops
> > rates for writes. Providing the functionality in the block layer does
> > make sense though, since it sits between that and the fs anyway. So just
> > not part of the generic IO path, but a set of helpers on the side.
>
> Yes, I agree. However, in that case, we need two things from the
> block I/O path. (A) The discard management layer needs a way of
> knowing that the block device has become idle, and (B) ideally there
> should be a more efficient method for sending trim requests to the I/O
> submission path.

Just a quick me too on the performance problem. The way btrfs does
trims today is going to be pretty slow as well.

For both btrfs and lvm, the filesystem is going to maintain free block
information based on logical block numbers. The generic trim layer
should probably be based on a logical address that is stored per-bdi.

Then the bdi will need a callback to turn the logical address based trim
extent into physical extents on N number of physical device.

The tricky part is how will the FS decide a given block is actually
reusable. We'll need a call back into the FS that indicates trim is
complete on a given logical extent.

-chris
>

2009-05-11 18:43:23

by Matthew Wilcox

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 01:18:45PM -0400, Chris Mason wrote:
> For both btrfs and lvm, the filesystem is going to maintain free block
> information based on logical block numbers. The generic trim layer
> should probably be based on a logical address that is stored per-bdi.
>
> Then the bdi will need a callback to turn the logical address based trim
> extent into physical extents on N number of physical device.
>
> The tricky part is how will the FS decide a given block is actually
> reusable. We'll need a call back into the FS that indicates trim is
> complete on a given logical extent.

Actually, that's the exact opposite of what you want. You want to try
to reuse blocks that are scheduled for trimming so that we never have to
send the command at all.

2009-05-11 18:47:30

by Greg Freemyer

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 11:00 AM, Matthew Wilcox <[email protected]> wrote:
> On Mon, May 11, 2009 at 10:50:59AM -0400, Theodore Tso wrote:
>> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
>> > The key is not at the FS layer - this is an issue for people who RAID
>> > these beasts together and want to actually check that the bits are what
>> > they should be (say doing a checksum validity check for a stripe).
>>
>> Good point, yes I can see why they need that. ?In that case, the
>> storage device can't just silently truncate a TRIM request; it would
>> have to expose to the OS its alignment requirements. ?The risk though
>> is that more they try push this compleixity into the OS, the higher
>> the risk that the OS will simply decide not to take advantage of the
>> functionality. ?Of course, there is the question why anyone would want
>> to build a software-raid device on top of a thin-provisioned hardware
>> storage unit. ?:-)
>
> It's not a problem for people who use Thin Provisioning, it's a problem
> for people who want to run RAID-5 on top of SSDs. ?If you have a sector
> whose reads are indeterminate, your parity for that stripe will always
> be wrong.

Thus my understanding that entire stripe will either be discarded or
not by the mdraid layer.

And if a discard comes along from above that is smaller than a stripe,
then it will tossed by the mdraid layer.

And if it is not aligned to the stripe geometry, then the start/end of
the discard area will be adjusted to be stripe aligned.

And since the mdraid layer is not currently planning to track what has
been discarded over time, when a re-shape comes along, it will
effectively un-trim everything and rewrite 100% of the FS.

The same thing will happen if a drive is cloned via dd as happens
pretty routinely.

Overall, I think Linux will need a mechanism to scan a filesystem and
re-issue all the trim commands in order to get the hardware back in
sync a major maintenance activity. That mechanism could either be
admin invoked.or a always on maintenance task.

Personally, I think the best option is a background task (kernel I
assume) to scan the filesystem and issue discards for all the data on
a slow but steady basis. If it takes a week to make its way around
the disk/volume, then it takes a week. Who really cares.

Once you assume you have that background task in place, I'm not sure
how important it is to even have the filesystem manage this in
realtime with the file deletes.

Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2009-05-11 18:54:23

by Chris Mason

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, 2009-05-11 at 11:43 -0700, Matthew Wilcox wrote:
> On Mon, May 11, 2009 at 01:18:45PM -0400, Chris Mason wrote:
> > For both btrfs and lvm, the filesystem is going to maintain free block
> > information based on logical block numbers. The generic trim layer
> > should probably be based on a logical address that is stored per-bdi.
> >
> > Then the bdi will need a callback to turn the logical address based trim
> > extent into physical extents on N number of physical device.
> >
> > The tricky part is how will the FS decide a given block is actually
> > reusable. We'll need a call back into the FS that indicates trim is
> > complete on a given logical extent.
>
> Actually, that's the exact opposite of what you want. You want to try
> to reuse blocks that are scheduled for trimming so that we never have to
> send the command at all.

Regardless of the optimal way to reuse blocks, we need some way of
knowing the discard is done, or at least sent down to the device in such
a way that any writes will happen after the discard and not before.

-chris

2009-05-11 19:19:57

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 02:53:15PM -0400, Chris Mason wrote:
> > Actually, that's the exact opposite of what you want. You want to try
> > to reuse blocks that are scheduled for trimming so that we never have to
> > send the command at all.
>
> Regardless of the optimal way to reuse blocks, we need some way of
> knowing the discard is done, or at least sent down to the device in such
> a way that any writes will happen after the discard and not before.

An easy way of solving this is simply to have a way for the block
allocator to inform the discard management layer that a particular
block is now in use again. That will prevent the discard from
happening. If the discard is in flight, then the interface won't be
able to return until the discard is done. (This is where real
OS-controlled ordering via dependency --- which NCQ doesn't provide
--- combined with discard/trim as a queuable operation --- would be
really handy.)

One of the things which I worry about is the discard allocation layer
could be an SMP contention point, since the filesystem will need to
call it before every block allocation or deallocation.

Hmm... maybe the better approach is let the filesystem keep the
authoratative list of what's free and not free, and only keep a range
of blocks where some deallocation has taken place. Then when the
filesystem is quiscent, we can lock out block allocations and scan the
block bitmaps, and then send a trim request for anything that's not in
use in a particular region (i.e. allocation group) of the filesystem.

After all, quiescing the block I/O queues is what is expensive;
sending a large number of block ranges attached to a single ATA TRIM
command looks cheap by comparison. So maybe we just lock out the
block group, and send a TRIM for all the unused blocks in that block
group, and only keep track of which block groups should be scanned via
flag in the block group descriptors. That might be a much simpler
approach.

- Ted

2009-05-11 19:23:01

by Andreas Dilger

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On May 11, 2009 14:47 -0400, Greg Freemyer wrote:
> Overall, I think Linux will need a mechanism to scan a filesystem and
> re-issue all the trim commands in order to get the hardware back in
> sync a major maintenance activity. That mechanism could either be
> admin invoked.or a always on maintenance task.
>
> Personally, I think the best option is a background task (kernel I
> assume) to scan the filesystem and issue discards for all the data on
> a slow but steady basis. If it takes a week to make its way around
> the disk/volume, then it takes a week. Who really cares.

I'd suggested that we can also modify e2fsck to (optionally) send the
definitive list of blocks to be trimmed at that time. It shouldn't
necessarily be done all of the times e2fsck is run, because that would
kill any chance of data recovery, but should be optional.

Other filesystem checking tools (say btrfs online check) can periodically
do the same - lock an idle group from new allocations, scan the allocation
bitmap for all unused blocks, send a trim command for any regions >=
erase block size, unlock group. It might make more sense to do this
than send thousands of trim operations while the filesystem is busy.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-05-11 22:03:38

by Chris Worley

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 6:48 AM, Matthew Wilcox <[email protected]> wrote:
>
> On Mon, May 11, 2009 at 02:43:25PM +0200, J??rn Engel wrote:
> > Given the hardware braindamage it is relatively sane. ?As always, it
> > would be much better to fix the problem and not add workarounds, but we
> > seem to lack the gods favor this time around.
> >
> > Can't anyone explain to the SATA folks that a discard is much closer to
> > a write than to a secure erase or some other rare and slow command?
>
> I've heard the ATA committee are working on an NCQ version of TRIM.

Doesn't this fact make this discussion moot?

If the ATA committee knows they've got a problem, and are fixing it at
the level where the problem exists, why is Linux's job to fix at a
higher level?

The proposed solutions are going to consume CPU and slow down I/O
unnecessarily, as well as inefficiently dispatch Discards (i.e. the
longer the time between the discard and the reuse of a block, the
better).? If they are going to be implemented, then have a special
"brain-dead ATA mode" that doesn't inhibit solutions that can
implement Discard w/o the "queue draining" required by the broken
implementation.

Chris
P.S. Why was ext2/discard functionality removed?

2009-05-11 23:38:08

by NeilBrown

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Monday May 11, [email protected] wrote:
>
> And since the mdraid layer is not currently planning to track what has
> been discarded over time, when a re-shape comes along, it will
> effectively un-trim everything and rewrite 100% of the FS.

You might not call them "plans" exactly, but I have had thoughts
about tracking which part of an raid5 had 'live' data and which were
trimmed. I think that is the only way I could support TRIM, unless
devices guarantee that all trimmed blocks read a zeros, and that seems
unlikely.
You are right that the granularity would have to be at least
one stripe.
And a re-shape would be interesting, wouldn't it! We could probably
avoid instantiating every trimmed block, but in general quite a few
would get instantiated.. I hadn't thought about that...

NeilBrown

2009-05-12 13:28:52

by Greg Freemyer

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, May 11, 2009 at 7:38 PM, Neil Brown <[email protected]> wrote:
> On Monday May 11, [email protected] wrote:
>>
>> And since the mdraid layer is not currently planning to track what has
>> been discarded over time, when a re-shape comes along, it will
>> effectively un-trim everything and rewrite 100% of the FS.
>
> You might not call them "plans" exactly, but I have had thoughts
> about tracking which part of an raid5 had 'live' data and which were
> trimmed. ?I think that is the only way I could support TRIM, unless
> devices guarantee that all trimmed blocks read a zeros, and that seems
> unlikely.

Neil,

Re: raid 5, etc. No FS info/discussion

The latest T13 proposed spec I saw explicitly allows reads from
trimmed sectors to return non-determinate data in some devices. Their
is a per device flag you can read to see if a device does that or not.
I think mdraid needs to simply assume all trimmed sectors return
non-determinate data. Either that, or simply check that per device
flag and refuse to accept a drive that supports returning
non-determinate data.

Regardless, ignoring reshape, why do you need to track it?

... thinking

Oh yes, you will have to track it at least at the stripe level.

If p = d1 ^ d2 is not guaranteed to be true due to a stripe discard
and p, d1, d2 are all potentially non-determinate all is good at first
because who cares that d1 = p ^ d2 is not true for your discarded
stripe. d1 is effectively just random data anyway.

But as soon as either d1 or d2 is written to, you will need to force
the entire stripe back into a determinate state or else you will have
unprotected data sitting on that stripe. You can only do that if you
know the entire stripe was previously indeterminate, thus you have no
option but to track the state of the stripes if dmraid is going to
support discards with devices that advertise themselves as returning
indeterminate data.

So Neil, it looks like you need to move from thoughts about tracking
discards to planning to track discards.

FYI: I don't know if it just for show, or if people really plan to do
it, but I have seen several people build up very high performance raid
arrays from SSDs already. Seems that about 8 SSDs maxes out the
current group of sata controllers, pci-express, etc.

Since SSDs with trim support should be even faster, I suspect these
ultra-high performance setups will want to use them.

Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2009-05-29 11:05:22

by Florian Weimer

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

* Matthew Wilcox:

> Actually, that's the exact opposite of what you want. You want to try
> to reuse blocks that are scheduled for trimming so that we never have to
> send the command at all.

I thought that the device would receive as many TRIM commands as
possible, to aid its internal reorganization process? A write which
overwrites a whole block could be equivalent, but it may not be a good
idea to artificially increase I/O traffic to turn partial writes to
unused blocks into full-block writes.

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2010-04-24 17:12:00

by Phillip Susi

[permalink] [raw]

Subject: Re: Is TRIM/DISCARD going to be a performance problem?

On Mon, 2009-05-11 at 08:09 -0400, Theodore Tso wrote:
> I just did a bit more web browsing, and it appears that OCZ's
> userspace support for trim is currently Windows-only, and they've
> implemented it by taking the filesystme off line, and running a
> userspace utility that sends TRIM requests for all of the free space
> on the drive.

Recent releases of hdparm also support sending down TRIM commands and
have a script that maps the unused blocks of the fs to trim them.
Someone even posted a gui on the ocz forums that monitors the amount of
data written to the fs since the last trim and reminds you to run
another pass. The script can trim the fs while it is online by creating
a large temporary file to occupy the remaining free space, mapping the
blocks it is allocated, and trimming those.