I spent an hour talking to architecture guy from a major flash
manufacturer, who makes everything from SSD's to SD cards to eMMC
devices, and he said a few things that were interesting.
One is that he would actually be very happy if we send lots of extra
trim commands; in particular, he would actually *like* us to send trims
at unlink/commit time, *and* trims periodically via FITRIM. The reason
for that is because that way, if the disk is busy, it would be OK if he
dropped the TRIM on the floor, knowing that he would get another bite at
the apple later on. But, if the disk has time to process the trim, he
he would be able to use that information as quickly as possible.
One of the other things we talked about was it would be really nice if
we could send TRIM commands at journal checkpoint time, and perhaps send
checkpoints more aggressively (although the requirement to send a
SYNCHORNIZE CACHE command may make this be too expensive, unless we have
ways of reliably knowing when the disk is idle, since unlike the
enterprise server case, when ext4 is used in a mobile device, the fs
accesses patterns tend to have more gaps where this sort of maintenance
can take place).
We also talked about ways that we might right some application notes so
that handset OEM's understood how to use mke2fs parameters to optimize
their file systems for different types of flash systems, and perhaps
ways that the eMMC spec could be enhanced so that key parameters such as
erase block size, flash page size, and translation table granularity
could be passed back to the block layer, and made available to file
system and mkfs.
Anyway, going back to TRIM, I suspect that efforts to optimize out TRIM
requests may not make as much sense once we have devices with are SATA
3.1 complaint, when we will have a queuable TRIM command. Also,
presumably SATA 3.1 compliance devices are less likely to have
disastrous firmware bugs that make TRIM such a performance dog, and in
fact they may be devices that would very much like as much TRIM
information as we are willing to send to them.
Regards,
- Ted
On 3/2/12 3:00 PM, Theodore Ts'o wrote:
> I spent an hour talking to architecture guy from a major flash
> manufacturer, who makes everything from SSD's to SD cards to eMMC
> devices, and he said a few things that were interesting.
>
> One is that he would actually be very happy if we send lots of extra
> trim commands; in particular, he would actually *like* us to send trims
> at unlink/commit time, *and* trims periodically via FITRIM. The reason
> for that is because that way, if the disk is busy, it would be OK if he
> dropped the TRIM on the floor, knowing that he would get another bite at
> the apple later on. But, if the disk has time to process the trim, he
> he would be able to use that information as quickly as possible.
Is that within spec?
> One of the other things we talked about was it would be really nice if
> we could send TRIM commands at journal checkpoint time, and perhaps send
> checkpoints more aggressively (although the requirement to send a
> SYNCHORNIZE CACHE command may make this be too expensive, unless we have
> ways of reliably knowing when the disk is idle, since unlike the
> enterprise server case, when ext4 is used in a mobile device, the fs
> accesses patterns tend to have more gaps where this sort of maintenance
> can take place).
>
> We also talked about ways that we might right some application notes so
> that handset OEM's understood how to use mke2fs parameters to optimize
> their file systems for different types of flash systems, and perhaps
> ways that the eMMC spec could be enhanced so that key parameters such as
> erase block size, flash page size, and translation table granularity
> could be passed back to the block layer, and made available to file
> system and mkfs.
Now that would be nice. Could some of this just be piggybacked on the
existing preferred_io_size-type geometry interfaces?
-Eric
> Anyway, going back to TRIM, I suspect that efforts to optimize out TRIM
> requests may not make as much sense once we have devices with are SATA
> 3.1 complaint, when we will have a queuable TRIM command. Also,
> presumably SATA 3.1 compliance devices are less likely to have
> disastrous firmware bugs that make TRIM such a performance dog, and in
> fact they may be devices that would very much like as much TRIM
> information as we are willing to send to them.
>
> Regards,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 02, 2012 at 03:04:48PM -0600, Eric Sandeen wrote:
> > One is that he would actually be very happy if we send lots of extra
> > trim commands; in particular, he would actually *like* us to send trims
> > at unlink/commit time, *and* trims periodically via FITRIM. The reason
> > for that is because that way, if the disk is busy, it would be OK if he
> > dropped the TRIM on the floor, knowing that he would get another bite at
> > the apple later on. But, if the disk has time to process the trim, he
> > he would be able to use that information as quickly as possible.
>
> Is that within spec?
Yup; the drive manufacturer is free to do anything they want with the
TRIM command; it's purely advisory. So dropping it on the floor if
you're too busy because some other process is sending random 4k writes
to you at a high rate, is something that's within spec.
Or if the thin provisioning service is only tracking blocks with a
granularity of 4megs, and it receives trim request for less than 4
megabytes, it again is perfectly free to drop the trim request on the
floor. I'm even aware of one implementation which remembers the trim
request while the system is powered on, but since it doesn't
(necessarily) write the trim information to stable store, you could
trim the block, read the block and get zeros, then take a power
failure, and afterwards, read the block and get the previous contents.
As far as I know, the Trim spec allows all of this.
> > We also talked about ways that we might right some application notes so
> > that handset OEM's understood how to use mke2fs parameters to optimize
> > their file systems for different types of flash systems, and perhaps
> > ways that the eMMC spec could be enhanced so that key parameters such as
> > erase block size, flash page size, and translation table granularity
> > could be passed back to the block layer, and made available to file
> > system and mkfs.
>
> Now that would be nice. Could some of this just be piggybacked on the
> existing preferred_io_size-type geometry interfaces?
As far as the /sys/block/XXX/queue/* framework, certainly. It's not
clear, however, whether or not we should use entirely new parameters,
or try to reuse the existing parameters. For example, would it be
better to use optimal_io_size for the flash page size, or the erase
block size?
- Ted
On Fri, 2 Mar 2012, Theodore Ts'o wrote:
>
> I spent an hour talking to architecture guy from a major flash
> manufacturer, who makes everything from SSD's to SD cards to eMMC
> devices, and he said a few things that were interesting.
>
> One is that he would actually be very happy if we send lots of extra
> trim commands; in particular, he would actually *like* us to send trims
> at unlink/commit time, *and* trims periodically via FITRIM. The reason
> for that is because that way, if the disk is busy, it would be OK if he
> dropped the TRIM on the floor, knowing that he would get another bite at
> the apple later on. But, if the disk has time to process the trim, he
> he would be able to use that information as quickly as possible.
Hi Ted,
yes, they can do a lot of things behind the curtain, and dropping the
TRIM on the floor is clearly on of it. We do not actually care all that
much, but they should export proper flags accordingly. So if the TRIMs
can be droppend on the floor, of the unmapped regions can be read again
after a power cycle they should not export the "discard zeroes data"
thing. Of course we do not want them to drop every TRIM command as well
:).
I think that we would very much like to enable '-o discard' however it
is still very slow due to the fact that it is nonqueable command and it
take a while to process the command as well. Moreover I have noticed
that some device become 'busy' after they get the TRIM command, hence
the performance is lower for a short period of time after the TRIM.
>
> One of the other things we talked about was it would be really nice if
> we could send TRIM commands at journal checkpoint time, and perhaps send
> checkpoints more aggressively (although the requirement to send a
> SYNCHORNIZE CACHE command may make this be too expensive, unless we have
> ways of reliably knowing when the disk is idle, since unlike the
> enterprise server case, when ext4 is used in a mobile device, the fs
> accesses patterns tend to have more gaps where this sort of maintenance
> can take place).
>
> We also talked about ways that we might right some application notes so
> that handset OEM's understood how to use mke2fs parameters to optimize
> their file systems for different types of flash systems, and perhaps
> ways that the eMMC spec could be enhanced so that key parameters such as
> erase block size, flash page size, and translation table granularity
> could be passed back to the block layer, and made available to file
> system and mkfs.
Regarding the eMMC it would also be very nice from them if they stopped
optimize their flashes for FAT, but rather take a more general approach
and advertise which parts of the flash are faster than other :). Also
from what I know, doing frequent discard on those flashes might make
them wear off much faster, because the wear leveling involves copying
data around the flash so they can free the whole erase blocks.
>
> Anyway, going back to TRIM, I suspect that efforts to optimize out TRIM
> requests may not make as much sense once we have devices with are SATA
> 3.1 complaint, when we will have a queuable TRIM command. Also,
> presumably SATA 3.1 compliance devices are less likely to have
> disastrous firmware bugs that make TRIM such a performance dog, and in
> fact they may be devices that would very much like as much TRIM
> information as we are willing to send to them.
That is definitely very good news, however those optimization still
makes sense. SSD's are not the only discard capable devices out there,
nor will be the 3.1 compliant SSD's. So we still need some kind of
optimization so that it does not hurt the performance on
thin-provisioned storage, or today's SSD's, right ?
But I definitely agree that we should start looking into enabling the
new SSD's to be more effective and if the frequent discard can help
then, then we could start to look how to enable -o discard for such
device by default. Maybe /sys/block/sda/queue/discard_queuable or
something.
Thanks!
-Lukas
>
> Regards,
>
> - Ted
>
On Fri, Mar 2, 2012 at 6:11 PM, Ted Ts'o <[email protected]> wrote:
> I'm even aware of one implementation which remembers the trim
> request while the system is powered on, but since it doesn't
> (necessarily) write the trim information to stable store, you could
> trim the block, read the block and get zeros, then take a power
> failure, and afterwards, read the block and get the previous contents.
>
> As far as I know, the Trim spec allows all of this.
It's been a while since I read the spec, but the read operation above
changes the rules I believe.
That is if the SSD advertizes itself as having deterministic reads
after a trim, that read should lock in the values, and a power cycle
should not change that as I understood the spec.
Otherwise what you describe would be a non-deterministic read. That
is also allowed, but the drive would need to advertise itself as
non-deterministic after trim.
Greg
>>>>> "Eric" == Eric Sandeen <[email protected]> writes:
>> We also talked about ways that we might right some application notes
>> so that handset OEM's understood how to use mke2fs parameters to
>> optimize their file systems for different types of flash systems, and
>> perhaps ways that the eMMC spec could be enhanced so that key
>> parameters such as erase block size, flash page size, and translation
>> table granularity could be passed back to the block layer, and made
>> available to file system and mkfs.
Eric> Now that would be nice. Could some of this just be piggybacked on
Eric> the existing preferred_io_size-type geometry interfaces?
So far the barrier has been that the flash manufacturers did not want to
disclose the erase block size, etc. That's why the original
standardization efforts in that department were shelved.
If the devices actually start exporting this information I'll be happy
to put it in the topology.
--
Martin K. Petersen Oracle Linux Engineering
>>>>> "Ted" == Ted Ts'o <[email protected]> writes:
Ted> As far as the /sys/block/XXX/queue/* framework, certainly. It's
Ted> not clear, however, whether or not we should use entirely new
Ted> parameters, or try to reuse the existing parameters. For example,
Ted> would it be better to use optimal_io_size for the flash page size,
Ted> or the erase block size?
If we were to use the existing fields we'd probably set min_io to the
flash page size and optimal_io to the erase block size.
--
Martin K. Petersen Oracle Linux Engineering
On Tue, Mar 06, 2012 at 01:44:28PM -0500, Martin K. Petersen wrote:
> >>>>> "Ted" == Ted Ts'o <[email protected]> writes:
>
> Ted> As far as the /sys/block/XXX/queue/* framework, certainly. It's
> Ted> not clear, however, whether or not we should use entirely new
> Ted> parameters, or try to reuse the existing parameters. For example,
> Ted> would it be better to use optimal_io_size for the flash page size,
> Ted> or the erase block size?
>
> If we were to use the existing fields we'd probably set min_io to the
> flash page size and optimal_io to the erase block size.
But min_io currently means the smallest size that we're allowed to
write, correct? And the flash page size could be 128k and 512 byte
writes might be perfectly OK; it's just that writes are more optimal
at 128k, and would be even more optimal at the erbase block size of 4
megs. That's why I'm not sure it makes sense to use the existing
fields, since it will confuse file system utilities that are reading
those fields.
- Ted
>>>>> "Ted" == Ted Ts'o <[email protected]> writes:
Ted> But min_io currently means the smallest size that we're allowed to
Ted> write, correct?
Without incurring a penalty, yes. That was conceived in the standards
with 4K sectors and RAID RMW in mind. But I think it would apply to SSDs
as well. Depending on how mkfs.* interpret the field, obviously.
Ted> And the flash page size could be 128k and 512 byte writes might be
Ted> perfectly OK; it's just that writes are more optimal at 128k, and
Ted> would be even more optimal at the erbase block size of 4 megs.
Yep. Just like in the RAID case where the writing the full stripe chunk
is better than just a logical block. And a full stripe is even better.
Ted> That's why I'm not sure it makes sense to use the existing fields,
Ted> since it will confuse file system utilities that are reading those
Ted> fields.
Happy to add new fields if it makes sense. But right now ATA ACS doesn't
even have anything corresponding to the SCSI fields that populate min_io
and opt_io.
--
Martin K. Petersen Oracle Linux Engineering