2011-03-28 10:34:45

by Kyungmin Park

[permalink] [raw]
Subject: [PATCH v6] fat: Batched discard support for fat

From: Kyungmin Park <[email protected]>

FAT supports batched discard as ext4.

Cited from Lukas words.
"The current solution is not ideal because of its bad performance impact.
So basic idea to improve things is to avoid discarding every time some
blocks are freed. and instead batching is together into bigger trims,
which tends to be more effective."

You can find an information in detail at following URLs.
http://lwn.net/Articles/397538/
http://lwn.net/Articles/383933/

Clearify the meaning of "len" (Cited form Lukas mail)

Let the "O" be free (bytes, blocks, whatever), and "=" be used.
Now, we have a filesystem like this.

OOOO==O===OO===OOOOO==O===O===OOOOOOO===
^ ^
0 40

This is how it supposed to wotk if you have called FITIRM with parameters:

start = 0
minlen = 2
len = 20

So you will go through (blocks, bytes...) 0 -> 20

OOOO==O===OO===OOOOO==O===O===OOOOOOO===
^ ^
0 20

So, you will call discard on extents:

0-3
You'll skip 6 because is smaller than minlen
10-11
15-19

instead of

0-3
10-11
15-19
30-36

Signed-off-by: Kyungmin Park <[email protected]>
---
Changelog v6:
Check the user input tightly
Fix cyclic trim routine from OGAWA

Changelog v5:
Exit when start cluster is grater than max cluster

Changelog v4:
Simplify the exit condition

Changelog v3:
Adjust the minlen from queue discard_granularity
Use the corrent len usage
Changelog v2:
Use the given start and len as Lukas comments
Check the queue supports discard feature
---
diff --git a/fs/fat/fat.h b/fs/fat/fat.h
index f504089..08b53e1 100644
--- a/fs/fat/fat.h
+++ b/fs/fat/fat.h
@@ -299,6 +299,7 @@ extern int fat_alloc_clusters(struct inode *inode, int *cluster,
int nr_cluster);
extern int fat_free_clusters(struct inode *inode, int cluster);
extern int fat_count_free_clusters(struct super_block *sb);
+extern int fat_trim_fs(struct super_block *sb, struct fstrim_range *range);

/* fat/file.c */
extern long fat_generic_ioctl(struct file *filp, unsigned int cmd,
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index b47d2c9..81efef2 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -1,6 +1,8 @@
/*
* Copyright (C) 2004, OGAWA Hirofumi
* Released under GPL v2.
+ *
+ * Batched discard support by Kyungmin Park <[email protected]>
*/

#include <linux/module.h>
@@ -541,6 +543,16 @@ out:
return err;
}

+static int fat_issue_discard(struct super_block *sb, int cluster, int nr_clus)
+{
+ struct msdos_sb_info *sbi = MSDOS_SB(sb);
+ sector_t block, nr_blocks;
+
+ block = fat_clus_to_blknr(sbi, cluster);
+ nr_blocks = nr_clus * sbi->sec_per_clus;
+ return sb_issue_discard(sb, block, nr_blocks, GFP_NOFS, 0);
+}
+
int fat_free_clusters(struct inode *inode, int cluster)
{
struct super_block *sb = inode->i_sb;
@@ -575,11 +587,7 @@ int fat_free_clusters(struct inode *inode, int cluster)
if (cluster != fatent.entry + 1) {
int nr_clus = fatent.entry - first_cl + 1;

- sb_issue_discard(sb,
- fat_clus_to_blknr(sbi, first_cl),
- nr_clus * sbi->sec_per_clus,
- GFP_NOFS, 0);
-
+ fat_issue_discard(sb, first_cl, nr_clus);
first_cl = cluster;
}
}
@@ -683,3 +691,80 @@ out:
unlock_fat(sbi);
return err;
}
+
+int fat_trim_fs(struct super_block *sb, struct fstrim_range *range)
+{
+ struct msdos_sb_info *sbi = MSDOS_SB(sb);
+ struct fatent_operations *ops = sbi->fatent_ops;
+ struct fat_entry fatent;
+ unsigned long reada_blocks, reada_mask, cur_block;
+ int err, free, count, entry;
+ int start, len, minlen, trimmed;
+
+ start = range->start >> sb->s_blocksize_bits;
+ len = range->len >> sb->s_blocksize_bits;
+ len = round_down(start + len, sbi->sec_per_clus);
+ start = round_up(start, sbi->sec_per_clus);
+ minlen = range->minlen >> sb->s_blocksize_bits;
+ minlen = round_up(minlen, sbi->sec_per_clus);
+ trimmed = 0;
+ count = 0;
+ err = -EINVAL;
+
+ if (start >= sbi->max_cluster)
+ goto out;
+
+ len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
+
+ lock_fat(sbi);
+
+ reada_blocks = FAT_READA_SIZE >> sb->s_blocksize_bits;
+ reada_mask = reada_blocks - 1;
+ cur_block = 0;
+ entry = 0;
+ free = 0;
+ fatent_init(&fatent);
+
+ if (start < FAT_START_ENT)
+ start = FAT_START_ENT;
+
+ fatent_set_entry(&fatent, start);
+
+ for (count = start; count <= len; count++) {
+ /* readahead of fat blocks */
+ if ((cur_block & reada_mask) == 0) {
+ unsigned long rest = sbi->fat_length - cur_block;
+ fat_ent_reada(sb, &fatent, min(reada_blocks, rest));
+ }
+ cur_block++;
+
+ err = fat_ent_read_block(sb, &fatent);
+ if (err)
+ goto out;
+
+ do {
+ if (ops->ent_get(&fatent) == FAT_ENT_FREE) {
+ free++;
+ if (!entry)
+ entry = fatent.entry;
+ } else if (entry) {
+ if (free >= minlen) {
+ fat_issue_discard(sb, entry, free);
+ trimmed += free;
+ }
+ free = 0;
+ entry = 0;
+ }
+ } while (fat_ent_next(sbi, &fatent));
+ }
+
+ if (free >= minlen) {
+ fat_issue_discard(sb, entry, free);
+ trimmed += free;
+ }
+ range->len = (u64)(trimmed * sbi->sec_per_clus) << sb->s_blocksize_bits;
+ fatent_brelse(&fatent);
+out:
+ unlock_fat(sbi);
+ return err;
+}
diff --git a/fs/fat/file.c b/fs/fat/file.c
index 7257752..8d7901f 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -125,6 +125,36 @@ long fat_generic_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
return fat_ioctl_get_attributes(inode, user_attr);
case FAT_IOCTL_SET_ATTRIBUTES:
return fat_ioctl_set_attributes(filp, user_attr);
+ case FITRIM:
+ {
+ struct super_block *sb = inode->i_sb;
+ struct request_queue *q = bdev_get_queue(sb->s_bdev);
+ struct fstrim_range range;
+ int ret = 0;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (!blk_queue_discard(q))
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(&range, (struct fstrim_range __user *)arg,
+ sizeof(range)))
+ return -EFAULT;
+
+ range.minlen = max_t(unsigned int, range.minlen,
+ q->limits.discard_granularity);
+ ret = fat_trim_fs(sb, &range);
+ if (ret < 0)
+ return ret;
+
+ if (copy_to_user((struct fstrim_range __user *)arg, &range,
+ sizeof(range)))
+ return -EFAULT;
+
+ return 0;
+ }
+
default:
return -ENOTTY; /* Inappropriate ioctl for device */
}


2011-03-29 05:04:12

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

> +int fat_trim_fs(struct super_block *sb, struct fstrim_range *range)
> +{
> + struct msdos_sb_info *sbi = MSDOS_SB(sb);
> + struct fatent_operations *ops = sbi->fatent_ops;
> + struct fat_entry fatent;
> + unsigned long reada_blocks, reada_mask, cur_block;
> + int err, free, count, entry;
> + int start, len, minlen, trimmed;
> +
> + start = range->start >> sb->s_blocksize_bits;
> + len = range->len >> sb->s_blocksize_bits;
> + len = round_down(start + len, sbi->sec_per_clus);
> + start = round_up(start, sbi->sec_per_clus);
> + minlen = range->minlen >> sb->s_blocksize_bits;
> + minlen = round_up(minlen, sbi->sec_per_clus);
> + trimmed = 0;
> + count = 0;
> + err = -EINVAL;

Sorry for didn't mention at previous. You can use ->cluster_size, and
->cluster_bits.

> + if (start >= sbi->max_cluster)
> + goto out;
> +
> + len = (len > sbi->max_cluster) ? sbi->max_cluster : len;

[...]

> + trimmed += free;
> + }
> + range->len = (u64)(trimmed * sbi->sec_per_clus) << sb->s_blocksize_bits;
> + fatent_brelse(&fatent);
> +out:
> + unlock_fat(sbi);
> + return err;

Again, this ioctl's design is unclear, and seems to be strange. I
wouldn't want to add this before clearing it. Please explain what is
right behavior.

E.g. if user specified 0-1024 and FS data block was actually started at
2048. What is right behavior? And if the end of blocks, what returned?
For now, it seems to return range->len == 0 on both cases.

Well, so, my suggestion is providing this like flat one extent
file. I.e. FS have to map actual block placement to flat. And result
also like write/read (return bytes as trimed, and at EOF returns 0).

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-03-29 05:11:38

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, Mar 29, 2011 at 2:04 PM, OGAWA Hirofumi
<[email protected]> wrote:
> Kyungmin Park <[email protected]> writes:
>
>> +int fat_trim_fs(struct super_block *sb, struct fstrim_range *range)
>> +{
>> + ? ? struct msdos_sb_info *sbi = MSDOS_SB(sb);
>> + ? ? struct fatent_operations *ops = sbi->fatent_ops;
>> + ? ? struct fat_entry fatent;
>> + ? ? unsigned long reada_blocks, reada_mask, cur_block;
>> + ? ? int err, free, count, entry;
>> + ? ? int start, len, minlen, trimmed;
>> +
>> + ? ? start = range->start >> sb->s_blocksize_bits;
>> + ? ? len = range->len >> sb->s_blocksize_bits;
>> + ? ? len = round_down(start + len, sbi->sec_per_clus);
>> + ? ? start = round_up(start, sbi->sec_per_clus);
>> + ? ? minlen = range->minlen >> sb->s_blocksize_bits;
>> + ? ? minlen = round_up(minlen, sbi->sec_per_clus);
>> + ? ? trimmed = 0;
>> + ? ? count = 0;
>> + ? ? err = -EINVAL;
>
> Sorry for didn't mention at previous. You can use ->cluster_size, and
> ->cluster_bits.
>
>> + ? ? if (start >= sbi->max_cluster)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
>
> [...]
>
>> + ? ? ? ? ? ? trimmed += free;
>> + ? ? }
>> + ? ? range->len = (u64)(trimmed * sbi->sec_per_clus) << sb->s_blocksize_bits;
>> + ? ? fatent_brelse(&fatent);
>> +out:
>> + ? ? unlock_fat(sbi);
>> + ? ? return err;
>
> Again, this ioctl's design is unclear, and seems to be strange. I
> wouldn't want to add this before clearing it. Please explain what is
> right behavior.

Umm it's out of my scope. it's trim design.
See also btrfs batched discard support. it's also no consideration as
you mentioned.

As I know, now xfs, ext4, and btrfs support this fstrim without these concern.

http://thread.gmane.org/gmane.comp.file-systems.btrfs/9758

Thank you,
Kyungmin Park

>
> E.g. if user specified 0-1024 and FS data block was actually started at
> 2048. What is right behavior? And if the end of blocks, what returned?
> For now, it seems to return range->len == 0 on both cases.
>
> Well, so, my suggestion is providing this like flat one extent
> file. I.e. FS have to map actual block placement to flat. And result
> also like write/read (return bytes as trimed, and at EOF returns 0).
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-03-29 06:37:22

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

>> Again, this ioctl's design is unclear, and seems to be strange. I
>> wouldn't want to add this before clearing it. Please explain what is
>> right behavior.
>
> Umm it's out of my scope. it's trim design.
> See also btrfs batched discard support. it's also no consideration as
> you mentioned.
>
> As I know, now xfs, ext4, and btrfs support this fstrim without these concern.
>
> http://thread.gmane.org/gmane.comp.file-systems.btrfs/9758

I looked at fstrim.c. It is lowlevel tools to just issue FITRIM - it
doesn't use the result at all.

Um. Honestly I'd like to see more or wait until real user for now,
instead of providing unclear design.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-03-29 06:42:09

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, Mar 29, 2011 at 3:37 PM, OGAWA Hirofumi
<[email protected]> wrote:
> Kyungmin Park <[email protected]> writes:
>
>>> Again, this ioctl's design is unclear, and seems to be strange. I
>>> wouldn't want to add this before clearing it. Please explain what is
>>> right behavior.
>>
>> Umm it's out of my scope. it's trim design.
>> See also btrfs batched discard support. it's also no consideration as
>> you mentioned.
>>
>> As I know, now xfs, ext4, and btrfs support this fstrim without these concern.
>>
>> http://thread.gmane.org/gmane.comp.file-systems.btrfs/9758
>
> I looked at fstrim.c. It is lowlevel tools to just issue FITRIM - it
> doesn't use the result at all.
>
> Um. Honestly I'd like to see more or wait until real user for now,
> instead of providing unclear design.

You can find the testcase xfs251(?)

and it's our our usage. After UMS support finished, execute the trim
command from 0 to MAXINT.
and periodically call the trim FAT partitions.

Thank you,
Kyungmin Park

>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2011-03-29 07:28:37

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

>> I looked at fstrim.c. It is lowlevel tools to just issue FITRIM - it
>> doesn't use the result at all.
>>
>> Um. Honestly I'd like to see more or wait until real user for now,
>> instead of providing unclear design.
>
> You can find the testcase xfs251(?)
>
> and it's our our usage. After UMS support finished, execute the trim
> command from 0 to MAXINT.
> and periodically call the trim FAT partitions.

It seems to be using fstrim. So state is same with fstrim.
More detail is,

1) This means to trim 0-ULLONG

fstrim mntpoint

2) step for each 1GB

while all blocks; do
fstrim -s start -l step mntpoint
done

Those bring the real problem up on this example. What is right behavior
if start + len is outside the end of FS? Also I think other issues on
this thread are same state.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-03-30 13:22:26

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Monday 28 March 2011, Kyungmin Park wrote:
> So you will go through (blocks, bytes...) 0 -> 20
>
> OOOO==O===OO===OOOOO==O===O===OOOOOOO===
> ^ ^
> 0 20
>
> So, you will call discard on extents:
>
> 0-3
> You'll skip 6 because is smaller than minlen
> 10-11
> 15-19
>
> instead of
>
> 0-3
> 10-11
> 15-19
> 30-36

Sorry for joining the discussion late, but shouldn't you also pass
the alignment of the discards?

FAT is typically used on cheap media that have very limited support
for garbage-collection, such as eMMC or SD cards.

On most SDHC cards, you only ever want to issue discard on full erase
blocks (allocation units per spec), typically sized 4 MB.

If you just pass the minimum length, the file system could end up
erasing a 4 MB section that spans two half erase blocks, or it
could span a few clusters of the following erase block, both of
which is not desirable from a performance point of view.

On other media, you have the same problem inside an erase block:
These might be able to discard parts of an erase block efficiently,
but normally not less than a flash page (typically 8 to 32 KB).

Again, you don't want to discard partial pages in this case, and
that is much more important than discarding a large number of pages
because it would result in an immediate copy-on-write operation.

Further, when you erase some pages inside of an erase block, you
probably should not span multiple erase blocks but instead issue
separate requests for each set of pages in one erase block.

Arnd

Arnd

2011-03-30 13:51:00

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Wed, 30 Mar 2011, Arnd Bergmann wrote:

> On Monday 28 March 2011, Kyungmin Park wrote:
> > So you will go through (blocks, bytes...) 0 -> 20
> >
> > OOOO==O===OO===OOOOO==O===O===OOOOOOO===
> > ^ ^
> > 0 20
> >
> > So, you will call discard on extents:
> >
> > 0-3
> > You'll skip 6 because is smaller than minlen
> > 10-11
> > 15-19
> >
> > instead of
> >
> > 0-3
> > 10-11
> > 15-19
> > 30-36
>
> Sorry for joining the discussion late, but shouldn't you also pass
> the alignment of the discards?
>
> FAT is typically used on cheap media that have very limited support
> for garbage-collection, such as eMMC or SD cards.
>
> On most SDHC cards, you only ever want to issue discard on full erase
> blocks (allocation units per spec), typically sized 4 MB.

I was not aware of the fact that SD cards (etc..) does have garbage
collection of some sort, or that they even have support discard, since I
thought that we have only TRIM,UNAMP/WRITE_SAME comands for SATA or SCSI
drives.

Or is there some sort of kernel mechanism doing garbage collection such
is this for the cheap media ?

>
> If you just pass the minimum length, the file system could end up
> erasing a 4 MB section that spans two half erase blocks, or it
> could span a few clusters of the following erase block, both of
> which is not desirable from a performance point of view.

Does those cards export such information correctly ?

>
> On other media, you have the same problem inside an erase block:
> These might be able to discard parts of an erase block efficiently,
> but normally not less than a flash page (typically 8 to 32 KB).

Well I have tested several SSD's and thinly provisioned devices, but I
have not seen any strange behaviour, other than it was terribly
unefficient to do so. See my results here:

http://people.redhat.com/lczerner/discard/test_discard.html

the fact is that I have not tried discard size smaller than 4K, since
this is the most usual block size for the filesystem.

>
> Again, you don't want to discard partial pages in this case, and
> that is much more important than discarding a large number of pages
> because it would result in an immediate copy-on-write operation.
>
> Further, when you erase some pages inside of an erase block, you
> probably should not span multiple erase blocks but instead issue
> separate requests for each set of pages in one erase block.

Does it mean that we should not issue bigger discards that erase block ?
That does not sound good given my test results. Or maybe I misunderstood
your point ?

>
> Arnd
>
> Arnd
>

Thanks!
-Lukas

2011-03-30 13:58:50

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Wed, Mar 30, 2011 at 10:50 PM, Lukas Czerner <[email protected]> wrote:
> On Wed, 30 Mar 2011, Arnd Bergmann wrote:
>
>> On Monday 28 March 2011, Kyungmin Park wrote:
>> > So you will go through (blocks, bytes...) 0 -> 20
>> >
>> > ? OOOO==O===OO===OOOOO==O===O===OOOOOOO===
>> > ? ^ ? ? ? ? ? ? ? ? ? ^
>> > ? 0 ? ? ? ? ? ? ? ? ? 20
>> >
>> > So, you will call discard on extents:
>> >
>> > 0-3
>> > You'll skip 6 because is smaller than minlen
>> > 10-11
>> > 15-19
>> >
>> > instead of
>> >
>> > 0-3
>> > 10-11
>> > 15-19
>> > 30-36
>>
>> Sorry for joining the discussion late, but shouldn't you also pass
>> the alignment of the discards?
>>
>> FAT is typically used on cheap media that have very limited support
>> for garbage-collection, such as eMMC or SD cards.
>>
>> On most SDHC cards, you only ever want to issue discard on full erase
>> blocks (allocation units per spec), typically sized 4 MB.
>
> I was not aware of the fact that SD cards (etc..) does have garbage
> collection of some sort, or that they even have support discard, since I
> thought that we have only TRIM,UNAMP/WRITE_SAME comands for SATA or SCSI
> drives.
>
> Or is there some sort of kernel mechanism doing garbage collection such
> is this for the cheap media ?
>
>>
>> If you just pass the minimum length, the file system could end up
>> erasing a 4 MB section that spans two half erase blocks, or it
>> could span a few clusters of the following erase block, both of
>> which is not desirable from a performance point of view.
>
> Does those cards export such information correctly ?

Each chip vendor knows it exactly and usually can't tell it outside officially.
but it's not big as you think. also if the request size is under trim
size, then it's discarded internally.
I mean do nothing at firmware level.

Thank you,
Kyungmin Park

>
>>
>> On other media, you have the same problem inside an erase block:
>> These might be able to discard parts of an erase block efficiently,
>> but normally not less than a flash page (typically 8 to 32 KB).
>
> Well I have tested several SSD's and thinly provisioned devices, but I
> have not seen any strange behaviour, other than it was terribly
> unefficient to do so. See my results here:
>
> http://people.redhat.com/lczerner/discard/test_discard.html
>
> the fact is that I have not tried discard size smaller than 4K, since
> this is the most usual block size for the filesystem.
>
>>
>> Again, you don't want to discard partial pages in this case, and
>> that is much more important than discarding a large number of pages
>> because it would result in an immediate copy-on-write operation.
>>
>> Further, when you erase some pages inside of an erase block, you
>> probably should not span multiple erase blocks but instead issue
>> separate requests for each set of pages in one erase block.
>
> Does it mean that we should not issue bigger discards that erase block ?
> That does not sound good given my test results. Or maybe I misunderstood
> your point ?
>
>>
>> ? ? ? Arnd
>>
>> ? ? ? Arnd
>>
>
> Thanks!
> -Lukas
>

2011-03-30 14:21:07

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Wednesday 30 March 2011, Lukas Czerner wrote:
> On Wed, 30 Mar 2011, Arnd Bergmann wrote:
> > Sorry for joining the discussion late, but shouldn't you also pass
> > the alignment of the discards?
> >
> > FAT is typically used on cheap media that have very limited support
> > for garbage-collection, such as eMMC or SD cards.
> >
> > On most SDHC cards, you only ever want to issue discard on full erase
> > blocks (allocation units per spec), typically sized 4 MB.
>
> I was not aware of the fact that SD cards (etc..) does have garbage
> collection of some sort, or that they even have support discard, since I
> thought that we have only TRIM,UNAMP/WRITE_SAME comands for SATA or SCSI
> drives.
>
> Or is there some sort of kernel mechanism doing garbage collection such
> is this for the cheap media ?

The garbage collection is what happens on the device internally. Each
card has only a small number of erase blocks that it can write to
at a time (between 1 and ten, typically). When you "open" a new erase
block by writing to it, the card will "close" another erase block
by garbage-collecting the data, i.e. it copies the data that has been
recently written together with the data that was in the erase block
previously and has not been touched since the last GC into a single
erase block, and then erases ones that has been freed up.

I've explained this in more detail a few weeks ago in an article, see
https://lwn.net/Articles/428584/.

Discarding full erase blocks makes the wear levelling more efficient,
since there is a larger number of unused erase blocks to choose from.

The SD card standard does not specify what happens when you erase
less than an erase block, other than that the data will be in a
known state (all-ones or all-zeros). On cheap cards that might
require actually writing that data, while better ones use a sector
remapping inside of an erase block that makes is possible mark
sectors or pages as unused, which can speed up future garbage
collection on the erase block.

> > If you just pass the minimum length, the file system could end up
> > erasing a 4 MB section that spans two half erase blocks, or it
> > could span a few clusters of the following erase block, both of
> > which is not desirable from a performance point of view.
>
> Does those cards export such information correctly ?

Most of the time, they export the erase block size correctly, but
sometimes they lie. In fact, all of the cards I've seen report
4 MB erase blocks, but some cards in fact use 1.5 MB, 2 MB or 8 MB
instead. I'm sure we will see 3 MB, 6 MB, 12 MB and 16 MB soon.

SD cards do not export the page size, but most of them today use
16 KB. See my list at

https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey

> > On other media, you have the same problem inside an erase block:
> > These might be able to discard parts of an erase block efficiently,
> > but normally not less than a flash page (typically 8 to 32 KB).
>
> Well I have tested several SSD's and thinly provisioned devices, but I
> have not seen any strange behaviour, other than it was terribly
> unefficient to do so. See my results here:
>
> http://people.redhat.com/lczerner/discard/test_discard.html
>
> the fact is that I have not tried discard size smaller than 4K, since
> this is the most usual block size for the filesystem.

Good SSDs can deal with remapping 4k blocks, while cheap ones
(SD cards, USB sticks, or low-end ATA SSDs) can only do operations
on full blocks.

> > Again, you don't want to discard partial pages in this case, and
> > that is much more important than discarding a large number of pages
> > because it would result in an immediate copy-on-write operation.
> >
> > Further, when you erase some pages inside of an erase block, you
> > probably should not span multiple erase blocks but instead issue
> > separate requests for each set of pages in one erase block.
>
> Does it mean that we should not issue bigger discards that erase block ?
> That does not sound good given my test results. Or maybe I misunderstood
> your point ?

I think ideally you want to have multiples of full erase blocks, but I
would not combine discards of a partial erase block with any other
discard.

Arnd

2011-03-30 14:44:56

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Wed, Mar 30, 2011 at 11:20 PM, Arnd Bergmann <[email protected]> wrote:
> On Wednesday 30 March 2011, Lukas Czerner wrote:
>> On Wed, 30 Mar 2011, Arnd Bergmann wrote:
>> > Sorry for joining the discussion late, but shouldn't you also pass
>> > the alignment of the discards?
>> >
>> > FAT is typically used on cheap media that have very limited support
>> > for garbage-collection, such as eMMC or SD cards.
>> >
>> > On most SDHC cards, you only ever want to issue discard on full erase
>> > blocks (allocation units per spec), typically sized 4 MB.
>>
>> I was not aware of the fact that SD cards (etc..) does have garbage
>> collection of some sort, or that they even have support discard, since I
>> thought that we have only TRIM,UNAMP/WRITE_SAME comands for SATA or SCSI
>> drives.
>>
>> Or is there some sort of kernel mechanism doing garbage collection such
>> is this for the cheap media ?
>
> The garbage collection is what happens on the device internally. Each
> card has only a small number of erase blocks that it can write to
> at a time (between 1 and ten, typically). When you "open" a new erase
> block by writing to it, the card will "close" another erase block
> by garbage-collecting the data, i.e. it copies the data that has been
> recently written together with the data that was in the erase block
> previously and has not been touched since the last GC into a single
> erase block, and then erases ones that has been freed up.
>
> I've explained this in more detail a few weeks ago in an article, see
> https://lwn.net/Articles/428584/.
>
> Discarding full erase blocks makes the wear levelling more efficient,
> since there is a larger number of unused erase blocks to choose from.
>
> The SD card standard does not specify what happens when you erase
> less than an erase block, other than that the data will be in a
> known state (all-ones or all-zeros). On cheap cards that might
> require actually writing that data, while better ones use a sector
> remapping inside of an erase block that makes is possible mark
> sectors or pages as unused, which can speed up future garbage
> collection on the erase block.
>
>> > If you just pass the minimum length, the file system could end up
>> > erasing a 4 MB section that spans two half erase blocks, or it
>> > could span a few clusters of the following erase block, both of
>> > which is not desirable from a performance point of view.
>>
>> Does those cards export such information correctly ?
>
> Most of the time, they export the erase block size correctly, but
> sometimes they lie. In fact, all of the cards I've seen report
> 4 MB erase blocks, but some cards in fact use 1.5 MB, 2 MB or 8 MB
> instead. I'm sure we will see 3 MB, 6 MB, 12 MB and 16 MB soon.
>
> SD cards do not export the page size, but most of them today use
> 16 KB. See my list at
>
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey

Thank you for your effort. but think it the original purpose of
batched discard feature.
Do you want to run the batched discard for SD card. I mean usually in
our environment SD card is optional.
and my goal is used for eMMC. and you know, eMMC and SD card internal
algorithm is different.
The goal of SD card is performance but eMMC is different. it should
consider the reliability also. e.g., Sudden Power Off Recovery.

>
>> > On other media, you have the same problem inside an erase block:
>> > These might be able to discard parts of an erase block efficiently,
>> > but normally not less than a flash page (typically 8 to 32 KB).
>>
>> Well I have tested several SSD's and thinly provisioned devices, but I
>> have not seen any strange behaviour, other than it was terribly
>> unefficient to do so. See my results here:
>>
>> http://people.redhat.com/lczerner/discard/test_discard.html
>>
>> the fact is that I have not tried discard size smaller than 4K, since
>> this is the most usual block size for the filesystem.
>
> Good SSDs can deal with remapping 4k blocks, while cheap ones
> (SD cards, USB sticks, or low-end ATA SSDs) can only do operations
> on full blocks.
It really depends on devices. To increase the performance we have to
increase the channel and the way unit at internally.
Most NAND devices are similar there's no way except increase the
interleaving unit. and it affects the price.

That's reason I like the batched discard. it's not easy to guess the
erase block at system level. so it's helpful scan the unused blocks
and *try* to trim at once.

Please remember that the discard is just hint. If we write the flash
sequentially then we maybe don't need to trim it.

Thank you,
Kyungmin Park
>
>> > Again, you don't want to discard partial pages in this case, and
>> > that is much more important than discarding a large number of pages
>> > because it would result in an immediate copy-on-write operation.
>> >
>> > Further, when you erase some pages inside of an erase block, you
>> > probably should not span multiple erase blocks but instead issue
>> > separate requests for each set of pages in one erase block.
>>
>> Does it mean that we should not issue bigger discards that erase block ?
>> That does not sound good given my test results. Or maybe I misunderstood
>> your point ?
>
> I think ideally you want to have multiples of full erase blocks, but I
> would not combine discards of a partial erase block with any other
> discard.
>
> ? ? ? ?Arnd
>

2011-03-30 14:46:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Wednesday 30 March 2011, Kyungmin Park wrote:
> >> If you just pass the minimum length, the file system could end up
> >> erasing a 4 MB section that spans two half erase blocks, or it
> >> could span a few clusters of the following erase block, both of
> >> which is not desirable from a performance point of view.
> >
> > Does those cards export such information correctly ?
>
> Each chip vendor knows it exactly and usually can't tell it outside officially.

I have written a tool for measuring the erase block size reliably
and guessing the page size.

> but it's not big as you think.

The largest erase block I have measured is 12 MB (on a USB stick),
while the smallest typical erase block that gets used in SDHC cards
is 2 MB. I have seen one card using a 1.5 MB erase block.

Older SD cards (up to 2 GB) commonly have erase blocks as small as
128 KB, but for all I know these do not get produced any more. The
smallest cards on the market today are 2 GB cards with 2 MB erase
blocks.

> also if the request size is under trim size, then it's discarded internally.
> I mean do nothing at firmware level.

I don't think that's possible with SD cards, since the command is "erase",
not "trim", i.e. the data gets inaccessible after the command returns.

On eMMC, the command is actually "trim", which can probably be discarded
as you say.

SD cards can also do pre-erase when writing. This has similar meaning
as trim, but is not currently supported by Linux, because it is not
as easy to use.

Arnd

2011-03-30 15:06:46

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Wednesday 30 March 2011, Kyungmin Park wrote:
> On Wed, Mar 30, 2011 at 11:20 PM, Arnd Bergmann <[email protected]> wrote:
> >> > If you just pass the minimum length, the file system could end up
> >> > erasing a 4 MB section that spans two half erase blocks, or it
> >> > could span a few clusters of the following erase block, both of
> >> > which is not desirable from a performance point of view.
> >>
> >> Does those cards export such information correctly ?
> >
> > Most of the time, they export the erase block size correctly, but
> > sometimes they lie. In fact, all of the cards I've seen report
> > 4 MB erase blocks, but some cards in fact use 1.5 MB, 2 MB or 8 MB
> > instead. I'm sure we will see 3 MB, 6 MB, 12 MB and 16 MB soon.
> >
> > SD cards do not export the page size, but most of them today use
> > 16 KB. See my list at
> >
> > https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>
> Thank you for your effort. but think it the original purpose of
> batched discard feature.
> Do you want to run the batched discard for SD card. I mean usually in
> our environment SD card is optional.
> and my goal is used for eMMC. and you know, eMMC and SD card internal
> algorithm is different.

I've done only a few measurements on eMMC, but for all I know, they
use the same fundamental principles, just with a slightly different
command set and physical interface. A number of manufacturers advertise
controller chips that can be integrated on either MMC or SD packages,
presumably using a different firmware.

> The goal of SD card is performance but eMMC is different. it should
> consider the reliability also. e.g., Sudden Power Off Recovery.

I would have guessed the opposite, that sudden power off is much
more important on SD cards that can simply be removed while writing.

In practice, I would expect very little effort getting put into
reliability on either of the two.

Note that there was a patch on the mmc mailing list very recently
to support the reliable write mode of eMMC.

Also, when you want reliablility, you probably shouldn't use the
FAT file system, since it lacks transactional writes. If you target
eMMC media, there is usually no advantage in using FAT because the
data does not get accessed by other operating systems.

> > Good SSDs can deal with remapping 4k blocks, while cheap ones
> > (SD cards, USB sticks, or low-end ATA SSDs) can only do operations
> > on full blocks.
> It really depends on devices. To increase the performance we have to
> increase the channel and the way unit at internally.
> Most NAND devices are similar there's no way except increase the
> interleaving unit. and it affects the price.
>
> That's reason I like the batched discard. it's not easy to guess the
> erase block at system level. so it's helpful scan the unused blocks
> and try to trim at once.

I'm absolutely in favor of batched discard, I just think we should
apply some stricter rules to what blocks can get batched.

> Please remember that the discard is just hint. If we write the flash
> sequentially then we maybe don't need to trim it.

With the current trends in NAND technology, the number of erase cycles
keep going down, so I'd assume that we increasingly rely on discard
in order to keep the wear levelling working, at least for the media
that only do dynamic wear levelling. When you have a drive that supports
static wear levelling, discard becomes more a performance optimization
than a reliability requirement.

Arnd

2011-05-24 01:18:24

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Hi again,

To Ogawa Hirofumi,

Do you still object to merge this feature for .40? As I said the
batched discard design is out-of-scope of this patch.
It's implemented at other batched discard, ext4, xfs and so on.

I hope fat is also support the batched discard.

Any opinions?

Thank you,
Kyungmin Park

On Thu, Mar 31, 2011 at 12:06 AM, Arnd Bergmann <[email protected]> wrote:
> On Wednesday 30 March 2011, Kyungmin Park wrote:
>> On Wed, Mar 30, 2011 at 11:20 PM, Arnd Bergmann <[email protected]> wrote:
>> >> > If you just pass the minimum length, the file system could end up
>> >> > erasing a 4 MB section that spans two half erase blocks, or it
>> >> > could span a few clusters of the following erase block, both of
>> >> > which is not desirable from a performance point of view.
>> >>
>> >> Does those cards export such information correctly ?
>> >
>> > Most of the time, they export the erase block size correctly, but
>> > sometimes they lie. In fact, all of the cards I've seen report
>> > 4 MB erase blocks, but some cards in fact use 1.5 MB, 2 MB or 8 MB
>> > instead. I'm sure we will see 3 MB, 6 MB, 12 MB and 16 MB soon.
>> >
>> > SD cards do not export the page size, but most of them today use
>> > 16 KB. See my list at
>> >
>> > https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>>
>> Thank you for your effort. but think it the original purpose of
>> batched discard feature.
>> Do you want to run the batched discard for SD card. I mean usually in
>> our environment SD card is optional.
>> and my goal is used for eMMC. and you know, eMMC and SD card internal
>> algorithm is different.
>
> I've done only a few measurements on eMMC, but for all I know, they
> use the same fundamental principles, just with a slightly different
> command set and physical interface. A number of manufacturers advertise
> controller chips that can be integrated on either MMC or SD packages,
> presumably using a different firmware.
>
>> The goal of SD card is performance but eMMC is different. it should
>> consider the reliability also. e.g., Sudden Power Off Recovery.
>
> I would have guessed the opposite, that sudden power off is much
> more important on SD cards that can simply be removed while writing.
>
> In practice, I would expect very little effort getting put into
> reliability on either of the two.
>
> Note that there was a patch on the mmc mailing list very recently
> to support the reliable write mode of eMMC.
>
> Also, when you want reliablility, you probably shouldn't use the
> FAT file system, since it lacks transactional writes. If you target
> eMMC media, there is usually no advantage in using FAT because the
> data does not get accessed by other operating systems.
>
>> > Good SSDs can deal with remapping 4k blocks, while cheap ones
>> > (SD cards, USB sticks, or low-end ATA SSDs) can only do operations
>> > on full blocks.
>> It really depends on devices. To increase the performance we have to
>> increase the channel and the way unit at internally.
>> Most NAND devices are similar there's no way except increase the
>> interleaving unit. and it affects the price.
>>
>> That's reason I like the batched discard. it's not easy to guess the
>> erase block at system level. so it's helpful scan the unused blocks
>> and try to trim at once.
>
> I'm absolutely in favor of batched discard, I just think we should
> apply some stricter rules to what blocks can get batched.
>
>> Please remember that the discard is just hint. If we write the flash
>> sequentially then we maybe don't need to trim it.
>
> With the current trends in NAND technology, the number of erase cycles
> keep going down, so I'd assume that we increasingly rely on discard
> in order to keep the wear levelling working, at least for the media
> that only do dynamic wear levelling. When you have a drive that supports
> static wear levelling, discard becomes more a performance optimization
> than a reliability requirement.
>
> ? ? ? ?Arnd
>

2011-05-24 04:47:53

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

> Hi again,

Hi

> Do you still object to merge this feature for .40? As I said the
> batched discard design is out-of-scope of this patch.
> It's implemented at other batched discard, ext4, xfs and so on.
>
> I hope fat is also support the batched discard.
>
> Any opinions?

I'm also thinking implementing this is good though. Sorry, I'm not going
to apply this for now, and would like to wait to be used by real
userland (I hope guys notice the problem, or userland tackles it somehow
sadly).

I think, to expose the wrong behavior like this would be worse.

E.g. one of problems, userland might do like this (trim chunk from 0 to
number of block)

for chunk in number_of_blocks
do_trim chunk
done

But this is actually wrong, this interface doesn't map blocks to 0-max,
so userland have to know real maximum-block-number somehow for each FS
(and maybe have to know real minimum-block-number).

So, how to fix this? The solutions would be userland workaround, or fix
kernel. If it was userland, userland have to know FS internal sadly. If
it was kernel, we would have backward compatible nightmare, or ignore
compatible :(.

I really think we have to rethink this and have agreement about common
interface. Or there was real userland app, I think FAT can implement to
work that app.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-05-24 05:23:46

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, May 24, 2011 at 1:47 PM, OGAWA Hirofumi
<[email protected]> wrote:
> Kyungmin Park <[email protected]> writes:
>
>> Hi again,
>
> Hi
>
>> Do you still object to merge this feature for .40? As I said the
>> batched discard design is out-of-scope of this patch.
>> It's implemented at other batched discard, ext4, xfs and so on.
>>
>> I hope fat is also support the batched discard.
>>
>> Any opinions?
>
> I'm also thinking implementing this is good though. Sorry, I'm not going
> to apply this for now, and would like to wait to be used by real
> userland (I hope guys notice the problem, or userland tackles it somehow
> sadly).
>
> I think, to expose the wrong behavior like this would be worse.
>
> E.g. one of problems, userland might do like this (trim chunk from 0 to
> number of block)
>
> for chunk in number_of_blocks
> ? ? ? ?do_trim chunk
> done

It's handled at trim implementation. It just trim the fat aware block.
Not trim the blocks which fat doesn't know.
As fat don't use the block 0, 1, it adjust the start block at kernel.

+ if (start < FAT_START_ENT)
+ start = FAT_START_ENT;

and don't exceed the max cluster size.

+ len = (len > sbi->max_cluster) ? sbi->max_cluster : len;

+ for (count = start; count <= len; count++) {

>
> But this is actually wrong, this interface doesn't map blocks to 0-max,
> so userland have to know real maximum-block-number somehow for each FS
> (and maybe have to know real minimum-block-number).
>
> So, how to fix this? The solutions would be userland workaround, or fix
> kernel. If it was userland, userland have to know FS internal sadly. If
> it was kernel, we would have backward compatible nightmare, or ignore
> compatible :(.

I think basic concept of batched discard is trim at once instead of
deleted entries at filesystem (original one).
So it can set the specific start address or any start (usually 0) with
maximum length.

>
> I really think we have to rethink this and have agreement about common
> interface. Or there was real userland app, I think FAT can implement to
> work that app.

IMO, we should use the same user space application. user program
doesn't know the which filesystem are underlying.
So sample user space program used at ext4, xfs should be used. and I tested it.

Thank you,
Kyungmin Park

2011-05-24 06:39:28

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

>>> Do you still object to merge this feature for .40? As I said the
>>> batched discard design is out-of-scope of this patch.
>>> It's implemented at other batched discard, ext4, xfs and so on.
>>>
>>> I hope fat is also support the batched discard.
>>>
>>> Any opinions?
>>
>> I'm also thinking implementing this is good though. Sorry, I'm not going
>> to apply this for now, and would like to wait to be used by real
>> userland (I hope guys notice the problem, or userland tackles it somehow
>> sadly).
>>
>> I think, to expose the wrong behavior like this would be worse.
>>
>> E.g. one of problems, userland might do like this (trim chunk from 0 to
>> number of block)
>>
>> for chunk in number_of_blocks
>> ? ? ? ?do_trim chunk
>> done
>
> It's handled at trim implementation. It just trim the fat aware block.
> Not trim the blocks which fat doesn't know.
> As fat don't use the block 0, 1, it adjust the start block at kernel.
>
> + if (start < FAT_START_ENT)
> + start = FAT_START_ENT;
>
> and don't exceed the max cluster size.
>
> + len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
>
> + for (count = start; count <= len; count++) {

Yes. We _adjust_ from 0 to 2 here, so, the end of block also have to be
_adjusted_.

>From other point of view, if userland specified 0 - max-length
(i.e. number of blocks), what happens? It would trim block of 2 -
(max-length - 2), right?

So, actually, userland specify 2 - (max-length + 2). And userland has to
know 2 is real start, not 0.

But how to know it? Yes, it's FS internal layout, AFAICS, other FSes also
expose internal like this.

>> But this is actually wrong, this interface doesn't map blocks to 0-max,
>> so userland have to know real maximum-block-number somehow for each FS
>> (and maybe have to know real minimum-block-number).
>>
>> So, how to fix this? The solutions would be userland workaround, or fix
>> kernel. If it was userland, userland have to know FS internal sadly. If
>> it was kernel, we would have backward compatible nightmare, or ignore
>> compatible :(.
>
> I think basic concept of batched discard is trim at once instead of
> deleted entries at filesystem (original one).
> So it can set the specific start address or any start (usually 0) with
> maximum length.

Yes, I think so too (i.e. 0 - max length). But the implement is not
working like it. It exposes FS internal layout.

>> I really think we have to rethink this and have agreement about common
>> interface. Or there was real userland app, I think FAT can implement to
>> work that app.
>
> IMO, we should use the same user space application. user program
> doesn't know the which filesystem are underlying.

Indeed.

> So sample user space program used at ext4, xfs should be used. and I
> tested it.

I bet it doesn't trim whole FS (due to expose FS layout) actually even
if user asked like above example.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-05-24 06:55:37

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, May 24, 2011 at 3:39 PM, OGAWA Hirofumi
<[email protected]> wrote:
> Kyungmin Park <[email protected]> writes:
>
>>>> Do you still object to merge this feature for .40? As I said the
>>>> batched discard design is out-of-scope of this patch.
>>>> It's implemented at other batched discard, ext4, xfs and so on.
>>>>
>>>> I hope fat is also support the batched discard.
>>>>
>>>> Any opinions?
>>>
>>> I'm also thinking implementing this is good though. Sorry, I'm not going
>>> to apply this for now, and would like to wait to be used by real
>>> userland (I hope guys notice the problem, or userland tackles it somehow
>>> sadly).
>>>
>>> I think, to expose the wrong behavior like this would be worse.
>>>
>>> E.g. one of problems, userland might do like this (trim chunk from 0 to
>>> number of block)
>>>
>>> for chunk in number_of_blocks
>>> ? ? ? ?do_trim chunk
>>> done
>>
>> It's handled at trim implementation. It just trim the fat aware block.
>> Not trim the blocks which fat doesn't know.
>> As fat don't use the block 0, 1, it adjust the start block at kernel.
>>
>> + ? ? ? if (start < FAT_START_ENT)
>> + ? ? ? ? ? ? ? start = FAT_START_ENT;
>>
>> and don't exceed the max cluster size.
>>
>> + ? ? ? len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
>>
>> + ? ? ? for (count = start; count <= len; count++) {
>
> Yes. We _adjust_ from 0 to 2 here, so, the end of block also have to be
> _adjusted_.
>
> From other point of view, if userland specified 0 - max-length
> (i.e. number of blocks), what happens? It would trim block of 2 -
> (max-length - 2), right?

No, length is not changed. so max-length is used.
>
> So, actually, userland specify 2 - (max-length + 2). And userland has to
> know 2 is real start, not 0.
>
> But how to know it? Yes, it's FS internal layout, AFAICS, other FSes also
> expose internal like this.

user space doesn't know the internal fs layout, it should be handled
at each fs trim implementation.

>
>>> But this is actually wrong, this interface doesn't map blocks to 0-max,
>>> so userland have to know real maximum-block-number somehow for each FS
>>> (and maybe have to know real minimum-block-number).
>>>
>>> So, how to fix this? The solutions would be userland workaround, or fix
>>> kernel. If it was userland, userland have to know FS internal sadly. If
>>> it was kernel, we would have backward compatible nightmare, or ignore
>>> compatible :(.
>>
>> I think basic concept of batched discard is trim at once instead of
>> deleted entries at filesystem (original one).
>> So it can set the specific start address or any start (usually 0) with
>> maximum length.
>
> Yes, I think so too (i.e. 0 - max length). But the implement is not
> working like it. It exposes FS internal layout.
>
>>> I really think we have to rethink this and have agreement about common
>>> interface. Or there was real userland app, I think FAT can implement to
>>> work that app.
>>
>> IMO, we should use the same user space application. user program
>> doesn't know the which filesystem are underlying.
>
> Indeed.
>
>> So sample user space program used at ext4, xfs should be used. and I
>> tested it.
>
> I bet it doesn't trim whole FS (due to expose FS layout) actually even
> if user asked like above example.

Right, it's fs dependent parts. Do you think it should touch whole FS
blocks when batched discard. I think original discard doesn't support
it also.

Batched discard support is just extension of original discard
implementation. It doesn't modify the original design.

Thank you,
Kyungmin Park

>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2011-05-24 07:32:20

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

>>> It's handled at trim implementation. It just trim the fat aware block.
>>> Not trim the blocks which fat doesn't know.
>>> As fat don't use the block 0, 1, it adjust the start block at kernel.
>>>
>>> + ? ? ? if (start < FAT_START_ENT)
>>> + ? ? ? ? ? ? ? start = FAT_START_ENT;
>>>
>>> and don't exceed the max cluster size.
>>>
>>> + ? ? ? len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
>>>
>>> + ? ? ? for (count = start; count <= len; count++) {
>>
>> Yes. We _adjust_ from 0 to 2 here, so, the end of block also have to be
>> _adjusted_.
>>
>> From other point of view, if userland specified 0 - max-length
>> (i.e. number of blocks), what happens? It would trim block of 2 -
>> (max-length - 2), right?
>
> No, length is not changed. so max-length is used.

No, no. Userland will know max-length from statvfs, right? So, let's
assume it is 100 (->f_blocks) * 1024 (->f_bsize).

Now, userland know about max length, 102400, ok? Let's start to trim.

Assume, userland want to trim whole. So, userland will specify like

trim(0, 102400).

What happen in kernel actually?

Current implement doesn't map blocks. So, in the case of FAT, it adjusts
from 0 to 2 * 1024.

So, it trims between 2048 and 102400. The problem is here. FS layout is
actually, 2048 and (102400 + 2048). I.e. actually userland has to do

trim(2048, 102400 + 2048)

to specify whole. How to know 2048?

See what I'm saying?

FAT has liner block space, so the problem is small against mapping. But
other FSes has bigger problem.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-05-24 08:54:38

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, May 24, 2011 at 4:32 PM, OGAWA Hirofumi
<[email protected]> wrote:
> Kyungmin Park <[email protected]> writes:
>
>>>> It's handled at trim implementation. It just trim the fat aware block.
>>>> Not trim the blocks which fat doesn't know.
>>>> As fat don't use the block 0, 1, it adjust the start block at kernel.
>>>>
>>>> + ? ? ? if (start < FAT_START_ENT)
>>>> + ? ? ? ? ? ? ? start = FAT_START_ENT;
>>>>
>>>> and don't exceed the max cluster size.
>>>>
>>>> + ? ? ? len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
>>>>
>>>> + ? ? ? for (count = start; count <= len; count++) {
>>>
>>> Yes. We _adjust_ from 0 to 2 here, so, the end of block also have to be
>>> _adjusted_.
>>>
>>> From other point of view, if userland specified 0 - max-length
>>> (i.e. number of blocks), what happens? It would trim block of 2 -
>>> (max-length - 2), right?
>>
>> No, length is not changed. so max-length is used.
>
> No, no. Userland will know max-length from statvfs, right? So, let's
> assume it is 100 (->f_blocks) * 1024 (->f_bsize).
>
> Now, userland know about max length, 102400, ok? Let's start to trim.
>
> Assume, userland want to trim whole. So, userland will specify like
>
> ? ? ? ?trim(0, 102400).
>
> What happen in kernel actually?
>
> Current implement doesn't map blocks. So, in the case of FAT, it adjusts
> from 0 to 2 * 1024.
>
> So, it trims between 2048 and 102400. The problem is here. FS layout is
> actually, 2048 and (102400 + 2048). I.e. actually userland has to do
>
> ? ? ? ?trim(2048, 102400 + 2048)

Umm maybe first implementation does as like this, but Lukas mentioned
it's wrong. So I modified it for batched discard concept.

You want the loop like this

for (count = start; count <= (start + len); count++)

>
> to specify whole. How to know 2048?
>
> See what I'm saying?
>
> FAT has liner block space, so the problem is small against mapping. But
> other FSes has bigger problem.
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2011-05-24 09:25:27

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, 24 May 2011, OGAWA Hirofumi wrote:

> Kyungmin Park <[email protected]> writes:
>
> >>> It's handled at trim implementation. It just trim the fat aware block.
> >>> Not trim the blocks which fat doesn't know.
> >>> As fat don't use the block 0, 1, it adjust the start block at kernel.
> >>>
> >>> + ? ? ? if (start < FAT_START_ENT)
> >>> + ? ? ? ? ? ? ? start = FAT_START_ENT;
> >>>
> >>> and don't exceed the max cluster size.
> >>>
> >>> + ? ? ? len = (len > sbi->max_cluster) ? sbi->max_cluster : len;
> >>>
> >>> + ? ? ? for (count = start; count <= len; count++) {
> >>
> >> Yes. We _adjust_ from 0 to 2 here, so, the end of block also have to be
> >> _adjusted_.
> >>
> >> From other point of view, if userland specified 0 - max-length
> >> (i.e. number of blocks), what happens? It would trim block of 2 -
> >> (max-length - 2), right?
> >
> > No, length is not changed. so max-length is used.
>
> No, no. Userland will know max-length from statvfs, right? So, let's
> assume it is 100 (->f_blocks) * 1024 (->f_bsize).

You do not need to know the filesystem size to do the discard, it should
be adjusted within the kernel. Just specify ULLONG_MAX as a length. See
fstrim tool in util-linux-ng.

>
> Now, userland know about max length, 102400, ok? Let's start to trim.
>
> Assume, userland want to trim whole. So, userland will specify like
>
> trim(0, 102400).
>
> What happen in kernel actually?
>
> Current implement doesn't map blocks. So, in the case of FAT, it adjusts
> from 0 to 2 * 1024.
>
> So, it trims between 2048 and 102400. The problem is here. FS layout is
> actually, 2048 and (102400 + 2048). I.e. actually userland has to do
>
> trim(2048, 102400 + 2048)
>
> to specify whole. How to know 2048?

You do not need to know anything in userspace. If you want to trim the
whole filesystem you just do trim(0, ULLONG_MAX) - which is what fstrim
does when you do not specify range. And you just skip the filesystem
metadata obviously, regardless if they are at the beginning of the
filesystem or in the middle. Just do whatever you need to do within your
filesystem.

What we do in ext4 is, that we convert length and start passed in struct
fstrim_range into filesystem block units and then get the last
allocation group and block offset within that group (we do the same for
the start block) and we try to discard free block ranges in from staring
block to the last block.

It is really not a rocket science and since every filesystem is
different and has different internal data structures it is up to you how
to do this. And if you shift a block or two, it really does not matter
as much since user-land does not know about how the filesystem block are
laid out anyway, nor user land knows which are free and which are not.

I agree that the interface is a little bit fuzzy, but that is mainly
because it is intended to be filesystem independent and we do have a lot
of various filesystems, so I wanted it to be as flexibile as it should,
hence the start, len in Bytes.

Hope it helped.

Thanks!
-Lukas

>
> See what I'm saying?
>
> FAT has liner block space, so the problem is small against mapping. But
> other FSes has bigger problem.
>
> Thanks.
>

--

2011-05-24 09:45:06

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Kyungmin Park <[email protected]> writes:

>>> No, length is not changed. so max-length is used.
>>
>> No, no. Userland will know max-length from statvfs, right? So, let's
>> assume it is 100 (->f_blocks) * 1024 (->f_bsize).
>>
>> Now, userland know about max length, 102400, ok? Let's start to trim.
>>
>> Assume, userland want to trim whole. So, userland will specify like
>>
>> ? ? ? ?trim(0, 102400).
>>
>> What happen in kernel actually?
>>
>> Current implement doesn't map blocks. So, in the case of FAT, it adjusts
>> from 0 to 2 * 1024.
>>
>> So, it trims between 2048 and 102400. The problem is here. FS layout is
>> actually, 2048 and (102400 + 2048). I.e. actually userland has to do
>>
>> ? ? ? ?trim(2048, 102400 + 2048)
>
> Umm maybe first implementation does as like this, but Lukas mentioned
> it's wrong. So I modified it for batched discard concept.

I don't know what Lukas thought/said though, I hope he notices the
problem if real app started to use. Or I'd like to know how to work
without the knowledge of FS internal.

> You want the loop like this
>
> for (count = start; count <= (start + len); count++)

I'm not sure this loop is what doing from the above example. Like I said
in previous thread, I'd like to see liner mapping from 0 like simply
regular file in _all_FSes_. If I read ext* correctly, it was hell of raw
FS layout.

I.e. FAT has from "2 - (2 + number_of_blocks)" layout, in FAT case, I
think it should map to "0 to number_of_block", and expose as liner from 0.

But, it should be for all FSes, otherwise there is no benefit for
userland. If there is exception, userland would has to know FS internal
after all.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-05-24 10:07:31

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Lukas Czerner <[email protected]> writes:

>> No, no. Userland will know max-length from statvfs, right? So, let's
>> assume it is 100 (->f_blocks) * 1024 (->f_bsize).
>
> You do not need to know the filesystem size to do the discard, it should
> be adjusted within the kernel. Just specify ULLONG_MAX as a length. See
> fstrim tool in util-linux-ng.
>
>>
>> Now, userland know about max length, 102400, ok? Let's start to trim.
>>
>> Assume, userland want to trim whole. So, userland will specify like
>>
>> trim(0, 102400).
>>
>> What happen in kernel actually?
>>
>> Current implement doesn't map blocks. So, in the case of FAT, it adjusts
>> from 0 to 2 * 1024.
>>
>> So, it trims between 2048 and 102400. The problem is here. FS layout is
>> actually, 2048 and (102400 + 2048). I.e. actually userland has to do
>>
>> trim(2048, 102400 + 2048)
>>
>> to specify whole. How to know 2048?
>
> You do not need to know anything in userspace. If you want to trim the
> whole filesystem you just do trim(0, ULLONG_MAX) - which is what fstrim
> does when you do not specify range. And you just skip the filesystem
> metadata obviously, regardless if they are at the beginning of the
> filesystem or in the middle. Just do whatever you need to do within your
> filesystem.
>
> What we do in ext4 is, that we convert length and start passed in struct
> fstrim_range into filesystem block units and then get the last
> allocation group and block offset within that group (we do the same for
> the start block) and we try to discard free block ranges in from staring
> block to the last block.
>
> It is really not a rocket science and since every filesystem is
> different and has different internal data structures it is up to you how
> to do this. And if you shift a block or two, it really does not matter
> as much since user-land does not know about how the filesystem block are
> laid out anyway, nor user land knows which are free and which are not.
>
> I agree that the interface is a little bit fuzzy, but that is mainly
> because it is intended to be filesystem independent and we do have a lot
> of various filesystems, so I wanted it to be as flexibile as it should,
> hence the start, len in Bytes.
>
> Hope it helped.

No. If you want to trim whole with some chunk like 1GB and periodically
(IIRC in xfstest), what do? We have to trim until ULLONG_MAX for each
1GB?

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-05-24 10:44:37

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, 24 May 2011, OGAWA Hirofumi wrote:

> Lukas Czerner <[email protected]> writes:
>
> >> No, no. Userland will know max-length from statvfs, right? So, let's
> >> assume it is 100 (->f_blocks) * 1024 (->f_bsize).
> >
> > You do not need to know the filesystem size to do the discard, it should
> > be adjusted within the kernel. Just specify ULLONG_MAX as a length. See
> > fstrim tool in util-linux-ng.
> >
> >>
> >> Now, userland know about max length, 102400, ok? Let's start to trim.
> >>
> >> Assume, userland want to trim whole. So, userland will specify like
> >>
> >> trim(0, 102400).
> >>
> >> What happen in kernel actually?
> >>
> >> Current implement doesn't map blocks. So, in the case of FAT, it adjusts
> >> from 0 to 2 * 1024.
> >>
> >> So, it trims between 2048 and 102400. The problem is here. FS layout is
> >> actually, 2048 and (102400 + 2048). I.e. actually userland has to do
> >>
> >> trim(2048, 102400 + 2048)
> >>
> >> to specify whole. How to know 2048?
> >
> > You do not need to know anything in userspace. If you want to trim the
> > whole filesystem you just do trim(0, ULLONG_MAX) - which is what fstrim
> > does when you do not specify range. And you just skip the filesystem
> > metadata obviously, regardless if they are at the beginning of the
> > filesystem or in the middle. Just do whatever you need to do within your
> > filesystem.
> >
> > What we do in ext4 is, that we convert length and start passed in struct
> > fstrim_range into filesystem block units and then get the last
> > allocation group and block offset within that group (we do the same for
> > the start block) and we try to discard free block ranges in from staring
> > block to the last block.
> >
> > It is really not a rocket science and since every filesystem is
> > different and has different internal data structures it is up to you how
> > to do this. And if you shift a block or two, it really does not matter
> > as much since user-land does not know about how the filesystem block are
> > laid out anyway, nor user land knows which are free and which are not.
> >
> > I agree that the interface is a little bit fuzzy, but that is mainly
> > because it is intended to be filesystem independent and we do have a lot
> > of various filesystems, so I wanted it to be as flexibile as it should,
> > hence the start, len in Bytes.
> >
> > Hope it helped.
>
> No. If you want to trim whole with some chunk like 1GB and periodically
> (IIRC in xfstest), what do? We have to trim until ULLONG_MAX for each
> 1GB?
>
> Thanks.
>

What ? No, of course not. As I said, just go through 1G worth of filesystem
blocks skipping metadata. However we do have a special case when we
adjust start and len according to the first data block (which is only
the case of 1024B blocksize).

if (start < first_data_blk) {
len -= first_data_blk - start;
start = first_data_blk;
}

Which means that we just skip the first block (or whatever first data
block is). And this is the same as skipping metadata.

-Lukas

2011-05-24 11:14:43

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Lukas Czerner <[email protected]> writes:

>> No. If you want to trim whole with some chunk like 1GB and periodically
>> (IIRC in xfstest), what do? We have to trim until ULLONG_MAX for each
>> 1GB?
>>
>> Thanks.
>>
>
> What ? No, of course not. As I said, just go through 1G worth of filesystem
> blocks skipping metadata. However we do have a special case when we
> adjust start and len according to the first data block (which is only
> the case of 1024B blocksize).
>
> if (start < first_data_blk) {
> len -= first_data_blk - start;
> start = first_data_blk;
> }
>
> Which means that we just skip the first block (or whatever first data
> block is). And this is the same as skipping metadata.

Are you read my email? So, FAT adjust 2 blocks, ext* 1 block, and what
is other? The middle was guaranteed as continued? So, which is end of
blocks?
--
OGAWA Hirofumi <[email protected]>

2011-05-24 11:32:58

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, 24 May 2011, OGAWA Hirofumi wrote:

> Lukas Czerner <[email protected]> writes:
>
> >> No. If you want to trim whole with some chunk like 1GB and periodically
> >> (IIRC in xfstest), what do? We have to trim until ULLONG_MAX for each
> >> 1GB?
> >>
> >> Thanks.
> >>
> >
> > What ? No, of course not. As I said, just go through 1G worth of filesystem
> > blocks skipping metadata. However we do have a special case when we
> > adjust start and len according to the first data block (which is only
> > the case of 1024B blocksize).
> >
> > if (start < first_data_blk) {
> > len -= first_data_blk - start;
> > start = first_data_blk;
> > }
> >
> > Which means that we just skip the first block (or whatever first data
> > block is). And this is the same as skipping metadata.
>
> Are you read my email? So, FAT adjust 2 blocks, ext* 1 block, and what
> is other? The middle was guaranteed as continued? So, which is end of
> blocks?

I am sorry but I am not sure what you are asking for. Just grab the size
of the file system or even better the size of underlying device (since
the filesyste would not be bigger than that) and use that.

Of course if you grab the count of filesystem blocks (and the file
system does not account for the first data block) you'll end up with len
smaller than first data block. So it might make sense to not to
decrement the len and just start from first data block since it is not
accounted for anyway. However it is not a big deal, is it ? Are you
expecting any problems with this behaviour ?

-Lukas

2011-05-24 12:19:52

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Lukas Czerner <[email protected]> writes:

>> Are you read my email? So, FAT adjust 2 blocks, ext* 1 block, and what
>> is other? The middle was guaranteed as continued? So, which is end of
>> blocks?
>
> I am sorry but I am not sure what you are asking for. Just grab the size
> of the file system or even better the size of underlying device (since
> the filesyste would not be bigger than that) and use that.
>
> Of course if you grab the count of filesystem blocks (and the file
> system does not account for the first data block) you'll end up with len
> smaller than first data block. So it might make sense to not to
> decrement the len and just start from first data block since it is not
> accounted for anyway. However it is not a big deal, is it ? Are you
> expecting any problems with this behaviour ?

What is size of file system or underlying devices? You force to find the
device which target FS is using? Even if you can get size of underlying
devices, you force to user to insane loop in size of devices?

Why can you guarantee it's not big deal in design? Why can't you admit
userland can't make optimized loop?
--
OGAWA Hirofumi <[email protected]>

2011-05-24 13:30:35

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

On Tue, 24 May 2011, OGAWA Hirofumi wrote:

> Lukas Czerner <[email protected]> writes:
>
> >> Are you read my email? So, FAT adjust 2 blocks, ext* 1 block, and what
> >> is other? The middle was guaranteed as continued? So, which is end of
> >> blocks?
> >
> > I am sorry but I am not sure what you are asking for. Just grab the size
> > of the file system or even better the size of underlying device (since
> > the filesyste would not be bigger than that) and use that.
> >
> > Of course if you grab the count of filesystem blocks (and the file
> > system does not account for the first data block) you'll end up with len
> > smaller than first data block. So it might make sense to not to
> > decrement the len and just start from first data block since it is not
> > accounted for anyway. However it is not a big deal, is it ? Are you
> > expecting any problems with this behaviour ?
>
> What is size of file system or underlying devices? You force to find the
> device which target FS is using? Even if you can get size of underlying
> devices, you force to user to insane loop in size of devices?

Look, I do not have time to argue with you forever and I do not even
understand what is your point. Just go and read other filesystems
implementation of FITRIM (ext4,ext3,xfs,btrfs?) and you'll see what you
need to do.

If you do not want to get the file system size, then FINE! just pass the
damn UULONG_MAX as length. I have no clue what insane loop are you
talking about! It is *easy* just discard the whole thing (with
UULONG_MAX) or, if you want to do it per-partes, then do it as long as
it does not return EINVAL, once it does you know that your "start" is
out of the filesystem and you are done!


>
> Why can you guarantee it's not big deal in design? Why can't you admit
> userland can't make optimized loop?

And what do you mean by that ?

-Lukas

2011-05-24 14:19:07

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v6] fat: Batched discard support for fat

Lukas Czerner <[email protected]> writes:

>> What is size of file system or underlying devices? You force to find the
>> device which target FS is using? Even if you can get size of underlying
>> devices, you force to user to insane loop in size of devices?
>
> Look, I do not have time to argue with you forever and I do not even
> understand what is your point. Just go and read other filesystems
> implementation of FITRIM (ext4,ext3,xfs,btrfs?) and you'll see what you
> need to do.
>
> If you do not want to get the file system size, then FINE! just pass the
> damn UULONG_MAX as length. I have no clue what insane loop are you
> talking about! It is *easy* just discard the whole thing (with
> UULONG_MAX) or, if you want to do it per-partes, then do it as long as
> it does not return EINVAL, once it does you know that your "start" is
> out of the filesystem and you are done!

You are not even understanding current implementations. See
ext3_trim_fs(), and ext4_trim_fs().

What happen if "start" was outside of max_blks:

ext3 returns 0
ext4 returns EINVAL

What means "start" is 0

ext3 maps to 1
ext4 just remove 0 from request

I missing something?

>> Why can you guarantee it's not big deal in design? Why can't you admit
>> userland can't make optimized loop?
>
> And what do you mean by that ?
>
> -Lukas

--
OGAWA Hirofumi <[email protected]>