2013-03-07 13:57:14

by Namjae Jeon

[permalink] [raw]
Subject: [PATCH v3] fat: editions to support fat_fallocate

From: Namjae Jeon <[email protected]>

Implement preallocation via the fallocate syscall on VFAT partitions.

Change Log:
v3: Release preallocated blocks at file release.

With FALLOC_FL_KEEP_SIZE, there is no way to distinguish if the mismatch
between i_size and no. of clusters allocated is a consequence of
fallocate or just plain corruption. When a non fallocate aware (old)
linux fat driver tries to write to such a file, it throws an error.
Also, fsck detects this as inconsistency and truncates the prealloc'd blocks.

To avoid this, as suggested by OGAWA, remove changes that make fallocate
persistent across mounts and restrict lifetime of blocks from
fallocate(2) to file release.

v2: On an area preallocated with FALLOC_FL_KEEP_SIZE, when a seek was
done to an offset beyond i_size, the old (garbage) data was exposed as
we did not zero out the area at allocation time. Added
fat_zero_falloc_area() to fix this.

v1: Reworked an earlier patch of the same name
(https://lkml.org/lkml/2007/12/22/130) to fix some bugs:
i)Preallocated space was not persistent and was lost on remount. Fixed
it.
ii)Did not zero out allocated clusters when FALLOC_FL_KEEP_SIZE was set,
thereby speeding up preallocation time.

Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Ravishankar N <[email protected]>
---
fs/fat/file.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fat/inode.c | 47 ++++++++++++++++++++++++++++++++
2 files changed, 128 insertions(+)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index b0b632e..76df547 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -17,8 +17,11 @@
#include <linux/blkdev.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
+#include <linux/falloc.h>
#include "fat.h"

+static long fat_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len);
static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr)
{
u32 attr;
@@ -140,6 +143,12 @@ static long fat_generic_compat_ioctl(struct file *filp, unsigned int cmd,

static int fat_file_release(struct inode *inode, struct file *filp)
{
+ struct super_block *sb = inode->i_sb;
+ loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
+ ~(sb->s_blocksize-1);
+ if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
+ filp->f_dentry->d_count == 1)
+ fat_truncate_blocks(inode, inode->i_size);
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
@@ -174,6 +183,7 @@ const struct file_operations fat_file_operations = {
#endif
.fsync = fat_file_fsync,
.splice_read = generic_file_splice_read,
+ .fallocate = fat_fallocate,
};

static int fat_cont_expand(struct inode *inode, loff_t size)
@@ -211,7 +221,78 @@ static int fat_cont_expand(struct inode *inode, loff_t size)
out:
return err;
}
+/*
+ * preallocate space for a file. This implements fat's fallocate file
+ * operation, which gets called from sys_fallocate system call. User
+ * space requests len bytes at offset.If FALLOC_FL_KEEP_SIZE is set
+ * we just allocate clusters without zeroing them out.Otherwise we
+ * allocate and zero out clusters via an expanding truncate.
+ */
+static long fat_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len)
+{
+ int err = 0;
+ struct inode *inode = file->f_mapping->host;
+ int cluster, nr_cluster, fclus, dclus, free_bytes, nr_bytes;
+ struct super_block *sb = inode->i_sb;
+ struct msdos_sb_info *sbi = MSDOS_SB(sb);
+
+ /* No support for hole punch or other fallocate flags. */
+ if (mode & ~FALLOC_FL_KEEP_SIZE)
+ return -EOPNOTSUPP;
+
+ if ((offset + len) <= MSDOS_I(inode)->mmu_private) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate():Blocks already allocated");
+ return -EINVAL;
+ }

+ if ((mode & FALLOC_FL_KEEP_SIZE)) {
+ /* First compute the number of clusters to be allocated */
+ if (inode->i_size > 0) {
+ err = fat_get_cluster(inode, FAT_ENT_EOF,
+ &fclus, &dclus);
+ if (err < 0) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate():fat_get_cluster() error");
+ return err;
+ }
+ free_bytes = ((fclus+1) << sbi->cluster_bits)-
+ (inode->i_size);
+ nr_bytes = (offset + len - inode->i_size) - free_bytes;
+ } else
+ nr_bytes = (offset + len - inode->i_size);
+ nr_cluster = (nr_bytes + (sbi->cluster_size - 1)) >>
+ sbi->cluster_bits;
+ mutex_lock(&inode->i_mutex);
+ /* Start the allocation.We are not zeroing out the clusters */
+ while (nr_cluster-- > 0) {
+ err = fat_alloc_clusters(inode, &cluster, 1);
+ if (err) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate():fat_alloc_clusters() error");
+ goto error;
+ }
+ err = fat_chain_add(inode, cluster, 1);
+ if (err) {
+ fat_free_clusters(inode, cluster);
+ goto error;
+ }
+ MSDOS_I(inode)->mmu_private += sbi->cluster_size;
+ }
+ } else {
+ mutex_lock(&inode->i_mutex);
+ /* This is just an expanding truncate */
+ err = fat_cont_expand(inode, (offset + len));
+ if (err) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate():fat_cont_expand() error");
+ }
+ }
+error:
+ mutex_unlock(&inode->i_mutex);
+ return err;
+}
/* Free all clusters after the skip'th cluster. */
static int fat_free(struct inode *inode, int skip)
{
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index dfce656..ddf2969 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -152,11 +152,58 @@ static void fat_write_failed(struct address_space *mapping, loff_t to)
}
}

+static int fat_zero_falloc_area(struct file *file,
+ struct address_space *mapping, loff_t pos)
+{
+ struct page *page;
+ struct inode *inode = mapping->host;
+ loff_t curpos = inode->i_size;
+ size_t count = pos-curpos;
+ int err;
+ do {
+ unsigned offset, bytes;
+ void *fsdata;
+
+ offset = (curpos & (PAGE_CACHE_SIZE - 1));
+ bytes = PAGE_CACHE_SIZE - offset;
+ if (bytes > count)
+ bytes = count;
+
+ err = pagecache_write_begin(NULL, mapping, curpos, bytes,
+ AOP_FLAG_UNINTERRUPTIBLE,
+ &page, &fsdata);
+ if (err)
+ break;
+
+ zero_user(page, offset, bytes);
+
+ err = pagecache_write_end(NULL, mapping, curpos, bytes, bytes,
+ page, fsdata);
+ WARN_ON(err <= 0);
+ curpos += bytes;
+ count -= bytes;
+ err = 0;
+ } while (count);
+
+ return -err;
+}
+
static int fat_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
int err;
+ struct inode *inode = mapping->host;
+ struct super_block *sb = inode->i_sb;
+ loff_t mmu_private_actual = MSDOS_I(inode)->mmu_private;
+ loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
+ ~(sb->s_blocksize-1);
+
+ if ((mmu_private_actual > mmu_private_ideal) && (pos > inode->i_size)) {
+ err = fat_zero_falloc_area(file, mapping, pos);
+ if (err)
+ fat_msg(sb, KERN_ERR, "error zeroing fallocated area");
+ }

*pagep = NULL;
err = cont_write_begin(file, mapping, pos, len, flags,
--
1.7.9.5


2013-03-08 23:37:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

On Thu, 7 Mar 2013 22:56:57 +0900 Namjae Jeon <[email protected]> wrote:

> From: Namjae Jeon <[email protected]>
>
> Implement preallocation via the fallocate syscall on VFAT partitions.
>
> Change Log:
> v3: Release preallocated blocks at file release.
>
> With FALLOC_FL_KEEP_SIZE, there is no way to distinguish if the mismatch
> between i_size and no. of clusters allocated is a consequence of
> fallocate or just plain corruption. When a non fallocate aware (old)
> linux fat driver tries to write to such a file, it throws an error.
> Also, fsck detects this as inconsistency and truncates the prealloc'd blocks.
>
> To avoid this, as suggested by OGAWA, remove changes that make fallocate
> persistent across mounts and restrict lifetime of blocks from
> fallocate(2) to file release.
>
> v2: On an area preallocated with FALLOC_FL_KEEP_SIZE, when a seek was
> done to an offset beyond i_size, the old (garbage) data was exposed as
> we did not zero out the area at allocation time. Added
> fat_zero_falloc_area() to fix this.
>
> v1: Reworked an earlier patch of the same name
> (https://lkml.org/lkml/2007/12/22/130) to fix some bugs:
> i)Preallocated space was not persistent and was lost on remount. Fixed
> it.
> ii)Did not zero out allocated clusters when FALLOC_FL_KEEP_SIZE was set,
> thereby speeding up preallocation time.
>
> ...
>
> --- a/fs/fat/file.c
> +++ b/fs/fat/file.c
> @@ -17,8 +17,11 @@
> #include <linux/blkdev.h>
> #include <linux/fsnotify.h>
> #include <linux/security.h>
> +#include <linux/falloc.h>
> #include "fat.h"
>
> +static long fat_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len);
> static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr)
> {
> u32 attr;
> @@ -140,6 +143,12 @@ static long fat_generic_compat_ioctl(struct file *filp, unsigned int cmd,
>
> static int fat_file_release(struct inode *inode, struct file *filp)
> {
> + struct super_block *sb = inode->i_sb;
> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
> + ~(sb->s_blocksize-1);

Stylistically, it looks better to do

loff_t mmu_private_ideal;

mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
~(sb->s_blocksize-1);

Note the blank line between end-of-definitions and start-of-code. The
patch fails to do this in numerous places.

Also, I think and hope we can use round_up() here.

And we're not using i_size_read(). Probably that's OK if it is
guaranteed that fat_file_release() is always called under i_mutex, but
I might have forgotten the rules there.


> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
> + filp->f_dentry->d_count == 1)
> + fat_truncate_blocks(inode, inode->i_size);

I suggest that a comment be added here. It is unobvious why this code
is here, and what role d_count plays.

> if ((filp->f_mode & FMODE_WRITE) &&
> MSDOS_SB(inode->i_sb)->options.flush) {
> fat_flush_inodes(inode->i_sb, inode, NULL);
> @@ -174,6 +183,7 @@ const struct file_operations fat_file_operations = {
> #endif
> .fsync = fat_file_fsync,
> .splice_read = generic_file_splice_read,
> + .fallocate = fat_fallocate,
> };
>
> static int fat_cont_expand(struct inode *inode, loff_t size)
> @@ -211,7 +221,78 @@ static int fat_cont_expand(struct inode *inode, loff_t size)
> out:
> return err;
> }
> +/*
> + * preallocate space for a file. This implements fat's fallocate file
> + * operation, which gets called from sys_fallocate system call. User
> + * space requests len bytes at offset.If FALLOC_FL_KEEP_SIZE is set
> + * we just allocate clusters without zeroing them out.Otherwise we
> + * allocate and zero out clusters via an expanding truncate.

This comment is a bit lazy :( Capital letters at the start of
sentences, a space after a full stop etc, please.

> + */
> +static long fat_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len)
> +{
> + int err = 0;
> + struct inode *inode = file->f_mapping->host;
> + int cluster, nr_cluster, fclus, dclus, free_bytes, nr_bytes;

I'm rather allergic to multiple-definitions-on-one-line like this.
They make the code harder to read and they result in messy patch resolution
efforts. Most significantly, one-definition-per-line leaves a little
room on the right for a brief comment explaining the variable's role.
Such comments appear to be needed in this function!

Are you sure that `int' is the best type for all these? Do they need
to be signed? For example nr_bytes and free_bytes are derived from
loff_t's and it is unobvious that there is no risk of overflowing.


> + struct super_block *sb = inode->i_sb;
> + struct msdos_sb_info *sbi = MSDOS_SB(sb);
> +
> + /* No support for hole punch or other fallocate flags. */
> + if (mode & ~FALLOC_FL_KEEP_SIZE)
> + return -EOPNOTSUPP;
> +
> + if ((offset + len) <= MSDOS_I(inode)->mmu_private) {
> + fat_msg(sb, KERN_ERR,
> + "fat_fallocate():Blocks already allocated");

Place a space after the colon.

> + return -EINVAL;
> + }
>
> + if ((mode & FALLOC_FL_KEEP_SIZE)) {

Unneeded parentheses.

> + /* First compute the number of clusters to be allocated */
> + if (inode->i_size > 0) {

i_size_read()?

> + err = fat_get_cluster(inode, FAT_ENT_EOF,
> + &fclus, &dclus);
> + if (err < 0) {
> + fat_msg(sb, KERN_ERR,
> + "fat_fallocate():fat_get_cluster() error");

space after colon

> + return err;
> + }
> + free_bytes = ((fclus+1) << sbi->cluster_bits)-

Place spaces around + and -

> + (inode->i_size);

More overparenthesization.

> + nr_bytes = (offset + len - inode->i_size) - free_bytes;
> + } else
> + nr_bytes = (offset + len - inode->i_size);

Overparenthesization.

> + nr_cluster = (nr_bytes + (sbi->cluster_size - 1)) >>
> + sbi->cluster_bits;
> + mutex_lock(&inode->i_mutex);

whoa, darn. We weren't holding i_mutex? Then yes, i_size_read() is
needed.

And this code reads i_size multiple times, while not holding any lock
which will prevent i_size from changing between those two reads. It
seems racy.


> + /* Start the allocation.We are not zeroing out the clusters */
> + while (nr_cluster-- > 0) {
> + err = fat_alloc_clusters(inode, &cluster, 1);
> + if (err) {
> + fat_msg(sb, KERN_ERR,
> + "fat_fallocate():fat_alloc_clusters() error");

space after colon

> + goto error;
> + }
> + err = fat_chain_add(inode, cluster, 1);
> + if (err) {
> + fat_free_clusters(inode, cluster);
> + goto error;
> + }
> + MSDOS_I(inode)->mmu_private += sbi->cluster_size;
> + }
> + } else {
> + mutex_lock(&inode->i_mutex);
> + /* This is just an expanding truncate */
> + err = fat_cont_expand(inode, (offset + len));

Overparenthesization.

> + if (err) {
> + fat_msg(sb, KERN_ERR,
> + "fat_fallocate():fat_cont_expand() error");

space

> + }
> + }
> +error:
> + mutex_unlock(&inode->i_mutex);
> + return err;
> +}
> /* Free all clusters after the skip'th cluster. */
> static int fat_free(struct inode *inode, int skip)
> {
> diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> index dfce656..ddf2969 100644
> --- a/fs/fat/inode.c
> +++ b/fs/fat/inode.c
> @@ -152,11 +152,58 @@ static void fat_write_failed(struct address_space *mapping, loff_t to)
> }
> }
>
> +static int fat_zero_falloc_area(struct file *file,
> + struct address_space *mapping, loff_t pos)
> +{
> + struct page *page;
> + struct inode *inode = mapping->host;
> + loff_t curpos = inode->i_size;
> + size_t count = pos-curpos;

spaces around -

> + int err;

Newline after end-of-locals.

> + do {
> + unsigned offset, bytes;
> + void *fsdata;
> +
> + offset = (curpos & (PAGE_CACHE_SIZE - 1));
> + bytes = PAGE_CACHE_SIZE - offset;

OK, so use of 32-bit scalars are safe here. They are "offset within a
page", yes? That's unobvious from the chosen names...

> + if (bytes > count)
> + bytes = count;

Use min()?

> + err = pagecache_write_begin(NULL, mapping, curpos, bytes,
> + AOP_FLAG_UNINTERRUPTIBLE,
> + &page, &fsdata);
> + if (err)
> + break;

hm, so if we were only able to fallocate 1MB from a requested 2MB, we
don't tell userspace about this? As far as userspace is concerned, the
whole thing failed? Seems so... Is there no requirement to clean up
the partial allocation on failure?

> + zero_user(page, offset, bytes);
> +
> + err = pagecache_write_end(NULL, mapping, curpos, bytes, bytes,
> + page, fsdata);
> + WARN_ON(err <= 0);

Why? That could make the kernel extremely noisy if something goes
wrong.

> + curpos += bytes;
> + count -= bytes;
> + err = 0;
> + } while (count);
> +
> + return -err;

What? So if pagecache_write_begin() returned -ENOMEM,
fat_zero_falloc_area() will return --ENOMEM - that's +12.

> +}
> +
> static int fat_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> int err;
> + struct inode *inode = mapping->host;
> + struct super_block *sb = inode->i_sb;
> + loff_t mmu_private_actual = MSDOS_I(inode)->mmu_private;
> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
> + ~(sb->s_blocksize-1);

See earlier comments.

> + if ((mmu_private_actual > mmu_private_ideal) && (pos > inode->i_size)) {

Overparenthesization.

> + err = fat_zero_falloc_area(file, mapping, pos);
> + if (err)
> + fat_msg(sb, KERN_ERR, "error zeroing fallocated area");

a) the errno should be displayed

b) why is it OK to just ignore the error and proceed?

> + }
>
> *pagep = NULL;
> err = cont_write_begin(file, mapping, pos, len, flags,

2013-03-09 14:18:12

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

> static int fat_file_release(struct inode *inode, struct file *filp)
> {
> + struct super_block *sb = inode->i_sb;
> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
> + ~(sb->s_blocksize-1);
> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
> + filp->f_dentry->d_count == 1)
> + fat_truncate_blocks(inode, inode->i_size);

Without locking, truncate is racy.

This choose ->release(). BTW, we would also be able to do this only
->evict_inode(), although I'm not thinking yet which one is better.

If you had conclusion, it would be nice to explain it.

> +static long fat_fallocate(struct file *file, int mode,
> + loff_t offset, loff_t len)
> +{
>
> + if ((offset + len) <= MSDOS_I(inode)->mmu_private) {
> + fat_msg(sb, KERN_ERR,
> + "fat_fallocate():Blocks already allocated");
> + return -EINVAL;
> + }

Also this looks like totally racy.

> static int fat_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> int err;
> + struct inode *inode = mapping->host;
> + struct super_block *sb = inode->i_sb;
> + loff_t mmu_private_actual = MSDOS_I(inode)->mmu_private;
> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
> + ~(sb->s_blocksize-1);
> +
> + if ((mmu_private_actual > mmu_private_ideal) && (pos > inode->i_size)) {
> + err = fat_zero_falloc_area(file, mapping, pos);
> + if (err)
> + fat_msg(sb, KERN_ERR, "error zeroing fallocated area");
> + }
>
> *pagep = NULL;
> err = cont_write_begin(file, mapping, pos, len, flags,

Hm, only write_begin is enough to handle mmap, truncate, and etc.?

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-03-11 09:43:21

by Namjae Jeon

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

Hi. Andrew.
>>
>> static int fat_file_release(struct inode *inode, struct file *filp)
>> {
>> + struct super_block *sb = inode->i_sb;
>> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
>> + ~(sb->s_blocksize-1);
>
> Stylistically, it looks better to do
>
> loff_t mmu_private_ideal;
>
> mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
> ~(sb->s_blocksize-1);
Agreed. I will fix this.
>
> Note the blank line between end-of-definitions and start-of-code. The
> patch fails to do this in numerous places.
Okay. I will.

>
> Also, I think and hope we can use round_up() here.
Okay, I will.
>
> And we're not using i_size_read(). Probably that's OK if it is
> guaranteed that fat_file_release() is always called under i_mutex, but
> I might have forgotten the rules there.
Okay, I will fix it.

>
>
>> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
>> + filp->f_dentry->d_count == 1)
>> + fat_truncate_blocks(inode, inode->i_size);
>
> I suggest that a comment be added here. It is unobvious why this code
> is here, and what role d_count plays.
Ok. This is the code which releases prealloced (and not yet written
to) area when file is released.
The d_count check ensures release happens only when the last file
descriptor for that file is closed.
I will add a comment on next version patch.
>
>> if ((filp->f_mode & FMODE_WRITE) &&
>> MSDOS_SB(inode->i_sb)->options.flush) {
>> fat_flush_inodes(inode->i_sb, inode, NULL);
>> @@ -174,6 +183,7 @@ const struct file_operations fat_file_operations = {
>> #endif
>> .fsync = fat_file_fsync,
>> .splice_read = generic_file_splice_read,
>> + .fallocate = fat_fallocate,
>> };
>>
>> static int fat_cont_expand(struct inode *inode, loff_t size)
>> @@ -211,7 +221,78 @@ static int fat_cont_expand(struct inode *inode,
>> loff_t size)
>> out:
>> return err;
>> }
>> +/*
>> + * preallocate space for a file. This implements fat's fallocate file
>> + * operation, which gets called from sys_fallocate system call. User
>> + * space requests len bytes at offset.If FALLOC_FL_KEEP_SIZE is set
>> + * we just allocate clusters without zeroing them out.Otherwise we
>> + * allocate and zero out clusters via an expanding truncate.
>
> This comment is a bit lazy :( Capital letters at the start of
> sentences, a space after a full stop etc, please.
Okay!

>
>> + */
>> +static long fat_fallocate(struct file *file, int mode,
>> + loff_t offset, loff_t len)
>> +{
>> + int err = 0;
>> + struct inode *inode = file->f_mapping->host;
>> + int cluster, nr_cluster, fclus, dclus, free_bytes, nr_bytes;
>
> I'm rather allergic to multiple-definitions-on-one-line like this.
> They make the code harder to read and they result in messy patch resolution
> efforts. Most significantly, one-definition-per-line leaves a little
> room on the right for a brief comment explaining the variable's role.
> Such comments appear to be needed in this function!
Okay, I will add detailed comments.

>
> Are you sure that `int' is the best type for all these? Do they need
> to be signed? For example nr_bytes and free_bytes are derived from
> loff_t's and it is unobvious that there is no risk of overflowing.
Yes, right. I will change free_bytes and nr_bytes to loff_t.

>
>
>> + struct super_block *sb = inode->i_sb;
>> + struct msdos_sb_info *sbi = MSDOS_SB(sb);
>> +
>> + /* No support for hole punch or other fallocate flags. */
>> + if (mode & ~FALLOC_FL_KEEP_SIZE)
>> + return -EOPNOTSUPP;
>> +
>> + if ((offset + len) <= MSDOS_I(inode)->mmu_private) {
>> + fat_msg(sb, KERN_ERR,
>> + "fat_fallocate():Blocks already allocated");
>
> Place a space after the colon.
Okay, Thanks.

>
>> + return -EINVAL;
>> + }
>>
>> + if ((mode & FALLOC_FL_KEEP_SIZE)) {
>
> Unneeded parentheses.
Okay.

>
>> + /* First compute the number of clusters to be allocated */
>> + if (inode->i_size > 0) {
>
> i_size_read()?
Okay, I will change it.

>
>> + err = fat_get_cluster(inode, FAT_ENT_EOF,
>> + &fclus, &dclus);
>> + if (err < 0) {
>> + fat_msg(sb, KERN_ERR,
>> + "fat_fallocate():fat_get_cluster() error");
>
> space after colon
Okay. I will add a space after colon.

>
>> + return err;
>> + }
>> + free_bytes = ((fclus+1) << sbi->cluster_bits)-
>
> Place spaces around + and -
Okay, I will fix it.

>
>> + (inode->i_size);
>
> More overparenthesization.
Okay, I will.

>
>> + nr_bytes = (offset + len - inode->i_size) - free_bytes;
>> + } else
>> + nr_bytes = (offset + len - inode->i_size);
>
> Overparenthesization.
Okay.

>
>> + nr_cluster = (nr_bytes + (sbi->cluster_size - 1)) >>
>> + sbi->cluster_bits;
>> + mutex_lock(&inode->i_mutex);
>
> whoa, darn. We weren't holding i_mutex? Then yes, i_size_read() is
> needed.
Yes, right. I will fix it.

>
> And this code reads i_size multiple times, while not holding any lock
> which will prevent i_size from changing between those two reads. It
> seems racy.
Right, I will fix it.

>
>
>> + /* Start the allocation.We are not zeroing out the clusters */
>> + while (nr_cluster-- > 0) {
>> + err = fat_alloc_clusters(inode, &cluster, 1);
>> + if (err) {
>> + fat_msg(sb, KERN_ERR,
>> + "fat_fallocate():fat_alloc_clusters() error");
>
> space after colon
Okay!

>
>> + goto error;
>> + }
>> + err = fat_chain_add(inode, cluster, 1);
>> + if (err) {
>> + fat_free_clusters(inode, cluster);
>> + goto error;
>> + }
>> + MSDOS_I(inode)->mmu_private += sbi->cluster_size;
>> + }
>> + } else {
>> + mutex_lock(&inode->i_mutex);
>> + /* This is just an expanding truncate */
>> + err = fat_cont_expand(inode, (offset + len));
>
> Overparenthesization.
Okay!

>
>> + if (err) {
>> + fat_msg(sb, KERN_ERR,
>> + "fat_fallocate():fat_cont_expand() error");
>
> space
Okay!

>
>> + }
>> + }
>> +error:
>> + mutex_unlock(&inode->i_mutex);
>> + return err;
>> +}
>> /* Free all clusters after the skip'th cluster. */
>> static int fat_free(struct inode *inode, int skip)
>> {
>> diff --git a/fs/fat/inode.c b/fs/fat/inode.c
>> index dfce656..ddf2969 100644
>> --- a/fs/fat/inode.c
>> +++ b/fs/fat/inode.c
>> @@ -152,11 +152,58 @@ static void fat_write_failed(struct address_space
>> *mapping, loff_t to)
>> }
>> }
>>
>> +static int fat_zero_falloc_area(struct file *file,
>> + struct address_space *mapping, loff_t pos)
>> +{
>> + struct page *page;
>> + struct inode *inode = mapping->host;
>> + loff_t curpos = inode->i_size;
>> + size_t count = pos-curpos;
>
> spaces around -
Okay, I will add it.

>
>> + int err;
>
> Newline after end-of-locals.
Okay.

>
>> + do {
>> + unsigned offset, bytes;
>> + void *fsdata;
>> +
>> + offset = (curpos & (PAGE_CACHE_SIZE - 1));
>> + bytes = PAGE_CACHE_SIZE - offset;
>
> OK, so use of 32-bit scalars are safe here. They are "offset within a
> page", yes? That's unobvious from the chosen names...
Yes, I will fix it.

>
>> + if (bytes > count)
>> + bytes = count;
>
> Use min()?
Okay. I will use it.

>
>> + err = pagecache_write_begin(NULL, mapping, curpos, bytes,
>> + AOP_FLAG_UNINTERRUPTIBLE,
>> + &page, &fsdata);
>> + if (err)
>> + break;
>
> hm, so if we were only able to fallocate 1MB from a requested 2MB, we
> don't tell userspace about this? As far as userspace is concerned, the
> whole thing failed? Seems so... Is there no requirement to clean up
> the partial allocation on failure?
Other filesystem like EXT4 do not release partial allocation. we
verified this by fallocating 300MB on a 256MB EXT4 partition.
So I followed the same.
>
>> + zero_user(page, offset, bytes);
>> +
>> + err = pagecache_write_end(NULL, mapping, curpos, bytes, bytes,
>> + page, fsdata);
>> + WARN_ON(err <= 0);
>
> Why? That could make the kernel extremely noisy if something goes
> wrong.
Yes, There was a mistake. I will fix it.

>
>> + curpos += bytes;
>> + count -= bytes;
>> + err = 0;
>> + } while (count);
>> +
>> + return -err;
>
> What? So if pagecache_write_begin() returned -ENOMEM,
> fat_zero_falloc_area() will return --ENOMEM - that's +12.
I had literally used the xfs_iozero() function which does these same
things and will fix it.

>
>> +}
>> +
>> static int fat_write_begin(struct file *file, struct address_space
>> *mapping,
>> loff_t pos, unsigned len, unsigned flags,
>> struct page **pagep, void **fsdata)
>> {
>> int err;
>> + struct inode *inode = mapping->host;
>> + struct super_block *sb = inode->i_sb;
>> + loff_t mmu_private_actual = MSDOS_I(inode)->mmu_private;
>> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
>> + ~(sb->s_blocksize-1);
>
> See earlier comments.
Okay.

>
>> + if ((mmu_private_actual > mmu_private_ideal) && (pos > inode->i_size))
>> {
>
> Overparenthesization.
Okay.

>
>> + err = fat_zero_falloc_area(file, mapping, pos);
>> + if (err)
>> + fat_msg(sb, KERN_ERR, "error zeroing fallocated area");
>
> a) the errno should be displayed
Yes, I will add it.
>
> b) why is it OK to just ignore the error and proceed?
Right, we should not proceed and return the error values.

I will post the next v2 patch after fixing totally.
Thanks for your review!
>
>> + }
>>
>> *pagep = NULL;
>> err = cont_write_begin(file, mapping, pos, len, flags,
>
>

2013-03-11 09:52:28

by Namjae Jeon

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

2013/3/9, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>> static int fat_file_release(struct inode *inode, struct file *filp)
>> {
>> + struct super_block *sb = inode->i_sb;
>> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
>> + ~(sb->s_blocksize-1);
>> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
>> + filp->f_dentry->d_count == 1)
>> + fat_truncate_blocks(inode, inode->i_size);
>
Hi OGAWA.
> Without locking, truncate is racy.
I will check for races and provide locking.
>
> This choose ->release(). BTW, we would also be able to do this only
> ->evict_inode(), although I'm not thinking yet which one is better.
>
> If you had conclusion, it would be nice to explain it.
evict_inode() will be called only when we unlink the file or if inode
is evicted from cache.
As we discussed with you before, We considered preallocated blocks is
discarded on all close file cases(unlink and muliple openning file).
So we think it would be better to do this in ->release().
>
>> +static long fat_fallocate(struct file *file, int mode,
>> + loff_t offset, loff_t len)
>> +{
>>
>> + if ((offset + len) <= MSDOS_I(inode)->mmu_private) {
>> + fat_msg(sb, KERN_ERR,
>> + "fat_fallocate():Blocks already allocated");
>> + return -EINVAL;
>> + }
>
> Also this looks like totally racy.
Okay, I will fix it.

>
>> static int fat_write_begin(struct file *file, struct address_space
>> *mapping,
>> loff_t pos, unsigned len, unsigned flags,
>> struct page **pagep, void **fsdata)
>> {
>> int err;
>> + struct inode *inode = mapping->host;
>> + struct super_block *sb = inode->i_sb;
>> + loff_t mmu_private_actual = MSDOS_I(inode)->mmu_private;
>> + loff_t mmu_private_ideal = (inode->i_size + (sb->s_blocksize-1)) &
>> + ~(sb->s_blocksize-1);
>> +
>> + if ((mmu_private_actual > mmu_private_ideal) && (pos > inode->i_size))
>> {
>> + err = fat_zero_falloc_area(file, mapping, pos);
>> + if (err)
>> + fat_msg(sb, KERN_ERR, "error zeroing fallocated area");
>> + }
>>
>> *pagep = NULL;
>> err = cont_write_begin(file, mapping, pos, len, flags,
>
> Hm, only write_begin is enough to handle mmap, truncate, and etc.?
We had not checked these use cases. We will check these.

Thanks for review!

> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-03-11 15:01:42

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

>> This choose ->release(). BTW, we would also be able to do this only
>> ->evict_inode(), although I'm not thinking yet which one is better.
>>
>> If you had conclusion, it would be nice to explain it.
> evict_inode() will be called only when we unlink the file or if inode
> is evicted from cache.
> As we discussed with you before, We considered preallocated blocks is
> discarded on all close file cases(unlink and muliple openning file).
> So we think it would be better to do this in ->release().

If so, probably, I didn't clear my opinion/suggestion, or misled
you. Sorry about it.

My opinion/suggestion is, "it should be before umount()".
I.e. fallocate() doesn't have any affect to FAT on clean state (clean
umount).

To clear my state, I don't have strong opinion about implementation yet.
For example, about ->release() or ->evict_inode().

So, if you had reason to use ->release() over "we discussed", it would
be good. (Or, if you still didn't have reasons, we would be better to
think about it)

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-03-12 08:29:10

by Namjae Jeon

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

2013/3/12, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>>> This choose ->release(). BTW, we would also be able to do this only
>>> ->evict_inode(), although I'm not thinking yet which one is better.
>>>
>>> If you had conclusion, it would be nice to explain it.
>> evict_inode() will be called only when we unlink the file or if inode
>> is evicted from cache.
>> As we discussed with you before, We considered preallocated blocks is
>> discarded on all close file cases(unlink and muliple openning file).
>> So we think it would be better to do this in ->release().
>
> If so, probably, I didn't clear my opinion/suggestion, or misled
> you. Sorry about it.
>
> My opinion/suggestion is, "it should be before umount()".
> I.e. fallocate() doesn't have any affect to FAT on clean state (clean
> umount).
>
> To clear my state, I don't have strong opinion about implementation yet.
> For example, about ->release() or ->evict_inode().
>
> So, if you had reason to use ->release() over "we discussed", it would
> be good. (Or, if you still didn't have reasons, we would be better to
> think about it)
We considered all the possible points where we can release the
pre-allocated blocks.
As the "pre-allocation" is file based, we think it need to have
release mechanism.

In case of evict, Eviction will either happen when the file is removed
or the inode is evicted from the cache by memory pressure when no
references are present for that file.
It is good that evict will be triggered in umount. but there is a risk
that umount time can be increased.
And It show users inconsistent result. e.g sometime user can not get
file discarded fallocated blocks by memory pressure.

Let me know your opinion.

Thanks.
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-03-12 09:07:52

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

>> If so, probably, I didn't clear my opinion/suggestion, or misled
>> you. Sorry about it.
>>
>> My opinion/suggestion is, "it should be before umount()".
>> I.e. fallocate() doesn't have any affect to FAT on clean state (clean
>> umount).
>>
>> To clear my state, I don't have strong opinion about implementation yet.
>> For example, about ->release() or ->evict_inode().
>>
>> So, if you had reason to use ->release() over "we discussed", it would
>> be good. (Or, if you still didn't have reasons, we would be better to
>> think about it)
> We considered all the possible points where we can release the
> pre-allocated blocks.
> As the "pre-allocation" is file based, we think it need to have
> release mechanism.
>
> In case of evict, Eviction will either happen when the file is removed
> or the inode is evicted from the cache by memory pressure when no
> references are present for that file.
> It is good that evict will be triggered in umount. but there is a risk
> that umount time can be increased.
> And It show users inconsistent result. e.g sometime user can not get
> file discarded fallocated blocks by memory pressure.
>
> Let me know your opinion.

Yes. My personal opinion is almost same, but I see some trade-off, and
it is why I can't decide easily.

Although I don't care about inconsistency on umount time. Because
inconsistency window is opening while user is holding reference to
inode, the window can already be enough big, and inconsistent only
happens with crash. And if crashed, after all, both of ->release and
->evict requires fsck to recover FS.

Possibly longer umount time (batch modify) and longer close time (on
demand modify) are simply trade-off.

Predictable behavior would matter. Yes, evict time doesn't guarantee to
fallocate() is still available, however evict can be able to say it just
guarantees to available fallocated blocks between open() and
close(). Users have to call fallocate() again after close(), but
fallocate() may skip actual allocation if blocks still available.

I see optimization window with this evict behavior, because it can do
batch discard of fallocated blocks on multiple files, and update FAT at
once.

However, ->release() would be simpler, and would more predictable.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-03-12 09:38:44

by Namjae Jeon

[permalink] [raw]
Subject: Re: [PATCH v3] fat: editions to support fat_fallocate

2013/3/12, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>>> If so, probably, I didn't clear my opinion/suggestion, or misled
>>> you. Sorry about it.
>>>
>>> My opinion/suggestion is, "it should be before umount()".
>>> I.e. fallocate() doesn't have any affect to FAT on clean state (clean
>>> umount).
>>>
>>> To clear my state, I don't have strong opinion about implementation yet.
>>> For example, about ->release() or ->evict_inode().
>>>
>>> So, if you had reason to use ->release() over "we discussed", it would
>>> be good. (Or, if you still didn't have reasons, we would be better to
>>> think about it)
>> We considered all the possible points where we can release the
>> pre-allocated blocks.
>> As the "pre-allocation" is file based, we think it need to have
>> release mechanism.
>>
>> In case of evict, Eviction will either happen when the file is removed
>> or the inode is evicted from the cache by memory pressure when no
>> references are present for that file.
>> It is good that evict will be triggered in umount. but there is a risk
>> that umount time can be increased.
>> And It show users inconsistent result. e.g sometime user can not get
>> file discarded fallocated blocks by memory pressure.
>>
>> Let me know your opinion.
>
> Yes. My personal opinion is almost same, but I see some trade-off, and
> it is why I can't decide easily.
>
> Although I don't care about inconsistency on umount time. Because
> inconsistency window is opening while user is holding reference to
> inode, the window can already be enough big, and inconsistent only
> happens with crash. And if crashed, after all, both of ->release and
> ->evict requires fsck to recover FS.
>
> Possibly longer umount time (batch modify) and longer close time (on
> demand modify) are simply trade-off.
>
> Predictable behavior would matter. Yes, evict time doesn't guarantee to
> fallocate() is still available, however evict can be able to say it just
> guarantees to available fallocated blocks between open() and
> close(). Users have to call fallocate() again after close(), but
> fallocate() may skip actual allocation if blocks still available.
>
> I see optimization window with this evict behavior, because it can do
> batch discard of fallocated blocks on multiple files, and update FAT at
> once.
>
> However, ->release() would be simpler, and would more predictable.
Yes, agreed so let’s prepare the solution with freeing blocks part of ->release.
This will also make solution align with the concept of normal default
reserved block allocation like ext2/XFS in which it allocates extra
blocks at initial write request and extra blocks are freed in
->release.
Once I will share the patch with all the review comments.

Thanks OGAWA :)
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>