LinuxLists.cc - [PATCH RESEND v5] fat: editions to support fat

2013-04-28 00:08:39

Subject: [PATCH RESEND v5] fat: editions to support fat_fallocate

From: Namjae Jeon <[email protected]>

Implement preallocation via the fallocate syscall on VFAT partitions.

Change Log:
v5: change to avoid compilation warning:
fs/fat/inode.c: In function 'fat_zero_falloc_area':
>> fs/fat/inode.c:169:11: warning: comparison of distinct pointer
types lacks a cast [enabled by default]

v4: Rework based on review comments.
Add check in fat_setattr to release fallocated blocks on a truncate

v3: Release preallocated blocks at file release.

With FALLOC_FL_KEEP_SIZE, there is no way to distinguish if the
mismatch between i_size and no. of clusters allocated is a consequence
of fallocate or just plain corruption. When a non fallocate aware (old)
linux fat driver tries to write to such a file, it throws an error.Also,
fsck detects this as inconsistency and truncates the prealloc'd blocks.

To avoid this, as suggested by OGAWA, remove changes that make fallocate
persistent across mounts and restrict lifetime of blocks from
fallocate(2) to file release.

v2: On an area preallocated with FALLOC_FL_KEEP_SIZE, when a seek was
done to an offset beyond i_size, the old (garbage) data was exposed as
we did not zero out the area at allocation time. Added
fat_zero_falloc_area() to fix this.

v1: Reworked an earlier patch of the same name
(https://lkml.org/lkml/2007/12/22/130) to fix some bugs:
i) Preallocated space was not persistent and was lost on remount. Fixed
it.
ii) Did not zero out allocated clusters when FALLOC_FL_KEEP_SIZE was set,
thereby speeding up preallocation time.

Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Amit Sahrawat <[email protected]>
---
fs/fat/file.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/fat/inode.c | 54 ++++++++++++++++++++++++++++
2 files changed, 161 insertions(+), 1 deletion(-)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index b0b632e..7326439 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -17,8 +17,11 @@
#include <linux/blkdev.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
+#include <linux/falloc.h>
#include "fat.h"

+static long fat_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len);
static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr)
{
u32 attr;
@@ -140,6 +143,22 @@ static long fat_generic_compat_ioctl(struct file *filp, unsigned int cmd,

static int fat_file_release(struct inode *inode, struct file *filp)
{
+
+ struct super_block *sb = inode->i_sb;
+ loff_t mmu_private_ideal;
+
+ /*
+ * Release unwritten fallocated blocks on file release.
+ * Do this only when the last open file descriptor is closed.
+ */
+ mutex_lock(&inode->i_mutex);
+ mmu_private_ideal = round_up(inode->i_size, sb->s_blocksize);
+
+ if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
+ filp->f_dentry->d_count == 1)
+ fat_truncate_blocks(inode, inode->i_size);
+ mutex_unlock(&inode->i_mutex);
+
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
@@ -174,6 +193,7 @@ const struct file_operations fat_file_operations = {
#endif
.fsync = fat_file_fsync,
.splice_read = generic_file_splice_read,
+ .fallocate = fat_fallocate,
};

static int fat_cont_expand(struct inode *inode, loff_t size)
@@ -212,6 +232,88 @@ out:
return err;
}

+/*
+ * Preallocate space for a file. This implements fat's fallocate file
+ * operation, which gets called from sys_fallocate system call. User
+ * space requests len bytes at offset. If FALLOC_FL_KEEP_SIZE is set
+ * we just allocate clusters without zeroing them out. Otherwise we
+ * allocate and zero out clusters via an expanding truncate. The
+ * allocated clusters are freed in fat_file_release().
+ */
+static long fat_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t len)
+{
+ int cluster, fclus, dclus;
+ int nr_cluster; /* Number of clusters to be allocated */
+ loff_t nr_bytes; /* Number of bytes to be allocated*/
+ loff_t free_bytes; /* Unused bytes in the last cluster of file*/
+ struct inode *inode = file->f_mapping->host;
+ struct super_block *sb = inode->i_sb;
+ struct msdos_sb_info *sbi = MSDOS_SB(sb);
+ int err = 0;
+
+ /* No support for hole punch or other fallocate flags. */
+ if (mode & ~FALLOC_FL_KEEP_SIZE)
+ return -EOPNOTSUPP;
+
+ mutex_lock(&inode->i_mutex);
+ if ((offset + len) <= MSDOS_I(inode)->mmu_private) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate(): Blocks already allocated");
+ err = -EINVAL;
+ goto error;
+ }
+
+ if (mode & FALLOC_FL_KEEP_SIZE) {
+ /* First compute the number of clusters to be allocated */
+ if (inode->i_size > 0) {
+ err = fat_get_cluster(inode, FAT_ENT_EOF,
+ &fclus, &dclus);
+ if (err < 0) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate(): fat_get_cluster() error");
+ goto error;
+ }
+ free_bytes = ((fclus + 1) << sbi->cluster_bits) -
+ inode->i_size;
+ nr_bytes = offset + len - inode->i_size - free_bytes;
+ MSDOS_I(inode)->mmu_private = (fclus + 1) <<
+ sbi->cluster_bits;
+ } else
+ nr_bytes = offset + len - inode->i_size;
+
+ nr_cluster = (nr_bytes + (sbi->cluster_size - 1)) >>
+ sbi->cluster_bits;
+
+ /* Start the allocation.We are not zeroing out the clusters */
+ while (nr_cluster-- > 0) {
+ err = fat_alloc_clusters(inode, &cluster, 1);
+ if (err) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate(): fat_alloc_clusters() error");
+ goto error;
+ }
+ err = fat_chain_add(inode, cluster, 1);
+ if (err) {
+ fat_free_clusters(inode, cluster);
+ goto error;
+ }
+ MSDOS_I(inode)->mmu_private += sbi->cluster_size;
+ }
+ } else {
+ /* This is just an expanding truncate */
+ err = fat_cont_expand(inode, (offset + len));
+ if (err) {
+ fat_msg(sb, KERN_ERR,
+ "fat_fallocate(): fat_cont_expand() error");
+ }
+ }
+
+error:
+ mutex_unlock(&inode->i_mutex);
+ return err;
+}
+
/* Free all clusters after the skip'th cluster. */
static int fat_free(struct inode *inode, int skip)
{
@@ -378,6 +480,9 @@ int fat_setattr(struct dentry *dentry, struct iattr *attr)
struct inode *inode = dentry->d_inode;
unsigned int ia_valid;
int error;
+ loff_t mmu_private_ideal;
+
+ mmu_private_ideal = round_up(inode->i_size, dentry->d_sb->s_blocksize);

/* Check for setting the inode time. */
ia_valid = attr->ia_valid;
@@ -403,7 +508,8 @@ int fat_setattr(struct dentry *dentry, struct iattr *attr)
if (attr->ia_valid & ATTR_SIZE) {
inode_dio_wait(inode);

- if (attr->ia_size > inode->i_size) {
+ if (attr->ia_size > inode->i_size &&
+ MSDOS_I(inode)->mmu_private <= mmu_private_ideal) {
error = fat_cont_expand(inode, attr->ia_size);
if (error || attr->ia_valid == ATTR_SIZE)
goto out;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index dfce656..3300da0c 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -152,11 +152,65 @@ static void fat_write_failed(struct address_space *mapping, loff_t to)
}
}

+static int fat_zero_falloc_area(struct file *file,
+ struct address_space *mapping, loff_t pos)
+{
+ struct page *page;
+ struct inode *inode = mapping->host;
+ loff_t curpos = i_size_read(inode);
+ size_t count = pos - curpos;
+ int err;
+
+ do {
+ unsigned offset;
+ size_t bytes;
+ void *fsdata;
+
+ offset = (curpos & (PAGE_CACHE_SIZE - 1));
+ bytes = PAGE_CACHE_SIZE - offset;
+ bytes = min(bytes, count);
+
+ err = pagecache_write_begin(NULL, mapping, curpos, bytes,
+ AOP_FLAG_UNINTERRUPTIBLE,
+ &page, &fsdata);
+ if (err)
+ break;
+
+ zero_user(page, offset, bytes);
+
+ err = pagecache_write_end(NULL, mapping, curpos, bytes, bytes,
+ page, fsdata);
+ if (err < 0)
+ break;
+ curpos += bytes;
+ count -= bytes;
+ err = 0;
+ } while (count);
+
+ return err;
+}
+
static int fat_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
int err;
+ loff_t mmu_private_ideal, mmu_private_actual;
+ loff_t size;
+ struct inode *inode = mapping->host;
+ struct super_block *sb = inode->i_sb;
+
+ size = i_size_read(inode);
+ mmu_private_actual = MSDOS_I(inode)->mmu_private;
+ mmu_private_ideal = round_up(size, sb->s_blocksize);
+ if ((mmu_private_actual > mmu_private_ideal) && (pos > size)) {
+ err = fat_zero_falloc_area(file, mapping, pos);
+ if (err) {
+ fat_msg(sb, KERN_ERR,
+ "Error (%d) zeroing fallocated area", err);
+ return err;
+ }
+ }

*pagep = NULL;
err = cont_write_begin(file, mapping, pos, len, flags,
--
1.7.9.5

2013-04-29 14:31:40

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

I couldn't review fully though.

> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
> + filp->f_dentry->d_count == 1)
> + fat_truncate_blocks(inode, inode->i_size);

Hm, why d_count == 1 check is needed? Feel strange and racy.

> + /* Start the allocation.We are not zeroing out the clusters */
> + while (nr_cluster-- > 0) {
> + err = fat_alloc_clusters(inode, &cluster, 1);

Why doesn't allocate clusters at once by fat_alloc_clusters()?

> + size = i_size_read(inode);
> + mmu_private_actual = MSDOS_I(inode)->mmu_private;
> + mmu_private_ideal = round_up(size, sb->s_blocksize);
> + if ((mmu_private_actual > mmu_private_ideal) && (pos > size)) {
> + err = fat_zero_falloc_area(file, mapping, pos);
> + if (err) {
> + fat_msg(sb, KERN_ERR,
> + "Error (%d) zeroing fallocated area", err);
> + return err;
> + }
> + }

This way probably inefficient. This would write data twice times (one is
zeroed, one is actual data). So, cpu time would be twice higher if
user uses fallocated, right?

Difference of fallocated area would be whether get_block() set
buffer_new() or not? If true, we should change get_block(), not
write_begin()?

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-01 04:14:16

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

2013/4/29, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
Hi OGAWA.
> I couldn't review fully though.
>
>> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
>> + filp->f_dentry->d_count == 1)
>> + fat_truncate_blocks(inode, inode->i_size);
>
> Hm, why d_count == 1 check is needed? Feel strange and racy.
Since, fat_file_release() is called on every close for the file.
But we want to free up the reserved blocks only at the last reference
for the file exits.
So, we have used “d_count ==1” i.e., when there is only one reference
left for the file and it is being closed.
Then call the truncate blocks to free up the space.

>
>> + /* Start the allocation.We are not zeroing out the clusters */
>> + while (nr_cluster-- > 0) {
>> + err = fat_alloc_clusters(inode, &cluster, 1);
>
> Why doesn't allocate clusters at once by fat_alloc_clusters()?
It is because of default design, where we cannot allocate all the
clusters at once. For reference if we try to allocate all clusters at
once, it will trigger a BUG_ON in
fat_alloc_clusters()->
BUG_ON(nr_cluster > (MAX_BUF_PER_PAGE / 2)); /* fixed limit */
And we needed to update the fat chain after each allocation and take
care of the failure cases as well. So, we have done that sequential.
That optimization of allocating all clusters at once can be considered
as a separate changeline.
>
>> + size = i_size_read(inode);
>> + mmu_private_actual = MSDOS_I(inode)->mmu_private;
>> + mmu_private_ideal = round_up(size, sb->s_blocksize);
>> + if ((mmu_private_actual > mmu_private_ideal) && (pos > size)) {
>> + err = fat_zero_falloc_area(file, mapping, pos);
>> + if (err) {
>> + fat_msg(sb, KERN_ERR,
>> + "Error (%d) zeroing fallocated area", err);
>> + return err;
>> + }
>> + }
>
> This way probably inefficient. This would write data twice times (one is
> zeroed, one is actual data). So, cpu time would be twice higher if
> user uses fallocated, right?
We introduced the “zeroing out” after there was a comment regarding
the security loophole of accessing invalid data.
So, while doing fallocate we reserved the space. But, if there is a
request to access the pre-allocated space we zeroout the complete area
to avoid any security issue.

Let me know your opinion :)

Thanks.
>
> Difference of fallocated area would be whether get_block() set
> buffer_new() or not? If true, we should change get_block(), not
> write_begin()?
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-05-01 06:42:17

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

> Hi OGAWA.

Hi,

>> I couldn't review fully though.
>>
>>> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
>>> + filp->f_dentry->d_count == 1)
>>> + fat_truncate_blocks(inode, inode->i_size);
>>
>> Hm, why d_count == 1 check is needed? Feel strange and racy.
> Since, fat_file_release() is called on every close for the file.

What is wrong? IIRC, it is what you choose (i.e. for each last close for
the file descriptor).

> But we want to free up the reserved blocks only at the last reference
> for the file exits.
> So, we have used $B!H(Bd_count ==1$B!I(B i.e., when there is only one reference
> left for the file and it is being closed.
> Then call the truncate blocks to free up the space.

It probably doesn't work. E.g. if unlink(2) is grabbing refcount, then
close(2) may not be last referencer, right?

So, then, nobody truncates anymore.

>>> + /* Start the allocation.We are not zeroing out the clusters */
>>> + while (nr_cluster-- > 0) {
>>> + err = fat_alloc_clusters(inode, &cluster, 1);
>>
>> Why doesn't allocate clusters at once by fat_alloc_clusters()?
> It is because of default design, where we cannot allocate all the
> clusters at once. For reference if we try to allocate all clusters at
> once, it will trigger a BUG_ON in
> fat_alloc_clusters()->
> BUG_ON(nr_cluster > (MAX_BUF_PER_PAGE / 2)); /* fixed limit */
> And we needed to update the fat chain after each allocation and take
> care of the failure cases as well. So, we have done that sequential.
> That optimization of allocating all clusters at once can be considered
> as a separate changeline.

OK.

>>> + size = i_size_read(inode);
>>> + mmu_private_actual = MSDOS_I(inode)->mmu_private;
>>> + mmu_private_ideal = round_up(size, sb->s_blocksize);
>>> + if ((mmu_private_actual > mmu_private_ideal) && (pos > size)) {
>>> + err = fat_zero_falloc_area(file, mapping, pos);
>>> + if (err) {
>>> + fat_msg(sb, KERN_ERR,
>>> + "Error (%d) zeroing fallocated area", err);
>>> + return err;
>>> + }
>>> + }
>>
>> This way probably inefficient. This would write data twice times (one is
>> zeroed, one is actual data). So, cpu time would be twice higher if
>> user uses fallocated, right?
> We introduced the $B!H(Bzeroing out$B!I(B after there was a comment regarding
> the security loophole of accessing invalid data.
> So, while doing fallocate we reserved the space. But, if there is a
> request to access the pre-allocated space we zeroout the complete area
> to avoid any security issue.

I know. Question is, why do we need to initialize twice.

1) zeroed for uninitialized area, 2) then copy user data area. We need
only either, right? This seems to be doing both for all fallocated area.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-02 04:46:03

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

2013/5/1, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>> Hi OGAWA.
>
> Hi,
>
>>> I couldn't review fully though.
>>>
>>>> + if (mmu_private_ideal < MSDOS_I(inode)->mmu_private &&
>>>> + filp->f_dentry->d_count == 1)
>>>> + fat_truncate_blocks(inode, inode->i_size);
>>>
>>> Hm, why d_count == 1 check is needed? Feel strange and racy.
>> Since, fat_file_release() is called on every close for the file.
>
> What is wrong? IIRC, it is what you choose (i.e. for each last close for
> the file descriptor).
Yes, this is what we had chosen after discussion. Freeing reserved
space point being the file release path.
But if there are multiple accessors for the file then file_release
will be called by each process.
Freeing the space in first call will result in wrong file attributes
for the other points. So, we needed a differentiation of last close
for the file.
Am I missing something ?

>
>> But we want to free up the reserved blocks only at the last reference
>> for the file exits.
>> So, we have used “d_count ==1” i.e., when there is only one reference
>> left for the file and it is being closed.
>> Then call the truncate blocks to free up the space.
>
> It probably doesn't work. E.g. if unlink(2) is grabbing refcount, then
> close(2) may not be last referencer, right?
Yes, Right. I will check :)

>
> So, then, nobody truncates anymore.
>
>>>> + /* Start the allocation.We are not zeroing out the clusters */
>>>> + while (nr_cluster-- > 0) {
>>>> + err = fat_alloc_clusters(inode, &cluster, 1);
>>>
>>> Why doesn't allocate clusters at once by fat_alloc_clusters()?
>> It is because of default design, where we cannot allocate all the
>> clusters at once. For reference if we try to allocate all clusters at
>> once, it will trigger a BUG_ON in
>> fat_alloc_clusters()->
>> BUG_ON(nr_cluster > (MAX_BUF_PER_PAGE / 2)); /* fixed limit */
>> And we needed to update the fat chain after each allocation and take
>> care of the failure cases as well. So, we have done that sequential.
>> That optimization of allocating all clusters at once can be considered
>> as a separate changeline.
>
> OK.
>
>>>> + size = i_size_read(inode);
>>>> + mmu_private_actual = MSDOS_I(inode)->mmu_private;
>>>> + mmu_private_ideal = round_up(size, sb->s_blocksize);
>>>> + if ((mmu_private_actual > mmu_private_ideal) && (pos > size)) {
>>>> + err = fat_zero_falloc_area(file, mapping, pos);
>>>> + if (err) {
>>>> + fat_msg(sb, KERN_ERR,
>>>> + "Error (%d) zeroing fallocated area", err);
>>>> + return err;
>>>> + }
>>>> + }
>>>
>>> This way probably inefficient. This would write data twice times (one is
>>> zeroed, one is actual data). So, cpu time would be twice higher if
>>> user uses fallocated, right?
>> We introduced the “zeroing out” after there was a comment regarding
>> the security loophole of accessing invalid data.
>> So, while doing fallocate we reserved the space. But, if there is a
>> request to access the pre-allocated space we zeroout the complete area
>> to avoid any security issue.
>
> I know. Question is, why do we need to initialize twice.
>
> 1) zeroed for uninitialized area, 2) then copy user data area. We need
> only either, right? This seems to be doing both for all fallocated area.
We did not initialize twice. We are using the ‘pos’ as the attribute
to define zeroing length in case of pre-allocation.
Zeroing out occurs till the ‘pos’ while actual write occur after ‘pos’.
If we file size is 100KB and we pre-allocated till 1MB. Next if we try
to write at 500KB,
Then zeroing out will occur only for 100KB->500KB, after that there
will be normal write. There is no duplication for the same space.

Let me know your opinion.

Thanks~
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-05-02 05:12:58

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

>>>> Hm, why d_count == 1 check is needed? Feel strange and racy.
>>> Since, fat_file_release() is called on every close for the file.
>>
>> What is wrong? IIRC, it is what you choose (i.e. for each last close for
>> the file descriptor).
> Yes, this is what we had chosen after discussion. Freeing reserved
> space point being the file release path.
> But if there are multiple accessors for the file then file_release
> will be called by each process.
> Freeing the space in first call will result in wrong file attributes
> for the other points. So, we needed a differentiation of last close
> for the file.
> Am I missing something ?

Then, per-file discard fallocate space sounds like wrong. fallocate
space probably is inode attribute.

>> I know. Question is, why do we need to initialize twice.
>>
>> 1) zeroed for uninitialized area, 2) then copy user data area. We need
>> only either, right? This seems to be doing both for all fallocated area.
> We did not initialize twice. We are using the $B!F(Bpos$B!G(B as the attribute
> to define zeroing length in case of pre-allocation.
> Zeroing out occurs till the $B!F(Bpos$B!G(B while actual write occur after $B!F(Bpos$B!G(B.
> If we file size is 100KB and we pre-allocated till 1MB. Next if we try
> to write at 500KB,
> Then zeroing out will occur only for 100KB->500KB, after that there
> will be normal write. There is no duplication for the same space.

Ah. Then write_begin() really initialize after i_size until page cache
boudary for append write? I wonder if this patch works correctly for
mmap.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-02 06:12:58

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

2013/5/2, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>>>>> Hm, why d_count == 1 check is needed? Feel strange and racy.
>>>> Since, fat_file_release() is called on every close for the file.
>>>
>>> What is wrong? IIRC, it is what you choose (i.e. for each last close for
>>> the file descriptor).
>> Yes, this is what we had chosen after discussion. Freeing reserved
>> space point being the file release path.
>> But if there are multiple accessors for the file then file_release
>> will be called by each process.
>> Freeing the space in first call will result in wrong file attributes
>> for the other points. So, we needed a differentiation of last close
>> for the file.
>> Am I missing something ?
>
> Then, per-file discard fallocate space sounds like wrong. fallocate
> space probably is inode attribute.
Since, our preallocation will not be persistent after umount. So, we
need to free up the space at some point.
If we consider for normal pre-allocation in ext4, in that case also
the blocks are removed in ext4_release_file when the last writer
closes the file.

ext4_release_file()
{
...
/* if we are the last writer on the inode, drop the block reservation */
if ((filp->f_mode & FMODE_WRITE) &&
(atomic_read(&inode->i_writecount) == 1) &&
!EXT4_I(inode)->i_reserved_data_blocks)
{
down_write(&EXT4_I(inode)->i_data_sem);
ext4_discard_preallocations(inode);
up_write(&EXT4_I(inode)->i_data_sem);
}

So, we will need to have this per file . May be the condition for
checking is wrong which can be correct but the correctness points
should be same. We can give a thought on using "i_writecount" for
controlling the parallel write in FAT also.
how do you think ?

>
>>> I know. Question is, why do we need to initialize twice.
>>>
>>> 1) zeroed for uninitialized area, 2) then copy user data area. We need
>>> only either, right? This seems to be doing both for all fallocated area.
>> We did not initialize twice. We are using the ‘pos’ as the attribute
>> to define zeroing length in case of pre-allocation.
>> Zeroing out occurs till the ‘pos’ while actual write occur after ‘pos’.
>> If we file size is 100KB and we pre-allocated till 1MB. Next if we try
>> to write at 500KB,
>> Then zeroing out will occur only for 100KB->500KB, after that there
>> will be normal write. There is no duplication for the same space.
>
> Ah. Then write_begin() really initialize after i_size until page cache
> boudary for append write? I wonder if this patch works correctly for
> mmap.
Since you already provided me review comments to check truncate and
mmap, we checked all points for those cases.

Thanks~
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-05-02 07:36:28

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

>> Then, per-file discard fallocate space sounds like wrong. fallocate
>> space probably is inode attribute.
> Since, our preallocation will not be persistent after umount. So, we
> need to free up the space at some point.
> If we consider for normal pre-allocation in ext4, in that case also
> the blocks are removed in ext4_release_file when the last writer
> closes the file.
>
> ext4_release_file()
> {
> ...
> /* if we are the last writer on the inode, drop the block reservation */
> if ((filp->f_mode & FMODE_WRITE) &&
> (atomic_read(&inode->i_writecount) == 1) &&
> !EXT4_I(inode)->i_reserved_data_blocks)
> {
> down_write(&EXT4_I(inode)->i_data_sem);
> ext4_discard_preallocations(inode);
> up_write(&EXT4_I(inode)->i_data_sem);
> }
>
> So, we will need to have this per file . May be the condition for
> checking is wrong which can be correct but the correctness points
> should be same. We can give a thought on using "i_writecount" for
> controlling the parallel write in FAT also.
> how do you think ?

AFAIK, preallocation != fallocate. ext*'s preallocation was there at
before fallocation to optimize block allocation for user data blocks.

>>>> I know. Question is, why do we need to initialize twice.
>>>>
>>>> 1) zeroed for uninitialized area, 2) then copy user data area. We need
>>>> only either, right? This seems to be doing both for all fallocated area.
>>> We did not initialize twice. We are using the $B!F(Bpos$B!G(B as the attribute
>>> to define zeroing length in case of pre-allocation.
>>> Zeroing out occurs till the $B!F(Bpos$B!G(B while actual write occur after $B!F(Bpos$B!G(B.
>>> If we file size is 100KB and we pre-allocated till 1MB. Next if we try
>>> to write at 500KB,
>>> Then zeroing out will occur only for 100KB->500KB, after that there
>>> will be normal write. There is no duplication for the same space.
>>
>> Ah. Then write_begin() really initialize after i_size until page cache
>> boudary for append write? I wonder if this patch works correctly for
>> mmap.
> Since you already provided me review comments to check truncate and
> mmap, we checked all points for those cases.

cluster size == 512b

1) create new file
2) fallocate 100MB
3) write(2) data for each 512b

With this, write_begin() will be called for each 512b data. When we
allocates new page for this file, write_begin() writes data 0-512. Then,
we have to initialize 512-4096 by zero. Because mmap read maps 0-4096,
even if i_size == 512.

Who is initializing area for 512-4096?

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-02 07:43:19

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

OGAWA Hirofumi <[email protected]> writes:

> Namjae Jeon <[email protected]> writes:
>
>>> Then, per-file discard fallocate space sounds like wrong. fallocate
>>> space probably is inode attribute.
>> Since, our preallocation will not be persistent after umount. So, we
>> need to free up the space at some point.
>> If we consider for normal pre-allocation in ext4, in that case also
>> the blocks are removed in ext4_release_file when the last writer
>> closes the file.
>>
>> ext4_release_file()
>> {
>> ...
>> /* if we are the last writer on the inode, drop the block reservation */
>> if ((filp->f_mode & FMODE_WRITE) &&
>> (atomic_read(&inode->i_writecount) == 1) &&
>> !EXT4_I(inode)->i_reserved_data_blocks)
>> {
>> down_write(&EXT4_I(inode)->i_data_sem);
>> ext4_discard_preallocations(inode);
>> up_write(&EXT4_I(inode)->i_data_sem);
>> }
>>
>> So, we will need to have this per file . May be the condition for
>> checking is wrong which can be correct but the correctness points
>> should be same. We can give a thought on using "i_writecount" for
>> controlling the parallel write in FAT also.
>> how do you think ?
>
> AFAIK, preallocation != fallocate. ext*'s preallocation was there at
> before fallocation to optimize block allocation for user data blocks.
>
>>>>> I know. Question is, why do we need to initialize twice.
>>>>>
>>>>> 1) zeroed for uninitialized area, 2) then copy user data area. We need
>>>>> only either, right? This seems to be doing both for all fallocated area.
>>>> We did not initialize twice. We are using the $B!F(Bpos$B!G(B as the attribute
>>>> to define zeroing length in case of pre-allocation.
>>>> Zeroing out occurs till the $B!F(Bpos$B!G(B while actual write occur after $B!F(Bpos$B!G(B.
>>>> If we file size is 100KB and we pre-allocated till 1MB. Next if we try
>>>> to write at 500KB,
>>>> Then zeroing out will occur only for 100KB->500KB, after that there
>>>> will be normal write. There is no duplication for the same space.
>>>
>>> Ah. Then write_begin() really initialize after i_size until page cache
>>> boudary for append write? I wonder if this patch works correctly for
>>> mmap.
>> Since you already provided me review comments to check truncate and
>> mmap, we checked all points for those cases.
>
> cluster size == 512b
>
> 1) create new file
> 2) fallocate 100MB
> 3) write(2) data for each 512b
>
> With this, write_begin() will be called for each 512b data. When we
> allocates new page for this file, write_begin() writes data 0-512. Then,
> we have to initialize 512-4096 by zero. Because mmap read maps 0-4096,
> even if i_size == 512.
>
> Who is initializing area for 512-4096?

>From other view, I guess fat_zero_falloc_area() is for filling zero for
0-10000, in the following case?

1) create new file
2) lseek(10000)
3) write data by write(2)

This job is for cont_write_begin(). If example is correct, why
cont_write_begin() doesn't work? I guess, because get_block() doesn't
set buffer_new() for those area.

If above is correct, right implement to change get_block().

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-02 09:15:24

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

2013/5/2, OGAWA Hirofumi <[email protected]>:
> OGAWA Hirofumi <[email protected]> writes:
>
>> Namjae Jeon <[email protected]> writes:
>>
>>>> Then, per-file discard fallocate space sounds like wrong. fallocate
>>>> space probably is inode attribute.
>>> Since, our preallocation will not be persistent after umount. So, we
>>> need to free up the space at some point.
>>> If we consider for normal pre-allocation in ext4, in that case also
>>> the blocks are removed in ext4_release_file when the last writer
>>> closes the file.
>>>
>>> ext4_release_file()
>>> {
>>> ...
>>> /* if we are the last writer on the inode, drop the block reservation */
>>> if ((filp->f_mode & FMODE_WRITE) &&
>>> (atomic_read(&inode->i_writecount) == 1) &&
>>> !EXT4_I(inode)->i_reserved_data_blocks)
>>> {
>>> down_write(&EXT4_I(inode)->i_data_sem);
>>> ext4_discard_preallocations(inode);
>>> up_write(&EXT4_I(inode)->i_data_sem);
>>> }
>>>
>>> So, we will need to have this per file . May be the condition for
>>> checking is wrong which can be correct but the correctness points
>>> should be same. We can give a thought on using "i_writecount" for
>>> controlling the parallel write in FAT also.
>>> how do you think ?
>>
>> AFAIK, preallocation != fallocate. ext*'s preallocation was there at
>> before fallocation to optimize block allocation for user data blocks.
yes, this is correct , preallocation!= fallocate, we just adopted only
the "release part" from that approach
Sorry, Would you mind to adopt this approach :) ?

>>
>>>>>> I know. Question is, why do we need to initialize twice.
>>>>>>
>>>>>> 1) zeroed for uninitialized area, 2) then copy user data area. We
>>>>>> need
>>>>>> only either, right? This seems to be doing both for all fallocated
>>>>>> area.
>>>>> We did not initialize twice. We are using the ‘pos’ as the attribute
>>>>> to define zeroing length in case of pre-allocation.
>>>>> Zeroing out occurs till the ‘pos’ while actual write occur after
>>>>> ‘pos’.
>>>>> If we file size is 100KB and we pre-allocated till 1MB. Next if we try
>>>>> to write at 500KB,
>>>>> Then zeroing out will occur only for 100KB->500KB, after that there
>>>>> will be normal write. There is no duplication for the same space.
>>>>
>>>> Ah. Then write_begin() really initialize after i_size until page cache
>>>> boudary for append write? I wonder if this patch works correctly for
>>>> mmap.
>>> Since you already provided me review comments to check truncate and
>>> mmap, we checked all points for those cases.
>>
>> cluster size == 512b
>>
>> 1) create new file
>> 2) fallocate 100MB
>> 3) write(2) data for each 512b
>>
>> With this, write_begin() will be called for each 512b data. When we
>> allocates new page for this file, write_begin() writes data 0-512. Then,
>> we have to initialize 512-4096 by zero. Because mmap read maps 0-4096,
>> even if i_size == 512.
>>
>> Who is initializing area for 512-4096?
>
> From other view, I guess fat_zero_falloc_area() is for filling zero for
> 0-10000, in the following case?
>
> 1) create new file
> 2) lseek(10000)
> 3) write data by write(2)
>
> This job is for cont_write_begin(). If example is correct, why
> cont_write_begin() doesn't work? I guess, because get_block() doesn't
> set buffer_new() for those area.
>
> If above is correct, right implement to change get_block().
We will check your case.

Thanks~
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-05-02 09:34:14

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

>>> AFAIK, preallocation != fallocate. ext*'s preallocation was there at
>>> before fallocation to optimize block allocation for user data blocks.
> yes, this is correct , preallocation!= fallocate, we just adopted only
> the "release part" from that approach
> Sorry, Would you mind to adopt this approach :) ?

I'm ok as start if it works.

But from this discussion, discard at last close(2) doesn't look like
working for your requirement. Since you want to discard at last close of
inode, so, rather, I guess you actually want to discard at last
dereference of inode.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-05-02 09:56:20

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

2013/5/2, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>>>> AFAIK, preallocation != fallocate. ext*'s preallocation was there at
>>>> before fallocation to optimize block allocation for user data blocks.
>> yes, this is correct , preallocation!= fallocate, we just adopted only
>> the "release part" from that approach
>> Sorry, Would you mind to adopt this approach :) ?
>
> I'm ok as start if it works.
>
> But from this discussion, discard at last close(2) doesn't look like
> working for your requirement. Since you want to discard at last close of
> inode, so, rather, I guess you actually want to discard at last
> dereference of inode.
Okay, I see. I will check your review points again.

Thanks OGAWA for review!
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>

2013-06-18 05:47:12

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Hi, OGAWA.

We checked several cases with respect to your questions. But we cannot
find any issue.
Also, We compare the results with Ext4. It is same.
>cluster size == 512b

>1) create new file
>2) fallocate 100MB
>3) write(2) data for each 512b

>With this, write_begin() will be called for each 512b data. When we
>allocates new page for this file, write_begin() writes data 0-512. Then,
>we have to initialize 512-4096 by zero. Because mmap read maps 0-4096,
>even if i_size == 512.

>Who is initializing area for 512-4096?
When we checked, we found that page which is allocated to mmap is
already filled with zeros. And 512byte is filled in this page. If we
try to read mmap beyond the file size of 512 bytes , nulls are
returned . Hence , mmap works correctly in such cases .

>From other view, I guess fat_zero_falloc_area() is for filling zero for
>0-10000, in the following case?

> 1) create new file
> 2) lseek(10000)
> 3) write data by write(2)

>This job is for cont_write_begin(). If example is correct, why
>cont_write_begin() doesn't work? I guess, because get_block() doesn't
>set buffer_new() for those area.
If we will not call fallocate with keep size flags, cont_write_begin
will work and zero out on lseek area (this works like expanding
truncate).

>If above is correct, right implement to change get_block().
Now when we try to write in the fallocated region ( with keep size) in
the fat_write_begin when it is called first time it checks that the
mismatch is present between the mmu_private and mmu_actual , and hence
zero out the region ; since buffer_new is not set for fallocated
region by fat_get_block , we explicitly zero out the lseeked region
using “fat_zero_falloc_area” and normal write occurs beyond that , and
i_size is updated accordingly , and as such there is no need to move
the code to fat_get_block .
>I'm ok as start if it works.
>
>But from this discussion, discard at last close(2) doesn't look like
>working for your requirement. Since you want to discard at last close of
>inode, so, rather, I guess you actually want to discard at last
>dereference of inode.
We also moved the release of fallocated clusters from the
“fat_file_release” to “fat_evict_inode” and on the basis of i_count so
that the fallocated clusters are released at the last reference of the
inode.

Thanks.

2013-06-18 17:33:08

by OGAWA Hirofumi

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

Namjae Jeon <[email protected]> writes:

>>If above is correct, right implement to change get_block().
> Now when we try to write in the fallocated region ( with keep size) in
> the fat_write_begin when it is called first time it checks that the
> mismatch is present between the mmu_private and mmu_actual , and hence
> zero out the region ; since buffer_new is not set for fallocated
> region by fat_get_block , we explicitly zero out the lseeked region
> using "fat_zero_falloc_area" and normal write occurs beyond that , and
> i_size is updated accordingly , and as such there is no need to move
> the code to fat_get_block .

OK. So, like I said, you *changed* the behavior of get_block() via
fallocate() change, right? (I think, now, you noticed fat_get_block()
was changed.) Since you changed the behavior of get_block(), you had to
hack write_begin(). (IMO, that patch is dirty hack to fix write_begin()
path only)

Likewise, you have to prove all callers of get_block() must work
collectedly with that change.

What happen on direct I/O, bmap ioctl, etc.? Well, anyway, please fix
the root cause of change of behavior.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2013-06-20 06:28:13

by Namjae Jeon

[permalink] [raw]

Subject: Re: [PATCH RESEND v5] fat: editions to support fat_fallocate

2013/6/19, OGAWA Hirofumi <[email protected]>:
> Namjae Jeon <[email protected]> writes:
>
>>>If above is correct, right implement to change get_block().
>> Now when we try to write in the fallocated region ( with keep size) in
>> the fat_write_begin when it is called first time it checks that the
>> mismatch is present between the mmu_private and mmu_actual , and hence
>> zero out the region ; since buffer_new is not set for fallocated
>> region by fat_get_block , we explicitly zero out the lseeked region
>> using "fat_zero_falloc_area" and normal write occurs beyond that , and
>> i_size is updated accordingly , and as such there is no need to move
>> the code to fat_get_block .
>
> OK. So, like I said, you *changed* the behavior of get_block() via
> fallocate() change, right? (I think, now, you noticed fat_get_block()
> was changed.) Since you changed the behavior of get_block(), you had to
> hack write_begin(). (IMO, that patch is dirty hack to fix write_begin()
> path only)
>
> Likewise, you have to prove all callers of get_block() must work
> collectedly with that change.
>
Hi OGAWA.
> What happen on direct I/O, bmap ioctl, etc.? Well, anyway, please fix
> the root cause of change of behavior.
Good point!
Yes, I didn't consider direct I/O and bmap yet. Sure, I will fix it.
Thanks for review :)
>
> Thanks.
> --
> OGAWA Hirofumi <[email protected]>
>