LinuxLists.cc - [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE

2012-04-17 16:47:21

Subject: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

Hi list,

fallocate is a useful system call because it can preallocate some disk blocks
for a file and keep blocks contiguous. However, it has a defect that file
system will convert an uninitialized extent to be an initialized when the user
wants to write some data to this file, because file system create an
unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
Especially, it causes a severe degradation when the user tries to do some
random write operations, which frequently modifies the metadata of this file.
We meet this problem in our product system at Taobao. Last month, in ext4
workshop, we discussed this problem and the Google faces the same problem. So
a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.
When this flag is set, file system will create an inititalized extent for this
file. So it avoids the conversion from uninitialized to initialized. If users
want to use this flag, they must guarantee that file has been initialized by
themselves before it is read at the same offset. This flag is added in vfs so
that other file systems can also support this flag to improve the performance.

I try to make ext4 support this new flag, and run a simple test in my own
desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
to tset the performance.

#/bin/sh
mkfs.ext4 ${DEVICE}
mount -t ext4 ${DEVICE} ${TARGET}
fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
2>/dev/null; done

* I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
flag because the userspace tool doesn't support the new flag.

The result:
w/o w/
real 1m16.043s 0m17.946s -76.4%
user 0m0.195s 0m0.192s -1.54%
sys 0m0.468s 0m0.462s -1.28%

Obviously, this flag will bring an secure issue because the malicious user
could use this flag to get other user's data if (s)he doesn't do a
initialization before reading this file. Thus, a sysctl parameter
'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
whether or not this flag is enabled. Currently, this flag is disabled by
default. I am not sure whether this is enough or not. Another option is that
a new Kconfig entry is created to remove this flag during the kernel is
complied. So any suggestions or comments are appreciated.

Regards,
Zheng

Zheng Liu (3):
vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
vfs: add security check for _NO_HIDE_STALE flag
ext4: add FALLOC_FL_NO_HIDE_STALE support

fs/ext4/extents.c | 7 +++++--
fs/open.c | 12 +++++++++++-
include/linux/falloc.h | 5 +++++
include/linux/sysctl.h | 1 +
kernel/sysctl.c | 10 ++++++++++
5 files changed, 32 insertions(+), 3 deletions(-)

2012-04-17 16:53:36

by Zheng Liu

[permalink] [raw]

Subject: [RFC][PATCH 1/3] vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate

From: Zheng Liu <[email protected]>

FALLOC_FL_NO_HIDE_STALE flag is defined in fallocate for avoiding a
uninitialized->initialized conversion.

Signed-off-by: Zheng Liu <[email protected]>
---
fs/open.c | 2 +-
include/linux/falloc.h | 5 +++++
2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 5720854..1d3117f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,7 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EINVAL;

/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~FALLOC_FL_SUPPORTED_FLAGS)
return -EOPNOTSUPP;

/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..0d5d5aa 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -3,6 +3,11 @@

#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
+#define FALLOC_FL_NO_HIDE_STALE 0x04 /* no stale allocation */
+
+#define FALLOC_FL_SUPPORTED_FLAGS (FALLOC_FL_KEEP_SIZE | \
+ FALLOC_FL_PUNCH_HOLE | \
+ FALLOC_FL_NO_HIDE_STALE)

#ifdef __KERNEL__

--
1.7.4.1

2012-04-17 16:53:38

by Zheng Liu

[permalink] [raw]

Subject: [RFC][PATCH 3/3] ext4: add FALLOC_FL_NO_HIDE_STALE support

From: Zheng Liu <[email protected]>

When FALLOC_FL_NO_HIDE_STALE flag is marked, we create an initialized
extent directly.

Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 1421938..efb5150 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4291,7 +4291,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EOPNOTSUPP;

/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~FALLOC_FL_SUPPORTED_FLAGS)
return -EOPNOTSUPP;

if (mode & FALLOC_FL_PUNCH_HOLE)
@@ -4316,7 +4316,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
return ret;
}
- flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
+ if (mode & FALLOC_FL_NO_HIDE_STALE)
+ flags = EXT4_GET_BLOCKS_CREATE;
+ else
+ flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
/*
--
1.7.4.1

2012-04-17 16:53:37

by Zheng Liu

[permalink] [raw]

Subject: [RFC][PATCH 2/3] vfs: add security check for _NO_HIDE_STALE flag

From: Zheng Liu <[email protected]>

A new variable is added to enable/disable _NO_HIDE_STALE flag becuase this
flag is possible to be abused.

Signed-off-by: Zheng Liu <[email protected]>
---
fs/open.c | 10 ++++++++++
include/linux/sysctl.h | 1 +
kernel/sysctl.c | 10 ++++++++++
3 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 1d3117f..43fd39c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -213,6 +213,11 @@ SYSCALL_ALIAS(sys_ftruncate64, SyS_ftruncate64);
#endif
#endif /* BITS_PER_LONG == 32 */

+/*
+ * enable/disable FALLOC_FL_NO_HIDE_STALE flag
+ * 0: disable (default), 1: enable
+ */
+int sysctl_enable_falloc_no_hide_stale = 0;

int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
{
@@ -249,6 +254,11 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (ret)
return ret;

+ /* Check for enabling _NO_HIDE_STALE flag */
+ if (mode & FALLOC_FL_NO_HIDE_STALE &&
+ !sysctl_enable_falloc_no_hide_stale)
+ return -EPERM;
+
if (S_ISFIFO(inode->i_mode))
return -ESPIPE;

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c34b4c8..484e5dc 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -820,6 +820,7 @@ enum
FS_AIO_NR=18, /* current system-wide number of aio requests */
FS_AIO_MAX_NR=19, /* system-wide maximum number of aio requests */
FS_INOTIFY=20, /* inotify submenu */
+ FS_FALLOC_STALE=21, /* enable fallocate _NO_HIDE_STALE flag */
FS_OCFS2=988, /* ocfs2 */
};

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4ab1187..37de7c2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -108,6 +108,7 @@ extern int percpu_pagelist_fraction;
extern int compat_log;
extern int latencytop_enabled;
extern int sysctl_nr_open_min, sysctl_nr_open_max;
+extern int sysctl_enable_falloc_no_hide_stale;
#ifndef CONFIG_MMU
extern int sysctl_nr_trim_pages;
#endif
@@ -1517,6 +1518,15 @@ static struct ctl_table fs_table[] = {
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
+ {
+ .procname = "falloc_no_hide_stale",
+ .data = &sysctl_enable_falloc_no_hide_stale,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
{ }
};

--
1.7.4.1

2012-04-17 17:40:21

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 4/17/12 11:53 AM, Zheng Liu wrote:
> Hi list,
>
> fallocate is a useful system call because it can preallocate some disk blocks
> for a file and keep blocks contiguous. However, it has a defect that file
> system will convert an uninitialized extent to be an initialized when the user
> wants to write some data to this file, because file system create an
> unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).

That's a security-driven design feature, not a defect. :)

> Especially, it causes a severe degradation when the user tries to do some
> random write operations, which frequently modifies the metadata of this file.
> We meet this problem in our product system at Taobao. Last month, in ext4
> workshop, we discussed this problem and the Google faces the same problem. So
> a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.

Which then opens up severe security problems.

> When this flag is set, file system will create an inititalized extent for this
> file. So it avoids the conversion from uninitialized to initialized. If users
> want to use this flag, they must guarantee that file has been initialized by
> themselves before it is read at the same offset. This flag is added in vfs so
> that other file systems can also support this flag to improve the performance.
>
> I try to make ext4 support this new flag, and run a simple test in my own
> desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> to tset the performance.
>
> #/bin/sh
> mkfs.ext4 ${DEVICE}
> mount -t ext4 ${DEVICE} ${TARGET}
> fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)

That's only 26MB, but the below loop writes to a max offset of around
256M.

> time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> 2>/dev/null; done

You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
those should be the same file.

So the testcase as shown above seems fairly wrong, no? Is that what you
used for the numbers below?

> * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> flag because the userspace tool doesn't support the new flag.
>
> The result:
> w/o w/
> real 1m16.043s 0m17.946s -76.4%
> user 0m0.195s 0m0.192s -1.54%
> sys 0m0.468s 0m0.462s -1.28%

I think that the missing information here is some profiling to show where
the time was spent in the "w/o" case. What, exactly, in ext4 extent
management is so darned slow that we have to poke security holes in the
filesytem to get decent performance?

However,, the above also seems like an alarmingly large difference, and
one that I can't attribute to unwritten extent conversion overhead.

If I test the seeky dd to a prewritten file (to eliminate extent
conversion):

# dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
# sync

vs. to a fallocated file (which requires extent conversion):

# fallocate -l 256m /mnt/scratch/test

and then do your seeky dd test after each of the above:

# time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
2>/dev/null; done

On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.

On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.

I don't see anywhere near the slowdown you show above, certainly
nothing bad enough to warrant opening the big security hole.
Am I missing something?

The ext4 delta is a bit larger, though, so it seems worth investigating
the *root cause* of the extra overhead if it's problematic in your
workloads...

-Eric

> Obviously, this flag will bring an secure issue because the malicious user
> could use this flag to get other user's data if (s)he doesn't do a
> initialization before reading this file. Thus, a sysctl parameter
> 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> whether or not this flag is enabled. Currently, this flag is disabled by
> default. I am not sure whether this is enough or not. Another option is that
> a new Kconfig entry is created to remove this flag during the kernel is
> complied. So any suggestions or comments are appreciated.
>
> Regards,
> Zheng
>
> Zheng Liu (3):
> vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> vfs: add security check for _NO_HIDE_STALE flag
> ext4: add FALLOC_FL_NO_HIDE_STALE support
>
> fs/ext4/extents.c | 7 +++++--
> fs/open.c | 12 +++++++++++-
> include/linux/falloc.h | 5 +++++
> include/linux/sysctl.h | 1 +
> kernel/sysctl.c | 10 ++++++++++
> 5 files changed, 32 insertions(+), 3 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-04-17 17:59:42

by Ric Wheeler

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 04/17/2012 12:53 PM, Zheng Liu wrote:
> Hi list,
>
> fallocate is a useful system call because it can preallocate some disk blocks
> for a file and keep blocks contiguous. However, it has a defect that file
> system will convert an uninitialized extent to be an initialized when the user
> wants to write some data to this file, because file system create an
> unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> Especially, it causes a severe degradation when the user tries to do some
> random write operations, which frequently modifies the metadata of this file.
> We meet this problem in our product system at Taobao. Last month, in ext4
> workshop, we discussed this problem and the Google faces the same problem. So
> a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.
> When this flag is set, file system will create an inititalized extent for this
> file. So it avoids the conversion from uninitialized to initialized. If users
> want to use this flag, they must guarantee that file has been initialized by
> themselves before it is read at the same offset. This flag is added in vfs so
> that other file systems can also support this flag to improve the performance.

I really, really don't like exposing stale data to users and applications.

This is something that no enterprise file system (or distribution) would be able
to support and it totally breaks one of the age old promises that file systems
have always given to applications (if you preallocate and don't write, you will
read back zeros).

Sounds like we are proposing the introduction a huge security hole instead of
addressing the performance issue head on. Let's not punt on solving the design
challenge by relying on the inherent goodness of arbitrary users just yet please!

You could get both security and avoid the run time hit by fully writing the file
or by having a variation that relied on "discard" (i.e., no need to zero data if
we can discard or track it as unwritten).

Ric

>
> I try to make ext4 support this new flag, and run a simple test in my own
> desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> to tset the performance.
>
> #/bin/sh
> mkfs.ext4 ${DEVICE}
> mount -t ext4 ${DEVICE} ${TARGET}
> fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> 2>/dev/null; done
>
> * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> flag because the userspace tool doesn't support the new flag.
>
> The result:
> w/o w/
> real 1m16.043s 0m17.946s -76.4%
> user 0m0.195s 0m0.192s -1.54%
> sys 0m0.468s 0m0.462s -1.28%
>
> Obviously, this flag will bring an secure issue because the malicious user
> could use this flag to get other user's data if (s)he doesn't do a
> initialization before reading this file. Thus, a sysctl parameter
> 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> whether or not this flag is enabled. Currently, this flag is disabled by
> default. I am not sure whether this is enough or not. Another option is that
> a new Kconfig entry is created to remove this flag during the kernel is
> complied. So any suggestions or comments are appreciated.
>
> Regards,
> Zheng
>
> Zheng Liu (3):
> vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> vfs: add security check for _NO_HIDE_STALE flag
> ext4: add FALLOC_FL_NO_HIDE_STALE support
>
> fs/ext4/extents.c | 7 +++++--
> fs/open.c | 12 +++++++++++-
> include/linux/falloc.h | 5 +++++
> include/linux/sysctl.h | 1 +
> kernel/sysctl.c | 10 ++++++++++
> 5 files changed, 32 insertions(+), 3 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-04-17 18:43:11

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, Apr 17, 2012 at 01:59:37PM -0400, Ric Wheeler wrote:
>
> You could get both security and avoid the run time hit by fully
> writing the file or by having a variation that relied on "discard"
> (i.e., no need to zero data if we can discard or track it as
> unwritten).

It's certainly the case that if the device supports persistent
discard, something which we definitely *should* do is to send the
discard at fallocate time and then mark the space as initialized.

Unfortunately, not all devices, and in particular no HDD's for which I
aware support persistent discard. And, writing all zero's to the file
is in fact what a number of programs for which I am aware (including
an enterprise database) are doing, precisely because they tend to
write into the fallocated space in a somewhat random order, and the
extent conversion costs is in fact quite significant. But writing all
zero's to the file before you can use it is quite costly; at the very
least it burns disk bandwidth --- one of the main motivations of
fallocate was to avoid needing to do a "write all zero pass", and
while it does solve the problem for some use cases (such as DVR's),
it's not a complete solution.

Whether or not it is a security issue is debateable. If using the
fallocate flag requires CAP_SYS_RAWIO, and the process has to
explicitly ask for the privilege, a process with those privileges can
directly access memory and I/O ports directly, via the ioperm(2) and
iopl(2) system calls. So I think it's possible to be a bit nuanced
over whether or not this is as horrible as you might think.

Ultimately, if there are application programmers who are really
desperate for that the last bit of performance, they can always use
FIBMAP/FIEMAP and then read/write directly to the block device. (And
no, that's not a theoretical example.) I think it is a worthwhile
goal to provide file system interfaces that allow a trusted process
which has the appropriate security capabilities to do things in a
safer way than that.

Regards,

- Ted

2012-04-17 18:52:23

by Ric Wheeler

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 04/17/2012 02:43 PM, Ted Ts'o wrote:
> On Tue, Apr 17, 2012 at 01:59:37PM -0400, Ric Wheeler wrote:
>> You could get both security and avoid the run time hit by fully
>> writing the file or by having a variation that relied on "discard"
>> (i.e., no need to zero data if we can discard or track it as
>> unwritten).
> It's certainly the case that if the device supports persistent
> discard, something which we definitely *should* do is to send the
> discard at fallocate time and then mark the space as initialized.

This should be all advertised in /sys/block/sda - definitely worth encouraging
this for devices. I think that the device mapper "thin" target also supports
discard so you could get this behaviour with all devices if needed.

>
> Unfortunately, not all devices, and in particular no HDD's for which I
> aware support persistent discard. And, writing all zero's to the file
> is in fact what a number of programs for which I am aware (including
> an enterprise database) are doing, precisely because they tend to
> write into the fallocated space in a somewhat random order, and the
> extent conversion costs is in fact quite significant. But writing all
> zero's to the file before you can use it is quite costly; at the very
> least it burns disk bandwidth --- one of the main motivations of
> fallocate was to avoid needing to do a "write all zero pass", and
> while it does solve the problem for some use cases (such as DVR's),
> it's not a complete solution.

We also have a WRITE_SAME (with default pattern of zero data) that has long been
used in SCSI to initialize data.

>
> Whether or not it is a security issue is debateable. If using the
> fallocate flag requires CAP_SYS_RAWIO, and the process has to
> explicitly ask for the privilege, a process with those privileges can
> directly access memory and I/O ports directly, via the ioperm(2) and
> iopl(2) system calls. So I think it's possible to be a bit nuanced
> over whether or not this is as horrible as you might think.

We are still papering over an issue that seems to not be a challenge for XFS.

>
> Ultimately, if there are application programmers who are really
> desperate for that the last bit of performance, they can always use
> FIBMAP/FIEMAP and then read/write directly to the block device. (And
> no, that's not a theoretical example.) I think it is a worthwhile
> goal to provide file system interfaces that allow a trusted process
> which has the appropriate security capabilities to do things in a
> safer way than that.
>

I would prefer to let the very few crazy application programmers who need this
do insane things instead of opening and exposing data to these applications.

Or have them use a different file system that does not have this same penalty
(or to the same degree).

Thanks!

Ric

2012-04-17 18:53:27

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 4/17/12 1:43 PM, Ted Ts'o wrote:
> On Tue, Apr 17, 2012 at 01:59:37PM -0400, Ric Wheeler wrote:
>>
>> You could get both security and avoid the run time hit by fully
>> writing the file or by having a variation that relied on "discard"
>> (i.e., no need to zero data if we can discard or track it as
>> unwritten).
>
> It's certainly the case that if the device supports persistent
> discard, something which we definitely *should* do is to send the
> discard at fallocate time and then mark the space as initialized.
>
> Unfortunately, not all devices, and in particular no HDD's for which I
> aware support persistent discard. And, writing all zero's to the file
> is in fact what a number of programs for which I am aware (including
> an enterprise database) are doing, precisely because they tend to
> write into the fallocated space in a somewhat random order, and the
> extent conversion costs is in fact quite significant. But writing all
> zero's to the file before you can use it is quite costly; at the very
> least it burns disk bandwidth --- one of the main motivations of
> fallocate was to avoid needing to do a "write all zero pass", and
> while it does solve the problem for some use cases (such as DVR's),
> it's not a complete solution.

Can we please start with profiling the workload causing trouble, see why
ext4 takes such a hit, and see if anything can be done there to fix
it surgically, rather than just throwing this big hammer at it?

In my (admittedly quick, hacky) test, xfs suffed about a 1% perf degradation,
ext4 about 8%. Until we at least know why ext4 is so much worse, I'll
signal a strong NAK for this change, for whatever may or may not be worth. :)

-Eric

2012-04-17 19:05:02

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, Apr 17, 2012 at 01:53:20PM -0500, Eric Sandeen wrote:
>
> In my (admittedly quick, hacky) test, xfs suffed about a 1% perf
> degradation, ext4 about 8%. Until we at least know why ext4 is so
> much worse, I'll signal a strong NAK for this change, for whatever
> may or may not be worth. :)

Was this on a HDD? Try it on a PCIe attached flash device that
doesn't support persistent discard. :-) I suspect in that case even
with XFS you will see a bigger performance penalty than that.

That being said, I agree that it would be good to see if we can
improve ext4's performance without the big hammer, regardless of
whether or not we can bring the uninit->init overhead down to zero.
Some of the overhead I suspect is due the fact that ext4 is doing
physical block journalling, which we won't be able to work around, but
it may be there is room for improvement in the extent manipulation
codepaths and and how they get called from the I/O completion handler.

(Oh, and that's assuming you were using DIO or AIO; if you were using
buffered writes, you were probably getting hit by the ordered mode
flush requirements, since in buffered mode we do the uninit->init
conversion before we write out the data block, so we have to do an
implicit fsync in the commit transaction. Fixing that is something
we've talked about before, and that's certainly worth doing; it
requires the I/O tree work that we talked about at the ext4 workshop
as a prerequisite, though.)

- Ted

2012-04-18 03:02:13

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, Apr 17, 2012 at 01:53:20PM -0500, Eric Sandeen wrote:
> On 4/17/12 1:43 PM, Ted Ts'o wrote:
> > On Tue, Apr 17, 2012 at 01:59:37PM -0400, Ric Wheeler wrote:
> >>
> >> You could get both security and avoid the run time hit by fully
> >> writing the file or by having a variation that relied on "discard"
> >> (i.e., no need to zero data if we can discard or track it as
> >> unwritten).
> >
> > It's certainly the case that if the device supports persistent
> > discard, something which we definitely *should* do is to send the
> > discard at fallocate time and then mark the space as initialized.
> >
> > Unfortunately, not all devices, and in particular no HDD's for which I
> > aware support persistent discard. And, writing all zero's to the file
> > is in fact what a number of programs for which I am aware (including
> > an enterprise database) are doing, precisely because they tend to
> > write into the fallocated space in a somewhat random order, and the
> > extent conversion costs is in fact quite significant. But writing all
> > zero's to the file before you can use it is quite costly; at the very
> > least it burns disk bandwidth --- one of the main motivations of
> > fallocate was to avoid needing to do a "write all zero pass", and
> > while it does solve the problem for some use cases (such as DVR's),
> > it's not a complete solution.
>
> Can we please start with profiling the workload causing trouble, see why
> ext4 takes such a hit, and see if anything can be done there to fix
> it surgically, rather than just throwing this big hammer at it?
>
> In my (admittedly quick, hacky) test, xfs suffed about a 1% perf degradation,
> ext4 about 8%. Until we at least know why ext4 is so much worse, I'll
> signal a strong NAK for this change, for whatever may or may not be worth. :)

In actual fact, on my 12 disk RAID0 array, XFS is faster with
unwritten extents *enabled* than when hacked to turn them off. Yes,
you can turn off unwritten extent tracking in XFS if you know what
you are doing, we just don't provide any interfaces to users to do
so because of all the security problems it entails.

The result (using 256MB prealloc file, 2000 sparse 4k block writes,
one with O_SYNC, the other done async with a post write sync), with
averages over 5 runs are:

O_SYNC post-sync
unwritten 7.297s 5.734s
stale 7.641s 6.108s

These results are consistently repeatable, and only reinforce the
point that if ext4 is slow using unwritten extent tracking, then
it's an implementation problem and not an excuse to add an interface
to expose stale data....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-04-18 04:02:55

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, Apr 17, 2012 at 12:40:14PM -0500, Eric Sandeen wrote:
> On 4/17/12 11:53 AM, Zheng Liu wrote:
> > Hi list,
> >
> > fallocate is a useful system call because it can preallocate some disk blocks
> > for a file and keep blocks contiguous. However, it has a defect that file
> > system will convert an uninitialized extent to be an initialized when the user
> > wants to write some data to this file, because file system create an
> > unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
>
> That's a security-driven design feature, not a defect. :)
>
> > Especially, it causes a severe degradation when the user tries to do some
> > random write operations, which frequently modifies the metadata of this file.
> > We meet this problem in our product system at Taobao. Last month, in ext4
> > workshop, we discussed this problem and the Google faces the same problem. So
> > a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.
>
> Which then opens up severe security problems.
>
> > When this flag is set, file system will create an inititalized extent for this
> > file. So it avoids the conversion from uninitialized to initialized. If users
> > want to use this flag, they must guarantee that file has been initialized by
> > themselves before it is read at the same offset. This flag is added in vfs so
> > that other file systems can also support this flag to improve the performance.
> >
> > I try to make ext4 support this new flag, and run a simple test in my own
> > desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> > memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> > to tset the performance.
> >
> > #/bin/sh
> > mkfs.ext4 ${DEVICE}
> > mount -t ext4 ${DEVICE} ${TARGET}
> > fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
>
> That's only 26MB, but the below loop writes to a max offset of around
> 256M.

Yes, you are right. I preallocate a file that is 256MB.

>
> > time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > 2>/dev/null; done
>
> You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
> those should be the same file.

Yes, it is the same file.

>
> So the testcase as shown above seems fairly wrong, no? Is that what you
> used for the numbers below?
>
> > * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> > flag because the userspace tool doesn't support the new flag.
> >
> > The result:
> > w/o w/
> > real 1m16.043s 0m17.946s -76.4%
> > user 0m0.195s 0m0.192s -1.54%
> > sys 0m0.468s 0m0.462s -1.28%
>
> I think that the missing information here is some profiling to show where
> the time was spent in the "w/o" case. What, exactly, in ext4 extent
> management is so darned slow that we have to poke security holes in the
> filesytem to get decent performance?
>
> However,, the above also seems like an alarmingly large difference, and
> one that I can't attribute to unwritten extent conversion overhead.
>
> If I test the seeky dd to a prewritten file (to eliminate extent
> conversion):
>
> # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> # sync
>
> vs. to a fallocated file (which requires extent conversion):
>
> # fallocate -l 256m /mnt/scratch/test
>
> and then do your seeky dd test after each of the above:
>
> # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> 2>/dev/null; done
>
> On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
>
> On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
>
> I don't see anywhere near the slowdown you show above, certainly
> nothing bad enough to warrant opening the big security hole.
> Am I missing something?

I will run more detailed benchmarks to trace this issue. If I have a
lastest result, I will send a new mail to let you know. :)

I fully understand that this flag will bring a security hole, and I
totally agree that we should dig *root cause* in ext4. But, IMHO, a
proper interface, which is limited by a proper capabilities, might be
useful for the user.

Regards,
Zheng

>
> The ext4 delta is a bit larger, though, so it seems worth investigating
> the *root cause* of the extra overhead if it's problematic in your
> workloads...
>
> -Eric
>
>
> > Obviously, this flag will bring an secure issue because the malicious user
> > could use this flag to get other user's data if (s)he doesn't do a
> > initialization before reading this file. Thus, a sysctl parameter
> > 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> > whether or not this flag is enabled. Currently, this flag is disabled by
> > default. I am not sure whether this is enough or not. Another option is that
> > a new Kconfig entry is created to remove this flag during the kernel is
> > complied. So any suggestions or comments are appreciated.
> >
> > Regards,
> > Zheng
> >
> > Zheng Liu (3):
> > vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> > vfs: add security check for _NO_HIDE_STALE flag
> > ext4: add FALLOC_FL_NO_HIDE_STALE support
> >
> > fs/ext4/extents.c | 7 +++++--
> > fs/open.c | 12 +++++++++++-
> > include/linux/falloc.h | 5 +++++
> > include/linux/sysctl.h | 1 +
> > kernel/sysctl.c | 10 ++++++++++
> > 5 files changed, 32 insertions(+), 3 deletions(-)
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-04-18 04:59:37

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 2012-04-17, at 10:40, Eric Sandeen <[email protected]> wrote:
> On 4/17/12 11:53 AM, Zheng Liu wrote:
>>
>> fallocate is a useful system call because it can preallocate some disk blocks
>> for a file and keep blocks contiguous. However, it has a defect that file
>> system will convert an uninitialized extent to be an initialized when the user
>> wants to write some data to this file, because file system create an
>> unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
>
> That's a security-driven design feature, not a defect. :)
>
>> I try to make ext4 support this new flag, and run a simple test in my own
>> desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
>> memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
>> to tset the performance.
>>
>> #/bin/sh
>> mkfs.ext4 ${DEVICE}
>> mount -t ext4 ${DEVICE} ${TARGET}
>> fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
>
> That's only 26MB, but the below loop writes to a max offset of around
> 256M.
>
>> time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
>> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
>> 2>/dev/null; done
>
> You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
> those should be the same file.
>
> So the testcase as shown above seems fairly wrong, no? Is that what you
> used for the numbers below?

I auspect the real problem may be the lazy inode table zeroing being done in the kernel. If the test is run immediately after formatting the filesystem, then the kernel thread is busy writing to the disk in the background and interfering with your benchmark.

Please format the filesystem with "-E lazy-itable-init=0", to avoid this behavior.

Secondly, your test program is not doing random writes to disk, but rather doing writes at 64kB intervals. There is logic in the uninitialized extent handling that will write zeros to an entire extent, rather than create many fragmented uninitialized extents. It may be possible that you are zeroing out the entire file, and writing 16x as much data as you expect.

Cheers, Andreas

>> * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
>> flag because the userspace tool doesn't support the new flag.
>>
>> The result:
>> w/o w/
>> real 1m16.043s 0m17.946s -76.4%
>> user 0m0.195s 0m0.192s -1.54%
>> sys 0m0.468s 0m0.462s -1.28%
>
> I think that the missing information here is some profiling to show where
> the time was spent in the "w/o" case. What, exactly, in ext4 extent
> management is so darned slow that we have to poke security holes in the
> filesytem to get decent performance?
>
> However,, the above also seems like an alarmingly large difference, and
> one that I can't attribute to unwritten extent conversion overhead.
>
> If I test the seeky dd to a prewritten file (to eliminate extent
> conversion):
>
> # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> # sync
>
> vs. to a fallocated file (which requires extent conversion):
>
> # fallocate -l 256m /mnt/scratch/test
>
> and then do your seeky dd test after each of the above:
>
> # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> 2>/dev/null; done
>
> On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
>
> On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
>
> I don't see anywhere near the slowdown you show above, certainly
> nothing bad enough to warrant opening the big security hole.
> Am I missing something?
>
> The ext4 delta is a bit larger, though, so it seems worth investigating
> the *root cause* of the extra overhead if it's problematic in your
> workloads...
>
> -Eric
>
>
>> Obviously, this flag will bring an secure issue because the malicious user
>> could use this flag to get other user's data if (s)he doesn't do a
>> initialization before reading this file. Thus, a sysctl parameter
>> 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
>> whether or not this flag is enabled. Currently, this flag is disabled by
>> default. I am not sure whether this is enough or not. Another option is that
>> a new Kconfig entry is created to remove this flag during the kernel is
>> complied. So any suggestions or comments are appreciated.
>>
>> Regards,
>> Zheng
>>
>> Zheng Liu (3):
>> vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
>> vfs: add security check for _NO_HIDE_STALE flag
>> ext4: add FALLOC_FL_NO_HIDE_STALE support
>>
>> fs/ext4/extents.c | 7 +++++--
>> fs/open.c | 12 +++++++++++-
>> include/linux/falloc.h | 5 +++++
>> include/linux/sysctl.h | 1 +
>> kernel/sysctl.c | 10 ++++++++++
>> 5 files changed, 32 insertions(+), 3 deletions(-)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-04-18 07:48:45

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, 18 Apr 2012, Zheng Liu wrote:

> On Tue, Apr 17, 2012 at 12:40:14PM -0500, Eric Sandeen wrote:
> > On 4/17/12 11:53 AM, Zheng Liu wrote:
> > > Hi list,
> > >
> > > fallocate is a useful system call because it can preallocate some disk blocks
> > > for a file and keep blocks contiguous. However, it has a defect that file
> > > system will convert an uninitialized extent to be an initialized when the user
> > > wants to write some data to this file, because file system create an
> > > unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> >
> > That's a security-driven design feature, not a defect. :)

Exactly! It really surprise me that you call it a defect, moreover the
name you've chosen for the flag is IMO wrong. With security hole like
this you really want give user very strong feeling that he is NOT supposed
to use it if he does not know what he is doing. Maybe something like
FALLOC_FL_EXPOSE_USER_DATA because that is what it does.

> >
> > > Especially, it causes a severe degradation when the user tries to do some
> > > random write operations, which frequently modifies the metadata of this file.
> > > We meet this problem in our product system at Taobao. Last month, in ext4
> > > workshop, we discussed this problem and the Google faces the same problem. So
> > > a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.
> >
> > Which then opens up severe security problems.
> >
> > > When this flag is set, file system will create an inititalized extent for this
> > > file. So it avoids the conversion from uninitialized to initialized. If users
> > > want to use this flag, they must guarantee that file has been initialized by
> > > themselves before it is read at the same offset. This flag is added in vfs so
> > > that other file systems can also support this flag to improve the performance.
> > >
> > > I try to make ext4 support this new flag, and run a simple test in my own
> > > desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> > > memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> > > to tset the performance.
> > >
> > > #/bin/sh
> > > mkfs.ext4 ${DEVICE}
> > > mount -t ext4 ${DEVICE} ${TARGET}
> > > fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> >
> > That's only 26MB, but the below loop writes to a max offset of around
> > 256M.
>
> Yes, you are right. I preallocate a file that is 256MB.
>
> >
> > > time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> > > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > > 2>/dev/null; done
> >
> > You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
> > those should be the same file.
>
> Yes, it is the same file.

But it is not in your test case. Does it mean that your test case is
wrong ?

>
> >
> > So the testcase as shown above seems fairly wrong, no? Is that what you
> > used for the numbers below?
> >
> > > * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> > > flag because the userspace tool doesn't support the new flag.
> > >
> > > The result:
> > > w/o w/
> > > real 1m16.043s 0m17.946s -76.4%
> > > user 0m0.195s 0m0.192s -1.54%
> > > sys 0m0.468s 0m0.462s -1.28%
> >
> > I think that the missing information here is some profiling to show where
> > the time was spent in the "w/o" case. What, exactly, in ext4 extent
> > management is so darned slow that we have to poke security holes in the
> > filesytem to get decent performance?
> >
> > However,, the above also seems like an alarmingly large difference, and
> > one that I can't attribute to unwritten extent conversion overhead.
> >
> > If I test the seeky dd to a prewritten file (to eliminate extent
> > conversion):
> >
> > # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> > # sync
> >
> > vs. to a fallocated file (which requires extent conversion):
> >
> > # fallocate -l 256m /mnt/scratch/test
> >
> > and then do your seeky dd test after each of the above:
> >
> > # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > 2>/dev/null; done
> >
> > On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
> >
> > On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
> >
> > I don't see anywhere near the slowdown you show above, certainly
> > nothing bad enough to warrant opening the big security hole.
> > Am I missing something?
>
> I will run more detailed benchmarks to trace this issue. If I have a
> lastest result, I will send a new mail to let you know. :)

Yes, please do. I do not think that in this case simply doing 'time
dd' is enough. We need to know what is happening inside the kernel to see
what is causing the slowdown. It may very well be that we might be able
to fix it, rather than introduce crazy security hole to bypass the
problem.

>
> I fully understand that this flag will bring a security hole, and I
> totally agree that we should dig *root cause* in ext4. But, IMHO, a
> proper interface, which is limited by a proper capabilities, might be
> useful for the user.

No, you're not getting the point. In kernel (or fs) we are supposed to
prevent from security problems as much as possible and it is
indisputably the job of file systems to prevent reading data which are
NOT in the file system and might have belonged to the other user. If you
want to do this, just use raw device.

Limiting this by proper capabilities is so not enough from tons of
reasons. Application have bugs, which may very well cause the stale data
to be exported directly to the user of the application which might not
even have root capabilities. Consider buggy database returning stale
data (possibly deleted secret informations) from the on user to
another, it _is_ possible and things like that _will_ happen with
this "feature", what is even worse that in this case the application
writers do actually use the fallocate flag right.

Moreover, a lot of application are getting rid of the root capabilities
once they are executed and initialized from the obvious reason. Consider
the case where the application is doing fallocate-expose-stale-data
before betting rid of the root cap, then after it gets rid of the root
cap, the "regular user" have access to the other users old data.

And it might not even be the case of buggy application, consider very
simple case where root fallocate-expose-stale-data a file and did not
set read permission right, then *any* user can read other users data.
This will definitely happen as this kind of errors happens all the time.

And also you're assuming that the application developer or root
administrator know exactly what the problem is and can see all the
corner cases as are the few I've described above, that's not the case,
they will get it wrong in some cases especially when it will be promoted
as performance boost.

I do not like this "feature" at all, what about rather fixing the
problem ? Xfs does not seem to have this issue, at least not as
noticeable, so it is definitely possible.

Thanks!
-Lukas

>
> Regards,
> Zheng
>
> >
> > The ext4 delta is a bit larger, though, so it seems worth investigating
> > the *root cause* of the extra overhead if it's problematic in your
> > workloads...
> >
> > -Eric
> >
> >
> > > Obviously, this flag will bring an secure issue because the malicious user
> > > could use this flag to get other user's data if (s)he doesn't do a
> > > initialization before reading this file. Thus, a sysctl parameter
> > > 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> > > whether or not this flag is enabled. Currently, this flag is disabled by
> > > default. I am not sure whether this is enough or not. Another option is that
> > > a new Kconfig entry is created to remove this flag during the kernel is
> > > complied. So any suggestions or comments are appreciated.
> > >
> > > Regards,
> > > Zheng
> > >
> > > Zheng Liu (3):
> > > vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> > > vfs: add security check for _NO_HIDE_STALE flag
> > > ext4: add FALLOC_FL_NO_HIDE_STALE support
> > >
> > > fs/ext4/extents.c | 7 +++++--
> > > fs/open.c | 12 +++++++++++-
> > > include/linux/falloc.h | 5 +++++
> > > include/linux/sysctl.h | 1 +
> > > kernel/sysctl.c | 10 ++++++++++
> > > 5 files changed, 32 insertions(+), 3 deletions(-)
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--

2012-04-18 08:04:21

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, 17 Apr 2012, Ted Ts'o wrote:

> On Tue, Apr 17, 2012 at 01:59:37PM -0400, Ric Wheeler wrote:
> >
> > You could get both security and avoid the run time hit by fully
> > writing the file or by having a variation that relied on "discard"
> > (i.e., no need to zero data if we can discard or track it as
> > unwritten).
>
> It's certainly the case that if the device supports persistent
> discard, something which we definitely *should* do is to send the
> discard at fallocate time and then mark the space as initialized.

Yes using discard to avoid reading stale data might be partial solution
on some hardware. Although it has its own problems like minimum discard
range size, cache flush, slowness with some hardware and the fact that
it might not actually zero all of the data, but all of that should get
better in future I guess. Unfortunately this is definitely not a
solution for now.

>
> Unfortunately, not all devices, and in particular no HDD's for which I
> aware support persistent discard. And, writing all zero's to the file
> is in fact what a number of programs for which I am aware (including
> an enterprise database) are doing, precisely because they tend to
> write into the fallocated space in a somewhat random order, and the
> extent conversion costs is in fact quite significant. But writing all
> zero's to the file before you can use it is quite costly; at the very
> least it burns disk bandwidth --- one of the main motivations of
> fallocate was to avoid needing to do a "write all zero pass", and
> while it does solve the problem for some use cases (such as DVR's),
> it's not a complete solution.
>
> Whether or not it is a security issue is debateable. If using the
> fallocate flag requires CAP_SYS_RAWIO, and the process has to
> explicitly ask for the privilege, a process with those privileges can
> directly access memory and I/O ports directly, via the ioperm(2) and
> iopl(2) system calls. So I think it's possible to be a bit nuanced
> over whether or not this is as horrible as you might think.

I do not think this is debatable at all. It _is_ security hole no matter
what. What you're talking about is just the privilege to create such
dangerous file. After it is created, it is protected _only_ by the
application logic (and bug in the application can expose stale data to
the regular use) or file permission which is even worse since people get
this wrong all the time. So it is not like it is perfectly safe to let only
root create such file, not at all.

Moreover users will get this wrong, because they will not understand
every corner case. And when you add the "performance boost" and the
vague name of the flag (really FALLOC_FL_EXPOSE_USER_DATA would have
been much better) into the mix you'll get a disaster.

>
> Ultimately, if there are application programmers who are really
> desperate for that the last bit of performance, they can always use
> FIBMAP/FIEMAP and then read/write directly to the block device. (And
> no, that's not a theoretical example.) I think it is a worthwhile
> goal to provide file system interfaces that allow a trusted process
> which has the appropriate security capabilities to do things in a
> safer way than that.

Well, then let them use xfs, since they do not have this problem, or
rather let's fix the root cause.

Regards
-Lukas

>
> Regards,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-04-18 08:19:16

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, 17 Apr 2012, Andreas Dilger wrote:

> On 2012-04-17, at 10:40, Eric Sandeen <[email protected]> wrote:
> > On 4/17/12 11:53 AM, Zheng Liu wrote:
> >>
> >> fallocate is a useful system call because it can preallocate some disk blocks
> >> for a file and keep blocks contiguous. However, it has a defect that file
> >> system will convert an uninitialized extent to be an initialized when the user
> >> wants to write some data to this file, because file system create an
> >> unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> >
> > That's a security-driven design feature, not a defect. :)
> >
> >> I try to make ext4 support this new flag, and run a simple test in my own
> >> desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> >> memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> >> to tset the performance.
> >>
> >> #/bin/sh
> >> mkfs.ext4 ${DEVICE}
> >> mount -t ext4 ${DEVICE} ${TARGET}
> >> fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> >
> > That's only 26MB, but the below loop writes to a max offset of around
> > 256M.
> >
> >> time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> >> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> >> 2>/dev/null; done
> >
> > You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
> > those should be the same file.
> >
> > So the testcase as shown above seems fairly wrong, no? Is that what you
> > used for the numbers below?
>
> I auspect the real problem may be the lazy inode table zeroing being done in the kernel. If the test is run immediately after formatting the filesystem, then the kernel thread is busy writing to the disk in the background and interfering with your benchmark.
>
> Please format the filesystem with "-E lazy-itable-init=0", to avoid this behavior.

That's right. Or you can mount it with -o noinit_itable to prevent
kernel from initializing the inode table. But I would expect that this
will not have such a huge impact since it is done in both w/ and w/o
case.

-Lukas

>
> Secondly, your test program is not doing random writes to disk, but rather doing writes at 64kB intervals. There is logic in the uninitialized extent handling that will write zeros to an entire extent, rather than create many fragmented uninitialized extents. It may be possible that you are zeroing out the entire file, and writing 16x as much data as you expect.
>
> Cheers, Andreas
>
> >> * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> >> flag because the userspace tool doesn't support the new flag.
> >>
> >> The result:
> >> w/o w/
> >> real 1m16.043s 0m17.946s -76.4%
> >> user 0m0.195s 0m0.192s -1.54%
> >> sys 0m0.468s 0m0.462s -1.28%
> >
> > I think that the missing information here is some profiling to show where
> > the time was spent in the "w/o" case. What, exactly, in ext4 extent
> > management is so darned slow that we have to poke security holes in the
> > filesytem to get decent performance?
> >
> > However,, the above also seems like an alarmingly large difference, and
> > one that I can't attribute to unwritten extent conversion overhead.
> >
> > If I test the seeky dd to a prewritten file (to eliminate extent
> > conversion):
> >
> > # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> > # sync
> >
> > vs. to a fallocated file (which requires extent conversion):
> >
> > # fallocate -l 256m /mnt/scratch/test
> >
> > and then do your seeky dd test after each of the above:
> >
> > # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > 2>/dev/null; done
> >
> > On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
> >
> > On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
> >
> > I don't see anywhere near the slowdown you show above, certainly
> > nothing bad enough to warrant opening the big security hole.
> > Am I missing something?
> >
> > The ext4 delta is a bit larger, though, so it seems worth investigating
> > the *root cause* of the extra overhead if it's problematic in your
> > workloads...
> >
> > -Eric
> >
> >
> >> Obviously, this flag will bring an secure issue because the malicious user
> >> could use this flag to get other user's data if (s)he doesn't do a
> >> initialization before reading this file. Thus, a sysctl parameter
> >> 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> >> whether or not this flag is enabled. Currently, this flag is disabled by
> >> default. I am not sure whether this is enough or not. Another option is that
> >> a new Kconfig entry is created to remove this flag during the kernel is
> >> complied. So any suggestions or comments are appreciated.
> >>
> >> Regards,
> >> Zheng
> >>
> >> Zheng Liu (3):
> >> vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> >> vfs: add security check for _NO_HIDE_STALE flag
> >> ext4: add FALLOC_FL_NO_HIDE_STALE support
> >>
> >> fs/ext4/extents.c | 7 +++++--
> >> fs/open.c | 12 +++++++++++-
> >> include/linux/falloc.h | 5 +++++
> >> include/linux/sysctl.h | 1 +
> >> kernel/sysctl.c | 10 ++++++++++
> >> 5 files changed, 32 insertions(+), 3 deletions(-)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--

2012-04-18 11:31:42

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Tue, Apr 17, 2012 at 09:59:37PM -0700, Andreas Dilger wrote:
> On 2012-04-17, at 10:40, Eric Sandeen <[email protected]> wrote:
> > On 4/17/12 11:53 AM, Zheng Liu wrote:
> >>
> >> fallocate is a useful system call because it can preallocate some disk blocks
> >> for a file and keep blocks contiguous. However, it has a defect that file
> >> system will convert an uninitialized extent to be an initialized when the user
> >> wants to write some data to this file, because file system create an
> >> unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> >
> > That's a security-driven design feature, not a defect. :)
> >
> >> I try to make ext4 support this new flag, and run a simple test in my own
> >> desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> >> memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> >> to tset the performance.
> >>
> >> #/bin/sh
> >> mkfs.ext4 ${DEVICE}
> >> mount -t ext4 ${DEVICE} ${TARGET}
> >> fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> >
> > That's only 26MB, but the below loop writes to a max offset of around
> > 256M.
> >
> >> time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> >> conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> >> 2>/dev/null; done
> >
> > You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
> > those should be the same file.
> >
> > So the testcase as shown above seems fairly wrong, no? Is that what you
> > used for the numbers below?
>
> I auspect the real problem may be the lazy inode table zeroing being done in the kernel. If the test is run immediately after formatting the filesystem, then the kernel thread is busy writing to the disk in the background and interfering with your benchmark.
>
> Please format the filesystem with "-E lazy-itable-init=0", to avoid this behavior.

I format the filesystem with this option, but it seems that it is
useless.

>
> Secondly, your test program is not doing random writes to disk, but rather doing writes at 64kB intervals. There is logic in the uninitialized extent handling that will write zeros to an entire extent, rather than create many fragmented uninitialized extents. It may be possible that you are zeroing out the entire file, and writing 16x as much data as you expect.

No, I don't see any entire extents that is written zeros. The extent is
splitted to many fragmented initialized and uninitialized extents.

I have run a more detailed benchmark. Later I will post it in mailing
list. IMHO, I guess that the journal might be the root cause.

Regards,
Zheng

>
> Cheers, Andreas
>
> >> * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> >> flag because the userspace tool doesn't support the new flag.
> >>
> >> The result:
> >> w/o w/
> >> real 1m16.043s 0m17.946s -76.4%
> >> user 0m0.195s 0m0.192s -1.54%
> >> sys 0m0.468s 0m0.462s -1.28%
> >
> > I think that the missing information here is some profiling to show where
> > the time was spent in the "w/o" case. What, exactly, in ext4 extent
> > management is so darned slow that we have to poke security holes in the
> > filesytem to get decent performance?
> >
> > However,, the above also seems like an alarmingly large difference, and
> > one that I can't attribute to unwritten extent conversion overhead.
> >
> > If I test the seeky dd to a prewritten file (to eliminate extent
> > conversion):
> >
> > # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> > # sync
> >
> > vs. to a fallocated file (which requires extent conversion):
> >
> > # fallocate -l 256m /mnt/scratch/test
> >
> > and then do your seeky dd test after each of the above:
> >
> > # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > 2>/dev/null; done
> >
> > On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
> >
> > On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
> >
> > I don't see anywhere near the slowdown you show above, certainly
> > nothing bad enough to warrant opening the big security hole.
> > Am I missing something?
> >
> > The ext4 delta is a bit larger, though, so it seems worth investigating
> > the *root cause* of the extra overhead if it's problematic in your
> > workloads...
> >
> > -Eric
> >
> >
> >> Obviously, this flag will bring an secure issue because the malicious user
> >> could use this flag to get other user's data if (s)he doesn't do a
> >> initialization before reading this file. Thus, a sysctl parameter
> >> 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> >> whether or not this flag is enabled. Currently, this flag is disabled by
> >> default. I am not sure whether this is enough or not. Another option is that
> >> a new Kconfig entry is created to remove this flag during the kernel is
> >> complied. So any suggestions or comments are appreciated.
> >>
> >> Regards,
> >> Zheng
> >>
> >> Zheng Liu (3):
> >> vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> >> vfs: add security check for _NO_HIDE_STALE flag
> >> ext4: add FALLOC_FL_NO_HIDE_STALE support
> >>
> >> fs/ext4/extents.c | 7 +++++--
> >> fs/open.c | 12 +++++++++++-
> >> include/linux/falloc.h | 5 +++++
> >> include/linux/sysctl.h | 1 +
> >> kernel/sysctl.c | 10 ++++++++++
> >> 5 files changed, 32 insertions(+), 3 deletions(-)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-04-18 11:39:06

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, 18 Apr 2012, Zheng Liu wrote:

> > I auspect the real problem may be the lazy inode table zeroing being done in the kernel. If the test is run immediately after formatting the filesystem, then the kernel thread is busy writing to the disk in the background and interfering with your benchmark.
> >
> > Please format the filesystem with "-E lazy-itable-init=0", to avoid this behavior.
>
> I format the filesystem with this option, but it seems that it is
> useless.

What do you mean by that ? It does not work as it's supposed to ?
ext4lazyinit kernel thread is still running ? Or the slowdown was not
caused by the lazy init in the first place ?

Thanks!
-Lukas

2012-04-18 12:03:08

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, Apr 18, 2012 at 09:48:45AM +0200, Lukas Czerner wrote:
> On Wed, 18 Apr 2012, Zheng Liu wrote:
>
> > On Tue, Apr 17, 2012 at 12:40:14PM -0500, Eric Sandeen wrote:
> > > On 4/17/12 11:53 AM, Zheng Liu wrote:
> > > > Hi list,
> > > >
> > > > fallocate is a useful system call because it can preallocate some disk blocks
> > > > for a file and keep blocks contiguous. However, it has a defect that file
> > > > system will convert an uninitialized extent to be an initialized when the user
> > > > wants to write some data to this file, because file system create an
> > > > unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> > >
> > > That's a security-driven design feature, not a defect. :)
>
> Exactly! It really surprise me that you call it a defect, moreover the
> name you've chosen for the flag is IMO wrong. With security hole like
> this you really want give user very strong feeling that he is NOT supposed
> to use it if he does not know what he is doing. Maybe something like
> FALLOC_FL_EXPOSE_USER_DATA because that is what it does.
>
> > >
> > > > Especially, it causes a severe degradation when the user tries to do some
> > > > random write operations, which frequently modifies the metadata of this file.
> > > > We meet this problem in our product system at Taobao. Last month, in ext4
> > > > workshop, we discussed this problem and the Google faces the same problem. So
> > > > a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.
> > >
> > > Which then opens up severe security problems.
> > >
> > > > When this flag is set, file system will create an inititalized extent for this
> > > > file. So it avoids the conversion from uninitialized to initialized. If users
> > > > want to use this flag, they must guarantee that file has been initialized by
> > > > themselves before it is read at the same offset. This flag is added in vfs so
> > > > that other file systems can also support this flag to improve the performance.
> > > >
> > > > I try to make ext4 support this new flag, and run a simple test in my own
> > > > desktop to verify it. The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> > > > memory and a WDC WD1600AAJS-75M0A0 160G SATA disk. I use the following script
> > > > to tset the performance.
> > > >
> > > > #/bin/sh
> > > > mkfs.ext4 ${DEVICE}
> > > > mount -t ext4 ${DEVICE} ${TARGET}
> > > > fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> > >
> > > That's only 26MB, but the below loop writes to a max offset of around
> > > 256M.
> >
> > Yes, you are right. I preallocate a file that is 256MB.
> >
> > >
> > > > time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> > > > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > > > 2>/dev/null; done
> > >
> > > You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ? I presume
> > > those should be the same file.
> >
> > Yes, it is the same file.
>
> But it is not in your test case. Does it mean that your test case is
> wrong ?
>
> >
> > >
> > > So the testcase as shown above seems fairly wrong, no? Is that what you
> > > used for the numbers below?
> > >
> > > > * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> > > > flag because the userspace tool doesn't support the new flag.
> > > >
> > > > The result:
> > > > w/o w/
> > > > real 1m16.043s 0m17.946s -76.4%
> > > > user 0m0.195s 0m0.192s -1.54%
> > > > sys 0m0.468s 0m0.462s -1.28%
> > >
> > > I think that the missing information here is some profiling to show where
> > > the time was spent in the "w/o" case. What, exactly, in ext4 extent
> > > management is so darned slow that we have to poke security holes in the
> > > filesytem to get decent performance?
> > >
> > > However,, the above also seems like an alarmingly large difference, and
> > > one that I can't attribute to unwritten extent conversion overhead.
> > >
> > > If I test the seeky dd to a prewritten file (to eliminate extent
> > > conversion):
> > >
> > > # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> > > # sync
> > >
> > > vs. to a fallocated file (which requires extent conversion):
> > >
> > > # fallocate -l 256m /mnt/scratch/test
> > >
> > > and then do your seeky dd test after each of the above:
> > >
> > > # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> > > conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > > 2>/dev/null; done
> > >
> > > On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
> > >
> > > On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
> > >
> > > I don't see anywhere near the slowdown you show above, certainly
> > > nothing bad enough to warrant opening the big security hole.
> > > Am I missing something?
> >
> > I will run more detailed benchmarks to trace this issue. If I have a
> > lastest result, I will send a new mail to let you know. :)
>
> Yes, please do. I do not think that in this case simply doing 'time
> dd' is enough. We need to know what is happening inside the kernel to see
> what is causing the slowdown. It may very well be that we might be able
> to fix it, rather than introduce crazy security hole to bypass the
> problem.

Sorry, maybe my expression is not clear. Now we have two problems that
need to be discussed in this thread. One is my patch set about adding a
new fallocate flag, and another is that the result of preallocated case
with conversion is too slower than the result of the case without
conversion.

For new flag, last year we have encountered this problem in our product
system. We decide to add this flag to avoid the overhead because our
application can guarantee to initialize the file before it read data
from this file. As I said before, last month on ext4 workshop, we
discussed this problem and Ted said that the Google has the same problem
too. So we think that maybe this flag is useful for other users. That
is reason why I send this patch set to mailing list.

For benchmark result, that is a simple test. When I send this patch set
to mailing list, I think that I need to get some data to compare the
performance of w/ and w/o this new flag. I am astonished by the result.
So, as I said above, the target of my patch set is not to solve this issue.

I have run a more detailed benchmark and I will post it to mailing list.
Please review it after I send it. Any comments are welcomed.

Regards,
Zheng

>
> >
> > I fully understand that this flag will bring a security hole, and I
> > totally agree that we should dig *root cause* in ext4. But, IMHO, a
> > proper interface, which is limited by a proper capabilities, might be
> > useful for the user.
>
> No, you're not getting the point. In kernel (or fs) we are supposed to
> prevent from security problems as much as possible and it is
> indisputably the job of file systems to prevent reading data which are
> NOT in the file system and might have belonged to the other user. If you
> want to do this, just use raw device.
>
> Limiting this by proper capabilities is so not enough from tons of
> reasons. Application have bugs, which may very well cause the stale data
> to be exported directly to the user of the application which might not
> even have root capabilities. Consider buggy database returning stale
> data (possibly deleted secret informations) from the on user to
> another, it _is_ possible and things like that _will_ happen with
> this "feature", what is even worse that in this case the application
> writers do actually use the fallocate flag right.
>
> Moreover, a lot of application are getting rid of the root capabilities
> once they are executed and initialized from the obvious reason. Consider
> the case where the application is doing fallocate-expose-stale-data
> before betting rid of the root cap, then after it gets rid of the root
> cap, the "regular user" have access to the other users old data.
>
> And it might not even be the case of buggy application, consider very
> simple case where root fallocate-expose-stale-data a file and did not
> set read permission right, then *any* user can read other users data.
> This will definitely happen as this kind of errors happens all the time.
>
> And also you're assuming that the application developer or root
> administrator know exactly what the problem is and can see all the
> corner cases as are the few I've described above, that's not the case,
> they will get it wrong in some cases especially when it will be promoted
> as performance boost.
>
> I do not like this "feature" at all, what about rather fixing the
> problem ? Xfs does not seem to have this issue, at least not as
> noticeable, so it is definitely possible.
>
> Thanks!
> -Lukas
>
> >
> > Regards,
> > Zheng
> >
> > >
> > > The ext4 delta is a bit larger, though, so it seems worth investigating
> > > the *root cause* of the extra overhead if it's problematic in your
> > > workloads...
> > >
> > > -Eric
> > >
> > >
> > > > Obviously, this flag will bring an secure issue because the malicious user
> > > > could use this flag to get other user's data if (s)he doesn't do a
> > > > initialization before reading this file. Thus, a sysctl parameter
> > > > 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> > > > whether or not this flag is enabled. Currently, this flag is disabled by
> > > > default. I am not sure whether this is enough or not. Another option is that
> > > > a new Kconfig entry is created to remove this flag during the kernel is
> > > > complied. So any suggestions or comments are appreciated.
> > > >
> > > > Regards,
> > > > Zheng
> > > >
> > > > Zheng Liu (3):
> > > > vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> > > > vfs: add security check for _NO_HIDE_STALE flag
> > > > ext4: add FALLOC_FL_NO_HIDE_STALE support
> > > >
> > > > fs/ext4/extents.c | 7 +++++--
> > > > fs/open.c | 12 +++++++++++-
> > > > include/linux/falloc.h | 5 +++++
> > > > include/linux/sysctl.h | 1 +
> > > > kernel/sysctl.c | 10 ++++++++++
> > > > 5 files changed, 32 insertions(+), 3 deletions(-)
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --

2012-04-18 11:59:32

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, Apr 18, 2012 at 01:39:06PM +0200, Lukas Czerner wrote:
> On Wed, 18 Apr 2012, Zheng Liu wrote:
>
> > > I auspect the real problem may be the lazy inode table zeroing being done in the kernel. If the test is run immediately after formatting the filesystem, then the kernel thread is busy writing to the disk in the background and interfering with your benchmark.
> > >
> > > Please format the filesystem with "-E lazy-itable-init=0", to avoid this behavior.
> >
> > I format the filesystem with this option, but it seems that it is
> > useless.
>
> What do you mean by that ? It does not work as it's supposed to ?
> ext4lazyinit kernel thread is still running ? Or the slowdown was not
> caused by the lazy init in the first place ?

Sorry, I have used this option and retry to run the benchmark, but the
slowdown is still in there.

Regards,
Zheng

2012-04-18 12:07:09

by Lukas Czerner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, 18 Apr 2012, Zheng Liu wrote:

--snip--

> > >
> > > I will run more detailed benchmarks to trace this issue. If I have a
> > > lastest result, I will send a new mail to let you know. :)
> >
> > Yes, please do. I do not think that in this case simply doing 'time
> > dd' is enough. We need to know what is happening inside the kernel to see
> > what is causing the slowdown. It may very well be that we might be able
> > to fix it, rather than introduce crazy security hole to bypass the
> > problem.
>
> Sorry, maybe my expression is not clear. Now we have two problems that
> need to be discussed in this thread. One is my patch set about adding a
> new fallocate flag, and another is that the result of preallocated case
> with conversion is too slower than the result of the case without
> conversion.
>
> For new flag, last year we have encountered this problem in our product
> system. We decide to add this flag to avoid the overhead because our
> application can guarantee to initialize the file before it read data
> from this file. As I said before, last month on ext4 workshop, we
> discussed this problem and Ted said that the Google has the same problem
> too. So we think that maybe this flag is useful for other users. That
> is reason why I send this patch set to mailing list.
>
> For benchmark result, that is a simple test. When I send this patch set
> to mailing list, I think that I need to get some data to compare the
> performance of w/ and w/o this new flag. I am astonished by the result.
> So, as I said above, the target of my patch set is not to solve this issue.
>
> I have run a more detailed benchmark and I will post it to mailing list.
> Please review it after I send it. Any comments are welcomed.

Ok I do understand this, but as other people in the thread already told
you, this is not a proper fix. If you have a performance problem when the
conversion from uninitialized to initialized happens, than the obvious
answer is figure out why and fix it, rather than create crazy workaround
without even properly testing why this is happening.

Regards,
-Lukas

>
> Regards,
> Zheng

--snip--

2012-04-18 12:41:44

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

I run a more detailed benchmark again. The environment is as before,
and the machine has Intel(R) Core(TM)2 Duo CPU E8400, 4G memory and a
WDC WD1600AAJS-75M0A0 160G SATA disk.

I use 'fallocate' and 'dd' command to create a 256M file. I compare
three cases, which are fallocate w/o new flag, fallocate w/ new flag,
and dd. We use these commands to create a file. Meanwhile w/ journal
and w/o journal are compared. When I format the filesytem, I use
'-E lazy_itable_init=0' to avoid impact. I use this command to do the
comparsion:

time for((i=0;i<2000;i++)); \
do \
dd if=/dev/zero of=/mnt/sda1/testfile conv=notrunc bs=4k \
count=1 seek=`expr $i \* 16` oflag=sync,direct 2>/dev/null; \
done

The result:

nojournal:
fallocte dd fallocate w/ new flag
real 0m4.196s 0m3.720s 0m3.782s
user 0m0.167s 0m0.194s 0m0.192s
sys 0m0.404s 0m0.393s 0m0.390s

data=journal:
fallocate dd fallocate w/ new flag
real 1m9.673s 1m10.241s 1m9.773s
user 0m0.183s 0m0.205s 0m0.192s
sys 0m0.397s 0m0.407s 0m0.398s

data=ordered
fallocate dd fallocate w/ new flag
real 1m16.006s 0m18.291s 0m18.449s
user 0m0.193s 0m0.193s 0m0.201s
sys 0m0.384s 0m0.387s 0m0.381s

data=writeback
fallocate dd fallocate w/ new flag
real 1m16.247s 0m18.133s 0m18.417s
user 0m0.187s 0m0.193s 0m0.205s
sys 0m0.401s 0m0.398s 0m0.387s

As result shows, in nojournal mode, the slowdown in w/ conversion case
is not very severe. Obviously it is caused by initializing an unwritten
extent. This patch set aims to reduce this overhead. So we can go on
discussing whether this patch set is acceptable or not. Certainly, if
most of developers strongly object it, just leave it in mailing list.

In journal mode, we can see, when data is set to 'journal', the result
is almost the same. However, when data is set 'ordered' or 'writeback',
the slowdown in w/ conversion case is severe. Then I do the same test
without 'oflag=sync,direct', and the result doesn't change. IMHO, I
guess that journal is the *root cause*. Until now, I don't have a
definitely conclusion, and I will go on traing this issue. Please feel
free to comment it.

Thanks to all who reply the mail. :-)

Regards,
Zheng

2012-04-18 14:57:17

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 4/17/12 11:59 PM, Andreas Dilger wrote:

...

> Secondly, your test program is not doing random writes to disk, but
> rather doing writes at 64kB intervals. There is logic in the
> uninitialized extent handling that will write zeros to an entire
> extent, rather than create many fragmented uninitialized extents. It
> may be possible that you are zeroing out the entire file, and writing
> 16x as much data as you expect.
>
> Cheers, Andreas

I don't think the testcase as written is triggering that behavior, though
other similar testcases might. In this case the left-over uninit extents
are large enough that they don't get zeroed:

File size of /mnt/scratch/test is 268435456 (65536 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 34816 1
1 1 34817 15 unwritten
2 16 34832 1
3 17 34833 15 unwritten
4 32 34848 1
5 33 34849 15 unwritten
...

Good guess though :)

-Eric

2012-04-18 15:09:39

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 2012-04-18, at 5:48, Zheng Liu <[email protected]> wrote:
> I run a more detailed benchmark again. The environment is as before,
> and the machine has Intel(R) Core(TM)2 Duo CPU E8400, 4G memory and a
> WDC WD1600AAJS-75M0A0 160G SATA disk.
>
> I use 'fallocate' and 'dd' command to create a 256M file. I compare
> three cases, which are fallocate w/o new flag, fallocate w/ new flag,
> and dd. We use these commands to create a file. Meanwhile w/ journal
> and w/o journal are compared. When I format the filesytem, I use
> '-E lazy_itable_init=0' to avoid impact. I use this command to do the
> comparsion:
>
> time for((i=0;i<2000;i++)); \
> do \
> dd if=/dev/zero of=/mnt/sda1/testfile conv=notrunc bs=4k \
> count=1 seek=`expr $i \* 16` oflag=sync,direct 2>/dev/null; \
> done
>
>
> The result:
>
> nojournal:
> fallocte dd fallocate w/ new flag
> real 0m4.196s 0m3.720s 0m3.782s
> user 0m0.167s 0m0.194s 0m0.192s
> sys 0m0.404s 0m0.393s 0m0.390s
>
> data=journal:
> fallocate dd fallocate w/ new flag
> real 1m9.673s 1m10.241s 1m9.773s
> user 0m0.183s 0m0.205s 0m0.192s
> sys 0m0.397s 0m0.407s 0m0.398s
>
> data=ordered
> fallocate dd fallocate w/ new flag
> real 1m16.006s 0m18.291s 0m18.449s
> user 0m0.193s 0m0.193s 0m0.201s
> sys 0m0.384s 0m0.387s 0m0.381s
>
> data=writeback
> fallocate dd fallocate w/ new flag
> real 1m16.247s 0m18.133s 0m18.417s
> user 0m0.187s 0m0.193s 0m0.205s
> sys 0m0.401s 0m0.398s 0m0.387s
>
> In journal mode, we can see, when data is set to 'journal', the result
> is almost the same. However, when data is set 'ordered' or 'writeback',
> the slowdown in w/ conversion case is severe. Then I do the same test
> without 'oflag=sync,direct', and the result doesn't change. IMHO, I
> guess that journal is the *root cause*. Until now, I don't have a
> definitely conclusion, and I will go on traing this issue. Please feel
> free to comment it.

Looking at these performance numbers again, it would seem better if ext4 _was_ zero filling the whole file and converting the whole thing to initialized extents instead of leaving so many uninitialized extents behind.

The file size is 256MB, and the disk would have to be doing only 3.5MB/s for linear streaming writes to match the performance that you report, so a modern disk doing 50MB/s should be able to zero the whole file in 5s.

It seems the threshold for zeroing uninitialized extents is incorrect. EXT4_EXT_ZERO_LEN is only 7 blocks (28kB normally), but typical disks can write 64kB as easily as 4kB, so it would be interesting to change EXT4_EXT_ZERO_LEN to 16 and re-run your test.

If that solves this particular test case, it wont necessarily the general case, but is still a useful fix. If you submit a patch for this, please change this code to compare against 64kB instead of a block count, and also to take s_raid_stride into account if set, like:

ext_zero_len = max(EXT4_EXT_ZERO_LEN * 1024 >> inode->i_blkbits,
EXT4_SB(inode->i_sb)->s_es->s_raid_stride);

This would write up to 64kB, or a full RAID stripe (since it already needs to seek that spindle), whichever is larger. It isn't perfect, since it should really align the zero-out to the RAID stripe to avoid seeking two spindles, but it is a starting point.

Cheers, Andreas

2012-04-18 16:15:20

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, Apr 18, 2012 at 01:02:08PM +1000, Dave Chinner wrote:
> In actual fact, on my 12 disk RAID0 array, XFS is faster with
> unwritten extents *enabled* than when hacked to turn them off.

Can you explain why this is the case? It seems... counterintuitive.

The only explanation I can think of is that your code paths when
unwritten extents are disabled haven't been optimized, in which case
the comparison between using and not using unwritten extents might not
be valid.

Is there anything going on other than _not_ mutating the extent tree
(and all of the logical journaling that would go along with it)?
Hacking to turn them off means it should be doing *less* work, so I
would expect at worst it would be the same speed as using extent
written extents. If it's faster to use unwritten extents, something
very wierd must be going on....

- Ted

2012-04-18 23:37:53

by Dave Chinner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, Apr 18, 2012 at 12:07:17PM -0400, Ted Ts'o wrote:
> On Wed, Apr 18, 2012 at 01:02:08PM +1000, Dave Chinner wrote:
> > In actual fact, on my 12 disk RAID0 array, XFS is faster with
> > unwritten extents *enabled* than when hacked to turn them off.
>
> Can you explain why this is the case? It seems... counterintuitive.

I'll admit that I was slightly surprised, too.

> The only explanation I can think of is that your code paths when
> unwritten extents are disabled haven't been optimized,

When unwritten extents are disabled, the IO takes the normal
overwrite/extending write path, which is as optimised as
the unwritten path because it is more commonly used.

> in which case
> the comparison between using and not using unwritten extents might not
> be valid.

It's and apples to apples comparison.

The code paths are identical on IO submission, the only difference
is the IO completion path. In the case of unwritten extents, it has
to do a transaction to convert the extents and update the file size.
In the case of the stale extent, it has to modify the file size
because the IO is beyond EOF.

My point, however, is that despite the complexity of unwritten
extent conversion it is not noticably more expensive than a file
size update transaction on XFS. Hence the reasoning for exposing
stale data due to unwritten conversion has too much overhead to be
useful is demonstratably false.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-04-20 09:46:13

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

Hi all,

These days I dig this slowdown and find that the *root cause* might be
journal. I use blktrace to grab a detailed behaviour and find the below
phenonmenon. I post the blktrace's result in here.

[Test command]
time for((i=0;i<2000;i++)); do \
dd if=/dev/zero of=/mnt/sda1/testfile conv=notrunc bs=4k count=1 \
seek=`expr $i \* 16` oflag=sync,direct 2>/dev/null; \
done

[scripts]
blktrace -d /dev/sda1
blkparse -i sda1.* -o blktrace.log
cat blktrace.log | grep 'D' | grep 'W' > result.log

[result (ext4)]
[cut]
8,1 0 14 49.969995286 10 D WS 38011023 + 40 [kworker/0:1]
8,1 0 20 49.996170768 0 D WS 38011063 + 8 [swapper/0]
8,1 0 29 50.006811878 10011 D WS 278719 + 8 [dd]
8,1 0 31 50.013996421 10 D WS 38011071 + 24 [kworker/0:1]
8,1 0 37 50.029656811 0 D WS 38011095 + 8 [swapper/0]
8,1 1 70 50.039768259 10013 D WS 278847 + 8 [dd]
8,1 0 41 50.046996403 10 D WS 38011103 + 24 [kworker/0:1]
8,1 0 47 50.071458802 0 D WS 38011127 + 8 [swapper/0]
8,1 1 89 50.081529060 10015 D WS 278975 + 8 [dd]
8,1 0 51 50.088996276 10 D WS 38011135 + 24 [kworker/0:1]
8,1 0 57 50.113247880 0 D WS 38011159 + 8 [swapper/0]
8,1 1 108 50.123329330 10017 D WS 279103 + 8 [dd]
8,1 0 61 50.130995672 10 D WS 38011167 + 24 [kworker/0:1]
8,1 0 67 50.155052076 0 D WS 38011191 + 8 [swapper/0]
8,1 1 126 50.165154127 10019 D WS 279231 + 8 [dd]
8,1 0 71 50.172995678 10 D WS 38011199 + 24 [kworker/0:1]
8,1 0 77 50.196855020 0 D WS 38011223 + 8 [swapper/0]
8,1 1 145 50.206945237 10021 D WS 279359 + 8 [dd]
8,1 0 81 50.214997236 10 D WS 38011231 + 24 [kworker/0:1]
8,1 0 87 50.238643778 0 D WS 38011255 + 8 [swapper/0]
8,1 1 164 50.248738960 10023 D WS 279487 + 8 [dd]
8,1 0 91 50.255996776 10 D WS 38011263 + 24 [kworker/0:1]
8,1 0 97 50.280447549 0 D WS 38011287 + 8 [swapper/0]
8,1 1 183 50.290550957 10025 D WS 279615 + 8 [dd]
[cut]

We can see that every one write operation needs to do two journal
writes, one writes journal header and data, and another writes commit.

Then I run the same benchmark in xfs to do a comparison. This
comparison just aims to explain why the slowdown occurs in ext4.

[result (xfs)]
[cut]
8,1 0 70 0.256951000 0 D WSM 40162600 + 3 [swapper/0]
8,1 1 50 0.271551873 12575 D WS 1311 + 8 [dd]
8,1 0 78 0.282466586 0 D WSM 40162603 + 3 [swapper/0]
8,1 1 55 0.296547264 12577 D WS 1439 + 8 [dd]
8,1 0 86 0.307978442 0 D WSM 40162606 + 3 [swapper/0]
8,1 1 60 0.321578789 12579 D WS 1567 + 8 [dd]
8,1 0 94 0.333494988 0 D WSM 40162609 + 3 [swapper/0]
8,1 1 65 0.346582549 12581 D WS 1695 + 8 [dd]
8,1 0 102 0.359005937 0 D WSM 40162612 + 3 [swapper/0]
8,1 1 70 0.371613387 12583 D WS 1823 + 8 [dd]
8,1 0 110 0.384552158 0 D WSM 40162615 + 3 [swapper/0]
8,1 1 75 0.396604067 12585 D WS 1951 + 8 [dd]
8,1 0 118 0.410062404 0 D WSM 40162618 + 3 [swapper/0]
8,1 1 80 0.421614702 12587 D WS 2079 + 8 [dd]
8,1 0 126 0.436783655 0 D WSM 40162621 + 3 [swapper/0]
8,1 1 85 0.454989457 12589 D WS 2207 + 8 [dd]
8,1 0 134 0.470633321 0 D WSM 40162624 + 3 [swapper/0]
8,1 1 90 0.488311574 12591 D WS 2335 + 8 [dd]
8,1 0 142 0.504477295 0 D WSM 40162627 + 3 [swapper/0]
8,1 1 95 0.521675622 12593 D WS 2463 + 8 [dd]
8,1 0 150 0.538326978 0 D WSM 40162630 + 3 [swapper/0]
8,1 1 100 0.555016257 12595 D WS 2591 + 8 [dd]
8,1 0 158 0.563839349 0 D WSM 40162633 + 4 [swapper/0]
8,1 1 105 0.580049767 12597 D WS 2719 + 8 [dd]
8,1 0 166 0.589336947 0 D WSM 40162637 + 4 [swapper/0]
8,1 1 110 0.605037173 12599 D WS 2847 + 8 [dd]
8,1 0 174 0.614850369 0 D WSM 40162641 + 4 [swapper/0]
8,1 1 115 0.630078920 12601 D WS 2975 + 8 [dd]
[cut]

The result shows that every one write in xfs only needs to do one
journal write. Meanwhile the size of journal write is smaller than the
size of ext4. Certainly, in modern disk, I believe that the disk writes
64k as easily as 4k. So, IMO, twice journal writes is the *root
*cause*. I use 'wc -l' to roughly count the number of I/Os.

$ wc -l result.log
ext4: 6018
xfs: 4021
ratio: 1.5:1

The benchmark result:
ext4: 75.615s
xfs: 54.395s
ratio: 1.4:1

Now, in JBD2, it at least needs to do two journal writes. I have an
idea to solve this problem, and I will send a new RFC to the mailing
list.

Please feel free to review it. Any comments are welcome. Thank you.

Regards,
Zheng

2012-04-20 09:53:21

by Zheng Liu

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On Wed, Apr 18, 2012 at 08:09:02AM -0700, Andreas Dilger wrote:
> Looking at these performance numbers again, it would seem better if ext4 _was_ zero filling the whole file and converting the whole thing to initialized extents instead of leaving so many uninitialized extents behind.
>
> The file size is 256MB, and the disk would have to be doing only 3.5MB/s for linear streaming writes to match the performance that you report, so a modern disk doing 50MB/s should be able to zero the whole file in 5s.
>
> It seems the threshold for zeroing uninitialized extents is incorrect. EXT4_EXT_ZERO_LEN is only 7 blocks (28kB normally), but typical disks can write 64kB as easily as 4kB, so it would be interesting to change EXT4_EXT_ZERO_LEN to 16 and re-run your test.
>
> If that solves this particular test case, it wont necessarily the general case, but is still a useful fix. If you submit a patch for this, please change this code to compare against 64kB instead of a block count, and also to take s_raid_stride into account if set, like:
>
> ext_zero_len = max(EXT4_EXT_ZERO_LEN * 1024 >> inode->i_blkbits,
> EXT4_SB(inode->i_sb)->s_es->s_raid_stride);
>
> This would write up to 64kB, or a full RAID stripe (since it already needs to seek that spindle), whichever is larger. It isn't perfect, since it should really align the zero-out to the RAID stripe to avoid seeking two spindles, but it is a starting point.

Hi Andreas,

I set EXT4_EXT_ZERO_LEN to 16 and run the same benchmark again. the result
is the same as before.

I notice this commit (3977c965) and it set EXT4_EXT_ZERO_LEN to 7. But
in commit log, it doesn't describe why this value is set to 7. As you
said, I believe that the disk writes 64K as easily as as 4k in modern
disk. So maybe we can consider to set it to 16 or RAID stripe. :)

Regards,
Zheng

2012-04-23 02:04:50

by Szabolcs Szakacsits

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

On 4/17/12 11:53 AM, Zheng Liu wrote:

> fallocate is a useful system call because it can preallocate some disk
> blocks for a file and keep blocks contiguous. However, it has a defect
> that file system will convert an uninitialized extent to be an
> initialized when the user wants to write some data to this file, because
> file system create an unititalized extent while it preallocates some
> blocks in fallocate (e.g. ext4). Especially, it causes a severe
> degradation when the user tries to do some random write operations, which
> frequently modifies the metadata of this file. We meet this problem in
> our product system at Taobao. Last month, in ext4 workshop, we discussed
> this problem and the Google faces the same problem. So a new flag,
> FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.

I think a more explicit name would be better like FALLOC_FL_EXPOSE_DATA,
FALLOC_FL_EXPOSE_STALE_DATA, FALLOC_FL_EXPOSE_UNINITIALIZED_DATA, etc.

> When this flag is set, file system will create an inititalized extent for
> this file. So it avoids the conversion from uninitialized to
> initialized. If users want to use this flag, they must guarantee that
> file has been initialized by themselves before it is read at the same
> offset. This flag is added in vfs so that other file systems can also
> support this flag to improve the performance.

This flag could be indeed helpful for filesystems which can't fully support
uninitialized allocated blocks efficiently unlike XFS and ext4. We are
supporting several such interoperable filesystems (NTFS, exFAT, FAT) where
changing the specification is unfortunately not possible.

There is real user need despite explaining potential security consequences.
Typical usage scenarios are using a large file as a container for an
application which tracks free/used blocks itself. Windows supports this
feature by SetFileValidData() if extra privilege is granted.

The performance gain can be fairly large on embedded using low-end storage
and CPU. In one of our cases it took 5 days vs 12 minutes to fully setup a
large file for use.

Regards,
Szaka