2010-12-13 05:39:40

by Kazuya Mio

[permalink] [raw]
Subject: [PATCH] ext4: Fix 32bit overflow in ext4_ext_find_goal()

Hi,

ext4_ext_find_goal() returns an ideal physical block number that the block
allocator tries to allocate first. However, if a required file offset is
smaller than the existing extent's one, ext4_ext_find_goal() returns
a wrong block number because it may overflow at
"block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.

ext4_ext_find_goal() will also return a wrong block number in case
a file offset of the existing extent is too big. In this case,
the ideal physical block number is fixed in ext4_mb_initialize_context(),
so it's no problem.

reproduce:
# dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
ext logical physical expected length flags
0 128 67456 128 eof
/mnt/mp1/file: 2 extents found
# rm -rf /mnt/mp1/tmp
# echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc

result (linux-2.6.37-rc2 + ext4 patch queue):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 33280 128
1 128 67456 33407 128 eof
/mnt/mp1/file: 2 extents found

result(apply this patch):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 66560 128
1 128 67456 66687 128 eof
/mnt/mp1/file: 2 extents found

Signed-off-by: Kazuya Mio <[email protected]>
---
fs/ext4/extents.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0554c48..ef76b70 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -119,9 +119,15 @@ static ext4_fsblk_t ext4_ext_find_goal(struct inode *inode,

/* try to predict block placement */
ex = path[depth].p_ext;
- if (ex)
- return (ext4_ext_pblock(ex) +
- (block - le32_to_cpu(ex->ee_block)));
+ if (ex) {
+ ext4_fsblk_t ext_pblk = ext4_ext_pblock(ex);
+ ext4_lblk_t ext_block = le32_to_cpu(ex->ee_block);
+
+ if (block > ext_block)
+ return (ext_pblk + (block - ext_block));
+ else
+ return (ext_pblk - (ext_block - block));
+ }

/* it looks like index is empty;
* try to find starting block from index itself */


2011-01-03 02:49:49

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Fix 32bit overflow in ext4_ext_find_goal()

On Sun, Dec 12, 2010 at 07:37:54PM -0000, Kazuya Mio wrote:
> Hi,
>
> ext4_ext_find_goal() returns an ideal physical block number that the block
> allocator tries to allocate first. However, if a required file offset is
> smaller than the existing extent's one, ext4_ext_find_goal() returns
> a wrong block number because it may overflow at
> "block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.

Thanks, applied. One comment which I've added to the code:

The block placement algorithm in this section of code assumes that we
are filling in a file which will eventually be non-sparse --- i.e., in
the case of libbfd writing an ELF object sections out-of-order but in
a way the eventually results in a contiguous object or executable
file, or the old BSD dbm library writing dbm files. However, this is
actually somewhat non-ideal if we are writing a sparse file such as
qemu or KVM writing a raw image file, as it will result in the free
space getting unnecessarily fragmented. Maybe we should have some
hueristics to determine whether we are in the first or second case?

I don't currently think using raw image files is that common in most
virtualization application, but if someone can think of some common
use cases where we would care, it might be worth adding either some
hueristics to detect this, or perhaps some way that userspace can pass
a hint to the file system that what we're doing is writing a raw
sparse file. For now I'm going to consider the first scenario more
common than the second....

- Ted

2011-01-03 03:36:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4: Fix 32bit overflow in ext4_ext_find_goal()

It was written that way because HPC applications writing to a shared file normally write to an offset of task_num * task_data_size so they do not overlap, and end up with a dense file. Similarly, bittorrent and parallel FTP clients will write dense files after seeking randomly around the file, and database files often end up dense as well.

I'd rather fix the relatively few applications that expect permanently sparse files to use fadvise() to notify the kernel of this.

Cheers, Andreas

On 2011-01-02, at 14:40, Ted Ts'o <[email protected]> wrote:

> On Sun, Dec 12, 2010 at 07:37:54PM -0000, Kazuya Mio wrote:
>> Hi,
>>
>> ext4_ext_find_goal() returns an ideal physical block number that the block
>> allocator tries to allocate first. However, if a required file offset is
>> smaller than the existing extent's one, ext4_ext_find_goal() returns
>> a wrong block number because it may overflow at
>> "block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.
>
> Thanks, applied. One comment which I've added to the code:
>
> The block placement algorithm in this section of code assumes that we
> are filling in a file which will eventually be non-sparse --- i.e., in
> the case of libbfd writing an ELF object sections out-of-order but in
> a way the eventually results in a contiguous object or executable
> file, or the old BSD dbm library writing dbm files. However, this is
> actually somewhat non-ideal if we are writing a sparse file such as
> qemu or KVM writing a raw image file, as it will result in the free
> space getting unnecessarily fragmented. Maybe we should have some
> hueristics to determine whether we are in the first or second case?
>
> I don't currently think using raw image files is that common in most
> virtualization application, but if someone can think of some common
> use cases where we would care, it might be worth adding either some
> hueristics to detect this, or perhaps some way that userspace can pass
> a hint to the file system that what we're doing is writing a raw
> sparse file. For now I'm going to consider the first scenario more
> common than the second....
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-01-03 04:12:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Fix 32bit overflow in ext4_ext_find_goal()

On Sun, Jan 02, 2011 at 08:35:39PM -0700, Andreas Dilger wrote:
> It was written that way because HPC applications writing to a shared
> file normally write to an offset of task_num * task_data_size so
> they do not overlap, and end up with a dense file. Similarly,
> bittorrent and parallel FTP clients will write dense files after
> seeking randomly around the file, and database files often end up
> dense as well.
>
> I'd rather fix the relatively few applications that expect
> permanently sparse files to use fadvise() to notify the kernel of
> this.

Agreed, and I'm not sure there are enough applications that expect
permanently sparse files that's worth adding a new fadvise(). But if
we do add a new fadvise(), the default should clearly be the current
behavior.

If someone knows of use cases where permanently sparse files are
common, please let us know!

- Ted

2011-01-03 07:02:56

by Amir Goldstein

[permalink] [raw]
Subject: Re: ext4: Fix 32bit overflow in ext4_ext_find_goal()

On Mon, Jan 3, 2011 at 6:11 AM, Ted Ts'o <[email protected]> wrote:
> On Sun, Jan 02, 2011 at 08:35:39PM -0700, Andreas Dilger wrote:
>> It was written that way because HPC applications writing to a shared
>> file normally write to an offset of task_num * task_data_size so
>> they do not overlap, and end up with a dense file. Similarly,
>> bittorrent and parallel FTP clients will write dense files after
>> seeking randomly around the file, and database files often end up
>> dense as well.
>>
>> I'd rather fix the relatively few applications that expect
>> permanently sparse files to use fadvise() to notify the kernel of
>> this.
>
> Agreed, and I'm not sure there are enough applications that expect
> permanently sparse files that's worth adding a new fadvise(). ?But if
> we do add a new fadvise(), the default should clearly be the current
> behavior.
>
> If someone knows of use cases where permanently sparse files are
> common, please let us know!
>

Well, there's e2image of course, but using the qcow2 format is a better
solution than fadvise in this case.

Also, I believe that if one chooses to use VM with raw image format,
it is mostly for the purpose or read performance, which implies that
the image was fallocate'd.

Am I wrong about this?

Amir.

2011-01-03 15:36:20

by Rogier Wolff

[permalink] [raw]
Subject: Re: ext4: Fix 32bit overflow in ext4_ext_find_goal()

On Sun, Jan 02, 2011 at 04:40:31PM -0500, Ted Ts'o wrote:
> On Sun, Dec 12, 2010 at 07:37:54PM -0000, Kazuya Mio wrote:
> > Hi,
> >
> > ext4_ext_find_goal() returns an ideal physical block number that the block
> > allocator tries to allocate first. However, if a required file offset is
> > smaller than the existing extent's one, ext4_ext_find_goal() returns
> > a wrong block number because it may overflow at
> > "block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.
>
> Thanks, applied. One comment which I've added to the code:
>
> The block placement algorithm in this section of code assumes that we
> are filling in a file which will eventually be non-sparse --- i.e., in
> the case of libbfd writing an ELF object sections out-of-order but in
> a way the eventually results in a contiguous object or executable
> file, or the old BSD dbm library writing dbm files. However, this is
> actually somewhat non-ideal if we are writing a sparse file such as
> qemu or KVM writing a raw image file, as it will result in the free
> space getting unnecessarily fragmented. Maybe we should have some
> heuristics to determine whether we are in the first or second case?

Heuristics are good, if they can be made to work.

Otherwise adding an attribute to a file that can be set using an IOCTL
can be used. Now of course you don't want to be changing all your
posix apps to start calling that filesystem specific ioctl.....

But how about: it's a little flag on a directory. So you can tag a
directory as "files will be sparse" or "files will fill in
eventually". All files created below a directory will inherit the
flag.

> I don't currently think using raw image files is that common in most
> virtualization application, but if someone can think of some common

I do datarecovery. Instead of writing one big 500Gb file, we still use
500 files of 1Gb each, dating from the time when Linux didn't really
support files > 2Gb that well.

Anyway, sometimes we read the disk "out of order", so we'll write the
image files out-of-order.

The difficult part is that the drives we're mirroring will contain
patches of zeroes. Those won't get written anytime.

Sometimes we don't manage to recover a lot of the patient drive. So
then the images will remain mostly empty. Other times we first recover
all the sectors that contain files, and only then recover the
remaining sectors.

So I'm not sure which strategy would benefit us most....

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-01-03 17:26:03

by Eric Sandeen

[permalink] [raw]
Subject: Re: ext4: Fix 32bit overflow in ext4_ext_find_goal()

On 01/02/2011 10:11 PM, Ted Ts'o wrote:
> On Sun, Jan 02, 2011 at 08:35:39PM -0700, Andreas Dilger wrote:
>> It was written that way because HPC applications writing to a shared
>> file normally write to an offset of task_num * task_data_size so
>> they do not overlap, and end up with a dense file. Similarly,
>> bittorrent and parallel FTP clients will write dense files after
>> seeking randomly around the file, and database files often end up
>> dense as well.
>>
>> I'd rather fix the relatively few applications that expect
>> permanently sparse files to use fadvise() to notify the kernel of
>> this.
>
> Agreed, and I'm not sure there are enough applications that expect
> permanently sparse files that's worth adding a new fadvise(). But if
> we do add a new fadvise(), the default should clearly be the current
> behavior.
>
> If someone knows of use cases where permanently sparse files are
> common, please let us know!

RPM database files stay sparse (Berkeley DB)

$ pwd
/var/lib/rpm
$ ls -lh Basenames; du -h Basenames
-rw-r--r-- 1 rpm rpm 11M Dec 8 12:55 Basenames
9.1M Basenames

etc.

-Eric