2013-06-03 19:00:30

by 宋柏翰

[permalink] [raw]
Subject: Question on delalloc

Hi everybody,

I am new to ext4 and doing research on Android with ext4 as file
system. These days, I have a question on ext4's delayed allocation
against ext4_sync_file.
I have learned that delalloc won't guarantee file data's integrity on
power failure, since those delayed allocated buffer heads won't be
handled by jbd2. In order to protect data, user programs need to fsync
those files to be secured.
But I have no idea on how ext4_sync_file would write those delalloc'd
data down to disk.
This is how I traced it.
In ext4_sync_file, I split it into roughly three parts where I think
possible to do IOs:
1. filemap_write_and_wait_range
2. ext4_flush_completed_IO
3. ext4_force_commit or jbd2_log_start_commit

Since we know that jbd2 don't play with those delalloc'd data, part 3
can be excluded.
Also after I traced into filemap_write_and_wait_range, I found it
eventually calls ext4_writepage to do the most part of work, which in
its comment says "We don't do any block allocation in this function."
And it will redirty page and do nothing whenever it find those pages
have delay or unwritten buffer heads.
Last, I found ext4_flush_completed_IO won't do anything for most of
the time list_empty(&ei->i_completed_io_list) holds.

So, can anyone kindly shed any light on my question, or point out my mistakes?

thanks,

Sung Po-Han


2013-06-04 05:22:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on delalloc

On Tue, Jun 04, 2013 at 03:00:29AM +0800, 宋柏翰 wrote:
> Also after I traced into filemap_write_and_wait_range, I found it
> eventually calls ext4_writepage to do the most part of work, which in
> its comment says "We don't do any block allocation in this function."
>
> So, can anyone kindly shed any light on my question, or point out my mistakes?

There are three possible address_space_operations structures that can
be used for ext4 files. They are ext4_aops, ext4_journalled_aops, and
ext4_da_aops. It is the last one which is used for delayed allocation
files, and in that case filemap_write_and_wait_range will use
ext4_da_writepages().

These days, if there is an writepages function, the writepage function
is not used at all. It used to be used for direct reclaim, but that's
been replaced by I/O-less reclaim.

- Ted

2013-06-04 14:26:30

by Eric Sandeen

[permalink] [raw]
Subject: Re: Question on delalloc

On 6/3/13 2:00 PM, 宋柏翰 wrote:
> Hi everybody,
>
> I am new to ext4 and doing research on Android with ext4 as file
> system. These days, I have a question on ext4's delayed allocation
> against ext4_sync_file.
> I have learned that delalloc won't guarantee file data's integrity on
> power failure, since those delayed allocated buffer heads won't be
> handled by jbd2.

I just want to address your first assertion regarding delalloc:

> In order to protect data, user programs need to fsync
> those files to be secured.

This is true with or without delayed allocation.

With delayed allocation, the blocks are chosen a the time of the IO.
Without delayed allocation, the blocks are chosen at the write syscall time.

But in both cases, data is only in memory after the write(), and is not
guaranteed to be on disk until an fsync or similar data integrity syscall.

http://lwn.net/Articles/457667/

-Eric

2013-06-04 17:30:22

by 宋柏翰

[permalink] [raw]
Subject: Re: Question on delalloc

Hi Eric,
I've found a mistake I'd made. I used to take for granted that
ext4 using ordered mode journal will have ext4_writepage do the
writing.
But with delalloc enabled, no matter what journal mode one uses, ext4
will use ext4_da_writepages do handle those delayed buffer heads'
writing.
Thanks for your advice, I know what you meant:) Though what I tried to
said is that delalloc'd data will stop kjournald from writing them
back when it would like to do so, therefore those delalloc'd data will
stay in memory more longer than those non-delalloc'd ones. As
described in http://lwn.net/Articles/322823/. Sorry for my unclear
statement.

Sung Po-Han


2013/6/4 Eric Sandeen <[email protected]>:
> On 6/3/13 2:00 PM, ???f?? wrote:
>> Hi everybody,
>>
>> I am new to ext4 and doing research on Android with ext4 as file
>> system. These days, I have a question on ext4's delayed allocation
>> against ext4_sync_file.
>> I have learned that delalloc won't guarantee file data's integrity on
>> power failure, since those delayed allocated buffer heads won't be
>> handled by jbd2.
>
> I just want to address your first assertion regarding delalloc:
>
>> In order to protect data, user programs need to fsync
>> those files to be secured.
>
> This is true with or without delayed allocation.
>
> With delayed allocation, the blocks are chosen a the time of the IO.
> Without delayed allocation, the blocks are chosen at the write syscall time.
>
> But in both cases, data is only in memory after the write(), and is not
> guaranteed to be on disk until an fsync or similar data integrity syscall.
>
> http://lwn.net/Articles/457667/
>
> -Eric

2013-06-04 19:01:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on delalloc

On Wed, Jun 05, 2013 at 01:30:22AM +0800, 宋柏翰 wrote:
> Though what I tried to
> said is that delalloc'd data will stop kjournald from writing them
> back when it would like to do so, therefore those delalloc'd data will
> stay in memory more longer than those non-delalloc'd ones.

That's not why when you enable delayed allocation data can stay in
memory longer than when delalloc is disabled (or when ext3 is used).
So you were wrong about why this happens, although the observation is
correct.

> As described in http://lwn.net/Articles/322823/.

Please note that the current behaviour vis-a-vis buffered writes and
when you they will written to disk is generally true for all modern
file systems: xfs, btrfs, and ext4. The workaround described at the
end of the above article has been adopted by all of the modern file
systems, as a workaround for buggy user space applications (which at
one point included core libraries for both GNOME and KDE); this
workaround however is __not__ guaranteed by POSIX, and there are other
operationg systems, such as MacOS X, where you can't guarantee on
these semantics.

(In fact with MacOS, by default fsync() is even weaker than what POSIX
guarantees; you need use fcntl(F_FULLSYNC) to get POSIX guarantees.)

More context can be found here:

http://blahg.josefsipek.net/?p=364

- Ted