Hi,
I am trying to cleanup ext3_ordered and ext3_writeback_writepage() routines.
I am confused on what ext3_ordered_writepage() is currently doing ? I hope
you can help me understand it little better.
1) Why do we do journal_start() all the time ? I was hoping to skip
journal_start()/stop() if the blocks are already allocated. Since we
allocated
blocks in prepare_write() for most cases (non-mapped writes), I was
hoping to avoid the whole journal stuff in writepage(), if blocks are
already there. (we can check buffers attached to the page and find
out if they are mapped or not).
2) Why do we call journal_dirty_data_fn() on the buffers ? We already
issued IO on all those buffers() in block_full_write_page(). Why do we
need to add them to transaction ? I understand we need to do this for
block allocation case. But do we need it for non-allocation case also ?
Can we skip the whole journal start, journal_dirty_data, journal_stop
for non-allocation cases ? I have coded up to do so, but I am confused
on what am I missing here ?
Please let me know.
Thanks,
Badari
Badari Pulavarty <[email protected]> wrote:
>
> Hi,
>
> I am trying to cleanup ext3_ordered and ext3_writeback_writepage() routines.
> I am confused on what ext3_ordered_writepage() is currently doing ? I hope
> you can help me understand it little better.
>
> 1) Why do we do journal_start() all the time ?
Because we're lame.
> 2) Why do we call journal_dirty_data_fn() on the buffers ? We already
> issued IO on all those buffers() in block_full_write_page(). Why do we
> need to add them to transaction ? I understand we need to do this for
> block allocation case. But do we need it for non-allocation case also ?
Yup. Ordered-mode JBD commit needs to write and wait upon all dirty
file-data buffers prior to journalling the metadata. If we didn't run
journal_dirty_data_fn() against those buffers then they'd still be under
I/O after commit had completed.
Consequently a crash+recover would occasionally allow a read() to read
uninitialised data blocks - those blocks for which we had a) started the
I/O, b) journalled the metadata which refers to that block and c) not yet
completed the I/O when the crash happened.
Now, if the write was known to be an overwrite then we know that the block
isn't uninitialised, so we could perhaps avoid writing that block in the
next commit - just let pdflush handle it. We'd need to work out whether a
particular block has initialised data on-disk under it when we dirty it,
then track that all the way through to writepage. It's all possible, and
adds a significant semantic change to ordered-mode.
If that change offered significant performance benefits then yeah, we could
scratch our heads over it.
But I think if you're looking for CPU consumption reductions, you'd be much
better off attacking prepare_write() and commit_write(), rather than
writepage().
Andrew Morton wrote:
>Badari Pulavarty <[email protected]> wrote:
>
>>Hi,
>>
>>I am trying to cleanup ext3_ordered and ext3_writeback_writepage() routines.
>>I am confused on what ext3_ordered_writepage() is currently doing ? I hope
>>you can help me understand it little better.
>>
>>1) Why do we do journal_start() all the time ?
>>
>
>Because we're lame.
>
>>2) Why do we call journal_dirty_data_fn() on the buffers ? We already
>>issued IO on all those buffers() in block_full_write_page(). Why do we
>>need to add them to transaction ? I understand we need to do this for
>>block allocation case. But do we need it for non-allocation case also ?
>>
>
>Yup. Ordered-mode JBD commit needs to write and wait upon all dirty
>file-data buffers prior to journalling the metadata. If we didn't run
>journal_dirty_data_fn() against those buffers then they'd still be under
>I/O after commit had completed.
>
In non-block allocation case, what metadata are we journaling in
writepage() ?
block allocation happend in prepare_write() and commit_write() journaled the
transaction. All the meta data updates should be done there. What JBD
commit
are you refering to here ?
>
>But I think if you're looking for CPU consumption reductions, you'd be much
>better off attacking prepare_write() and commit_write(), rather than
>writepage().
>
Yes. You are right. I never realized that, we call
prepare_write()/commit_write() for
each write. I was under the impression that we do it only on the first
instantiation of
the page. I will take a closer look at it.
The reasons for looking at writepage() are:
- want to support writepages() for ext3. Last time when I tried, ran into
page->lock and journal_start() deadlock. Thats why I want to understand the
journalling better and clean it up while looking at it.
- eventually, I want to add delayed allocation to make use of multiblock
allocation.
Right now, we can't make use of multiblock allocation for buffered mode :(
Thanks,
Badari
> >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty
> >file-data buffers prior to journalling the metadata. If we didn't run
> >journal_dirty_data_fn() against those buffers then they'd still be under
> >I/O after commit had completed.
> >
> In non-block allocation case, what metadata are we journaling in
> writepage() ?
> block allocation happend in prepare_write() and commit_write() journaled the
> transaction. All the meta data updates should be done there. What JBD
> commit are you refering to here ?
Basically, this boils down to what is our definition of ordered-mode?
If the goal is to make sure we avoid the security exposure of
allocating a block and then crashing before we write the data block,
potentially exposing previously written data that might be belong to
another user, then what Badari is suggesting would avoid this
particular problem.
However, if what we are doing is overwriting our own data with more an
updated, more recent version of the data block, do we guarantee that
any ordering semantics apply? For example, what if we write a data
block, and then follow it up with some kind of metadata update (say we
touch atime, or add an extended attribute). Do we guarantee that if
the metadata update is committed, that the data block will have made
it to disk as well? Today that is the way things work, but is that
guarantee part of the contract of ordered-mode?
- Ted
On Thu, 2006-03-16 at 13:09 -0500, Theodore Ts'o wrote:
> > >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty
> > >file-data buffers prior to journalling the metadata. If we didn't run
> > >journal_dirty_data_fn() against those buffers then they'd still be under
> > >I/O after commit had completed.
> > >
> > In non-block allocation case, what metadata are we journaling in
> > writepage() ?
> > block allocation happend in prepare_write() and commit_write() journaled the
> > transaction. All the meta data updates should be done there. What JBD
> > commit are you refering to here ?
>
> Basically, this boils down to what is our definition of ordered-mode?
>
> If the goal is to make sure we avoid the security exposure of
> allocating a block and then crashing before we write the data block,
> potentially exposing previously written data that might be belong to
> another user, then what Badari is suggesting would avoid this
> particular problem.
Yes, if the block allocation is needed, my patch is basically
no-op, we go through regular code.
> However, if what we are doing is overwriting our own data with more an
> updated, more recent version of the data block, do we guarantee that
> any ordering semantics apply? For example, what if we write a data
> block, and then follow it up with some kind of metadata update (say we
> touch atime, or add an extended attribute). Do we guarantee that if
> the metadata update is committed, that the data block will have made
> it to disk as well?
I don't see how we do this today. Yes. Metadata updates are jounalled,
but I don't see how current adding buffers through journal_dirty_data
(bh) call can guarantee that these buffers get added to metadata-update
transaction ?
> Today that is the way things work, but is that
> guarantee part of the contract of ordered-mode?
BTW, thanks Ted for putting this in human-readable terms :)
Thanks,
Badari
On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote:
> > However, if what we are doing is overwriting our own data with more an
> > updated, more recent version of the data block, do we guarantee that
> > any ordering semantics apply? For example, what if we write a data
> > block, and then follow it up with some kind of metadata update (say we
> > touch atime, or add an extended attribute). Do we guarantee that if
> > the metadata update is committed, that the data block will have made
> > it to disk as well?
>
> I don't see how we do this today. Yes. Metadata updates are jounalled,
> but I don't see how current adding buffers through journal_dirty_data
> (bh) call can guarantee that these buffers get added to metadata-update
> transaction ?
Even though there aren't any updates to any metadata blocks that take
place between the journal_start() and journal_stop() calls, if
journal_dirty_data() is called (for example in ordered_writepage),
those buffers will be associated with the currently open transaction,
so they will be guaranteed to be written before the transaction is
allowed to commit.
Remember, journal_start and journal_stop do not delineate a full
ext3/jbd transaction, but rather an operation, where a large number of
operations are bundled together to form a transaction. When you call
journal_start, and request a certain number of credits (number of
buffers that you maximally intend to dirty), that opens up an
operation. If the operation turns out not to dirty any metadata
blocks at the time of journal_stop(), all of the credits that were
reserved by jouranl_start() are returned to the currently open
transaction. However, any data blocks which are marked via
journal_dirty_data() are still going to be associated with the
currently open transaction, and they will still be forced out before
the transaction is allowed to commit.
Does that make sense?
- Ted
On Thu, 2006-03-16 at 16:04 -0500, Theodore Ts'o wrote:
> On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote:
> > > However, if what we are doing is overwriting our own data with more an
> > > updated, more recent version of the data block, do we guarantee that
> > > any ordering semantics apply? For example, what if we write a data
> > > block, and then follow it up with some kind of metadata update (say we
> > > touch atime, or add an extended attribute). Do we guarantee that if
> > > the metadata update is committed, that the data block will have made
> > > it to disk as well?
> >
> > I don't see how we do this today. Yes. Metadata updates are jounalled,
> > but I don't see how current adding buffers through journal_dirty_data
> > (bh) call can guarantee that these buffers get added to metadata-update
> > transaction ?
>
> Even though there aren't any updates to any metadata blocks that take
> place between the journal_start() and journal_stop() calls, if
> journal_dirty_data() is called (for example in ordered_writepage),
> those buffers will be associated with the currently open transaction,
> so they will be guaranteed to be written before the transaction is
> allowed to commit.
>
> Remember, journal_start and journal_stop do not delineate a full
> ext3/jbd transaction, but rather an operation, where a large number of
> operations are bundled together to form a transaction. When you call
> journal_start, and request a certain number of credits (number of
> buffers that you maximally intend to dirty), that opens up an
> operation. If the operation turns out not to dirty any metadata
> blocks at the time of journal_stop(), all of the credits that were
> reserved by jouranl_start() are returned to the currently open
> transaction. However, any data blocks which are marked via
> journal_dirty_data() are still going to be associated with the
> currently open transaction, and they will still be forced out before
> the transaction is allowed to commit.
>
> Does that make sense?
Makes perfect sense, except that it doesn't match what I see through
"debugfs" - logdump :(
I wrote a testcase to re-write same blocks again & again - there is
absolutely nothing showed up in log. Which implied to me that, all
the jorunal_dirty_data we did on all those buffers, did nothing -
since there is no current transaction. What am I missing ?
Thanks,
Badari
> On Thu, 2006-03-16 at 16:04 -0500, Theodore Ts'o wrote:
> > On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote:
> > > > However, if what we are doing is overwriting our own data with more an
> > > > updated, more recent version of the data block, do we guarantee that
> > > > any ordering semantics apply? For example, what if we write a data
> > > > block, and then follow it up with some kind of metadata update (say we
> > > > touch atime, or add an extended attribute). Do we guarantee that if
> > > > the metadata update is committed, that the data block will have made
> > > > it to disk as well?
> > >
> > > I don't see how we do this today. Yes. Metadata updates are jounalled,
> > > but I don't see how current adding buffers through journal_dirty_data
> > > (bh) call can guarantee that these buffers get added to metadata-update
> > > transaction ?
> >
> > Even though there aren't any updates to any metadata blocks that take
> > place between the journal_start() and journal_stop() calls, if
> > journal_dirty_data() is called (for example in ordered_writepage),
> > those buffers will be associated with the currently open transaction,
> > so they will be guaranteed to be written before the transaction is
> > allowed to commit.
> >
> > Remember, journal_start and journal_stop do not delineate a full
> > ext3/jbd transaction, but rather an operation, where a large number of
> > operations are bundled together to form a transaction. When you call
> > journal_start, and request a certain number of credits (number of
> > buffers that you maximally intend to dirty), that opens up an
> > operation. If the operation turns out not to dirty any metadata
> > blocks at the time of journal_stop(), all of the credits that were
> > reserved by jouranl_start() are returned to the currently open
> > transaction. However, any data blocks which are marked via
> > journal_dirty_data() are still going to be associated with the
> > currently open transaction, and they will still be forced out before
> > the transaction is allowed to commit.
> >
> > Does that make sense?
>
> Makes perfect sense, except that it doesn't match what I see through
> "debugfs" - logdump :(
>
> I wrote a testcase to re-write same blocks again & again - there is
> absolutely nothing showed up in log. Which implied to me that, all
> the jorunal_dirty_data we did on all those buffers, did nothing -
> since there is no current transaction. What am I missing ?
The data buffers are not journaled. The buffers are just attached to the
transaction and when the transaction is committed, they are written
directly to their final location. This ensures the ordering but no data
goes via the log... I guess you should see empty transactions in the log
which are eventually commited when they become too old.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Thu, 2006-03-16 at 23:05 +0100, Jan Kara wrote:
> > On Thu, 2006-03-16 at 16:04 -0500, Theodore Ts'o wrote:
> > > On Thu, Mar 16, 2006 at 10:22:40AM -0800, Badari Pulavarty wrote:
> > > > > However, if what we are doing is overwriting our own data with more an
> > > > > updated, more recent version of the data block, do we guarantee that
> > > > > any ordering semantics apply? For example, what if we write a data
> > > > > block, and then follow it up with some kind of metadata update (say we
> > > > > touch atime, or add an extended attribute). Do we guarantee that if
> > > > > the metadata update is committed, that the data block will have made
> > > > > it to disk as well?
> > > >
> > > > I don't see how we do this today. Yes. Metadata updates are jounalled,
> > > > but I don't see how current adding buffers through journal_dirty_data
> > > > (bh) call can guarantee that these buffers get added to metadata-update
> > > > transaction ?
> > >
> > > Even though there aren't any updates to any metadata blocks that take
> > > place between the journal_start() and journal_stop() calls, if
> > > journal_dirty_data() is called (for example in ordered_writepage),
> > > those buffers will be associated with the currently open transaction,
> > > so they will be guaranteed to be written before the transaction is
> > > allowed to commit.
> > >
> > > Remember, journal_start and journal_stop do not delineate a full
> > > ext3/jbd transaction, but rather an operation, where a large number of
> > > operations are bundled together to form a transaction. When you call
> > > journal_start, and request a certain number of credits (number of
> > > buffers that you maximally intend to dirty), that opens up an
> > > operation. If the operation turns out not to dirty any metadata
> > > blocks at the time of journal_stop(), all of the credits that were
> > > reserved by jouranl_start() are returned to the currently open
> > > transaction. However, any data blocks which are marked via
> > > journal_dirty_data() are still going to be associated with the
> > > currently open transaction, and they will still be forced out before
> > > the transaction is allowed to commit.
> > >
> > > Does that make sense?
> >
> > Makes perfect sense, except that it doesn't match what I see through
> > "debugfs" - logdump :(
> >
> > I wrote a testcase to re-write same blocks again & again - there is
> > absolutely nothing showed up in log. Which implied to me that, all
> > the jorunal_dirty_data we did on all those buffers, did nothing -
> > since there is no current transaction. What am I missing ?
> The data buffers are not journaled. The buffers are just attached to the
> transaction and when the transaction is committed, they are written
> directly to their final location. This ensures the ordering but no data
> goes via the log... I guess you should see empty transactions in the log
> which are eventually commited when they become too old.
Yep. I wasn't expecting to see buffers in the transaction/log. I was
expecting to see some "dummy" transaction - which these buffers are
attached to provide ordering. (even though we are not doing metadata
updates). In fact, I was expecting to see "ctime" update in the
transaction.
Thanks,
Badari
On Thu, Mar 16, 2006 at 03:45:21PM -0800, Badari Pulavarty wrote:
> Yep. I wasn't expecting to see buffers in the transaction/log. I was
> expecting to see some "dummy" transaction - which these buffers are
> attached to provide ordering. (even though we are not doing metadata
> updates). In fact, I was expecting to see "ctime" update in the
> transaction.
What you're missing is that journal_start() and journal_stop() don't
create a transaction. They delimit an operation, yes, but multiple
operations are grouped together to form a transaction. Transactions
are only closed when after the commit_internal or if the journal runs
out of space. So you're not going to see a dummy transaction which
the buffers are attached to; instead, all of the various operations
happening within commit_interval are grouped into a single transaction.
This is all explained in Stephen's paper; see page #4 in:
http://ftp.kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz
- Ted
On Mar 16, 2006 15:45 -0800, Badari Pulavarty wrote:
> On Thu, 2006-03-16 at 23:05 +0100, Jan Kara wrote:
> > The data buffers are not journaled. The buffers are just attached to the
> > transaction and when the transaction is committed, they are written
> > directly to their final location. This ensures the ordering but no data
> > goes via the log... I guess you should see empty transactions in the log
> > which are eventually commited when they become too old.
>
> Yep. I wasn't expecting to see buffers in the transaction/log. I was
> expecting to see some "dummy" transaction - which these buffers are
> attached to provide ordering. (even though we are not doing metadata
> updates). In fact, I was expecting to see "ctime" update in the
> transaction.
The U. Wisconsin group that was doing journal-guided fast RAID resync
actually ended up putting dummy transactions into the logs for this
with the block numbers even if there were no metadata changes.
That way the journal could tell the MD RAID layer what blocks might
need resyncing instead of having to scan the whole block device for
inconsistencies.
That code hasn't been merged, or even posted anywhere yet AFAICS, though
I'd be very interested in seeing it. It changes MD RAID recovery time
from O(device size) to O(journal size), and that is a huge deal when you
have an 8TB filesystem.
As for the ctime update, I'm not sure what happened to that, though
ext3 would currently only update the inode at most once a second.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
Theodore Ts'o wrote:
> > >Yup. Ordered-mode JBD commit needs to write and wait upon all dirty
> > >file-data buffers prior to journalling the metadata. If we didn't run
> > >journal_dirty_data_fn() against those buffers then they'd still be under
> > >I/O after commit had completed.
> > >
> > In non-block allocation case, what metadata are we journaling in
> > writepage() ? block allocation happend in prepare_write() and
> > commit_write() journaled the transaction. All the meta data
> > updates should be done there. What JBD commit are you refering to
> > here ?
>
> Basically, this boils down to what is our definition of ordered-mode?
>
> If the goal is to make sure we avoid the security exposure of
> allocating a block and then crashing before we write the data block,
> potentially exposing previously written data that might be belong to
> another user, then what Badari is suggesting would avoid this
> particular problem.
>
> However, if what we are doing is overwriting our own data with more an
> updated, more recent version of the data block, do we guarantee that
> any ordering semantics apply? For example, what if we write a data
> block, and then follow it up with some kind of metadata update (say we
> touch atime, or add an extended attribute). Do we guarantee that if
> the metadata update is committed, that the data block will have made
> it to disk as well? Today that is the way things work, but is that
> guarantee part of the contract of ordered-mode?
That's the wrong way around for uses which check mtimes to revalidate
information about a file's contents.
Local search engines like Beagle, and also anything where "make" is
involved, and "rsync" come to mind.
Then the mtime update should committed _before_ the data begins to be
written, not after, so that after a crash, programs will know those
files may not contain the expected data.
A notable example is "rsync". After a power cycle, you may want to
synchronise some files from another machine.
Ideally, running "rsync" to copy from the other machine would do the trick.
But if data is committed to files on the power cycled machine, and
mtime updates for those writes did not get committed, when "rsync" is
later run it will assume those files are already correct and not
update them. The result is that the data is not copied properly in
the way the user expects.
With "rsync" this problem can be avoided using the --checksum option,
but that's not something a person is likely to think of needing when
they assume ext3 provides reasonable consistency guarantees, so that
it's safe to pull the plug.
-- Jamie
Hi,
On Thu, 2006-03-16 at 17:54 -0700, Andreas Dilger wrote:
> That way the journal could tell the MD RAID layer what blocks might
> need resyncing instead of having to scan the whole block device for
> inconsistencies.
The current md layer supports write-intent bitmaps to deal with this.
--Stephen
Hi Stephen,
Now that we got your attention, I am wondering whats your opinion on
this ?
I have a patch which eliminates adding buffers to the journal, if
we are doing just re-write of the disk block. In theory, it should
be fine - but it does change the current behavior for order mode
writes. I guess, current code adds the buffers to the journal, so
any metadata updates to any file in the filesystem happen in the
journal - guarantees our buffers to be flushed out before that
transaction completes.
My patch *breaks* that guarantee. But provides significant improvement
for re-write case. My micro benchmark shows:
2.6.16-rc6 2.6.16-rc6+patch
real 0m6.606s 0m3.705s
user 0m0.124s 0m0.108s
sys 0m6.456s 0m3.600s
In real world, does this ordering guarantee matter ? Waiting for your
advise.
Thanks,
Badari
Make use of PageMappedToDisk(page) to find out if we need to
block allocation and skip the calls to it, if not needed.
When we are not doing block allocation, also avoid calls
to journal start and adding buffers to transaction.
Signed-off-by: Badari Pulavarty <[email protected]>
Index: linux-2.6.16-rc6/fs/buffer.c
===================================================================
--- linux-2.6.16-rc6.orig/fs/buffer.c 2006-03-11 14:12:55.000000000 -0800
+++ linux-2.6.16-rc6/fs/buffer.c 2006-03-16 08:22:37.000000000 -0800
@@ -2029,6 +2029,7 @@ static int __block_commit_write(struct i
int partial = 0;
unsigned blocksize;
struct buffer_head *bh, *head;
+ int fullymapped = 1;
blocksize = 1 << inode->i_blkbits;
@@ -2043,6 +2044,8 @@ static int __block_commit_write(struct i
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
}
+ if (!buffer_mapped(bh))
+ fullymapped = 0;
}
/*
@@ -2053,6 +2056,9 @@ static int __block_commit_write(struct i
*/
if (!partial)
SetPageUptodate(page);
+
+ if (fullymapped)
+ SetPageMappedToDisk(page);
return 0;
}
Index: linux-2.6.16-rc6/fs/ext3/inode.c
===================================================================
--- linux-2.6.16-rc6.orig/fs/ext3/inode.c 2006-03-11 14:12:55.000000000 -0800
+++ linux-2.6.16-rc6/fs/ext3/inode.c 2006-03-15 13:30:04.000000000 -0800
@@ -999,6 +999,12 @@ static int ext3_prepare_write(struct fil
handle_t *handle;
int retries = 0;
+ /*
+ * If the page is already mapped to disk and we are not
+ * journalling the data - there is nothing to do.
+ */
+ if (PageMappedToDisk(page) && !ext3_should_journal_data(inode))
+ return 0;
retry:
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -1059,8 +1065,14 @@ static int ext3_ordered_commit_write(str
struct inode *inode = page->mapping->host;
int ret = 0, ret2;
- ret = walk_page_buffers(handle, page_buffers(page),
- from, to, NULL, ext3_journal_dirty_data);
+ /*
+ * If the page is already mapped to disk, we won't have
+ * a handle - which means no metadata updates are needed.
+ * So, no need to add buffers to the transaction.
+ */
+ if (handle)
+ ret = walk_page_buffers(handle, page_buffers(page),
+ from, to, NULL, ext3_journal_dirty_data);
if (ret == 0) {
/*
@@ -1075,9 +1087,11 @@ static int ext3_ordered_commit_write(str
EXT3_I(inode)->i_disksize = new_i_size;
ret = generic_commit_write(file, page, from, to);
}
- ret2 = ext3_journal_stop(handle);
- if (!ret)
- ret = ret2;
+ if (handle) {
+ ret2 = ext3_journal_stop(handle);
+ if (!ret)
+ ret = ret2;
+ }
return ret;
}
@@ -1098,9 +1112,11 @@ static int ext3_writeback_commit_write(s
else
ret = generic_commit_write(file, page, from, to);
- ret2 = ext3_journal_stop(handle);
- if (!ret)
- ret = ret2;
+ if (handle) {
+ ret2 = ext3_journal_stop(handle);
+ if (!ret)
+ ret = ret2;
+ }
return ret;
}
@@ -1278,6 +1294,14 @@ static int ext3_ordered_writepage(struct
if (ext3_journal_current_handle())
goto out_fail;
+ /*
+ * If the page is mapped to disk, just do the IO
+ */
+ if (PageMappedToDisk(page)) {
+ ret = block_write_full_page(page, ext3_get_block, wbc);
+ goto out;
+ }
+
handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
if (IS_ERR(handle)) {
@@ -1318,6 +1342,7 @@ static int ext3_ordered_writepage(struct
err = ext3_journal_stop(handle);
if (!ret)
ret = err;
+out:
return ret;
out_fail:
@@ -1337,10 +1362,13 @@ static int ext3_writeback_writepage(stru
if (ext3_journal_current_handle())
goto out_fail;
- handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out_fail;
+ if (!PageMappedToDisk(page)) {
+ handle = ext3_journal_start(inode,
+ ext3_writepage_trans_blocks(inode));
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_fail;
+ }
}
if (test_opt(inode->i_sb, NOBH))
@@ -1348,9 +1376,11 @@ static int ext3_writeback_writepage(stru
else
ret = block_write_full_page(page, ext3_get_block, wbc);
- err = ext3_journal_stop(handle);
- if (!ret)
- ret = err;
+ if (handle) {
+ err = ext3_journal_stop(handle);
+ if (!ret)
+ ret = err;
+ }
return ret;
out_fail:
Hi,
On Fri, 2006-03-17 at 15:32 +0000, Jamie Lokier wrote:
> That's the wrong way around for uses which check mtimes to revalidate
> information about a file's contents.
It's actually the right way for newly-allocated data: the blocks being
written early are invisible until the mtime update, because the mtime
update is an atomic part of the transaction which links the blocks into
the inode.
> Local search engines like Beagle, and also anything where "make" is
> involved, and "rsync" come to mind.
Make and rsync (when writing, that is) are not usually updating in
place, so they do in fact want the current ordered mode.
It's *only* for updating existing data blocks that there's any
justification for writing mtime first. That's the question here.
There's a significant cost in forcing the mtime to go first: it means
that the VM cannot perform any data writeback for data written by a
transaction until the transaction has first been committed. That's the
last thing you want to be happening under VM pressure, as you may not in
fact be able to close the transaction without first allocating more
memory.
--Stephen
On Fri, Mar 17, 2006 at 04:50:21PM -0500, Stephen C. Tweedie wrote:
>
> It's *only* for updating existing data blocks that there's any
> justification for writing mtime first. That's the question here.
>
> There's a significant cost in forcing the mtime to go first: it means
> that the VM cannot perform any data writeback for data written by a
> transaction until the transaction has first been committed. That's the
> last thing you want to be happening under VM pressure, as you may not in
> fact be able to close the transaction without first allocating more
> memory.
Actually, we're not even able to force the mtime to happen first in
this case. In ordered mode, we still force the data blocks *first*,
and only later do we force the mtime update out. With Badari's
proposed change, we completely decouple when the data blocks get
written out with the mtime update; it could happen before, or after,
at the OS's convenience.
If the application cares about the precise ordering of data blocks
being written out with respect to the mtime field, I'd respectfully
suggest that the application use data journalling mode --- and note
that most applications which update existing data blocks, especially
relational databases, either don't care about mtime, have their own
data recovering subsystems, or both.
- Ted
Hi,
On Fri, 2006-03-17 at 13:32 -0800, Badari Pulavarty wrote:
> I have a patch which eliminates adding buffers to the journal, if
> we are doing just re-write of the disk block. ...
> 2.6.16-rc6 2.6.16-rc6+patch
> real 0m6.606s 0m3.705s
OK, that's a really significant win! What exactly was the test case for
this, and does that performance edge persist for a longer-running test?
> In real world, does this ordering guarantee matter ?
Not that I am aware of. Even with the ordering guarantee, there is
still no guarantee of the order in which the writes hit disk within that
transaction, which makes it hard to depend on it.
I recall that some versions of fsync depended on ordered mode flushing
dirty data on transaction commit, but I don't think the current
ext3_sync_file() will have any problems there.
Other than that, the only thing I can think of that had definite
dependencies in this are was InterMezzo, and that's no longer in the
tree. Even then, I'm not 100% certain that InterMezzo had a dependency
for overwrites (it was certainly strongly dependent on the ordering
semantics for allocates.)
It is theoretically possible to write applications that depend on that
ordering, but they would be necessarily non-portable anyway. I think
relaxing it is fine, especially for a 100% (wow) performance gain.
There is one other perspective to be aware of, though: the current
behaviour means that by default ext3 generally starts flushing pending
writeback data within 5 seconds of a write. Without that, we may end up
accumulating a lot more dirty data in memory, shifting the task of write
throttling from the filesystem to the VM.
That's not a problem per se, just a change of behaviour to keep in mind,
as it could expose different corner cases in the performance of
write-intensive workloads.
--Stephen
On Fri, 2006-03-17 at 17:22 -0500, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2006-03-17 at 13:32 -0800, Badari Pulavarty wrote:
>
> > I have a patch which eliminates adding buffers to the journal, if
> > we are doing just re-write of the disk block. ...
>
> > 2.6.16-rc6 2.6.16-rc6+patch
> > real 0m6.606s 0m3.705s
>
> OK, that's a really significant win! What exactly was the test case for
> this, and does that performance edge persist for a longer-running test?
Well, its a micro benchmark to test prepare_write/commit_write code.
Which does over-write of same blocks again & again for thousands
of times. I am doing general filesystem tests to see overall benifits
also.
>
> > In real world, does this ordering guarantee matter ?
>
> Not that I am aware of. Even with the ordering guarantee, there is
> still no guarantee of the order in which the writes hit disk within that
> transaction, which makes it hard to depend on it.
>
> I recall that some versions of fsync depended on ordered mode flushing
> dirty data on transaction commit, but I don't think the current
> ext3_sync_file() will have any problems there.
>
> Other than that, the only thing I can think of that had definite
> dependencies in this are was InterMezzo, and that's no longer in the
> tree. Even then, I'm not 100% certain that InterMezzo had a dependency
> for overwrites (it was certainly strongly dependent on the ordering
> semantics for allocates.)
>
> It is theoretically possible to write applications that depend on that
> ordering, but they would be necessarily non-portable anyway. I think
> relaxing it is fine, especially for a 100% (wow) performance gain.
>
> There is one other perspective to be aware of, though: the current
> behaviour means that by default ext3 generally starts flushing pending
> writeback data within 5 seconds of a write. Without that, we may end up
> accumulating a lot more dirty data in memory, shifting the task of write
> throttling from the filesystem to the VM.
Hmm.. You got a point there.
>
> That's not a problem per se, just a change of behaviour to keep in mind,
> as it could expose different corner cases in the performance of
> write-intensive workloads.
>
> --Stephen
>
>
Theodore Ts'o wrote:
> If the application cares about the precise ordering of data blocks
> being written out with respect to the mtime field, I'd respectfully
> suggest that the application use data journalling mode --- and note
> that most applications which update existing data blocks, especially
> relational databases, either don't care about mtime, have their own
> data recovering subsystems, or both.
I think if you're thinking this only affects "applications" or
individual programs (like databases), then you didn't think about the
example I gave.
Scenario:
- Person has two computers, A and B.
Maybe a desktop and laptop. Maybe office and home machines.
- Sometimes they do work on A, sometimes they do work on B.
Things like editing pictures or spreadsheets or whatever.
- They use "rsync" to copy their working directory from A to B, or
B to A, when they move between computers.
- They're working on A one day, and there's a power cut.
- Power comes back.
- They decide to start again on A, using "rsync" to copy from B to A
to get a good set of files.
- "rsync" is believed to mirror directories from one place to
another without problems. It's always worked for them before.
(Heck, until this thread came up, I assumed it would always work).
- ext3 is generally trusted, so no fsck or anything else special is
thought to be required after a power cut.
- So after running "rsync", they believe it's safe to work on A.
This assumption is invalid, because of ext3's data vs. mtime
write ordering when they were working on A before the power cut.
But the user doesn't expect this. It's far from obvious (except
to a very thoughtful techie) that rsync, which always works
normally and even tidies up mistakes normally, won't give correct
results this time.
- So they carry on working, with corrupted data. Maybe they won't
notice for a long time, and the corruption stays in their work
project.
No individual program or mount option is at fault in the above
scenario. The combination together creates a fault, but only after a
power cut. The usage is fine in normal use and for all other typical
errors which affect files.
Technically, using data=journal, or --checksum with rsync, would be fine.
But nobody _expects_ to have to do that. It's a surprise.
And they both imply a big performance overhead, so nobody is ever
advised to do that just to be safe for "ordinary" day to day work.
-- Jamie
On Fri, 2006-03-17 at 14:38 -0800, Badari Pulavarty wrote:
> On Fri, 2006-03-17 at 17:22 -0500, Stephen C. Tweedie wrote:
> > There is one other perspective to be aware of, though: the current
> > behaviour means that by default ext3 generally starts flushing pending
> > writeback data within 5 seconds of a write. Without that, we may end up
> > accumulating a lot more dirty data in memory, shifting the task of write
> > throttling from the filesystem to the VM.
>
> Hmm.. You got a point there.
>
> >
> > That's not a problem per se, just a change of behaviour to keep in mind,
> > as it could expose different corner cases in the performance of
> > write-intensive workloads.
> >
> > --Stephen
> >
> >
>
Current data=writeback mode already behaves like this, so the VM
subsystem should be tested for a certain extent, isn't?
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen C. Tweedie wrote:
> > That's the wrong way around for uses which check mtimes to revalidate
> > information about a file's contents.
>
> It's actually the right way for newly-allocated data: the blocks being
> written early are invisible until the mtime update, because the mtime
> update is an atomic part of the transaction which links the blocks into
> the inode.
Yes, I agree. It's right for that.
> > Local search engines like Beagle, and also anything where "make" is
> > involved, and "rsync" come to mind.
>
> Make and rsync (when writing, that is) are not usually updating in
> place, so they do in fact want the current ordered mode.
I'm referring to make and rsync _after_ a recovery, when _reading_ to
decide whether file data is up to date. The writing in that scenario
is by other programs.
Those are the times when the current ordering gives surprising
results, to the person who hasn't thought about this ordering, such as
rsync not synchronising a directory properly because it assumes
(incorrectly) a file's mtime is indicative of the last time data was
written to the file.
I agree that when writing data to the end of a new file, the data must
be committed before the metadata.
The weird distinction is really because the order ought to be, if they
can't all be atomic: commit mtime, then data, then size. But we
always commit size and mtime together.
> It's *only* for updating existing data blocks that there's any
> justification for writing mtime first. That's the question here.
>
> There's a significant cost in forcing the mtime to go first: it means
> that the VM cannot perform any data writeback for data written by a
> transaction until the transaction has first been committed. That's the
> last thing you want to be happening under VM pressure, as you may not in
> fact be able to close the transaction without first allocating more
> memory.
While I agree that it's not good for VM pressure, fooling programs
that rely on mtimes to decide if a file's content has changed is a
correctness issue for some applications.
I picked the example of copying a directory using rsync (or any other
program which compares mtimes) and not getting expected results as one
that's easily understood, that people actually do, and where they may
already be getting surprises that may not be noticed immediately.
Maybe the answer is to make the writeback order for in-place writes a
mount option and/or a file attribute?
-- Jamie
On Fri, Mar 17, 2006 at 05:22:13PM -0500, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2006-03-17 at 13:32 -0800, Badari Pulavarty wrote:
>
> > I have a patch which eliminates adding buffers to the journal, if
> > we are doing just re-write of the disk block. ...
>
> > 2.6.16-rc6 2.6.16-rc6+patch
> > real 0m6.606s 0m3.705s
>
> OK, that's a really significant win! What exactly was the test case for
> this, and does that performance edge persist for a longer-running test?
>
> > In real world, does this ordering guarantee matter ?
>
> Not that I am aware of. Even with the ordering guarantee, there is
> still no guarantee of the order in which the writes hit disk within that
> transaction, which makes it hard to depend on it.
>
> I recall that some versions of fsync depended on ordered mode flushing
> dirty data on transaction commit, but I don't think the current
> ext3_sync_file() will have any problems there.
>
> Other than that, the only thing I can think of that had definite
> dependencies in this are was InterMezzo, and that's no longer in the
> tree. Even then, I'm not 100% certain that InterMezzo had a dependency
> for overwrites (it was certainly strongly dependent on the ordering
> semantics for allocates.)
Besides we seem to have already broken the guarantee in async DIO
writes for the overwrite case.
Regards
Suparna
>
> It is theoretically possible to write applications that depend on that
> ordering, but they would be necessarily non-portable anyway. I think
> relaxing it is fine, especially for a 100% (wow) performance gain.
>
> There is one other perspective to be aware of, though: the current
> behaviour means that by default ext3 generally starts flushing pending
> writeback data within 5 seconds of a write. Without that, we may end up
> accumulating a lot more dirty data in memory, shifting the task of write
> throttling from the filesystem to the VM.
>
> That's not a problem per se, just a change of behaviour to keep in mind,
> as it could expose different corner cases in the performance of
> write-intensive workloads.
>
> --Stephen
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, Mar 17, 2006 at 01:32:21PM -0800, Badari Pulavarty wrote:
> Hi Stephen,
>
> Now that we got your attention, I am wondering whats your opinion on
> this ?
>
> I have a patch which eliminates adding buffers to the journal, if
> we are doing just re-write of the disk block. In theory, it should
> be fine - but it does change the current behavior for order mode
> writes. I guess, current code adds the buffers to the journal, so
> any metadata updates to any file in the filesystem happen in the
> journal - guarantees our buffers to be flushed out before that
> transaction completes.
>
> My patch *breaks* that guarantee. But provides significant improvement
> for re-write case. My micro benchmark shows:
>
> 2.6.16-rc6 2.6.16-rc6+patch
> real 0m6.606s 0m3.705s
> user 0m0.124s 0m0.108s
> sys 0m6.456s 0m3.600s
>
Just curious, how does this compare to the writeback case ? Essentially
this change amounts to getting close to writeback mode performance for
the overwrites of existing files, isn't it ?
>
> In real world, does this ordering guarantee matter ? Waiting for your
> advise.
>
> Thanks,
> Badari
>
> Make use of PageMappedToDisk(page) to find out if we need to
> block allocation and skip the calls to it, if not needed.
> When we are not doing block allocation, also avoid calls
> to journal start and adding buffers to transaction.
>
> Signed-off-by: Badari Pulavarty <[email protected]>
> Index: linux-2.6.16-rc6/fs/buffer.c
> ===================================================================
> --- linux-2.6.16-rc6.orig/fs/buffer.c 2006-03-11 14:12:55.000000000 -0800
> +++ linux-2.6.16-rc6/fs/buffer.c 2006-03-16 08:22:37.000000000 -0800
> @@ -2029,6 +2029,7 @@ static int __block_commit_write(struct i
> int partial = 0;
> unsigned blocksize;
> struct buffer_head *bh, *head;
> + int fullymapped = 1;
>
> blocksize = 1 << inode->i_blkbits;
>
> @@ -2043,6 +2044,8 @@ static int __block_commit_write(struct i
> set_buffer_uptodate(bh);
> mark_buffer_dirty(bh);
> }
> + if (!buffer_mapped(bh))
> + fullymapped = 0;
> }
>
> /*
> @@ -2053,6 +2056,9 @@ static int __block_commit_write(struct i
> */
> if (!partial)
> SetPageUptodate(page);
> +
> + if (fullymapped)
> + SetPageMappedToDisk(page);
> return 0;
> }
>
> Index: linux-2.6.16-rc6/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.16-rc6.orig/fs/ext3/inode.c 2006-03-11 14:12:55.000000000 -0800
> +++ linux-2.6.16-rc6/fs/ext3/inode.c 2006-03-15 13:30:04.000000000 -0800
> @@ -999,6 +999,12 @@ static int ext3_prepare_write(struct fil
> handle_t *handle;
> int retries = 0;
>
> + /*
> + * If the page is already mapped to disk and we are not
> + * journalling the data - there is nothing to do.
> + */
> + if (PageMappedToDisk(page) && !ext3_should_journal_data(inode))
> + return 0;
> retry:
> handle = ext3_journal_start(inode, needed_blocks);
> if (IS_ERR(handle)) {
> @@ -1059,8 +1065,14 @@ static int ext3_ordered_commit_write(str
> struct inode *inode = page->mapping->host;
> int ret = 0, ret2;
>
> - ret = walk_page_buffers(handle, page_buffers(page),
> - from, to, NULL, ext3_journal_dirty_data);
> + /*
> + * If the page is already mapped to disk, we won't have
> + * a handle - which means no metadata updates are needed.
> + * So, no need to add buffers to the transaction.
> + */
> + if (handle)
> + ret = walk_page_buffers(handle, page_buffers(page),
> + from, to, NULL, ext3_journal_dirty_data);
>
> if (ret == 0) {
> /*
> @@ -1075,9 +1087,11 @@ static int ext3_ordered_commit_write(str
> EXT3_I(inode)->i_disksize = new_i_size;
> ret = generic_commit_write(file, page, from, to);
> }
> - ret2 = ext3_journal_stop(handle);
> - if (!ret)
> - ret = ret2;
> + if (handle) {
> + ret2 = ext3_journal_stop(handle);
> + if (!ret)
> + ret = ret2;
> + }
> return ret;
> }
>
> @@ -1098,9 +1112,11 @@ static int ext3_writeback_commit_write(s
> else
> ret = generic_commit_write(file, page, from, to);
>
> - ret2 = ext3_journal_stop(handle);
> - if (!ret)
> - ret = ret2;
> + if (handle) {
> + ret2 = ext3_journal_stop(handle);
> + if (!ret)
> + ret = ret2;
> + }
> return ret;
> }
>
> @@ -1278,6 +1294,14 @@ static int ext3_ordered_writepage(struct
> if (ext3_journal_current_handle())
> goto out_fail;
>
> + /*
> + * If the page is mapped to disk, just do the IO
> + */
> + if (PageMappedToDisk(page)) {
> + ret = block_write_full_page(page, ext3_get_block, wbc);
> + goto out;
> + }
> +
> handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
>
> if (IS_ERR(handle)) {
> @@ -1318,6 +1342,7 @@ static int ext3_ordered_writepage(struct
> err = ext3_journal_stop(handle);
> if (!ret)
> ret = err;
> +out:
> return ret;
>
> out_fail:
> @@ -1337,10 +1362,13 @@ static int ext3_writeback_writepage(stru
> if (ext3_journal_current_handle())
> goto out_fail;
>
> - handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out_fail;
> + if (!PageMappedToDisk(page)) {
> + handle = ext3_journal_start(inode,
> + ext3_writepage_trans_blocks(inode));
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out_fail;
> + }
> }
>
> if (test_opt(inode->i_sb, NOBH))
> @@ -1348,9 +1376,11 @@ static int ext3_writeback_writepage(stru
> else
> ret = block_write_full_page(page, ext3_get_block, wbc);
>
> - err = ext3_journal_stop(handle);
> - if (!ret)
> - ret = err;
> + if (handle) {
> + err = ext3_journal_stop(handle);
> + if (!ret)
> + ret = err;
> + }
> return ret;
>
> out_fail:
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, Mar 17, 2006 at 10:44:39PM +0000, Jamie Lokier wrote:
> - Sometimes they do work on A, sometimes they do work on B.
> Things like editing pictures or spreadsheets or whatever.
>
> - They use "rsync" to copy their working directory from A to B, or
> B to A, when they move between computers.
>
> - They're working on A one day, and there's a power cut.
... and this is not a problem, because rsync works works by building
the file to be copied via a name such as .filename.NoC10k. If there
is a power cut, there will be a stale dot file on A that might take up
some extra disk space, but it won't cause a problem with a loss of
synchronization between B and A.
Hence, in your scenario, nothing would go wrong. And since rsync
builds up a new file each time, and only deletes the old file and
moves the new file to replace the old file when it is successful, in
ordered data mode all of the data blocks will be forced to disk before
the metadata operations for the close and rename are allowed to be
commited. Hence, it's perfectly safe, even with Badari's proposed
change.
I would also note that in the even with rsync doing the checksum test,
*even* if it didn't use the dotfile with the uniquifer, rsync always
did check to see if file sizes matched, and since the file sizes would
be different, it would have caught it that way.
- Ted
P.S. There is a potential problem with mtimes causing confusing
results with make, but it has nothing to do with ext3 journalling
modes, and everything to do with the fact that most programs,
including the compiler and linker, do not write their output files
using the rsync technique of using a temporary filename, and then
renaming the file to its final location once the compiler or linker
operation is complete. So it doesn't matter what filesystem you use,
if you are writing out some garguantuan .o file, and the system
crashes before the .o file is written out, the fact that it exist
means and is newer than the source files will lead make(1) to conclude
that the file is up to date, and not try to rebuild it. This has been
true for three decades or so that make has been around, yet I don't
see people running around in histronics about how this "horrible
problem". If people did care, the right way to fix it would be to
make the C compiler use the temp filename and rename trick, or change
the default .c.o rule in Makefiles to do the same thing. But in
reality, it doesn't seem to bother most developers, so no one has
really bothered.
Theodore Ts'o wrote:
> On Fri, Mar 17, 2006 at 10:44:39PM +0000, Jamie Lokier wrote:
> > - Sometimes they do work on A, sometimes they do work on B.
> > Things like editing pictures or spreadsheets or whatever.
> >
> > - They use "rsync" to copy their working directory from A to B, or
> > B to A, when they move between computers.
> >
> > - They're working on A one day, and there's a power cut.
>
> ... and this is not a problem, because rsync works works by building
> the file to be copied via a name such as .filename.NoC10k. If there
> is a power cut, there will be a stale dot file on A that might take up
> some extra disk space, but it won't cause a problem with a loss of
> synchronization between B and A.
Eh? Yikes, i'm obviously not writing clearly enough, because that's
not what I'm talking about at all.
The power cut doesn't interrupt rsync, it interrupts other programs
(unspecified ones), or just after they've written data but it hasn't
reached the disk.
It occurs in step 3 above: after "working on A", i.e. using
OpenOffice, Emacs, Gnumeric, whatever, etc.
_Those_ are the programs which save files shortly before the power cut
in that scenario.
rsync is relevant only *after* the power cut, because it checks mtimes
to see if files are modified. The method by which rsync writes files
isn't relevant to this scenario.
> Hence, in your scenario, nothing would go wrong. And since rsync
> builds up a new file each time, and only deletes the old file and
> moves the new file to replace the old file when it is successful, in
> ordered data mode all of the data blocks will be forced to disk before
> the metadata operations for the close and rename are allowed to be
> commited. Hence, it's perfectly safe, even with Badari's proposed
> change.
Not relevant; rsync is run after the power cut; how it writes files is
irrelevant. How it detects changed files is relevant.
It's other programs (OpenOffice, etc.) which are being used just
before the power cut. If the programs which run just before the power
cut do the above (writing then renaming), then they're fine, but many
programs don't do that.
Now, to be fair, most programs don't overwrite data blocks in place either.
They usually open files with O_TRUNC to write with new contents. How
does that work out with/without Badari's patch? Is that safe in the
same way as creating new files and appending to them is?
Or does O_TRUNC mean that the old data blocks might be released and
assigned to other files, before this file's metadata is committed,
opening a security hole where reading this file after a restart might
read blocks belonging to another, unrelated file?
> P.S. There is a potential problem with mtimes causing confusing
> results with make, but it has nothing to do with ext3 journalling
> modes, and everything to do with the fact that most programs,
> including the compiler and linker, do not write their output files
> using the rsync technique of using a temporary filename, and then
> renaming the file to its final location once the compiler or linker
> operation is complete. So it doesn't matter what filesystem you use,
> if you are writing out some garguantuan .o file, and the system
> crashes before the .o file is written out, the fact that it exist
> means and is newer than the source files will lead make(1) to conclude
> that the file is up to date, and not try to rebuild it. This has been
> true for three decades or so that make has been around, yet I don't
> see people running around in histronics about how this "horrible
> problem". If people did care, the right way to fix it would be to
> make the C compiler use the temp filename and rename trick, or change
> the default .c.o rule in Makefiles to do the same thing. But in
> reality, it doesn't seem to bother most developers, so no one has
> really bothered.
Again, I know about that problem. I'm referring to a _different_
problem with make, one that is less well known and has less easily
detected effects.
It's this: you edit a source file with your favourite editor, and save
it. 3 seconds later, there's a power cut. The next day, power comes
back and you've forgotten that you edited this file.
You type "make", and because the _source_ file's new data was
committed, but the mtime wasn't, "make" doesn't rebuild anything.
The result is output files which are perfectly valid, but inconsistent
with source files. No truncated output files (which tend to be easily
detected because they don't link or run).
This has nothing to do with partially written output files, and more
importantly, you can't fix it by cleverness in the Makefile. It's
insiduous because whole builds may appear to be fine for a long time
(something that doesn't occur with partially written output files -
those trigger further errors when used - which is maybe why nobody
much worries about those). Bugs in behaviour may not be seen from
viewing source code, and if you don't know a power cut was involved,
it may not be obvious to think it could be due to source-object
mismatch.
Similar effects occur with automatic byte-code compilations like
Python to .pyc files, and web sites where a templating system caches
generated output depending on mtimes of source files.
However, all of the above examples really depend on what happens with
O_TRUNC, because in practice all editors etc. that are likely to be
used, use that if they don't do write-then-rename.
So what does happen with O_TRUNC? Does that commit the size and mtime
change before (or at the same time as) freeing the data blocks? Or
can the data block freeing be committed first? If the former, O_TRUNC
is as good as writing to a new file: it's fine. If the latter, it's
like writing data in-place, and can have the problems I've described.
-- Jamie
Once upon a time, Jamie Lokier <[email protected]> said:
>rsync is relevant only *after* the power cut, because it checks mtimes
>to see if files are modified. The method by which rsync writes files
>isn't relevant to this scenario.
To simplify: substitute "rsync" with "backup program".
Any backup software that maintains some type of index of what has been
backed up (for incremental type backups) or even just backs up files
modified since a particular date (e.g. "dump") can miss files modified
shortly before a crash/power cut/unexpected shutdown. The data may get
modified but since the mtime may not get updated, nothing can tell that
the data has been modified.
rsync is actually a special case, in that you could always force it to
compare contents between two copies. Most backup software doesn't do
that (especially tape backups).
--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.
On Sun, Mar 19, 2006 at 02:36:10AM +0000, Jamie Lokier wrote:
> It's other programs (OpenOffice, etc.) which are being used just
> before the power cut. If the programs which run just before the power
> cut do the above (writing then renaming), then they're fine, but many
> programs don't do that.
>
> Now, to be fair, most programs don't overwrite data blocks in place either.
>
> They usually open files with O_TRUNC to write with new contents. How
> does that work out with/without Badari's patch? Is that safe in the
> same way as creating new files and appending to them is?
The competently written ones don't open files with O_TRUNC; they will
create a temp. file, write to the temp. file, and then rename file
once it is fully written to the original file, just like rsync does.
This is important, given that many developers (especially kernel
developers) like to use hard link farms to minimize space, and if you
just O_TRUNC the existing file, then the change will be visible in all
of the directories. If instead the editor (or openoffice, etc.)
writes to a temp file and then renames, then it breaks the hard link
with COW semantics, which is what you want --- and in practice,
everyone using (or was using) bk, git, and mercurial use hard-linked
directories and it works just fine.
But yes, using O_TRUNC works just fine with and without Badari's
patch, because allocating new data blocks to a old file that is
truncated is exactly the same as appending new data blocks to a new
file.
> Or does O_TRUNC mean that the old data blocks might be released and
> assigned to other files, before this file's metadata is committed,
> opening a security hole where reading this file after a restart might
> read blocks belonging to another, unrelated file?
No, not in journal=ordered mode, since the data blocks are forced to
disk before the metadata is committed. Opening with O_TRUNC is
metadata operation, and allocating new blocks after O_TRUNC is also a
metadata operation, and in data=journaled mode, blocks are written out
before the metadata is forced out.
> It's this: you edit a source file with your favourite editor, and save
> it. 3 seconds later, there's a power cut. The next day, power comes
> back and you've forgotten that you edited this file.
Again, *my* favorite editor saves the file to a newly created file,
#filename, and once it is done writing the new file, renames filename
to filename~, and finally renames #filename to filename. This means
that we don't have to worry about your power cut scenario, and it also
means that hard-link farms have the proper COW semantics.
> However, all of the above examples really depend on what happens with
> O_TRUNC, because in practice all editors etc. that are likely to be
> used, use that if they don't do write-then-rename.
O_TRUNC is a bad idea, because it means if the editor crashes, or
computer crashes, or the fileserver crashes, the original file is
*G*O*N*E*. So only silly/stupidly written editors use O_TRUNC; if
yours does, abandon it in favor of another editor, or one day you'll
be really sorry. It's much, much, safer to do write-then-rename
- Ted
Hi,
On Sun, 2006-03-19 at 02:36 +0000, Jamie Lokier wrote:
> Now, to be fair, most programs don't overwrite data blocks in place either.
Which is the point we're trying to make: "make" is almost always being
used to create or fully replace whole files, not to update existing data
inside a file, for example.
> They usually open files with O_TRUNC to write with new contents. How
> does that work out with/without Badari's patch? Is that safe in the
> same way as creating new files and appending to them is?
Yes, absolutely. We have to be extremely careful about ordering when it
comes to truncate, because we cannot allow the discarded data blocks to
be reused until the truncate has committed (otherwise a crash which
rolled back the truncate would potentially expose corruption in those
data blocks.) That's all done in the allocate logic, not in the
writeback code, so it is unaffected by the writeback patches.
So the O_TRUNC is still fully safe; and the allocation of new blocks
after that is simply a special case of extend, so it is also unaffected
by the patch.
It is *only* the recovery semantics of update-in-place which are
affected.
> It's this: you edit a source file with your favourite editor, and save
> it. 3 seconds later, there's a power cut. The next day, power comes
> back and you've forgotten that you edited this file.
If your editor is really opening the existing file and modifying the
contents in place, then you have got a fundamentally unsolvable problem
because the crash you worry about might happen while the editor is still
writing and the file is internally inconsistent. That's not something I
think is the filesystem's responsibility to fix!
--Stephen
Hi,
On Fri, 2006-03-17 at 15:23 -0800, Mingming Cao wrote:
> > > There is one other perspective to be aware of, though: the current
> > > behaviour means that by default ext3 generally starts flushing pending
> > > writeback data within 5 seconds of a write. Without that, we may end up
> > > accumulating a lot more dirty data in memory, shifting the task of write
> > > throttling from the filesystem to the VM.
> Current data=writeback mode already behaves like this, so the VM
> subsystem should be tested for a certain extent, isn't?
Yes, but there are repeated reports that for many workloads,
data=writeback is actually slower than data=ordered. So there are
probably some interactions like this which may be hurting us already.
--Stephen