Hi All, I'm seeing slow symlink corruption on ext3 on linux-2.6.27,
yesterday's linux-2.6 git tree and 2.6.9 RHEL4.7. I.e. every kernel
I've tried I see this effect. To reproduce this, I need:
* 250MB + tar file in memory (tmpfs or in the buffer cache)
* long symlinks in the tar file (over 60 characters)
* umount immediately after untarring
What I see is that the symlinks are corrupted, e.g.:
# ls -l etc/vmware-vix-disklib
etc/vmware-vix-disklib -> ??f
fsck shows:
Symlink /etc/vmware-vix-disklib (inode #16454) is invalid.
Debugfs shows:
debugfs: stat <16454>
Inode: 16454 Type: symlink Mode: 0777 Flags: 0x0 Generation: 1431972005
User: 0 Group: 0 Size: 65
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
atime: 0x4900ac84 -- Thu Oct 23 09:55:32 2008
mtime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
BLOCKS:
(0):56034
TOTAL: 1
I'm still tracking down exactly what's going on. Anyone seen
anything like this before? ext2 does not show this effect (I've
not tried ext4). It happens when the backing block device is
a SATA drive or flash.
Thanks,
Arthur
Some additional info -- and attempting to cast a wider
net on the CC:
I do not see the long symlink corruption with mount -o data=writeback
and we've now seen a couple cases where the symlink corruption
does not require a umount...
Arthur
On Fri, Oct 24, 2008 at 11:37:34AM -0700, Arthur Jones wrote:
> Hi All, I'm seeing slow symlink corruption on ext3 on linux-2.6.27,
> yesterday's linux-2.6 git tree and 2.6.9 RHEL4.7. I.e. every kernel
> I've tried I see this effect. To reproduce this, I need:
>
> * 250MB + tar file in memory (tmpfs or in the buffer cache)
> * long symlinks in the tar file (over 60 characters)
> * umount immediately after untarring
>
> What I see is that the symlinks are corrupted, e.g.:
>
> # ls -l etc/vmware-vix-disklib
> etc/vmware-vix-disklib -> ??f
>
> fsck shows:
>
> Symlink /etc/vmware-vix-disklib (inode #16454) is invalid.
>
> Debugfs shows:
>
> debugfs: stat <16454>
> Inode: 16454 Type: symlink Mode: 0777 Flags: 0x0 Generation: 1431972005
> User: 0 Group: 0 Size: 65
> File ACL: 0 Directory ACL: 0
> Links: 1 Blockcount: 8
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> atime: 0x4900ac84 -- Thu Oct 23 09:55:32 2008
> mtime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> BLOCKS:
> (0):56034
> TOTAL: 1
>
> I'm still tracking down exactly what's going on. Anyone seen
> anything like this before? ext2 does not show this effect (I've
> not tried ext4). It happens when the backing block device is
> a SATA drive or flash.
>
> Thanks,
>
> Arthur
This one's turning out to be a slippery fish.
I have found that the corruption appears to
be due to ->writepage() not getting called at all
for any of the long symlinks...
Ring any bells anyone? Any ideas where to look
or what to test? This is my first foray into
ext3 and I could definitely use some expert advice...
Arthur
On Mon, Oct 27, 2008 at 09:54:23AM -0700, Arthur Jones wrote:
> Some additional info -- and attempting to cast a wider
> net on the CC:
>
> I do not see the long symlink corruption with mount -o data=writeback
> and we've now seen a couple cases where the symlink corruption
> does not require a umount...
>
> Arthur
>
> On Fri, Oct 24, 2008 at 11:37:34AM -0700, Arthur Jones wrote:
> > Hi All, I'm seeing slow symlink corruption on ext3 on linux-2.6.27,
> > yesterday's linux-2.6 git tree and 2.6.9 RHEL4.7. I.e. every kernel
> > I've tried I see this effect. To reproduce this, I need:
> >
> > * 250MB + tar file in memory (tmpfs or in the buffer cache)
> > * long symlinks in the tar file (over 60 characters)
> > * umount immediately after untarring
> >
> > What I see is that the symlinks are corrupted, e.g.:
> >
> > # ls -l etc/vmware-vix-disklib
> > etc/vmware-vix-disklib -> ??f
> >
> > fsck shows:
> >
> > Symlink /etc/vmware-vix-disklib (inode #16454) is invalid.
> >
> > Debugfs shows:
> >
> > debugfs: stat <16454>
> > Inode: 16454 Type: symlink Mode: 0777 Flags: 0x0 Generation: 1431972005
> > User: 0 Group: 0 Size: 65
> > File ACL: 0 Directory ACL: 0
> > Links: 1 Blockcount: 8
> > Fragment: Address: 0 Number: 0 Size: 0
> > ctime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> > atime: 0x4900ac84 -- Thu Oct 23 09:55:32 2008
> > mtime: 0x4900ac69 -- Thu Oct 23 09:55:05 2008
> > BLOCKS:
> > (0):56034
> > TOTAL: 1
> >
> > I'm still tracking down exactly what's going on. Anyone seen
> > anything like this before? ext2 does not show this effect (I've
> > not tried ext4). It happens when the backing block device is
> > a SATA drive or flash.
> >
> > Thanks,
> >
> > Arthur
Arthur Jones wrote:
> This one's turning out to be a slippery fish.
> I have found that the corruption appears to
> be due to ->writepage() not getting called at all
> for any of the long symlinks...
>
> Ring any bells anyone? Any ideas where to look
> or what to test? This is my first foray into
> ext3 and I could definitely use some expert advice...
>
> Arthur
Sorry for the silence, this is a nice bug you've found :)
I can reproduce this at least, with this script:
#!/bin/bash
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch
/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s
/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
umount /mnt/test2
I'll look into it ...
-Eric
On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
>
> Sorry for the silence, this is a nice bug you've found :)
>
> I'll look into it ...
You may want to take a quick look at this thread:
http://lkml.org/lkml/2008/10/28/413
- Ted
Theodore Tso wrote:
> On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
>> Sorry for the silence, this is a nice bug you've found :)
>>
>> I'll look into it ...
>
> You may want to take a quick look at this thread:
>
> http://lkml.org/lkml/2008/10/28/413
>
> - Ted
I thought about that, but at least at first glance I don't see how the
gfp mask change would cause this behavior...? At least, I don't think
we're seeing recursion back into the filesystem... but I'll ponder that.
(Also, Arthur reports seeing this as long ago as 2.6.9...)
-Eric
Hi Eric, ...
On Thu, Oct 30, 2008 at 06:38:30AM -0700, Eric Sandeen wrote:
> Theodore Tso wrote:
> > On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
> >> Sorry for the silence, this is a nice bug you've found :)
> >>
> >> I'll look into it ...
> >
> > You may want to take a quick look at this thread:
> >
> > http://lkml.org/lkml/2008/10/28/413
> >
> > - Ted
>
> I thought about that, but at least at first glance I don't see how the
> gfp mask change would cause this behavior...? At least, I don't think
> we're seeing recursion back into the filesystem... but I'll ponder that.
Back when I thought this was a 2.6.9 bug, I tried
that patch. There was no change in behavior...
Arthur
Hi Eric, ...
On Wed, Oct 29, 2008 at 01:36:33PM -0700, Eric Sandeen wrote:
> [...]
> I'll look into it ...
In the cases that fail, I'm seeing bdi_write_congested()
return 2 at the top of write_cache_pages(). In the working
case, bdi_write_congested() returns 0 and the inodes are
found twice in the s_io list in generic_sync_sb_inodes,
first as i_state==7, where they are skipped, then a second
time as i_state==4, where ->writepage() is then called...
Arthur
Arthur Jones wrote:
> Hi Eric, ...
>
> On Wed, Oct 29, 2008 at 01:36:33PM -0700, Eric Sandeen wrote:
>> [...]
>> I'll look into it ...
>
> In the cases that fail, I'm seeing bdi_write_congested()
> return 2 at the top of write_cache_pages(). In the working
> case, bdi_write_congested() returns 0 and the inodes are
> found twice in the s_io list in generic_sync_sb_inodes,
> first as i_state==7, where they are skipped, then a second
> time as i_state==4, where ->writepage() is then called...
>
> Arthur
Something is definitely racy here; in my simple testcase I get failures
maybe 30-50% of the time...
If I add a mark_buffer_dirty to write_end_fn, it always passes but I
need to think a bit about that.
-Eric
Hi Eric, ...
On Thu, Oct 30, 2008 at 10:40:57AM -0700, Arthur Jones wrote:
> [...]
> return 2 at the top of write_cache_pages(). In the working
> case, bdi_write_congested() returns 0 and the inodes are
> found twice in the s_io list in generic_sync_sb_inodes,
> first as i_state==7, where they are skipped, then a second
> time as i_state==4, where ->writepage() is then called...
To clarify, they are not on the list twice, rather
a separate call to generic_sync_sb_inodes sees them
again as i_state==4...
Arthur
Hi Eric, ...
On Thu, Oct 30, 2008 at 11:03:49AM -0700, Eric Sandeen wrote:
> [...]
> Something is definitely racy here; in my simple testcase I get failures
> maybe 30-50% of the time...
Some more info: in the working case, the inodes are put
back on sb->s_dirty at then next ext3_sync_fs() call:
__fsync_super -> DQUOT_SYNC -> ext3_sync_fs -> log_wait_commit
In the failing case, journal_start_commit returns 0 in ext_sync_fs
and the inodes disappear into never-never land...
Arthur
On Thu, Oct 30, 2008 at 02:34:00PM -0700, Arthur Jones wrote:
> Hi Eric, ...
>
> On Thu, Oct 30, 2008 at 11:03:49AM -0700, Eric Sandeen wrote:
> > [...]
> > Something is definitely racy here; in my simple testcase I get failures
> > maybe 30-50% of the time...
>
> Some more info: in the working case, the inodes are put
> back on sb->s_dirty at then next ext3_sync_fs() call:
>
> __fsync_super -> DQUOT_SYNC -> ext3_sync_fs -> log_wait_commit
>
> In the failing case, journal_start_commit returns 0 in ext_sync_fs
> and the inodes disappear into never-never land...
More details, these are dumps at __log_start_commit in the
call chain described above, the first column is the failing
case, the next column is working case, t_expires is the delta
from the time the dump was taken:
journal->j_flags 0x10 0x10
journal->j_tail_sequence 515 519
journal->j_transaction_sequence 517 522
journal->j_commit_sequence 514 519
journal->j_commit_request 516 520
journal->j_running_transaction->t_tid 516 521
journal->j_running_transaction->t_state 0 0
journal->j_running_transaction->t_updates 0 0
journal->j_running_transaction->t_handle_count 27305 27344
journal->j_running_transaction->t_expires -566 28
Can you tell from this whether the transactions
are messed up or whether we're just missing a
wake_up? Any other info you'd like to see?
Arthur
Arthur Jones wrote:
> On Thu, Oct 30, 2008 at 02:34:00PM -0700, Arthur Jones wrote:
>> Hi Eric, ...
>>
>> On Thu, Oct 30, 2008 at 11:03:49AM -0700, Eric Sandeen wrote:
>>> [...]
>>> Something is definitely racy here; in my simple testcase I get failures
>>> maybe 30-50% of the time...
>> Some more info: in the working case, the inodes are put
>> back on sb->s_dirty at then next ext3_sync_fs() call:
>>
>> __fsync_super -> DQUOT_SYNC -> ext3_sync_fs -> log_wait_commit
>>
>> In the failing case, journal_start_commit returns 0 in ext_sync_fs
>> and the inodes disappear into never-never land...
>
> More details, these are dumps at __log_start_commit in the
> call chain described above, the first column is the failing
> case, the next column is working case, t_expires is the delta
> from the time the dump was taken:
>
> journal->j_flags 0x10 0x10
> journal->j_tail_sequence 515 519
> journal->j_transaction_sequence 517 522
> journal->j_commit_sequence 514 519
> journal->j_commit_request 516 520
>
> journal->j_running_transaction->t_tid 516 521
> journal->j_running_transaction->t_state 0 0
> journal->j_running_transaction->t_updates 0 0
> journal->j_running_transaction->t_handle_count 27305 27344
> journal->j_running_transaction->t_expires -566 28
>
> Can you tell from this whether the transactions
> are messed up or whether we're just missing a
> wake_up? Any other info you'd like to see?
That's kind of along the lines of what I'm seeing; also, in particular,
I'm never seeing the buffer_head in question (the one for the block
which contains the slow link's data) transition from jbddirty to normal
BH_Dirty. I've had to take a break from this today, but will be back at
it a bit later... since I have a solid testcase I'm sure I'll get to the
bottom of it ... :) I'll probably hook up akpm's buffer tracing
infrastructure, just need to find a decent thing to trigger on to dump
out the history.
-Eric
On Friday 31 October 2008 00:38, Eric Sandeen wrote:
> Theodore Tso wrote:
> > On Wed, Oct 29, 2008 at 03:36:33PM -0500, Eric Sandeen wrote:
> >> Sorry for the silence, this is a nice bug you've found :)
> >>
> >> I'll look into it ...
> >
> > You may want to take a quick look at this thread:
> >
> > http://lkml.org/lkml/2008/10/28/413
> >
> > - Ted
>
> I thought about that, but at least at first glance I don't see how the
> gfp mask change would cause this behavior...? At least, I don't think
> we're seeing recursion back into the filesystem... but I'll ponder that.
That's definitely a bug which I'll have to fix for 2.6.28, but
I agree it's unlikely to recurse frequently like this (would only
happen under high memory pressure, and only when writeout from
page reclaim happens).
> (Also, Arthur reports seeing this as long ago as 2.6.9...)
OK, well that would confirm it.
Hi Eric, This patch fixes the problem for me, and
seems to put the buffers on the dirty list at the
place where they are put on the list during the working
case. Despite having rooted around in the innards of
ext3 for the last few days, I cannot say that I have
any sense of whether this patch will cause problems
elsewhere or even if this is the best place to
intercede.
I post the complete patch not because I think it
should be committed as is, but rather to try
to explain the logic that brought it about. At the
very least, this should be reviewed by the experts
here to make sure there is no collateral damage.
Arthur
-------------------
In ext3_sync_fs, we only wait for a commit to
finish if we started it, but there may be one
already in progress which will not be synced.
In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list
of other I/O on the backing block device, this
causes the buffer associated with the long symlinks
to not be moved to the inode dirty list in the second
phase of fsync_super. Then, before they can be dirtied
again, kjournald exits, seeing the UMOUNT flag and the
dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or
previously freed block data to userspace.
This can be reproduced with a script created
by Eric Sandeen <[email protected]>:
#!/bin/bash
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch
/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s
/mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
umount /mnt/test2
To ensure all commits are synced, we flush
all journal commits now when sync_fs'ing ext3.
Signed-off-by: Arthur Jones <[email protected]>
---
fs/ext3/super.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18eaa78..053659a 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
if (wait)
log_wait_commit(EXT3_SB(sb)->s_journal, target);
- }
+ } else if (wait)
+ /*
+ * We may have a commit in progress, clear it out
+ * before we go on...
+ */
+ ext3_force_commit(sb);
+
return 0;
}
--
1.5.4.3
On Mon, 3 Nov 2008 10:44:26 -0800
Arthur Jones <[email protected]> wrote:
> Hi Eric, This patch fixes the problem for me, and
> seems to put the buffers on the dirty list at the
> place where they are put on the list during the working
> case. Despite having rooted around in the innards of
> ext3 for the last few days, I cannot say that I have
> any sense of whether this patch will cause problems
> elsewhere or even if this is the best place to
> intercede.
>
> I post the complete patch not because I think it
> should be committed as is, but rather to try
> to explain the logic that brought it about. At the
> very least, this should be reviewed by the experts
> here to make sure there is no collateral damage.
>
> Arthur
>
> -------------------
> In ext3_sync_fs, we only wait for a commit to
> finish if we started it, but there may be one
> already in progress which will not be synced.
argh.
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> if (wait)
> log_wait_commit(EXT3_SB(sb)->s_journal, target);
> - }
> + } else if (wait)
> + /*
> + * We may have a commit in progress, clear it out
> + * before we go on...
> + */
> + ext3_force_commit(sb);
> +
> return 0;
> }
Can we do
sb->s_dirt = 0;
if (wait)
ext3_force_commit(...);
else
journal_start_commit(...);
?
Also, I wonder if that `sb->s_dirt = 0' is correct if
journal_start_commit() didn't start a commit?
Arthur Jones wrote:
> Hi Eric, This patch fixes the problem for me, and
> seems to put the buffers on the dirty list at the
> place where they are put on the list during the working
> case. Despite having rooted around in the innards of
> ext3 for the last few days, I cannot say that I have
> any sense of whether this patch will cause problems
> elsewhere or even if this is the best place to
> intercede.
>
> I post the complete patch not because I think it
> should be committed as is, but rather to try
> to explain the logic that brought it about. At the
> very least, this should be reviewed by the experts
> here to make sure there is no collateral damage.
>
> Arthur
Seems like the right approach; I too was thinking that the problem was
we just weren't either kicking off, or waiting for, the log commit at
unmount time.
I've had to step away from this problem for a couple days but will
eyeball this soon, it seems like the right root cause & general approach
to a fix, to me.
Thanks!
-Eric
> -------------------
> In ext3_sync_fs, we only wait for a commit to
> finish if we started it, but there may be one
> already in progress which will not be synced.
>
> In the case of a data=ordered umount with pending long
> symlinks which are delayed due to a long list
> of other I/O on the backing block device, this
> causes the buffer associated with the long symlinks
> to not be moved to the inode dirty list in the second
> phase of fsync_super. Then, before they can be dirtied
> again, kjournald exits, seeing the UMOUNT flag and the
> dirty pages are never written to the backing block device,
> causing long symlink corruption and exposing new or
> previously freed block data to userspace.
>
> This can be reproduced with a script created
> by Eric Sandeen <[email protected]>:
>
> #!/bin/bash
>
> umount /mnt/test2
> mount /dev/sdb4 /mnt/test2
> rm -f /mnt/test2/*
> dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
> touch
> /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
> ln -s
> /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
> /mnt/test2/link
> umount /mnt/test2
> mount /dev/sdb4 /mnt/test2
> ls /mnt/test2/
> umount /mnt/test2
>
> To ensure all commits are synced, we flush
> all journal commits now when sync_fs'ing ext3.
>
> Signed-off-by: Arthur Jones <[email protected]>
> ---
> fs/ext3/super.c | 8 +++++++-
> 1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18eaa78..053659a 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> if (wait)
> log_wait_commit(EXT3_SB(sb)->s_journal, target);
> - }
> + } else if (wait)
> + /*
> + * We may have a commit in progress, clear it out
> + * before we go on...
> + */
> + ext3_force_commit(sb);
> +
> return 0;
> }
>
Hi Andrew, ...
On Mon, Nov 03, 2008 at 11:33:18AM -0800, Andrew Morton wrote:
> [...]
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> > if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> > if (wait)
> > log_wait_commit(EXT3_SB(sb)->s_journal, target);
> > - }
> > + } else if (wait)
> > + /*
> > + * We may have a commit in progress, clear it out
> > + * before we go on...
> > + */
> > + ext3_force_commit(sb);
> > +
> > return 0;
> > }
>
> Can we do
>
> sb->s_dirt = 0;
> if (wait)
> ext3_force_commit(...);
> else
> journal_start_commit(...);
I think this is what you had in mind:
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18eaa78..2b48b85 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2386,13 +2386,12 @@ static void ext3_write_super (struct super_block * sb)
static int ext3_sync_fs(struct super_block *sb, int wait)
{
- tid_t target;
On Mon, 3 Nov 2008 12:14:28 -0800
Arthur Jones <[email protected]> wrote:
> Hi Andrew, ...
>
> On Mon, Nov 03, 2008 at 11:33:18AM -0800, Andrew Morton wrote:
> > [...]
> > > --- a/fs/ext3/super.c
> > > +++ b/fs/ext3/super.c
> > > @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> > > if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> > > if (wait)
> > > log_wait_commit(EXT3_SB(sb)->s_journal, target);
> > > - }
> > > + } else if (wait)
> > > + /*
> > > + * We may have a commit in progress, clear it out
> > > + * before we go on...
> > > + */
> > > + ext3_force_commit(sb);
> > > +
> > > return 0;
> > > }
> >
> > Can we do
> >
> > sb->s_dirt = 0;
> > if (wait)
> > ext3_force_commit(...);
> > else
> > journal_start_commit(...);
>
> I think this is what you had in mind:
yup.
>
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18eaa78..2b48b85 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2386,13 +2386,12 @@ static void ext3_write_super (struct super_block * sb)
>
> static int ext3_sync_fs(struct super_block *sb, int wait)
> {
> - tid_t target;
> -
> sb->s_dirt = 0;
> - if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> - if (wait)
> - log_wait_commit(EXT3_SB(sb)->s_journal, target);
> - }
> + if (wait)
> + ext3_force_commit(sb);
> + else
> + journal_start_commit(EXT3_SB(sb)->s_journal, NULL);
> +
> return 0;
> }
>
> I tried this and it too fixes the problem. FWIW I agree it
> looks better...
>
OK. But still, questions remain.
- should we clear s_dirt if log_wait_commit() didn't work?
- ext3_force_commit() clears s_dirt also
- should ext3_sync_fs() be dropping the ext3_force_commit() return
value on the floor?
This is all rather worrisome.
Hi Andrew, ...
On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> [...]
> OK. But still, questions remain.
>
> - should we clear s_dirt if log_wait_commit() didn't work?
>
> - ext3_force_commit() clears s_dirt also
>
> - should ext3_sync_fs() be dropping the ext3_force_commit() return
> value on the floor?
- does ext4 have a similar issue (ext4_sync_fs looks the same)?
Arthur
On Mon, 3 Nov 2008 12:58:54 -0800
Arthur Jones <[email protected]> wrote:
> Hi Andrew, ...
>
> On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> > [...]
> > OK. But still, questions remain.
> >
> > - should we clear s_dirt if log_wait_commit() didn't work?
> >
> > - ext3_force_commit() clears s_dirt also
> >
> > - should ext3_sync_fs() be dropping the ext3_force_commit() return
> > value on the floor?
>
> - does ext4 have a similar issue (ext4_sync_fs looks the same)?
>
I'm sure it does. Someone(tm) should prepare the ext4 patch once we've
settled on the ext3 patch.
On Mon, Nov 03, 2008 at 01:13:13PM -0800, Andrew Morton wrote:
> On Mon, 3 Nov 2008 12:58:54 -0800
> Arthur Jones <[email protected]> wrote:
>
> > Hi Andrew, ...
> >
> > On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> > > [...]
> > > OK. But still, questions remain.
> > >
> > > - should we clear s_dirt if log_wait_commit() didn't work?
> > >
> > > - ext3_force_commit() clears s_dirt also
> > >
> > > - should ext3_sync_fs() be dropping the ext3_force_commit() return
> > > value on the floor?
Should all the callers of ->sync_fs also be dropping the error returns
on the floor?
> >
> > - does ext4 have a similar issue (ext4_sync_fs looks the same)?
> >
>
> I'm sure it does. Someone(tm) should prepare the ext4 patch once we've
> settled on the ext3 patch.
I'll take care of it. And I would reflect the error returns up to
ext3_sync_fs's callers, although at the moment it's not clear it would
make much of a difference. :-(
- Ted
On Mon, 3 Nov 2008 16:19:29 -0500
Theodore Tso <[email protected]> wrote:
> On Mon, Nov 03, 2008 at 01:13:13PM -0800, Andrew Morton wrote:
> > On Mon, 3 Nov 2008 12:58:54 -0800
> > Arthur Jones <[email protected]> wrote:
> >
> > > Hi Andrew, ...
> > >
> > > On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> > > > [...]
> > > > OK. But still, questions remain.
> > > >
> > > > - should we clear s_dirt if log_wait_commit() didn't work?
> > > >
> > > > - ext3_force_commit() clears s_dirt also
> > > >
> > > > - should ext3_sync_fs() be dropping the ext3_force_commit() return
> > > > value on the floor?
>
> Should all the callers of ->sync_fs also be dropping the error returns
> on the floor?
Oh, this is sync_fs. How meaningful would it be to return an error
from sys_umount()? Would there be any point in leaving the errored fs
mounted? Dunno.
But if we're going to drop this error on the floor, we should do
that at a higher level, not on a per-fs basis ;)
Hi Andrew, ...
On Mon, Nov 03, 2008 at 12:37:50PM -0800, Andrew Morton wrote:
> [...]
> OK. But still, questions remain.
>
> - should we clear s_dirt if log_wait_commit() didn't work?
>
> - ext3_force_commit() clears s_dirt also
I'm not sure if it makes sense, but ATM I don't think it
hurts anything given that ext3_write_super is just sb->s_dirt = 0;
> - should ext3_sync_fs() be dropping the ext3_force_commit() return
> value on the floor?
Something like this (tested)?
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 2b48b85..743acf1 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2386,13 +2386,15 @@ static void ext3_write_super (struct super_block * sb)
static int ext3_sync_fs(struct super_block *sb, int wait)
{
+ int ret = 0;
+
sb->s_dirt = 0;
if (wait)
- ext3_force_commit(sb);
+ ret = ext3_force_commit(sb);
else
journal_start_commit(EXT3_SB(sb)->s_journal, NULL);
- return 0;
+ return ret;
}
/*
Arthur
On Mon, Nov 03, 2008 at 01:27:35PM -0800, Andrew Morton wrote:
> Oh, this is sync_fs. How meaningful would it be to return an error
> from sys_umount()? Would there be any point in leaving the errored fs
> mounted? Dunno.
>
> But if we're going to drop this error on the floor, we should do
> that at a higher level, not on a per-fs basis ;)
At the very least we should log it at the VFS layer, IMHO.
- Ted
Here's the ext4 version of the patch. It's slightly altered from the
ext3 version in that we do reflect the errors up to sync_fs's callers
(which then throw it on the floor, but we might as well take away some
of the fun from all of those academic researchers who like to write
papers complaining about how often Linux doesn't do appropriat eerror
checking :-).
After doing some testing, I plan to carry it in the ext4 patch queue.
I think similar changes should be made to the ext3 version of the
patch; agreed?
- Ted
commit 8106ea5364c2680a385ed240e8f898611babf661
Author: Theodore Ts'o <[email protected]>
Date: Mon Nov 3 17:01:22 2008 -0500
ext4: wait on all pending commits in ext4_sync_fs()
In ext4_sync_fs, we only wait for a commit to finish if we started it,
but there may be one already in progress which will not be synced.
In the case of a data=ordered umount with pending long symlinks which
are delayed due to a long list of other I/O on the backing block
device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of
fsync_super. Then, before they can be dirtied again, kjournald exits,
seeing the UMOUNT flag and the dirty pages are never written to the
backing block device, causing long symlink corruption and exposing new
or previously freed block data to userspace.
To ensure all commits are synced, we flush all journal commits now
when sync_fs'ing ext4.
Signed-off-by: Arthur Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Eric Sandeen <[email protected]>
Cc: <[email protected]>
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 97cb896..e199773 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2899,8 +2899,9 @@ int ext4_force_commit(struct super_block *sb)
return 0;
journal = EXT4_SB(sb)->s_journal;
- sb->s_dirt = 0;
ret = ext4_journal_force_commit(journal);
+ if (!ret)
+ sb->s_dirt = 0;
return ret;
}
@@ -2922,15 +2923,16 @@ static void ext4_write_super(struct super_block *sb)
static int ext4_sync_fs(struct super_block *sb, int wait)
{
- tid_t target;
+ int ret;
trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
- sb->s_dirt = 0;
- if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
- if (wait)
- jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
- }
- return 0;
+ if (wait)
+ ret = ext4_force_commit(sb);
+ else
+ ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+ if (!ret)
+ sb->s_dirt = 0;
+ return ret;
}
/*
Hi Ted, ...
On Mon, Nov 03, 2008 at 02:01:44PM -0800, Theodore Tso wrote:
> [...]
> @@ -2922,15 +2923,16 @@ static void ext4_write_super(struct super_block *sb)
>
> static int ext4_sync_fs(struct super_block *sb, int wait)
> {
> - tid_t target;
> + int ret;
>
> trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
> - sb->s_dirt = 0;
> - if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
> - if (wait)
> - jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
> - }
> - return 0;
> + if (wait)
> + ret = ext4_force_commit(sb);
> + else
> + ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
> + if (!ret)
> + sb->s_dirt = 0;
> + return ret;
This bit will clear sb->s_dirt if we did _not_ start
a commit in jbd2_journal_start_commit. It will return
1 if we did start a commit which I don't think is what
we want to do here...
Arthur
On Mon, 3 Nov 2008 17:01:44 -0500
Theodore Tso <[email protected]> wrote:
> static int ext4_sync_fs(struct super_block *sb, int wait)
> {
> - tid_t target;
> + int ret;
>
> trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
> - sb->s_dirt = 0;
> - if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
> - if (wait)
> - jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
> - }
> - return 0;
> + if (wait)
> + ret = ext4_force_commit(sb);
> + else
> + ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
> + if (!ret)
> + sb->s_dirt = 0;
> + return ret;
> }
It should clear s_dirt before doing the "i/o", methinks?
The usual pattern is
foo->dirty = 0;
do_io_on(foo);
because
do_io_on(foo);
modify(foo);
foo->dirty = 1;
foo->dirty = 0;
is racy.
On Mon, Nov 03, 2008 at 01:48:40PM -0800, Arthur Jones wrote:
> > - should we clear s_dirt if log_wait_commit() didn't work?
> >
> > - ext3_force_commit() clears s_dirt also
>
> I'm not sure if it makes sense, but ATM I don't think it
> hurts anything given that ext3_write_super is just sb->s_dirt = 0;
I want to look at this some more, but it's possible we should just
remove all of the references to sb->s_dirt completely. Also, the
comment just above ext[34]_write_super() is out of date since before
2.6.0.
- Ted
On Mon, Nov 03, 2008 at 02:27:06PM -0800, Andrew Morton wrote:
> It should clear s_dirt before doing the "i/o", methinks?
Yep, good point. As I mentioned earlier, though, I'm about 99% sure
that the right fix is to remove all mention of s_dirt entirely, and in
fact we can make super_operations.write_super be NULL for ext3 and
ext4. But for now we should just keep it in its usual place for now,
and save that for a cleanup commit later on.
- Ted
commit b20506dc713db1105287b691390563d2aace6d84
Author: Theodore Ts'o <[email protected]>
Date: Mon Nov 3 17:54:41 2008 -0500
ext4: wait on all pending commits in ext4_sync_fs()
In ext4_sync_fs, we only wait for a commit to finish if we started it,
but there may be one already in progress which will not be synced.
In the case of a data=ordered umount with pending long symlinks which
are delayed due to a long list of other I/O on the backing block
device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of
fsync_super. Then, before they can be dirtied again, kjournald exits,
seeing the UMOUNT flag and the dirty pages are never written to the
backing block device, causing long symlink corruption and exposing new
or previously freed block data to userspace.
To ensure all commits are synced, we flush all journal commits now
when sync_fs'ing ext4.
Signed-off-by: Arthur Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Eric Sandeen <[email protected]>
Cc: <[email protected]>
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 97cb896..5b5e38e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2907,12 +2907,9 @@ int ext4_force_commit(struct super_block *sb)
/*
* Ext4 always journals updates to the superblock itself, so we don't
* have to propagate any other updates to the superblock on disk at this
- * point. Just start an async writeback to get the buffers on their way
- * to the disk.
- *
- * This implicitly triggers the writebehind on sync().
+ * point. (We can probably nuke this function altogether, and remove
+ * any mention to sb->s_dirt in all of fs/ext4; eventual cleanup...)
*/
-
static void ext4_write_super(struct super_block *sb)
{
if (mutex_trylock(&sb->s_lock) != 0)
@@ -2922,15 +2919,15 @@ static void ext4_write_super(struct super_block *sb)
static int ext4_sync_fs(struct super_block *sb, int wait)
{
- tid_t target;
+ int ret;
trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
sb->s_dirt = 0;
- if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
- if (wait)
- jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
- }
- return 0;
+ if (wait)
+ ret = ext4_force_commit(sb);
+ else
+ ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+ return ret;
}
/*
Hi Ted, ...
On Mon, Nov 03, 2008 at 02:55:44PM -0800, Theodore Tso wrote:
> @@ -2922,15 +2919,15 @@ static void ext4_write_super(struct super_block *sb)
>
> static int ext4_sync_fs(struct super_block *sb, int wait)
> {
> - tid_t target;
> + int ret;
>
> trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
> sb->s_dirt = 0;
> - if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
> - if (wait)
> - jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
> - }
> - return 0;
> + if (wait)
> + ret = ext4_force_commit(sb);
> + else
> + ret = jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
> + return ret;
> }
>
> /*
Same problem as the last one, jbd2_journal_start_commit returns
true if it started a commit, not an error...
Arthur
On Mon, Nov 03, 2008 at 03:01:04PM -0800, Arthur Jones wrote:
>
> Same problem as the last one, jbd2_journal_start_commit returns
> true if it started a commit, not an error...
Ah, right. !@#! differing return conventions....
BTW, did you ever open a bugzilla entry for this the symlink
corruption problem that we should reference in the commit log?
- Ted
commit fd7384f8243a957386af3767532d035346f0d149
Author: Theodore Ts'o <[email protected]>
Date: Mon Nov 3 18:10:55 2008 -0500
ext4: wait on all pending commits in ext4_sync_fs()
In ext4_sync_fs, we only wait for a commit to finish if we started it,
but there may be one already in progress which will not be synced.
In the case of a data=ordered umount with pending long symlinks which
are delayed due to a long list of other I/O on the backing block
device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of
fsync_super. Then, before they can be dirtied again, kjournald exits,
seeing the UMOUNT flag and the dirty pages are never written to the
backing block device, causing long symlink corruption and exposing new
or previously freed block data to userspace.
To ensure all commits are synced, we flush all journal commits now
when sync_fs'ing ext4.
Signed-off-by: Arthur Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Eric Sandeen <[email protected]>
Cc: <[email protected]>
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 97cb896..9e7e66c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2907,12 +2907,9 @@ int ext4_force_commit(struct super_block *sb)
/*
* Ext4 always journals updates to the superblock itself, so we don't
* have to propagate any other updates to the superblock on disk at this
- * point. Just start an async writeback to get the buffers on their way
- * to the disk.
- *
- * This implicitly triggers the writebehind on sync().
+ * point. (We can probably nuke this function altogether, and remove
+ * any mention to sb->s_dirt in all of fs/ext4; eventual cleanup...)
*/
-
static void ext4_write_super(struct super_block *sb)
{
if (mutex_trylock(&sb->s_lock) != 0)
@@ -2922,15 +2919,15 @@ static void ext4_write_super(struct super_block *sb)
static int ext4_sync_fs(struct super_block *sb, int wait)
{
- tid_t target;
+ int ret = 0;
trace_mark(ext4_sync_fs, "dev %s wait %d", sb->s_id, wait);
sb->s_dirt = 0;
- if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
- if (wait)
- jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
- }
- return 0;
+ if (wait)
+ ret = ext4_force_commit(sb);
+ else
+ jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+ return ret;
}
/*
Hi Ted, ...
On Mon, Nov 03, 2008 at 03:12:09PM -0800, Theodore Tso wrote:
> [...]
> BTW, did you ever open a bugzilla entry for this the symlink
> corruption problem that we should reference in the commit log?
Nope...
Patch looks good now -- thanks for looking into this...
Arthur
Hello,
I'm sorry I'm replying late but I got time to react to this only now...
> On Mon, Nov 03, 2008 at 11:33:18AM -0800, Andrew Morton wrote:
> > [...]
> > > --- a/fs/ext3/super.c
> > > +++ b/fs/ext3/super.c
> > > @@ -2392,7 +2392,13 @@ static int ext3_sync_fs(struct super_block *sb, int wait)
> > > if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> > > if (wait)
> > > log_wait_commit(EXT3_SB(sb)->s_journal, target);
> > > - }
> > > + } else if (wait)
> > > + /*
> > > + * We may have a commit in progress, clear it out
> > > + * before we go on...
> > > + */
> > > + ext3_force_commit(sb);
> > > +
> > > return 0;
> > > }
> >
> > Can we do
> >
> > sb->s_dirt = 0;
> > if (wait)
> > ext3_force_commit(...);
> > else
> > journal_start_commit(...);
>
> I think this is what you had in mind:
>
>
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18eaa78..2b48b85 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2386,13 +2386,12 @@ static void ext3_write_super (struct super_block * sb)
>
> static int ext3_sync_fs(struct super_block *sb, int wait)
> {
> - tid_t target;
> -
> sb->s_dirt = 0;
> - if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
> - if (wait)
> - log_wait_commit(EXT3_SB(sb)->s_journal, target);
> - }
> + if (wait)
> + ext3_force_commit(sb);
> + else
> + journal_start_commit(EXT3_SB(sb)->s_journal, NULL);
> +
> return 0;
> }
>
> I tried this and it too fixes the problem. FWIW I agree it
> looks better...
Well, shouldn't we rather fix what journal_start_commit() returns?
The interface which returns 1 when a transaction is already committing or
a transaction commit has just been started but 0 when we race with
somebody staring the commit is fairly unusable. Moreover
ext3_force_commit() will unnecessarily create new sync transaction and
commit it if there's no transaction running which is quite expensive
(even merging empty sync handle is not for free because of sync
transaction batching). But this is minor problem since we probably
don't care too much about sync() performance - BTW this is probably a
cause for bug 12224, isn't it?
BTW: ocfs2 would need fixing as well if done your way since it's
sync_fs function has been copied over from ext3.
To summarized I'd rather see a patch like below (untested) going in
and your patch reverted... Opinions? I can cookup a JBD2 version of
the patch in case we agree to go this way.
Honza
>From 0a578ba1b56fe655570ee6dad41748863a120dbc Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Fri, 19 Dec 2008 00:05:34 +0100
Subject: [PATCH] jbd: Fix return value of journal_start_commit()
journal_start_commit() returns 1 if either a transaction is committing or the
function has queued a transaction commit. But it returns 0 if we raced with
somebody queueing the transaction commit as well. This resulted in
ext3_sync_fs() not functioning correctly (description from Arthur Jones):
In the case of a data=ordered umount with pending long symlinks which are
delayed due to a long list of other I/O on the backing block device, this
causes the buffer associated with the long symlinks to not be moved to the
inode dirty list in the second phase of fsync_super. Then, before they can be
dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are
never written to the backing block device, causing long symlink corruption and
exposing new or previously freed block data to userspace.
This can be reproduced with a script created by Eric Sandeen
<[email protected]>:
#!/bin/bash
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.
Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd/journal.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 9e4fa52..e79c078 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -427,7 +427,7 @@ int __log_space_left(journal_t *journal)
}
/*
- * Called under j_state_lock. Returns true if a transaction was started.
+ * Called under j_state_lock. Returns true if a transaction commit was started.
*/
int __log_start_commit(journal_t *journal, tid_t target)
{
@@ -495,7 +495,8 @@ int journal_force_commit_nested(journal_t *journal)
/*
* Start a commit of the current running transaction (if any). Returns true
- * if a transaction was started, and fills its tid in at *ptid
+ * if a transaction is going to be committed (or is currently already
+ * committing), and fills its tid in at *ptid
*/
int journal_start_commit(journal_t *journal, tid_t *ptid)
{
@@ -505,15 +506,19 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
if (journal->j_running_transaction) {
tid_t tid = journal->j_running_transaction->t_tid;
- ret = __log_start_commit(journal, tid);
- if (ret && ptid)
+ __log_start_commit(journal, tid);
+ /* There's a running transaction and we've just made sure
+ * it's commit has been scheduled. */
+ if (ptid)
*ptid = tid;
- } else if (journal->j_committing_transaction && ptid) {
+ ret = 1;
+ } else if (journal->j_committing_transaction) {
/*
* If ext3_write_super() recently started a commit, then we
* have to wait for completion of that transaction
*/
- *ptid = journal->j_committing_transaction->t_tid;
+ if (ptid)
+ *ptid = journal->j_committing_transaction->t_tid;
ret = 1;
}
spin_unlock(&journal->j_state_lock);
--
1.6.0.2
Jan Kara wrote:
> Hello,
>
> I'm sorry I'm replying late but I got time to react to this only now...
>
<snip>
>> I tried this and it too fixes the problem. FWIW I agree it
>> looks better...
> Well, shouldn't we rather fix what journal_start_commit() returns?
> The interface which returns 1 when a transaction is already committing or
> a transaction commit has just been started but 0 when we race with
> somebody staring the commit is fairly unusable. Moreover
> ext3_force_commit() will unnecessarily create new sync transaction and
> commit it if there's no transaction running which is quite expensive
> (even merging empty sync handle is not for free because of sync
> transaction batching). But this is minor problem since we probably
> don't care too much about sync() performance - BTW this is probably a
> cause for bug 12224, isn't it?
Yep, it is! :)
> BTW: ocfs2 would need fixing as well if done your way since it's
> sync_fs function has been copied over from ext3.
> To summarized I'd rather see a patch like below (untested) going in
> and your patch reverted... Opinions? I can cookup a JBD2 version of
> the patch in case we agree to go this way.
Thanks, I'll look that over.
In looking at what we have today, I wonder if we can make things smarter
so that we don't commit empty transactions in any case?
-Eric
On Thu 18-12-08 17:37:23, Eric Sandeen wrote:
> Jan Kara wrote:
> > Hello,
> >
> > I'm sorry I'm replying late but I got time to react to this only now...
> >
>
> <snip>
>
> >> I tried this and it too fixes the problem. FWIW I agree it
> >> looks better...
> > Well, shouldn't we rather fix what journal_start_commit() returns?
> > The interface which returns 1 when a transaction is already committing or
> > a transaction commit has just been started but 0 when we race with
> > somebody staring the commit is fairly unusable. Moreover
> > ext3_force_commit() will unnecessarily create new sync transaction and
> > commit it if there's no transaction running which is quite expensive
> > (even merging empty sync handle is not for free because of sync
> > transaction batching). But this is minor problem since we probably
> > don't care too much about sync() performance - BTW this is probably a
> > cause for bug 12224, isn't it?
>
> Yep, it is! :)
>
> > BTW: ocfs2 would need fixing as well if done your way since it's
> > sync_fs function has been copied over from ext3.
> > To summarized I'd rather see a patch like below (untested) going in
> > and your patch reverted... Opinions? I can cookup a JBD2 version of
> > the patch in case we agree to go this way.
>
> Thanks, I'll look that over.
Thanks.
> In looking at what we have today, I wonder if we can make things smarter
> so that we don't commit empty transactions in any case?
Probably it does not make sence to commit such transactions and we might
save some time in sync paths if we do so. So yes, I think skipping empty
transaction commit might be worthwhile and it shouldn't be hard to do
either. But I'd give it serious testing just in case some unexpectedly
relies on this behaviour - wouldn't this interfere e.g. with sync
transaction batching autotuning code? Untested patch below...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
---
>From 3e1a8dc6a64fdce5c99d7edca0a2812178f7463c Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Fri, 19 Dec 2008 01:20:47 +0100
Subject: [PATCH] jbd2: Skip commit of a transaction without any buffers
Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd2/commit.c | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index ebc667b..798e021 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -401,6 +401,21 @@ void jbd2_journal_commit_transaction(journal_t *journal)
J_ASSERT (commit_transaction->t_outstanding_credits <=
journal->j_max_transaction_buffers);
+ /* Skip commit of a transaction without any buffers */
+ spin_lock(&journal->j_list_lock);
+ if (commit_transaction->t_reserved_list == NULL &&
+ commit_transaction->t_buffers == NULL &&
+ commit_transaction->t_forget == NULL &&
+ list_empty(commit_transaction->t_inode_list)) {
+ BUG_ON(commit_transaction->t_checkpoint_list);
+ BUG_ON(commit_transaction->t_checkpoint_io_list);
+ BUG_ON(commit_transaction->t_iobuf_list);
+ BUG_ON(commit_transaction->t_shadow_list);
+ BUG_ON(commit_transaction->t_log_list);
+ goto out_committed;
+ }
+ spin_unlock(&journal->j_list_lock);
+
/*
* First thing we are allowed to do is to discard any remaining
* BJ_Reserved buffers. Note, it is _not_ permissible to assume
@@ -934,6 +949,7 @@ restart_loop:
/* Done with this transaction! */
+out_committed:
jbd_debug(3, "JBD: commit phase 7\n");
J_ASSERT(commit_transaction->t_state == T_COMMIT);
--
1.6.0.2
Jan Kara wrote:
>> In looking at what we have today, I wonder if we can make things smarter
>> so that we don't commit empty transactions in any case?
> Probably it does not make sence to commit such transactions and we might
> save some time in sync paths if we do so. So yes, I think skipping empty
> transaction commit might be worthwhile and it shouldn't be hard to do
> either. But I'd give it serious testing just in case some unexpectedly
> relies on this behaviour - wouldn't this interfere e.g. with sync
> transaction batching autotuning code? Untested patch below...
> Honza
Cool, thanks! This's stop:
# sync
from spinning up disks under idle filesystems too, I think.
I was looking at something similar but was still working out how many
things to check before deciding if the transaction was in fact empty. :)
-Eric
Eric Sandeen wrote:
> Jan Kara wrote:
>
>
>>> In looking at what we have today, I wonder if we can make things smarter
>>> so that we don't commit empty transactions in any case?
>>>
>> Probably it does not make sence to commit such transactions and we might
>> save some time in sync paths if we do so. So yes, I think skipping empty
>> transaction commit might be worthwhile and it shouldn't be hard to do
>> either. But I'd give it serious testing just in case some unexpectedly
>> relies on this behaviour - wouldn't this interfere e.g. with sync
>> transaction batching autotuning code? Untested patch below...
>> Honza
>>
>
>
> Cool, thanks! This's stop:
>
> # sync
>
> from spinning up disks under idle filesystems too, I think.
>
> I was looking at something similar but was still working out how many
> things to check before deciding if the transaction was in fact empty. :)
>
> -Eric
>
Without having dived into the patch in detail, one worry I would have is
that we still might care to spin up a drive for empty transactions in
order to invalidate the drive's write cache.
For example, if we have the following sequence:
(1) user app performs series of writes to file A
(2) pages dirtied from writes to A are destaged to the disk over time
(3) user app issues fsync(file A) to make sure that the data will
survive a power outage
At this point in time, would this change prevent us from spinning up the
drive and invalidating the disk write cache for that fsync() ?
Regards,
Ric
On Dec 22, 2008 14:15 -0500, Ric Wheeler wrote:
> Without having dived into the patch in detail, one worry I would have is
> that we still might care to spin up a drive for empty transactions in
> order to invalidate the drive's write cache.
>
> For example, if we have the following sequence:
>
> (1) user app performs series of writes to file A
> (2) pages dirtied from writes to A are destaged to the disk over time
> (3) user app issues fsync(file A) to make sure that the data will
> survive a power outage
>
> At this point in time, would this change prevent us from spinning up the
> drive and invalidating the disk write cache for that fsync() ?
Well, if the writes themselves didn't spin up the drive, it is uncertain
whether the write of the journal commit block would be any more helpful
in getting that to happen.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> On Dec 22, 2008 14:15 -0500, Ric Wheeler wrote:
>
>> Without having dived into the patch in detail, one worry I would have is
>> that we still might care to spin up a drive for empty transactions in
>> order to invalidate the drive's write cache.
>>
>> For example, if we have the following sequence:
>>
>> (1) user app performs series of writes to file A
>> (2) pages dirtied from writes to A are destaged to the disk over time
>> (3) user app issues fsync(file A) to make sure that the data will
>> survive a power outage
>>
>> At this point in time, would this change prevent us from spinning up the
>> drive and invalidating the disk write cache for that fsync() ?
>>
>
> Well, if the writes themselves didn't spin up the drive, it is uncertain
> whether the write of the journal commit block would be any more helpful
> in getting that to happen.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
To make the example cleared, add a step:
(2.5) between the writes and the fsync(), the drive spins down
In this case, if the app issues an fsync(), will we still have volatile
data in its write cache that has not been flushed?
Ric
Andreas Dilger wrote:
> On Dec 22, 2008 14:15 -0500, Ric Wheeler wrote:
>> Without having dived into the patch in detail, one worry I would have is
>> that we still might care to spin up a drive for empty transactions in
>> order to invalidate the drive's write cache.
>>
>> For example, if we have the following sequence:
>>
>> (1) user app performs series of writes to file A
>> (2) pages dirtied from writes to A are destaged to the disk over time
>> (3) user app issues fsync(file A) to make sure that the data will
>> survive a power outage
>>
>> At this point in time, would this change prevent us from spinning up the
>> drive and invalidating the disk write cache for that fsync() ?
>
> Well, if the writes themselves didn't spin up the drive, it is uncertain
> whether the write of the journal commit block would be any more helpful
> in getting that to happen.
So, ext4_sync_file() calls blkdev_issue_flush() which would should do
the right thing even if the drive is spun down, I think (rather than
hoping that some other journal activity would flush this out...)
I guess I don't know for sure what blkdev_issue_flush does on a
spun-down drive but I'd hope it does the right thing.
Pretty sure I sent a patch for ext3 to do the same, but it was
ignored/dropped/forgotten along with the barriers-by-default patch.
Suppose I could try again.
-Eric
Hi Eric,
> Jan Kara wrote:
> > Hello,
> >
> > I'm sorry I'm replying late but I got time to react to this only now...
> >
>
> >> I tried this and it too fixes the problem. FWIW I agree it
> >> looks better...
> > Well, shouldn't we rather fix what journal_start_commit() returns?
> > The interface which returns 1 when a transaction is already committing or
> > a transaction commit has just been started but 0 when we race with
> > somebody staring the commit is fairly unusable. Moreover
> > ext3_force_commit() will unnecessarily create new sync transaction and
> > commit it if there's no transaction running which is quite expensive
> > (even merging empty sync handle is not for free because of sync
> > transaction batching). But this is minor problem since we probably
> > don't care too much about sync() performance - BTW this is probably a
> > cause for bug 12224, isn't it?
>
> Yep, it is! :)
>
> > BTW: ocfs2 would need fixing as well if done your way since it's
> > sync_fs function has been copied over from ext3.
> > To summarized I'd rather see a patch like below (untested) going in
> > and your patch reverted... Opinions? I can cookup a JBD2 version of
> > the patch in case we agree to go this way.
>
> Thanks, I'll look that over.
Any chance you've looked over that patch? Thanks.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan Kara wrote:
> Hi Eric,
>
>> Jan Kara wrote:
>>> Hello,
>>>
>>> I'm sorry I'm replying late but I got time to react to this only now...
>>>
>>>> I tried this and it too fixes the problem. FWIW I agree it
>>>> looks better...
>>> Well, shouldn't we rather fix what journal_start_commit() returns?
>>> The interface which returns 1 when a transaction is already committing or
>>> a transaction commit has just been started but 0 when we race with
>>> somebody staring the commit is fairly unusable. Moreover
>>> ext3_force_commit() will unnecessarily create new sync transaction and
>>> commit it if there's no transaction running which is quite expensive
>>> (even merging empty sync handle is not for free because of sync
>>> transaction batching). But this is minor problem since we probably
>>> don't care too much about sync() performance - BTW this is probably a
>>> cause for bug 12224, isn't it?
>> Yep, it is! :)
>>
>>> BTW: ocfs2 would need fixing as well if done your way since it's
>>> sync_fs function has been copied over from ext3.
>>> To summarized I'd rather see a patch like below (untested) going in
>>> and your patch reverted... Opinions? I can cookup a JBD2 version of
>>> the patch in case we agree to go this way.
>> Thanks, I'll look that over.
> Any chance you've looked over that patch? Thanks.
>
> Honza
Sorry, kind of slipped through the cracks. I'll do that and run it
through the testcase today.
-Eric
Jan Kara wrote:
> From 0a578ba1b56fe655570ee6dad41748863a120dbc Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Fri, 19 Dec 2008 00:05:34 +0100
> Subject: [PATCH] jbd: Fix return value of journal_start_commit()
>
> journal_start_commit() returns 1 if either a transaction is committing or the
> function has queued a transaction commit. But it returns 0 if we raced with
> somebody queueing the transaction commit as well. This resulted in
> ext3_sync_fs() not functioning correctly (description from Arthur Jones):
> In the case of a data=ordered umount with pending long symlinks which are
> delayed due to a long list of other I/O on the backing block device, this
> causes the buffer associated with the long symlinks to not be moved to the
> inode dirty list in the second phase of fsync_super. Then, before they can be
> dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are
> never written to the backing block device, causing long symlink corruption and
> exposing new or previously freed block data to userspace.
This looks sane to me, and it does fix the below testcase.
Care to formally propose it?
Thanks,
-Eric
> This can be reproduced with a script created by Eric Sandeen
> <[email protected]>:
>
> #!/bin/bash
>
> umount /mnt/test2
> mount /dev/sdb4 /mnt/test2
> rm -f /mnt/test2/*
> dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
> touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
> ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
> /mnt/test2/link
> umount /mnt/test2
> mount /dev/sdb4 /mnt/test2
> ls /mnt/test2/
>
> This patch fixes journal_start_commit() to always return 1 when there's
> a transaction committing or queued for commit.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/jbd/journal.c | 17 +++++++++++------
> 1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index 9e4fa52..e79c078 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -427,7 +427,7 @@ int __log_space_left(journal_t *journal)
> }
>
> /*
> - * Called under j_state_lock. Returns true if a transaction was started.
> + * Called under j_state_lock. Returns true if a transaction commit was started.
> */
> int __log_start_commit(journal_t *journal, tid_t target)
> {
> @@ -495,7 +495,8 @@ int journal_force_commit_nested(journal_t *journal)
>
> /*
> * Start a commit of the current running transaction (if any). Returns true
> - * if a transaction was started, and fills its tid in at *ptid
> + * if a transaction is going to be committed (or is currently already
> + * committing), and fills its tid in at *ptid
> */
> int journal_start_commit(journal_t *journal, tid_t *ptid)
> {
> @@ -505,15 +506,19 @@ int journal_start_commit(journal_t *journal, tid_t *ptid)
> if (journal->j_running_transaction) {
> tid_t tid = journal->j_running_transaction->t_tid;
>
> - ret = __log_start_commit(journal, tid);
> - if (ret && ptid)
> + __log_start_commit(journal, tid);
> + /* There's a running transaction and we've just made sure
> + * it's commit has been scheduled. */
> + if (ptid)
> *ptid = tid;
> - } else if (journal->j_committing_transaction && ptid) {
> + ret = 1;
> + } else if (journal->j_committing_transaction) {
> /*
> * If ext3_write_super() recently started a commit, then we
> * have to wait for completion of that transaction
> */
> - *ptid = journal->j_committing_transaction->t_tid;
> + if (ptid)
> + *ptid = journal->j_committing_transaction->t_tid;
> ret = 1;
> }
> spin_unlock(&journal->j_state_lock);
On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>
> This looks sane to me, and it does fix the below testcase.
>
> Care to formally propose it?
Can we confirm what is being proposed? From following this thread, I
think what folks are suggesting is:
1) Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
2) Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
3) Also apply Jan's patch "jbd2: Skip commit of a transaction without
any buffers" since it appears to be a good optimization (although it's
not clear it would happen once we revert (1), above.
- Ted
Theodore Tso wrote:
> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>> This looks sane to me, and it does fix the below testcase.
>>
>> Care to formally propose it?
>
> Can we confirm what is being proposed? From following this thread, I
> think what folks are suggesting is:
>
> 1) Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
>
> 2) Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
The above seems like the right plan to me. It should address the
kernel.org bug about IO on idle partitions.
> 3) Also apply Jan's patch "jbd2: Skip commit of a transaction without
> any buffers" since it appears to be a good optimization (although it's
> not clear it would happen once we revert (1), above.
Maybe not yet, it could probably use more testing/review/soak.
-Eric
> - Ted
On Tue 13-01-09 23:24:02, Theodore Tso wrote:
> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
> >
> > This looks sane to me, and it does fix the below testcase.
> >
> > Care to formally propose it?
>
> Can we confirm what is being proposed? From following this thread, I
> think what folks are suggesting is:
>
> 1) Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
Yes.
> 2) Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
Yes.
> 3) Also apply Jan's patch "jbd2: Skip commit of a transaction without
> any buffers" since it appears to be a good optimization (although it's
> not clear it would happen once we revert (1), above.
Yes, it's an optimization but I'm still a bit afraid about something
relying on jbd2_journal_force_commit() implying a barrier which would not
always be a case after this patch... So we should probably audit all users of
ext4_force_commit() and check that this change is fine with them.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Jan 14, 2009 at 12:27 PM, Jan Kara <[email protected]> wrote:
> On Tue 13-01-09 23:24:02, Theodore Tso wrote:
>> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>> >
>> > This looks sane to me, and it does fix the below testcase.
>> >
>> > Care to formally propose it?
>>
>> Can we confirm what is being proposed? From following this thread, I
>> think what folks are suggesting is:
>>
>> 1) Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
> Yes.
>
>> 2) Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
> Yes.
>
>> 3) Also apply Jan's patch "jbd2: Skip commit of a transaction without
>> any buffers" since it appears to be a good optimization (although it's
>> not clear it would happen once we revert (1), above.
> Yes, it's an optimization but I'm still a bit afraid about something
> relying on jbd2_journal_force_commit() implying a barrier which would not
> always be a case after this patch... So we should probably audit all users of
> ext4_force_commit() and check that this change is fine with them.
Ted/Jan/Eric,
I just wanted to followup on this to see what the plan is. Items 1
and 2 haven't occurred in any of the ext4.git branches that I can see.
I could be missing something but it seems this may have slipped
through the ext[34] cracks?
Mike
Mike Snitzer wrote:
> On Wed, Jan 14, 2009 at 12:27 PM, Jan Kara <[email protected]> wrote:
>> On Tue 13-01-09 23:24:02, Theodore Tso wrote:
>>> On Tue, Jan 13, 2009 at 04:14:11PM -0600, Eric Sandeen wrote:
>>>> This looks sane to me, and it does fix the below testcase.
>>>>
>>>> Care to formally propose it?
>>> Can we confirm what is being proposed? From following this thread, I
>>> think what folks are suggesting is:
>>>
>>> 1) Revert the current "ext3/4: wait on all pending ocmmits in ext3/4_sync_fs"
>> Yes.
>>
>>> 2) Apply Jan's patch "jbd[2]: Fix return value of journal_start_commit()"
>> Yes.
>>
>>> 3) Also apply Jan's patch "jbd2: Skip commit of a transaction without
>>> any buffers" since it appears to be a good optimization (although it's
>>> not clear it would happen once we revert (1), above.
>> Yes, it's an optimization but I'm still a bit afraid about something
>> relying on jbd2_journal_force_commit() implying a barrier which would not
>> always be a case after this patch... So we should probably audit all users of
>> ext4_force_commit() and check that this change is fine with them.
>
> Ted/Jan/Eric,
>
> I just wanted to followup on this to see what the plan is. Items 1
> and 2 haven't occurred in any of the ext4.git branches that I can see.
> I could be missing something but it seems this may have slipped
> through the ext[34] cracks?
Hm, I agree.
Jan, do you want to re-send it in its own message rather than buried in
the other thread? I don't know how we technically handle a "revert"
upstream, to be honest.
-Eric