Sorry, that last send was totally stuffed up w.r.t. mail headers.
So for the benefit of proper cc:'s here it comes again...
A collection of patches to make ext3 & 4 use barriers by
default, and to call blkdev_issue_flush on fsync if they
are enabled.
I can't think of any valid reason for ext3 to not use barriers when
they are available; I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.
An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.
SuSE has carried such a patch for quite some time now.
Also document the mount option while we're at it.
Signed-off-by: Eric Sandeen <[email protected]>
Acked-by: Jan Kara <[email protected]>
---
Documentation/filesystems/ext3.txt | 12 ++++++++++--
fs/ext3/super.c | 11 +++++++++--
2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index b45f3c1..daab1f5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -52,8 +52,16 @@ commit=nrsec (*) Ext3 can be told to sync all its data and metadata
Setting it to very large values will improve
performance.
-barrier=1 This enables/disables barriers. barrier=0 disables
- it, barrier=1 enables it.
+barrier=<0|1(*)> This enables/disables the use of write barriers in
+ the jbd code. barrier=0 disables, barrier=1 enables.
+ This also requires an IO stack which can support
+ barriers, and if jbd gets an error on a barrier
+ write, it will disable again with a warning.
+ Write barriers enforce proper on-disk ordering
+ of journal commits, making volatile disk write caches
+ safe to use, at some performance penalty. If
+ your disks are battery-backed in one way or another,
+ disabling barriers may safely improve performance.
orlov (*) This enables the new Orlov block allocator. It is
enabled by default.
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index fe3119a..9c30dc7 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -555,6 +555,7 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
struct super_block *sb = vfs->mnt_sb;
struct ext3_sb_info *sbi = EXT3_SB(sb);
struct ext3_super_block *es = sbi->s_es;
+ journal_t *journal = sbi->s_journal;
unsigned long def_mount_opts;
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
@@ -613,8 +614,13 @@ static int ext3_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_printf(seq, ",commit=%u",
(unsigned) (sbi->s_commit_interval / HZ));
}
- if (test_opt(sb, BARRIER))
- seq_puts(seq, ",barrier=1");
+ /*
+ * jbd inherits the barrier flag from ext3, and may actually
+ * turn off barriers if a write fails, so it's the real test.
+ */
+ if (!test_opt(sb, BARRIER) ||
+ (journal && !(journal->j_flags & JFS_BARRIER)))
+ seq_puts(seq, ",barrier=0");
if (test_opt(sb, NOBH))
seq_puts(seq, ",nobh");
@@ -1589,6 +1595,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
set_opt(sbi->s_mount_opt, RESERVATION);
+ set_opt(sbi->s_mount_opt, BARRIER);
if (!parse_options ((char *) data, sb, &journal_inum, &journal_devnum,
NULL, 0))
-- 1.5.3.6 -- To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
To ensure that bits are truly on-disk after an fsync,
ext3 should call blkdev_issue_flush if barriers are supported.
Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel
Signed-off-by: Eric Sandeen <[email protected]>
---
fs/ext3/fsync.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..f6167ec 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/jbd.h>
+#include <linux/blkdev.h>
#include <linux/ext3_fs.h>
#include <linux/ext3_jbd.h>
@@ -45,6 +46,7 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
int ret = 0;
J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+
+ if (journal && (journal->j_flags & JFS_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
}
out:
return ret;
-- 1.5.3.6
I can't think of any valid reason for ext4 to not use barriers when
they are available; I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.
An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.
SuSE has carried such a patch for ext3 for quite some time now.
Also document the mount option while we're at it.
Signed-off-by: Eric Sandeen <[email protected]>
---
Documentation/filesystems/ext4.txt | 12 ++++++++++--
fs/ext4/super.c | 11 +++++++++--
2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 560f88d..0c5086d 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -139,8 +139,16 @@ commit=nrsec (*) Ext4 can be told to sync all its data and metadata
Setting it to very large values will improve
performance.
-barrier=1 This enables/disables barriers. barrier=0 disables
- it, barrier=1 enables it.
+barrier=<0|1(*)> This enables/disables the use of write barriers in
+ the jbd code. barrier=0 disables, barrier=1 enables.
+ This also requires an IO stack which can support
+ barriers, and if jbd gets an error on a barrier
+ write, it will disable again with a warning.
+ Write barriers enforce proper on-disk ordering
+ of journal commits, making volatile disk write caches
+ safe to use, at some performance penalty. If
+ your disks are battery-backed in one way or another,
+ disabling barriers may safely improve performance.
orlov (*) This enables the new Orlov block allocator. It is
enabled by default.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 52dd067..77b036a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -671,6 +671,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
unsigned long def_mount_opts;
struct super_block *sb = vfs->mnt_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
+ journal_t *journal = sbi->s_journal;
struct ext4_super_block *es = sbi->s_es;
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
@@ -729,8 +730,13 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_printf(seq, ",commit=%u",
(unsigned) (sbi->s_commit_interval / HZ));
}
- if (test_opt(sb, BARRIER))
- seq_puts(seq, ",barrier=1");
+ /*
+ * jbd2 inherits the barrier flag from ext4, and may actually
+ * turn off barriers if a write fails, so it's the real test.
+ */
+ if (!test_opt(sb, BARRIER) ||
+ (journal && !(journal->j_flags & JBD2_BARRIER)))
+ seq_puts(seq, ",barrier=0");
if (test_opt(sb, NOBH))
seq_puts(seq, ",nobh");
if (!test_opt(sb, EXTENTS))
@@ -1890,6 +1896,7 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
set_opt(sbi->s_mount_opt, RESERVATION);
+ set_opt(sbi->s_mount_opt, BARRIER);
/*
* turn on extents feature by default in ext4 filesystem
-- 1.5.3.6
To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.
Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel
Signed-off-by: Eric Sandeen <[email protected]>
---
fs/ext4/fsync.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 1c8ba48..a45c373 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/jbd2.h>
+#include <linux/blkdev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
@@ -45,6 +46,7 @@
int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
int ret = 0;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -85,6 +87,8 @@ int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (journal && (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
}
out:
return ret;
-- 1.5.3.6
On Fri, 16 May 2008 14:02:46 -0500
Eric Sandeen <[email protected]> wrote:
> A collection of patches to make ext3 & 4 use barriers by
> default, and to call blkdev_issue_flush on fsync if they
> are enabled.
Last time this came up lots of workloads slowed down by 30% so I
dropped the patches in horror.
I just don't think we can quietly go and slow everyone's machines down
by this much. The overhead of journalling is already pretty horrid.
If we were seeing a significant number of "hey, my disk got wrecked"
reports which attributable to this then yes, perhaps we should change
the default. But I've never seen _any_, although I've seen claims that
others have seen reports.
There are no happy solutions here, and I'm inclined to let this dog
remain asleep and continue to leave it up to distributors to decide
what their default should be.
Do we know which distros are enabling barriers by default?
Andrew Morton wrote:
> On Fri, 16 May 2008 14:02:46 -0500
> Eric Sandeen <[email protected]> wrote:
>
>> A collection of patches to make ext3 & 4 use barriers by
>> default, and to call blkdev_issue_flush on fsync if they
>> are enabled.
>
> Last time this came up lots of workloads slowed down by 30% so I
> dropped the patches in horror.
I actually did a bit of research and found the old thread, honestly. I
thought this might not be a shoo-in. :) Seems worth hashing out, though.
> I just don't think we can quietly go and slow everyone's machines down
> by this much. The overhead of journalling is already pretty horrid.
But if journali[zi]ng guarantees are thrown out the window by volatile
caches on disk, why bother with the half-solution? Slower while you
run, worthless when you lose power? Sounds like the worst of both
worlds. (well, ok, experience shows that it's not worthless in practice...)
> If we were seeing a significant number of "hey, my disk got wrecked"
> reports which attributable to this then yes, perhaps we should change
> the default. But I've never seen _any_, although I've seen claims that
> others have seen reports.
Hm, how would we know, really? What does it look like? It'd totally
depend on what got lost... When do you find out? Again depends what
you're doing, I think. I'll admit that I don't have any good evidence
of my own. I'll go off and do some plug-pull-testing and a benchmark or
two.
But, drive caches are only getting bigger, I assume this can't help. I
have a hard time seeing how speed at the cost of correctness is the
right call...
> There are no happy solutions here, and I'm inclined to let this dog
> remain asleep and continue to leave it up to distributors to decide
> what their default should be.
>
> Do we know which distros are enabling barriers by default?
SuSE does (via patch for ext3). Red Hat & Fedora don't, and install by
default on lvm which won't pass barriers anyway. So maybe it's
hypocritical to send this patch from redhat.com :)
And as another "who uses barriers" datapoint, reiserfs & xfs both have
them on by default.
I suppose alternately I could send another patch to remove "remember
that ext3/4 by default offers higher data integrity guarantees than
most." from Documentation/filesystems/ext4.txt ;)
-Eric
On Fri, 16 May 2008 15:53:31 -0500
Eric Sandeen <[email protected]> wrote:
> Andrew Morton wrote:
> > On Fri, 16 May 2008 14:02:46 -0500
> > Eric Sandeen <[email protected]> wrote:
> >
> >> A collection of patches to make ext3 & 4 use barriers by
> >> default, and to call blkdev_issue_flush on fsync if they
> >> are enabled.
> >
> > Last time this came up lots of workloads slowed down by 30% so I
> > dropped the patches in horror.
>
> I actually did a bit of research and found the old thread, honestly. I
> thought this might not be a shoo-in. :) Seems worth hashing out, though.
>
> > I just don't think we can quietly go and slow everyone's machines down
> > by this much. The overhead of journalling is already pretty horrid.
>
> But if journali[zi]ng guarantees are thrown out the window by volatile
> caches on disk, why bother with the half-solution? Slower while you
> run, worthless when you lose power? Sounds like the worst of both
> worlds. (well, ok, experience shows that it's not worthless in practice...)
>
> > If we were seeing a significant number of "hey, my disk got wrecked"
> > reports which attributable to this then yes, perhaps we should change
> > the default. But I've never seen _any_, although I've seen claims that
> > others have seen reports.
>
> Hm, how would we know, really? What does it look like? It'd totally
> depend on what got lost... When do you find out? Again depends what
> you're doing, I think. I'll admit that I don't have any good evidence
> of my own. I'll go off and do some plug-pull-testing and a benchmark or
> two.
>
> But, drive caches are only getting bigger, I assume this can't help. I
> have a hard time seeing how speed at the cost of correctness is the
> right call...
Yeah, it's all so handwavy. The only thing which isn't handwavy is
that performance hit.
> > There are no happy solutions here, and I'm inclined to let this dog
> > remain asleep and continue to leave it up to distributors to decide
> > what their default should be.
> >
> > Do we know which distros are enabling barriers by default?
>
> SuSE does (via patch for ext3). Red Hat & Fedora don't, and install by
> default on lvm which won't pass barriers anyway. So maybe it's
> hypocritical to send this patch from redhat.com :)
>
> And as another "who uses barriers" datapoint, reiserfs & xfs both have
> them on by default.
>
> I suppose alternately I could send another patch to remove "remember
> that ext3/4 by default offers higher data integrity guarantees than
> most." from Documentation/filesystems/ext4.txt ;)
We could add a big scary printk at mount time and provide a document?
Andrew Morton wrote:
> > I suppose alternately I could send another patch to remove "remember
> > that ext3/4 by default offers higher data integrity guarantees than
> > most." from Documentation/filesystems/ext4.txt ;)
>
> We could add a big scary printk at mount time and provide a document?
Can I suggest making /proc/mounts say "barrier=0" when journal is not
enabled, instead of omitting the option.
Boot logs are too large to pay close attention to unless it's really
obvious. (2.4 kernels _do_ have a similar message about "data
integrity not guaranteed" with USB drivers - I never understood what
it was getting it, and why it was removed for 2.6).
However, if I saw barrier=0 in /proc/mounts it would at least prompt
me to look it up and then making an informed decision.
Personally I had assumed barriers were enabled by default with ext3,
as some distros do that, the 2.4 patches did that, and:
I *have* experienced corruption following power loss without
barriers, and none with barriers.
When I mentioned that turning off write cache or using barriers is
a solution to a programmer working on the same project, she said
"oh, yes, we've had reports of disk corruption too - thanks for the
advice", and the advice worked, so I am not the only one.
(In the interests of perspective, that's with ext3 on patched 2.4
kernels on a ARM device, but still - the barriers seem to work).
On a related note, there is advice floating about the net to run with
IDE write cache turned off if you're running a database and care about
integrity. That has much worse performance than barriers.
I guess the patch which fixes fsync is particularly useful for those
database users, as it means they can run with write cache enabled and
depend on fsync() to give equivalent integrity now. (Enabling
journalling is not really relevant to this).
-- Jamie
Jamie Lokier wrote:
> Andrew Morton wrote:
>>> I suppose alternately I could send another patch to remove "remember
>>> that ext3/4 by default offers higher data integrity guarantees than
>>> most." from Documentation/filesystems/ext4.txt ;)
>> We could add a big scary printk at mount time and provide a document?
>
> Can I suggest making /proc/mounts say "barrier=0" when journal is not
> enabled, instead of omitting the option.
>
> Boot logs are too large to pay close attention to unless it's really
> obvious. (2.4 kernels _do_ have a similar message about "data
> integrity not guaranteed" with USB drivers - I never understood what
> it was getting it, and why it was removed for 2.6).
>
> However, if I saw barrier=0 in /proc/mounts it would at least prompt
> me to look it up and then making an informed decision.
Right now, ext3_show_options has the scheme:
/*
* Show an option if
* - it's set to a non-default value OR
* - if the per-sb default is different from the global default
*/
so only non-default is shown, so today barrier=0 is not shown. I
suppose that could be changed...
FWIW, my patch would show barrier=0 if it's manually mounted that way
(against new proposed defaults), or if we are running w/o barriers due
to a failed barrier IO even though barriers were requested.
> Personally I had assumed barriers were enabled by default with ext3,
> as some distros do that, the 2.4 patches did that, and:
>
> I *have* experienced corruption following power loss without
> barriers, and none with barriers.
>
> When I mentioned that turning off write cache or using barriers is
> a solution to a programmer working on the same project, she said
> "oh, yes, we've had reports of disk corruption too - thanks for the
> advice", and the advice worked, so I am not the only one.
>
> (In the interests of perspective, that's with ext3 on patched 2.4
> kernels on a ARM device, but still - the barriers seem to work).
>
> On a related note, there is advice floating about the net to run with
> IDE write cache turned off if you're running a database and care about
> integrity. That has much worse performance than barriers.
... and I've seen hand-waving about shortened drive life running this
way? but who really knows....
Thanks,
-Eric
> I guess the patch which fixes fsync is particularly useful for those
> database users, as it means they can run with write cache enabled and
> depend on fsync() to give equivalent integrity now. (Enabling
> journalling is not really relevant to this).
>
> -- Jamie
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Sandeen wrote:
> > If we were seeing a significant number of "hey, my disk got wrecked"
> > reports which attributable to this then yes, perhaps we should change
> > the default. But I've never seen _any_, although I've seen claims that
> > others have seen reports.
>
> Hm, how would we know, really? What does it look like? It'd totally
> depend on what got lost... When do you find out? Again depends what
> you're doing, I think. I'll admit that I don't have any good evidence
> of my own. I'll go off and do some plug-pull-testing and a benchmark or
> two.
You have to pull the plug quite a lot, while there is data in write
cache, and when the data is something you will notice later.
Checking filesystem is hard. Something systematic would be good - for
which you will want an electronically controlled power switch.
I have seen corruption which I believe is from lack of barriers, and
hasn't occurred since I implemented them (or disabled write cache).
But it's hard to be sure that was the real cause.
If you just want to test the block I/O layer and drive itself, don't
use the filesystem, but write a program which just access the block
device, continuously writing with/without barriers every so often, and
after power cycle read back to see what was and wasn't written.
I think there may be drives which won't show any effect - if they have
enough internal power (from platter inertia) to write everything in
the cache before losing it.
If you want to test, the worst case is to queue many small writes at
seek positions acrosss the disk, so that flushing the disk's write
cache takes the longest time. A good pattern might be take numbers
0..2^N-1 (e.g. 0..255), for each number reverse the bit order (0, 128,
64, 192...) and do writes at those block positions, scaling up to the
range of the whole disk. The idea is if the disk just caches the last
few queued, they will always be quite spread out.
However, a particular disk's cache algorithm may bound the write cache
size by predicted seek time needed to flush it, rather than bytes,
defeating that.
> But, drive caches are only getting bigger, I assume this can't help. I
> have a hard time seeing how speed at the cost of correctness is the
> right call...
I agree.
The MacOS X folks decided that speed is most important for fsync().
fsync() does not guarantee commit to platter. *But* they added an
fcntl() for applications to request a commit to platter, which SQLite
at least uses. I don't know if MacOS X uses barriers for filesystem
operations.
> > Do we know which distros are enabling barriers by default?
>
> SuSE does (via patch for ext3).
SuSE did the same for 2.4: they patched the kernel to add barriers to
2.4. (I'm using an improved version of that patch in my embedded
devices). It's nice they haven't weakened the behavior in 2.6.
> Red Hat & Fedora don't,
Neither did they patch for barriers in 2.4, but we can't hold that
against them :-)
> and install by default on lvm which won't pass barriers anyway.
Considering how many things depend on LVM not passing barriers, that
is scary. People use software RAID assuming integrity. They are
immune to many hardware faults. But it turns out, on Linux, that a
single disk can have higher integrity against power failure than a
RAID.
> So maybe it's hypocritical to send this patch from redhat.com :)
So send the patch to fix LVM too :-)
> And as another "who uses barriers" datapoint, reiserfs & xfs both have
> them on by default.
Are they noticably slower than ext3? If not, it suggests ext3 can be
fixed to keep its performance with barriers enabled.
Specifically: under some workloads, batching larger changes into the
journal between commit blocks might compensate. Maybe the journal has
been tuned for barriers off because they are by default?
> I suppose alternately I could send another patch to remove "remember
> that ext3/4 by default offers higher data integrity guarantees than
> most." from Documentation/filesystems/ext4.txt ;)
It would be fair. I suspect a fair number of people are under the
impression ext3 uses barriers with no special options, prior to this
thread. It was advertised as a feature in development during the 2.5
series.
Eric Sandeen wrote:
> so only non-default is shown, so today barrier=0 is not shown. I
> suppose that could be changed...
Yes, I suggest that, even if barrier=0 remains the default. I suggest
showing barrier=1 no matter what the default, too, since if the
default is changed, no option will become ambiguous in bug reports,
cut and pastes etc.
> FWIW, my patch would show barrier=0 if it's manually mounted that way
> (against new proposed defaults), or if we are running w/o barriers due
> to a failed barrier IO even though barriers were requested.
Speaking of failed barrier I/O, it should be possible to fall back to
"disable cache, write commit block, enable cache" on old drives which
don't have the cache flush command.
> > On a related note, there is advice floating about the net to run with
> > IDE write cache turned off if you're running a database and care about
> > integrity. That has much worse performance than barriers.
>
> ... and I've seen hand-waving about shortened drive life running this
> way? but who really knows....
I've no idea. It makes sense: disabling write cache increases the
number of seeks for many loads.
In any case, with barriers implemented properly you get basically the
same level of integrity as disabling the write cache, without the
substantial performance hit.
-- Jamie
Eric Sandeen wrote:
> To ensure that bits are truly on-disk after an fsync,
> ext3 should call blkdev_issue_flush if barriers are supported.
>
> Inspired by an old thread on barriers, by reiserfs & xfs
> which do the same, and by a patch SuSE ships with their kernel
blkdev_issue_flush uses BIO_RW_BARRIER, which becomes REQ_HARDBARRIER.
REQ_HARDBARRIER does _not_ mean force the data to disk.
It is a barrier request, which when optimised for the highest
performance can return a long time before the data is physically on
the disk. (It can even do nothing, correctly).
In many cases, block drivers implement a barrier using a hardware
flush command, so it turns out the same. Then your patch is effective.
(It is however sending unnecessary hardware commands sometimes,
because fsync's journal write (when there is one) will also issue a
flush command, and another is not required; it may reduce performance.)
But that is not what REQ_HARDBARRIER means: block drivers are allowed
to implement barriers in other ways which are more efficient.
I'm not sure if this problem is theoretical only, or if any block
drivers actually do take advantage of this.
A good fix would be to add REQ_HARDFLUSH, with a precise meaning:
don't return until the hardware reports the data safely committed. It
would be very little change to most (if not all) current drivers.
-- Jamie
Jamie Lokier wrote:
> Eric Sandeen wrote:
>>> If we were seeing a significant number of "hey, my disk got wrecked"
>>> reports which attributable to this then yes, perhaps we should change
>>> the default. But I've never seen _any_, although I've seen claims that
>>> others have seen reports.
>> Hm, how would we know, really? What does it look like? It'd totally
>> depend on what got lost... When do you find out? Again depends what
>> you're doing, I think. I'll admit that I don't have any good evidence
>> of my own. I'll go off and do some plug-pull-testing and a benchmark or
>> two.
>
> You have to pull the plug quite a lot, while there is data in write
> cache, and when the data is something you will notice later.
>
> Checking filesystem is hard. Something systematic would be good - for
> which you will want an electronically controlled power switch.
Right, that was the plan. I wasn't really going to stand there and pull
the plug. :) I'd like to get to "out of $NUMBER power-loss events
under this usage, I saw $THIS corruption $THISMANY times ..."
> I have seen corruption which I believe is from lack of barriers, and
> hasn't occurred since I implemented them (or disabled write cache).
> But it's hard to be sure that was the real cause.
>
> If you just want to test the block I/O layer and drive itself, don't
> use the filesystem, but write a program which just access the block
> device, continuously writing with/without barriers every so often, and
> after power cycle read back to see what was and wasn't written.
Well, I think it is worth testing through the filesystem, different
journaling mechanisms will probably react^wcorrupt in different ways.
> I think there may be drives which won't show any effect - if they have
> enough internal power (from platter inertia) to write everything in
> the cache before losing it.
... and those with flux capacitors. ;) I've heard of this mechanism
but I'm not sure I believe it is present in any modern drive. Not sure
the seagates of the world will tell us, either ....
> If you want to test, the worst case is to queue many small writes at
> seek positions acrosss the disk, so that flushing the disk's write
> cache takes the longest time. A good pattern might be take numbers
> 0..2^N-1 (e.g. 0..255), for each number reverse the bit order (0, 128,
> 64, 192...) and do writes at those block positions, scaling up to the
> range of the whole disk. The idea is if the disk just caches the last
> few queued, they will always be quite spread out.
I suppose we could go about it 2 ways; come up with something diabolical
and try very hard to break it (I think we know that we can) or do
something more realistic (like untarring & building a kernel?) and see
what happens in that case...
> The MacOS X folks decided that speed is most important for fsync().
> fsync() does not guarantee commit to platter. *But* they added an
> fcntl() for applications to request a commit to platter, which SQLite
> at least uses. I don't know if MacOS X uses barriers for filesystem
> operations.
heh, reminds me of xfs's "osyncisosync" option ;)
>> and install by default on lvm which won't pass barriers anyway.
>
> Considering how many things depend on LVM not passing barriers, that
> is scary. People use software RAID assuming integrity. They are
> immune to many hardware faults. But it turns out, on Linux, that a
> single disk can have higher integrity against power failure than a
> RAID.
FWIW... md also only does it on raid1... but lvm with a single device
or mirror underneath really *should* IMHO...
>> So maybe it's hypocritical to send this patch from redhat.com :)
>
> So send the patch to fix LVM too :-)
hehe, I'll try ... ;)
>> And as another "who uses barriers" datapoint, reiserfs & xfs both have
>> them on by default.
>
> Are they noticably slower than ext3? If not, it suggests ext3 can be
> fixed to keep its performance with barriers enabled.
Well it all depends on what you're testing (and the hardware you're
testing it on). Between ext3 & xfs you can find plenty of tests which
will show either one or the other as faster. And most benchmark results
out there probably don't state whether barriers were in force or not.
> Specifically: under some workloads, batching larger changes into the
> journal between commit blocks might compensate. Maybe the journal has
> been tuned for barriers off because they are by default?
>
>> I suppose alternately I could send another patch to remove "remember
>> that ext3/4 by default offers higher data integrity guarantees than
>> most." from Documentation/filesystems/ext4.txt ;)
>
> It would be fair. I suspect a fair number of people are under the
> impression ext3 uses barriers with no special options, prior to this
> thread. It was advertised as a feature in development during the 2.5
> series.
It certainly does come into play in benchmarking scenarios...
-Eric
Andrew Morton wrote:
> > A collection of patches to make ext3 & 4 use barriers by
> > default, and to call blkdev_issue_flush on fsync if they
> > are enabled.
>
> Last time this came up lots of workloads slowed down by 30% so I
> dropped the patches in horror.
Apart from batching larger journal commits, I thought of another way
which might improve write performance with barriers.
Basically, right now a barrier prevents any write I/Os from being
moved before or after it in the elevator, as well as issuing commands
itself.
It may be that relaxing barriers to allow moving writes when it's fine
will allow the elevator to schedule writes to reduce seeking?
A clear example of when it's fine is data=ordered, overwriting data in
a file. Those data writes don't have to be ordered against other
filesystem activity which is triggering journal commits. So they
could be moved around barrier ops, if the elevator deems that would
reduce seeks or to coalesce ops.
Similar arguments apply to moving writes past fsync() flushes, but the
rules are a bit different.
-- Jamie
Eric Sandeen wrote:
> > Checking filesystem is hard. Something systematic would be good - for
> > which you will want an electronically controlled power switch.
>
> Right, that was the plan. I wasn't really going to stand there and pull
> the plug. :) I'd like to get to "out of $NUMBER power-loss events
> under this usage, I saw $THIS corruption $THISMANY times ..."
That would be lovely.
> > If you just want to test the block I/O layer and drive itself, don't
> > use the filesystem, but write a program which just access the block
> > device, continuously writing with/without barriers every so often, and
> > after power cycle read back to see what was and wasn't written.
>
> Well, I think it is worth testing through the filesystem, different
> journaling mechanisms will probably react^wcorrupt in different ways.
I agree, but intentional tests on the block device will show the
drives characteristcs on power failure much sooner and more
consistently. Then you can concentrate on the worst drivers :-)
> > I think there may be drives which won't show any effect - if they have
> > enough internal power (from platter inertia) to write everything in
> > the cache before losing it.
>
> ... and those with flux capacitors. ;) I've heard of this mechanism
> but I'm not sure I believe it is present in any modern drive. Not sure
> the seagates of the world will tell us, either ....
If you do the large-seek write pattern I suggested, and timing
confirms the drive is queueing many of them in cache, it will be
really, really clear if the drive has a flux capacitor or equivalent :-)
> > If you want to test, the worst case is to queue many small writes at
> > seek positions acrosss the disk, so that flushing the disk's write
> > cache takes the longest time. A good pattern might be take numbers
> > 0..2^N-1 (e.g. 0..255), for each number reverse the bit order (0, 128,
> > 64, 192...) and do writes at those block positions, scaling up to the
> > range of the whole disk. The idea is if the disk just caches the last
> > few queued, they will always be quite spread out.
>
> I suppose we could go about it 2 ways; come up with something diabolical
> and try very hard to break it (I think we know that we can) or do
> something more realistic (like untarring & building a kernel?) and see
> what happens in that case...
I would suggest one of the metadata intensive filesystem tests,
creating lots of files in different directories, that sort of thing.
> > The MacOS X folks decided that speed is most important for fsync().
> > fsync() does not guarantee commit to platter. *But* they added an
> > fcntl() for applications to request a commit to platter, which SQLite
> > at least uses. I don't know if MacOS X uses barriers for filesystem
> > operations.
>
> heh, reminds me of xfs's "osyncisosync" option ;)
VxFS have a few similar options ;-)
> >> and install by default on lvm which won't pass barriers anyway.
> >
> > Considering how many things depend on LVM not passing barriers, that
> > is scary. People use software RAID assuming integrity. They are
> > immune to many hardware faults. But it turns out, on Linux, that a
> > single disk can have higher integrity against power failure than a
> > RAID.
>
> FWIW... md also only does it on raid1... but lvm with a single device
> or mirror underneath really *should* IMHO...
>
> >> So maybe it's hypocritical to send this patch from redhat.com :)
> >
> > So send the patch to fix LVM too :-)
>
> hehe, I'll try ... ;)
Fwiw, if it were implemented in MD, the difference between barriers
and flushes could be worth having for performance, even when
underlying devices implement barriers with flush.
It would be good, especially for MT, to optimising away those
operations in unnecessary cases on underlying device request queues,
as well as the main MD queue. An example is WRITE-BARRIER, usually
implemented as FLUSH, WRITE, FLUSH, can actually report completion
when the WRITE is finished, and doesn't need to issue the second FLUSH
at all for a long time, until there's a subsequent WRITE on that drive
(and on the same partition, fwiw.)
I'm increasingly thinking that decomposing WRITE-BARRIER to three
requests, FLUSH+WRITE+FLUSH, should be done at the generic I/O request
level, because that is the best place to optimise away, merge, or
delay unnecessary flushes, and to relax request ordering around them
if we ever do that. (BARRIER would remain only as an op for extra
performance only on drivers which can do barriers another way; many
drivers would never see or handle it).
Since we need FLUSH for proper fsync() anyway, that would simplify
drivers too.
Then the MD patch could just implement FLUSH, which is probably
simpler than implementing WRITE-BARRIER.
-- Jamie
On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote:
> > > If you just want to test the block I/O layer and drive itself, don't
> > > use the filesystem, but write a program which just access the block
> > > device, continuously writing with/without barriers every so often, and
> > > after power cycle read back to see what was and wasn't written.
> >
> > Well, I think it is worth testing through the filesystem, different
> > journaling mechanisms will probably react^wcorrupt in different ways.
>
> I agree, but intentional tests on the block device will show the
> drives characteristcs on power failure much sooner and more
> consistently. Then you can concentrate on the worst drivers :-)
I suspect the real reason why we get away with it so much with ext3 is
that the journal is usually contiguous on disk, hence, when you write
to the journal, it's highly unlikely that commit block will be written
and the blocks before the commit block have not. In addition, because
we are doing physical block journalling, it repairs a huge amount of
damage during the journal replay. So as we are writing the journal,
the disk drive sees a large contiguous write stream, followed by
singleton writes where the disk blocks end up on disk. The most
important reason, though, is that the blocks which are dirty don't get
flushed out to disk right away!
They don't have to, since they are in the journal, and the journal
replay will write the correct data to disk. Before the journal
commit, the buffer heads are pinned, so they can't be written out.
After the journal commit, the buffer heads may be written out, but
they don't get written out right away; the kernel will only write them
out when the periodic buffer cache flush takes place, *or* if the
journal need to wrap, at which point if there are pending writes to an
old commit that haven't been superceded by another journal commit, the
jbd layer has to force them out. But the point is this is done in an
extremely lazy fashion.
As a result, it's very tough to create a situation where a hard drive
will reorder write requests aggressively enough that we would see a
potential problem.
I suspect if we want to demonstrate the problem, we would need to do a
number of things:
* create a highly fragmented journal inode, forcing the jbd
layer to seek all over the disk while writing out the
journal blocks during a commit
* make the journal small, forcing the journal to wrap very often
* run with journal=data mode, to put maximal stress on the journal
* make the workload one which creates and deletes large number
of files scattered all over the directory hierarchy, so that
we limit the number of blocks which are rewritten,
* forcibly crash the system while subjecting the ext3
filesystem to this torture test
Given that most ext3 filesystems have their journal created at mkfs
time, so it is contiguous, and the journal is generally nice and
large, in practice I suspect it's relatively difficult (I didn't say
impossible) for us to trigger corruption given how far away we are
from the worst case scenario described above.
- Ted
On Fri, 16 May 2008 20:20:30 -0400 Theodore Tso <[email protected]> wrote:
> On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote:
> > > > If you just want to test the block I/O layer and drive itself, don't
> > > > use the filesystem, but write a program which just access the block
> > > > device, continuously writing with/without barriers every so often, and
> > > > after power cycle read back to see what was and wasn't written.
> > >
> > > Well, I think it is worth testing through the filesystem, different
> > > journaling mechanisms will probably react^wcorrupt in different ways.
> >
> > I agree, but intentional tests on the block device will show the
> > drives characteristcs on power failure much sooner and more
> > consistently. Then you can concentrate on the worst drivers :-)
>
> I suspect the real reason why we get away with it so much with ext3 is
> that the journal is usually contiguous on disk, hence, when you write
> to the journal, it's highly unlikely that commit block will be written
> and the blocks before the commit block have not.
yup. Plus with a commit only happening once per few seconds, the time
window for a corrupting power outage is really really small, in
relative terms. All these improbabilities multiply.
Journal wrapping could cause the commit block to get written before its
data blocks. We could improve that by changing the journal layout a bit:
wrap the entire commit rather than just its tail. Such a change might
be intrusive though.
The other possible scenario is when the fs thinks the metadata blocks
have been checkpointed, so it reuses the journal space, only they
weren't really checkpointed. But that would require that we'd gone
through so much IO that the metadata probably _has_ been checkpointed.
> In addition, because
> we are doing physical block journalling, it repairs a huge amount of
> damage during the journal replay. So as we are writing the journal,
> the disk drive sees a large contiguous write stream, followed by
> singleton writes where the disk blocks end up on disk. The most
> important reason, though, is that the blocks which are dirty don't get
> flushed out to disk right away!
>
> They don't have to, since they are in the journal, and the journal
> replay will write the correct data to disk. Before the journal
> commit, the buffer heads are pinned, so they can't be written out.
> After the journal commit, the buffer heads may be written out, but
> they don't get written out right away; the kernel will only write them
> out when the periodic buffer cache flush takes place, *or* if the
> journal need to wrap, at which point if there are pending writes to an
> old commit that haven't been superceded by another journal commit, the
> jbd layer has to force them out. But the point is this is done in an
> extremely lazy fashion.
>
> As a result, it's very tough to create a situation where a hard drive
> will reorder write requests aggressively enough that we would see a
> potential problem.
>
> I suspect if we want to demonstrate the problem, we would need to do a
> number of things:
>
> * create a highly fragmented journal inode, forcing the jbd
> layer to seek all over the disk while writing out the
> journal blocks during a commit
> * make the journal small, forcing the journal to wrap very often
> * run with journal=data mode, to put maximal stress on the journal
> * make the workload one which creates and deletes large number
> of files scattered all over the directory hierarchy, so that
> we limit the number of blocks which are rewritten,
> * forcibly crash the system while subjecting the ext3
> filesystem to this torture test
>
> Given that most ext3 filesystems have their journal created at mkfs
> time, so it is contiguous, and the journal is generally nice and
> large, in practice I suspect it's relatively difficult (I didn't say
> impossible) for us to trigger corruption given how far away we are
> from the worst case scenario described above.
>
what he said.
On Fri, May 16, 2008 at 05:35:52PM -0700, Andrew Morton wrote:
> Journal wrapping could cause the commit block to get written before its
> data blocks. We could improve that by changing the journal layout a bit:
> wrap the entire commit rather than just its tail. Such a change might
> be intrusive though.
Or we backport jbd2's checksum support in the commit block to jbd;
with the checksum support, if the commit block is written out-of-order
with the rest of the blocks in the commit, the commit will simply not
be recognized as valid, and we'll use the previous commit block as the
last valid commit.
So with ext4, the only problems we should actually have w/o a barrier
are:
* Writes that were supposed to happen only *after* the journal
commit is written get reordered *before* the journal commit.
(But normally these writes, while _allowed_ after a journal
commit are not forced by the kernel.)
* In data=ordered mode, data blocks that *should* have been
written out before the journal commit, get reordered until
*after* the journal commit.
And in both cases, where the crash has to happen sometime *very*
shortly after commit record has been forced out.
Thinking about this some more, the most likely way I can think of some
problems happening would be an unexpected power failure that happened
exactly right as an unmount (or filesystem snapshot) was taking place.
That's one of the few cases I can think of where, a journal commit
write is followed immediately by the metadata writes. And in
data=ordered more, that sequence would be data writes, followed
immediately by journal writes, followed immediately by metadata
writes.
So if you want to demonstrate that this really *could* happen in real
life, without an artificially contrived, horribly fragmented journal
inode, here's a worst case scenario I would try to arrange.
(1) Pre-fill the disk 100% with some string, such as "my secret love
letters".
(2) In data=ordered mode, unpack a very large tarball, ideally with
the files ad directory ordered maximally pessimized so that files are
created in directory #1, then directory #6, then directory #4, then
directory #1, then directory #2, etc. (AKA the anti-Reiser4 benchmark
tarball, because Hans would do kernel unpack benchmarks using
specially prepared tarballs that were in an optimal order for a
particular reiser4 filesystem hash; this is the exact opposite. :-)
(3) With lots of dirty files in the page cache, (and for extra fun,
try this with ext4's unstable patch queue with delayed allocation
enabled), unmount the filesystem ---- and crash the system in the
middle of the unmount.
(4) Check to see if the filesystem metadata checks out cleanly using
e2fsck -f.
(5) Check all of the files on the disk to see if any of them contain
the string, "my secret love letters".
So all of this is not to argue one way or another about whether or not
barriers are a good idea. It's really so we (and system
administrators) can make some informed decisions about choices.
One thing which we *can* definitely do is add a flag in the superblock
to change the default mount option to enable barriers on a
per-filesystem basis, settable by tune2fs/mke2fs.
Another question is whether we can do better in our implementation of
a barrier, and the way the jbd layer uses barriers. The way we do it
in the jbd layer is actually pretty bad:
if (journal->j_flags & JFS_BARRIER) {
set_buffer_ordered(bh);
barrier_done = 1;
}
ret = sync_dirty_buffer(bh);
if (barrier_done)
clear_buffer_ordered(bh);
This means that while we are waiting for commit record to be written
out, any other writes that are happening via buffer heads (which
includes directory operations) are getting done with strict ordering.
All set_buffer_ordered() does is change make the submit_bh() done in
sync_dirty_buffer() actually be submitted with WRITE_BARRIER instead
of WRITE. So fixing sync_dirty_buffer() so that there is an
_sync_dirty_buffer() which takes two arguments, so we can do something
like this instead:
ret = _sync_dirty_buffer(bh, WRITE_BARRIER);
Should hopefully reduce the hit on the benchmarks.
On disks with real tagged command queuing it might be possible to do
even better by sending the hard drives the real data dependencies,
since in fact a barrier is a stronger guarantee than what we really
need. Unfortunatelly, TCQ seems to be getting obsoleted by the dumber
NCQ, where we don't get to make explicit write ordering requests to
the drive (and my drives ignored the ordering requests anyway).
- Ted
On May 17, 2008 09:43 -0400, Theodore Ts'o wrote:
> On Fri, May 16, 2008 at 05:35:52PM -0700, Andrew Morton wrote:
> > Journal wrapping could cause the commit block to get written before its
> > data blocks. We could improve that by changing the journal layout a bit:
> > wrap the entire commit rather than just its tail. Such a change might
> > be intrusive though.
>
> Or we backport jbd2's checksum support in the commit block to jbd;
> with the checksum support, if the commit block is written out-of-order
> with the rest of the blocks in the commit, the commit will simply not
> be recognized as valid, and we'll use the previous commit block as the
> last valid commit.
The jbd2 journal checksum support, along with the 64-bit journal block
layout is designed to be 100% compatible with jbd, so it would be
trivial to "back port" jbd2. Just copy jbd2 into jbd, and add the
option to set (or set by default) the JBD_FEATURE_COMPAT_CHECKSUM
feature in the journal.
I was going to suggest complete removal of jbd2 at that point, but
the ordered-mode rewrite is still pending and should live in ext4
for a reasonably long while before going back to ext3.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Sat, May 17, 2008 at 09:43:44AM -0400, Theodore Tso wrote:
> Another question is whether we can do better in our implementation of
> a barrier, and the way the jbd layer uses barriers. The way we do it
> in the jbd layer is actually pretty bad:
>
> This means that while we are waiting for commit record to be written
> out, any other writes that are happening via buffer heads (which
> includes directory operations) are getting done with strict ordering.
> All set_buffer_ordered() does is change make the submit_bh() done in
> sync_dirty_buffer() actually be submitted with WRITE_BARRIER instead
> of WRITE.
Never mind, I was confused when I wrote this; I somehow thought we
were setting ordered mode on a per queue basis, instead of on a
per-buffer-head basis.
Also, looking more closely on the jbd2 implementation, it looks like
using the async_commit option, which relies on the checksum for more
efficient commit, completely disables any barrier support. That's
because the only place we go into ordered more is if we are writing a
synchronous journal commit. If async journal commit is enabled, then
we don't write a barrier at all, which leaves us in potential trouble
with if data blocks end up getting reordered with respect to the
journal commit in data=ordered more.
I *think* what we need to do is to issue an empty barrier request
between the data blocks and the journal writes in data=ordered mode,
and still issue a WRITE_BARRIER request when writing the commit block,
but to not actually wait for the write to complete. I think if we do
that, we should be safe, and hopefully by not waiting for the commit
block to complete, the performance hit shouldn't be as bad as
previously reported.
- Ted
On Friday 16 May 2008, Andrew Morton wrote:
> On Fri, 16 May 2008 20:20:30 -0400 Theodore Tso <[email protected]> wrote:
> > On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote:
> > > > > If you just want to test the block I/O layer and drive itself,
> > > > > don't use the filesystem, but write a program which just access the
> > > > > block device, continuously writing with/without barriers every so
> > > > > often, and after power cycle read back to see what was and wasn't
> > > > > written.
> > > >
> > > > Well, I think it is worth testing through the filesystem, different
> > > > journaling mechanisms will probably react^wcorrupt in different ways.
> > >
> > > I agree, but intentional tests on the block device will show the
> > > drives characteristcs on power failure much sooner and more
> > > consistently. Then you can concentrate on the worst drivers :-)
> >
> > I suspect the real reason why we get away with it so much with ext3 is
> > that the journal is usually contiguous on disk, hence, when you write
> > to the journal, it's highly unlikely that commit block will be written
> > and the blocks before the commit block have not.
>
> yup. Plus with a commit only happening once per few seconds, the time
> window for a corrupting power outage is really really small, in
> relative terms. All these improbabilities multiply.
Well, the barriers happen like so (even if we actually only do one barrier in
submit_bh, it turns into two)
write log blocks
flush #1
write commit block
flush #2
write metadata blocks
I'd agree with Ted, there's a fairly small chance of things get reordered
around flush #1. flush #2 is likely to have lots of reordering though. It
should be easy to create situations where the metadata for a transaction is
written before the log blocks ever see the disk.
EMC did a ton of automated testing around this when Jens and I did the initial
barrier implementations, and they were able to trigger corruptions in fsync
heavy workloads with randomized power offs. I'll dig up the workload they
used.
-chris
On Sat, May 17, 2008 at 08:48:33PM -0400, Chris Mason wrote:
>
> Well, the barriers happen like so (even if we actually only do one
> barrier in submit_bh, it turns into two)
>
> write log blocks
> flush #1
> write commit block
> flush #2
> write metadata blocks
>
> I'd agree with Ted, there's a fairly small chance of things get reordered
> around flush #1. flush #2 is likely to have lots of reordering though. It
> should be easy to create situations where the metadata for a transaction is
> written before the log blocks ever see the disk.
True, but even with a very heavy fsync() workload, a commit doesn't
cause the metadata blocks to be written until we have to do a journal
truncate operation. A heavy fsync() workload would increase how
quickly we would use up the journal and need to do a journal truncate,
though.
> EMC did a ton of automated testing around this when Jens and I did
> the initial barrier implementations, and they were able to trigger
> corruptions in fsync heavy workloads with randomized power offs.
> I'll dig up the workload they used.
I could imagine a mode which forces a barrier operation for commits
triggered by fsync()'s, but not commits that occur due to a natural
closing of transactions. I'm not sure it's worth it, though, since
many of the benchmarks that we care about (like Postmark) do use
fsync() fairly heavily.
The really annoying thing is that what is really needed is a way to
make write barriers cheaper; we don't need to do a synchronous flush,
but unfortunately for most drives there isn't any other way of keeping
disk writes from getting reordered.
- Ted
Theodore Tso wrote:
> On Sat, May 17, 2008 at 08:48:33PM -0400, Chris Mason wrote:
>
>> Well, the barriers happen like so (even if we actually only do one
>> barrier in submit_bh, it turns into two)
>>
>> write log blocks
>> flush #1
>> write commit block
>> flush #2
>> write metadata blocks
>>
>> I'd agree with Ted, there's a fairly small chance of things get reordered
>> around flush #1. flush #2 is likely to have lots of reordering though. It
>> should be easy to create situations where the metadata for a transaction is
>> written before the log blocks ever see the disk.
>>
>
> True, but even with a very heavy fsync() workload, a commit doesn't
> cause the metadata blocks to be written until we have to do a journal
> truncate operation. A heavy fsync() workload would increase how
> quickly we would use up the journal and need to do a journal truncate,
> though.
>
>
>> EMC did a ton of automated testing around this when Jens and I did
>> the initial barrier implementations, and they were able to trigger
>> corruptions in fsync heavy workloads with randomized power offs.
>> I'll dig up the workload they used.
>>
>
> I could imagine a mode which forces a barrier operation for commits
> triggered by fsync()'s, but not commits that occur due to a natural
> closing of transactions. I'm not sure it's worth it, though, since
> many of the benchmarks that we care about (like Postmark) do use
> fsync() fairly heavily.
>
> The really annoying thing is that what is really needed is a way to
> make write barriers cheaper; we don't need to do a synchronous flush,
> but unfortunately for most drives there isn't any other way of keeping
> disk writes from getting reordered.
>
>
The workload we used was to run our existing Centera application on a
rack of boxes. The application is a bit special in that it does a
digital signature on each file and never returns success for the client
until an fsync is done on the server (kind of like synchronous NFS).
What we did for our test was to pound away on a rack of these boxes (say
32 boxes, each with 4 large ATA or S-ATA drives) and then drop power to
the whole rack.
All of our data file systems were reiserfs, some of the system
partitions were ext2.
The test would be marked as passed if we could reboot all of the boxes
and have the client validate that the digital signature of all files
written and ack'ed were valid. We also looked for issues seen during the
reboot (fsck grumbles, corrupted directories, etc).
I didn't run the tests personally, but seem to recall that without
barriers we routinely saw file system corruption on that reboot.
The hard thing is to figure out how to test this kind of scenario
without dropping power. To expose the failure mode, it might be
sufficient to drop power to a drive with smartctl (or, if you have hot
swap bays, just pull them).
Just a personal note, my last day at EMC was this past Friday. Monday, I
start working for Red Hat (focused on file systems) so I will have to
figure out to get this kind of test going without all of my big EMC toys ;-)
ric
Andrew Morton <[email protected]> writes:
> On Fri, 16 May 2008 14:02:46 -0500
> Eric Sandeen <[email protected]> wrote:
>
>> A collection of patches to make ext3 & 4 use barriers by
>> default, and to call blkdev_issue_flush on fsync if they
>> are enabled.
>
> Last time this came up lots of workloads slowed down by 30% so I
> dropped the patches in horror.
Didn't ext4 have some new checksum trick to avoid them?
> I just don't think we can quietly go and slow everyone's machines down
> by this much. The overhead of journalling is already pretty horrid.
>
> If we were seeing a significant number of "hey, my disk got wrecked"
There used to be a couple of reports of people doing
some write workload
use automatic power switch to reset
turn on again
repeat for 24 hours
getting consistent corruption after some time <24hours and those
went away with barriers.
> Do we know which distros are enabling barriers by default?
SUSE did for some releases, although reiserfs used to be the default there
(but that one uses barriers too if the disk supports them)
BTW for a lot of distros which default to LVM install it won't make
any difference because DM right now doesn't pass barriers through.
I have a patch queued for that for the one-disk case at least, but it
didn't make .26-rc.
-Andi
Ric Wheeler <[email protected]> writes:
>
> The hard thing is to figure out how to test this kind of scenario
> without dropping power. To expose the failure mode, it might be
Why not drop power? (see test scenario in previous mail)
There are lots of cheap ways available to do that.
I personally keep my test machines on cheap USB power switches and
can switch them any time.
It will probably lower the MTBF of your test boxes somewhat
though, but that's the price for good testing.
> Just a personal note, my last day at EMC was this past Friday. Monday,
> I start working for Red Hat (focused on file systems) so I will have
> to figure out to get this kind of test going without all of my big EMC
> toys ;-)
Congratulations on the new job. I don't think you need any big toys though.
-Andi
Eric Sandeen <[email protected]> writes:
>
> Right, that was the plan. I wasn't really going to stand there and pull
> the plug. :) I'd like to get to "out of $NUMBER power-loss events
> under this usage, I saw $THIS corruption $THISMANY times ..."
I'm not sure how good such exact numbers would do. Surely power down
behaviour that would depend on the exact disk/controller/system
combination? Some might be better at getting data out at
power less, some might be worse.
To get a good picture, you would probably need to do such tests
over a wide range of systems and put all the data together, but
even then you couldn't be sure you covered all well.
For a distributor it would probably make more sense to do such
tests as part of system certification and then modify the defaults
to match all tested systems. But that doesn't really work
for good mainline defaults.
I suspect for mainline the only good default is to be very conservative.
-Andi
Andi Kleen wrote:
> Ric Wheeler <[email protected]> writes:
>
>> The hard thing is to figure out how to test this kind of scenario
>> without dropping power. To expose the failure mode, it might be
>>
>
> Why not drop power? (see test scenario in previous mail)
>
> There are lots of cheap ways available to do that.
> I personally keep my test machines on cheap USB power switches and
> can switch them any time.
>
> It will probably lower the MTBF of your test boxes somewhat
> though, but that's the price for good testing.
>
The concern, as you mentioned above, is that you wear the equipment out
with this kind of thing. Power supplies are probably the most impacted ;-)
>
>> Just a personal note, my last day at EMC was this past Friday. Monday,
>> I start working for Red Hat (focused on file systems) so I will have
>> to figure out to get this kind of test going without all of my big EMC
>> toys ;-)
>>
>
> Congratulations on the new job. I don't think you need any big toys though.
>
> -Andi
>
Thanks!
ric
On Fri, May 16, 2008 at 11:03:15PM +0100, Jamie Lokier wrote:
> The MacOS X folks decided that speed is most important for fsync().
> fsync() does not guarantee commit to platter. *But* they added an
> fcntl() for applications to request a commit to platter, which SQLite
> at least uses. I don't know if MacOS X uses barriers for filesystem
> operations.
Out of curiosity, exactly *what* semantics did MacOS X give fsync(),
then? Did it simply start the process of staging writes to disk, but
not wait for the writes to hit the platter before returning? That's
basically the equivalent of ext3's barrier=0.
- Ted
On Sun, May 18, 2008 at 10:03:55PM +0200, Andi Kleen wrote:
> Eric Sandeen <[email protected]> writes:
> >
> > Right, that was the plan. I wasn't really going to stand there and pull
> > the plug. :) I'd like to get to "out of $NUMBER power-loss events
> > under this usage, I saw $THIS corruption $THISMANY times ..."
>
> I'm not sure how good such exact numbers would do. Surely power down
> behaviour that would depend on the exact disk/controller/system
> combination? Some might be better at getting data out at
> power less, some might be worse.
Given how rarely people have reported problems, I think it's a really
good idea to understand what exactly our exposure is for
$COMMON_HARDWARE. And I suspect the biggest question isn't the
hardware, but the workload. Here are the questions that I think are
worth asking:
* How often can we get corruption on a common desktop workload? Given
that we're mostly kernel developers, and kernbench is probably worst
case for desktops, that's useful.
* What is the performance hit on a common desktop workload (let's use
kernbench for consistency).
* How often can we get corruption on a hard-core enterprise
application with lots of fsync()'s? (i.e., postmark, et. al)
* What is the performance hit on a an fsync()-heavy workload?
I have a feeling that the likelihood of corruption when running
kernbench is minimal, but the performance hit is probably minimal as
well. And that the corruption for potential is higher for an
fsync-heavy workload, but that's also where we are seeing the
(reported) 30% hit.
The other thing which we should consider is that I suspect we can do
much better for ext4 given that we have journal checksums. As Chris
pointed out, right now, with barriers turned on, we are doing this:
write log blocks
flush #1
write commit block
flush #2
write metadata blocks
If we don't mind mixing bh and bio functions, we could change it to
this for ext4 (when journal checksumming is enabled)
write log blocks
write commit block
flush (via submitting an empty barrier block I/O request)
write metadata blocks
This should hopefully reduce the performance hit by half, since we're
eliminating one of the flushes. Even more interesting would be moving
the flush until right before we attempt to write the metadata blocks,
and allowing data writes which don't require metadata updates through.
That should be safe, even in data=ordered mode. The point is we
should think about ways that we can optimize barrier mode for ext4.
If we do this, then it may be that people will find it interesting to
mount ext3 filesystems using ext4, even without making any additional
changes, because of the better choices of speed/safety tradeoffs.
- Ted
Theodore Tso wrote:
...
> Given how rarely people have reported problems, I think it's a really
> good idea to understand what exactly our exposure is for
> $COMMON_HARDWARE.
I'll propose that very close to 0% of users will ever report "having
barriers off seems to have corrupted my disk on power loss!" even if
that's exactly what happened. And it'd be very tricky to identify in a
post-mortem. Instead we'd probably see other weird things caught down
the road during some later fsck or during filesystem use, and then
suggest that they go check their cables, run memtest86 or something...
Perhaps it's not the intent of this reply, Ted, but various other bits
of this thread have struck me as trying to rationalize away the problem.
If the discussion were about proper locking to avoid corruption, would
we really be saying well, gosh, it's a *really* small window, and
*most* people won't hit it very often, and proper locking would slow
things down....
So I think that as you suggest, looking for ways to make barriers less
painful is the far better route, rather than sacrificing correctness for
speed by turning them off by default when we know there is a chance for
problems. People running journaling filesystems most likely expect to
be safe from this sort of thing, not most of the time, but all of the time.
-Eric
On Sun, 18 May 2008 21:29:30 -0500 Eric Sandeen <[email protected]> wrote:
> Theodore Tso wrote:
> ...
>
> > Given how rarely people have reported problems, I think it's a really
> > good idea to understand what exactly our exposure is for
> > $COMMON_HARDWARE.
>
> I'll propose that very close to 0% of users will ever report "having
> barriers off seems to have corrupted my disk on power loss!" even if
> that's exactly what happened. And it'd be very tricky to identify in a
> post-mortem. Instead we'd probably see other weird things caught down
> the road during some later fsck or during filesystem use, and then
> suggest that they go check their cables, run memtest86 or something...
>
> Perhaps it's not the intent of this reply, Ted, but various other bits
> of this thread have struck me as trying to rationalize away the problem.
Not really. It's a matter of understanding how big the problem is. We
know what the cost of the solution is, and it's really large.
It's a tradeoff, and it is unobvious where the ideal answer lies,
especially when not all the information is available.
> If the discussion were about proper locking to avoid corruption, would
> we really be saying well, gosh, it's a *really* small window, and
> *most* people won't hit it very often, and proper locking would slow
> things down....
If it slowed really really important workloads by 30% then we'd be
running around with our hair on fire fixing that up.
But fixing this one is nowhere near as easy as fixing some locking
thing.
> So I think that as you suggest, looking for ways to make barriers less
> painful is the far better route, rather than sacrificing correctness for
> speed by turning them off by default when we know there is a chance for
> problems. People running journaling filesystems most likely expect to
> be safe from this sort of thing, not most of the time, but all of the time.
Well. Reducing the cost would of course make the decision easy.
On Fri 2008-05-16 14:05:55, Eric Sandeen wrote:
> I can't think of any valid reason for ext3 to not use barriers when
> they are available; I believe this is necessary for filesystem
> integrity in the face of a volatile write cache on storage.
>
> An administrator who trusts that the cache is sufficiently battery-
> backed (and power supplies are sufficiently redundant, etc...)
> can always turn it back off again.
>
> SuSE has carried such a patch for quite some time now.
>
> Also document the mount option while we're at it.
>
> Signed-off-by: Eric Sandeen <[email protected]>
> Acked-by: Jan Kara <[email protected]>
Yes please. Safe defaults are the way to go.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi!
On Fri 2008-05-16 13:05:45, Andrew Morton wrote:
> On Fri, 16 May 2008 14:02:46 -0500 Eric Sandeen <[email protected]> wrote:
>
> > A collection of patches to make ext3 & 4 use barriers by
> > default, and to call blkdev_issue_flush on fsync if they
> > are enabled.
>
> Last time this came up lots of workloads slowed down by 30% so I
> dropped the patches in horror.
>
> I just don't think we can quietly go and slow everyone's machines down
> by this much. The overhead of journalling is already pretty horrid.
>
> If we were seeing a significant number of "hey, my disk got wrecked"
> reports which attributable to this then yes, perhaps we should change
> the default. But I've never seen _any_, although I've seen claims that
> others have seen reports.
Do you remember seeing "my disk was wrecked" for reiserfs?
I believe I got some corruption, but it was
a) long time ago
b) took half a year of hard powerdowns to provoke
c) I did not report it anywhere because ... it just was not
reproducible.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sunday 18 May 2008, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
> > On Fri, 16 May 2008 14:02:46 -0500
> >
> > Eric Sandeen <[email protected]> wrote:
> >> A collection of patches to make ext3 & 4 use barriers by
> >> default, and to call blkdev_issue_flush on fsync if they
> >> are enabled.
> >
> > Last time this came up lots of workloads slowed down by 30% so I
> > dropped the patches in horror.
>
> Didn't ext4 have some new checksum trick to avoid them?
I didn't think checksumming avoided barriers completely. Just the barrier
before the commit block, not the barrier after.
-chris
On Mon, May 19, 2008 at 09:26:41AM -0400, Chris Mason wrote:
> >
> > Didn't ext4 have some new checksum trick to avoid them?
>
> I didn't think checksumming avoided barriers completely. Just the barrier
> before the commit block, not the barrier after.
Funny thing, I was looking in this over the weekend. It currently
avoids barriers entirely if journal_async_commit is enabled (which is
not the default); if enabled, it effectively forces barrier=0. This
is IMHO a bug.
I'm working on a patch where "barrier=1" will use a barrier before and
after, and "barrier=1,journal_async_commit" will use a barrier after.
- Ted
On Monday 19 May 2008, Andrew Morton wrote:
> On Sun, 18 May 2008 21:29:30 -0500 Eric Sandeen <[email protected]> wrote:
> > Theodore Tso wrote:
> > ...
> >
> > > Given how rarely people have reported problems, I think it's a really
> > > good idea to understand what exactly our exposure is for
> > > $COMMON_HARDWARE.
> >
> > I'll propose that very close to 0% of users will ever report "having
> > barriers off seems to have corrupted my disk on power loss!" even if
> > that's exactly what happened. And it'd be very tricky to identify in a
> > post-mortem. Instead we'd probably see other weird things caught down
> > the road during some later fsck or during filesystem use, and then
> > suggest that they go check their cables, run memtest86 or something...
> >
> > Perhaps it's not the intent of this reply, Ted, but various other bits
> > of this thread have struck me as trying to rationalize away the problem.
>
> Not really. It's a matter of understanding how big the problem is. We
> know what the cost of the solution is, and it's really large.
>
> It's a tradeoff, and it is unobvious where the ideal answer lies,
> especially when not all the information is available.
I think one mistake we (myself included) have made all along with the barrier
code is intermixing discussions about the cost of the solution with
discussions about needing barriers at all. Everyone thinks the barriers are
slow because we also think running without barriers is mostly safe.
Barriers are actually really fast, at least when you compare them to running
with the writecache off. Making them faster in general may be possible, but
they are somewhat pushed off to the side right now because so few people are
running them.
Here's a test workload that corrupts ext3 50% of the time on power fail
testing for me. The machine in this test is my poor dell desktop (3ghz, dual
core, 2GB of ram), and the power controller is me walking over and ripping
the plug out the back.
In other words, this is not a big automated setup doing randomized power fails
on 64 nodes over 16 hours and many TB of data. The data working set for this
script is 32MB, and it takes about 10 minutes per run.
The workload has 4 parts:
1) A directory tree full of empty files with very long names (160 chars)
2) A process hogging a significant percent of system ram. This must be
enough to force constant metadata writeback due to memory pressure, and is
controlled with -p size_in_mb
3) A process constantly writing, fsyncing and truncating to zero a single 64k
file
4) A process constantly renaming the files with very long names from (1)
between long-named-file.0 and long-named-file.1
The idea was to simulate a loaded mailserver, and to find the corruptions by
reading through the directory tree and finding files long-named-file.0 and
long-named-file.1 existing at the same time. In practice, it is faster to
just run fsck -f on the FS after a crash.
In order to consistently cause corruptions, the size of the directory from
(1) needs to be at least as large as the ext3 log. This is controlled with
the -s command line option. Smaller sizes may work for the impatient, but it
is more likely to corrupt for larger ones.
The program first creates the files in a directory called barrier-test
then it starts procs to pin ram and run the constant fsyncs. After
each phase has run long enough, they print out a statement about
being ready, along with some other debugging output:
Memory pin ready
fsyncs ready
Renames ready
Example run:
# make 500,000 inodes on a 2GB partition. The results in a 32MB log
mkfs.ext3 -N 500000 /dev/sda2
mount /dev/sda2 /mnt
cd /mnt
# my machine has 2GB of ram, -s 1500 will pin ~1.5GB
barrier-test -s 32 -p 1500
Run init, don't cut the power yet
10000 files 1 MB total
... these lines repeat for a bit
200000 files 30 MB total
Starting metadata operations now
r:1000
Memory pin ready
f:100 r:2000 f:200 r:3000 f:300
fsyncs ready
r:4000 f:400 r:5000 f:500 r:6000 f:600 r:7000 f:700 r:8000 f:800 r:9000 f:900
r:10000
Renames ready
# I pulled the plug here
# After boot:
root@opti:~# fsck -f /dev/sda2
fsck 1.40.8 (13-Mar-2008)
e2fsck 1.40.8 (13-Mar-2008)
/dev/sda2: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Problem in HTREE directory inode 281377 (/barrier-test): bad block number
13543.
Clear HTree index<y>?
< 246 other errors are here >
-chris
On Monday 19 May 2008, Chris Mason wrote:
>
> Here's a test workload that corrupts ext3 50% of the time on power fail
> testing for me. The machine in this test is my poor dell desktop (3ghz,
> dual core, 2GB of ram), and the power controller is me walking over and
> ripping the plug out the back.
Here's a new version that still gets about corruptions 50% of the time, but
does it with fewer files by using longer file names (240 chars instead of 160
chars).
I tested this one with a larger FS (40GB instead of 2GB) and larger log (128MB
instead of 32MB). barrier-test -s 32 -p 1500 was still able to get a 50%
corruption rate on the larger FS.
-chris
> On Monday 19 May 2008, Chris Mason wrote:
> >
> > Here's a test workload that corrupts ext3 50% of the time on power fail
> > testing for me. The machine in this test is my poor dell desktop (3ghz,
> > dual core, 2GB of ram), and the power controller is me walking over and
> > ripping the plug out the back.
>
> Here's a new version that still gets about corruptions 50% of the time, but
> does it with fewer files by using longer file names (240 chars instead of 160
> chars).
>
> I tested this one with a larger FS (40GB instead of 2GB) and larger log (128MB
> instead of 32MB). barrier-test -s 32 -p 1500 was still able to get a 50%
> corruption rate on the larger FS.
Hmm, this is worse than I'd have expected :( If it is that bad, I
think we should really enable them by default... I can give your script
a try on my test machine when I get back (which is next week).
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Monday 19 May 2008, Jan Kara wrote:
> > On Monday 19 May 2008, Chris Mason wrote:
> > > Here's a test workload that corrupts ext3 50% of the time on power fail
> > > testing for me. The machine in this test is my poor dell desktop
> > > (3ghz, dual core, 2GB of ram), and the power controller is me walking
> > > over and ripping the plug out the back.
> >
> > Here's a new version that still gets about corruptions 50% of the time,
> > but does it with fewer files by using longer file names (240 chars
> > instead of 160 chars).
> >
> > I tested this one with a larger FS (40GB instead of 2GB) and larger log
> > (128MB instead of 32MB). barrier-test -s 32 -p 1500 was still able to
> > get a 50% corruption rate on the larger FS.
>
> Hmm, this is worse than I'd have expected :( If it is that bad, I
> think we should really enable them by default... I can give your script
> a try on my test machine when I get back (which is next week).
That would be great, I'd like to confirm that my machine isn't the only one on
the planet so easily broken ;)
I was also able to trigger corruptions on XFS (one run out of two), so it is
unlikely I'm seeing bugs in the ext3 replay or logging code.
-chris
On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
> To ensure that bits are truly on-disk after an fsync,
> we should call blkdev_issue_flush if barriers are supported.
This patch isn't necessary, and in fact will cause a double flush.
When you call fsync(), it calls ext4_force_commit(), and we do a the
equivalent of a blkdev_issue_flush() today (which is what happenes
when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
set_ordered_mode(bh) ends up causing.
- Ted
On Mon, May 19, 2008 at 10:46:54AM -0400, Theodore Tso wrote:
> Funny thing, I was looking in this over the weekend. It currently
> avoids barriers entirely if journal_async_commit is enabled (which is
> not the default); if enabled, it effectively forces barrier=0. This
> is IMHO a bug.
>
> I'm working on a patch where "barrier=1" will use a barrier before and
> after, and "barrier=1,journal_async_commit" will use a barrier after.
And here it is....
Comments, please. And Eric, if you have a chance to benchmark this to
see how this patch works out, comparing "barrier=0, "barrier=1", and
"barrier=1,journal_async_commit", as we had discussed earlier on the
ext4, I'd be very much obliged.
As I mentioned earlier, adding blkdev_issue_flush() to ext[34]/fsync.c
is I believe not necessary. We should be doing the right thing in the
commit.c file. In the future, if we want some extra bonus points, in
the case where writes have taken place to the inode, but no metadata
operations have taken place (the common case when a database is
writing to a pre-existing tablespace), it would be nice if fsync()
could notice this case, force out the datablocks the old-fashioned way
without forcing an journal commit, and then calling
blkdev_issue_flush(). That should give us some nice performance wins
for database workloads; unfortunately it probably won't do us any good
on mailserver workloads.
Regards,
- Ted
>From ce99d82266d003e9b2284a1f46bd6f0e3a87c066 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Mon, 19 May 2008 22:40:52 -0400
Subject: [PATCH] ext4: Fix use of write barrier in commit logic
We were previously using set_buffer_ordered() and then calling
submit_bh() to write the commit block, which resulted in this flush
behavior:
write log blocks
blkdev_issue_flush() // flush #1
write commit block
blkdev_issue_flush() // flush #2
write metadata blocks
And if the journal_async_commit mount option was used, the commit was
written without using set_buffer_ordered(), which resulted in no
flushes happening at all.
Change the commit code to use blkdev_issue_flush() explicitly, and to
omit flush #1 if journal_async_commit is enabled, but to always
perform the second flush if write barriers are enabled.
This change should hopefully reduce the overhead of using write
barriers when an ext4 filesystem is mounted using
"barrier=1,journal_async_commit", while still maintaining filesystem
safety. If this works out, we should probably make this the default
mode for ext4.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/commit.c | 44 +++++++++++++++++---------------------------
1 files changed, 17 insertions(+), 27 deletions(-)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 3041d75..3fe5be5 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -22,6 +22,7 @@
#include <linux/pagemap.h>
#include <linux/jiffies.h>
#include <linux/crc32.h>
+#include <linux/blkdev.h>
/*
* Default IO end handler for temporary BJ_IO buffer_heads.
@@ -111,7 +112,6 @@ static int journal_submit_commit_record(journal_t *journal,
struct commit_header *tmp;
struct buffer_head *bh;
int ret;
- int barrier_done = 0;
struct timespec now = current_kernel_time();
if (is_journal_aborted(journal))
@@ -147,34 +147,24 @@ static int journal_submit_commit_record(journal_t *journal,
if (journal->j_flags & JBD2_BARRIER &&
!JBD2_HAS_INCOMPAT_FEATURE(journal,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
- set_buffer_ordered(bh);
- barrier_done = 1;
+ ret = blkdev_issue_flush(bh->b_bdev, NULL);
+ if (ret == -EOPNOTSUPP) {
+ char b[BDEVNAME_SIZE];
+
+ printk(KERN_WARNING
+ "JBD: barrier-based sync failed on %s - "
+ "disabling barriers\n",
+ bdevname(journal->j_dev, b));
+ spin_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_BARRIER;
+ spin_unlock(&journal->j_state_lock);
+ }
}
ret = submit_bh(WRITE, bh);
- if (barrier_done)
- clear_buffer_ordered(bh);
- /* is it possible for another commit to fail at roughly
- * the same time as this one? If so, we don't want to
- * trust the barrier flag in the super, but instead want
- * to remember if we sent a barrier request
- */
- if (ret == -EOPNOTSUPP && barrier_done) {
- char b[BDEVNAME_SIZE];
Hi Chris,
Chris Mason wrote:
> On Monday 19 May 2008, Jan Kara wrote:
>>> On Monday 19 May 2008, Chris Mason wrote:
>>>> Here's a test workload that corrupts ext3 50% of the time on power fail
>>>> testing for me. The machine in this test is my poor dell desktop
>>>> (3ghz, dual core, 2GB of ram), and the power controller is me walking
>>>> over and ripping the plug out the back.
>>> Here's a new version that still gets about corruptions 50% of the time,
>>> but does it with fewer files by using longer file names (240 chars
>>> instead of 160 chars).
>>>
>>> I tested this one with a larger FS (40GB instead of 2GB) and larger log
>>> (128MB instead of 32MB). barrier-test -s 32 -p 1500 was still able to
>>> get a 50% corruption rate on the larger FS.
>> Hmm, this is worse than I'd have expected :( If it is that bad, I
>> think we should really enable them by default... I can give your script
>> a try on my test machine when I get back (which is next week).
>
> That would be great, I'd like to confirm that my machine isn't the only one on
> the planet so easily broken ;)
>
> I was also able to trigger corruptions on XFS (one run out of two), so it is
> unlikely I'm seeing bugs in the ext3 replay or logging code.
>
Just to clarify,
was this test with barriers on or off for XFS?
I'm wondering if it was with barriers on, then we have a reproducible
bug in log replay.
Or whether you were just confirming the problems with barriers off on
another filesystem.
Thanks muchly,
Tim.
On Mon, May 19 2008, Chris Mason wrote:
> On Monday 19 May 2008, Chris Mason wrote:
> >
> > Here's a test workload that corrupts ext3 50% of the time on power fail
> > testing for me. The machine in this test is my poor dell desktop (3ghz,
> > dual core, 2GB of ram), and the power controller is me walking over and
> > ripping the plug out the back.
>
> Here's a new version that still gets about corruptions 50% of the
> time, but does it with fewer files by using longer file names (240
> chars instead of 160 chars).
>
> I tested this one with a larger FS (40GB instead of 2GB) and larger
> log (128MB instead of 32MB). barrier-test -s 32 -p 1500 was still
> able to get a 50% corruption rate on the larger FS.
I ran this twice, killing power after 'renames ready'. The first time it
was fine, the second time I got:
centera:~/e2fsprogs-1.40.9/e2fsck # ./e2fsck -f /dev/sdb1
e2fsck 1.40.9 (27-Apr-2008)
/dev/sdb1: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Problem in HTREE directory inode 2395569: node (281) has bad max hash
Problem in HTREE directory inode 2395569: node (849) has bad max hash
Problem in HTREE directory inode 2395569: node (1077) has bad max hash
Problem in HTREE directory inode 2395569: node (1718) has bad max hash
Problem in HTREE directory inode 2395569: node (4609) has bad max hash
Problem in HTREE directory inode 2395569: node (4864) has bad max hash
Problem in HTREE directory inode 2395569: node (5092) has bad max hash
Problem in HTREE directory inode 2395569: node (5148) has bad max hash
Problem in HTREE directory inode 2395569: node (5853) has bad max hash
Problem in HTREE directory inode 2395569: node (7588) has bad max hash
Problem in HTREE directory inode 2395569: node (8663) has bad max hash
Invalid HTREE directory inode 2395569 (/barrier-test). Clear HTree
index<y>? yes
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Duplicate entry
'0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a.0'
in /barrier-test (2395569) found. Clear? yes
and tons of 'duplicate entry' errors after that. And then
Pass 4: Checking reference counts
Inode 168 ref count is 0, should be 1. Fix? yes
Unattached zero-length inode 255. Clear? yes
Inode 1221 ref count is 0, should be 1. Fix? yes
Inode 1253 ref count is 1, should be 2. Fix? yes
Inode 2692 ref count is 0, should be 1. Fix? yes
Inode 3465 ref count is 0, should be 1. Fix? yes
and lots of those too. So definitely easy to trigger. Test fs was ext3
of 40gb on a 320gb maxtor drive.
--
Jens Axboe
On Monday 19 May 2008, Timothy Shimmin wrote:
[ power fail testing with write back cache on ]
> > I was also able to trigger corruptions on XFS (one run out of two), so it
> > is unlikely I'm seeing bugs in the ext3 replay or logging code.
>
> Just to clarify,
> was this test with barriers on or off for XFS?
> I'm wondering if it was with barriers on, then we have a reproducible
> bug in log replay.
> Or whether you were just confirming the problems with barriers off on
> another filesystem.
Barriers were off on xfs, the corruptions were not xfs' fault.
-chris
On Tuesday 20 May 2008, Jens Axboe wrote:
> On Mon, May 19 2008, Chris Mason wrote:
> > On Monday 19 May 2008, Chris Mason wrote:
> > > Here's a test workload that corrupts ext3 50% of the time on power fail
> > > testing for me. The machine in this test is my poor dell desktop
> > > (3ghz, dual core, 2GB of ram), and the power controller is me walking
> > > over and ripping the plug out the back.
> >
> > Here's a new version that still gets about corruptions 50% of the
> > time, but does it with fewer files by using longer file names (240
> > chars instead of 160 chars).
> >
> > I tested this one with a larger FS (40GB instead of 2GB) and larger
> > log (128MB instead of 32MB). barrier-test -s 32 -p 1500 was still
> > able to get a 50% corruption rate on the larger FS.
>
> I ran this twice, killing power after 'renames ready'. The first time it
> was fine, the second time I got:
Great, thanks Jens.
So, one compromise may be to change the barriers on ext3 to look like the
patch Ted just sent out for ext4. It should be mostly safe to skip the
barrier between the log blocks and the commit block since the drive is likely
to do those sequentially anyway. A little extra logic could be added to
detect log wrapping and force an extra barrier in that case.
Reiserfs saw some significant performance gains when I changed the code from:
write log blocks
barrier
wait on log blocks
write commit
barrier
wait on commit
to
write log blocks
barrier
write commit
barrier
wait on all of them
Both were tested with the great big emc power failure machine and both passed.
In the event of an IO error on log blocks, we should zero out the commit.
-chris
Ric Wheeler wrote:
> I will have to figure out to get this kind of test going without all
> of my big EMC toys ;-)
Fwiw, you can test correctness by running a nested VM-in-VM guest
which simulates disk write cache and flush operations (which flush to
the host kernel). I believe recent QEMU/KVM has this as an option.
Data written by the innermost guest will reside in the middle kernel's
write cache.
Disk cache flush commands from the innermost kernel will cause the
middle kernel to write dirty sectors to the outer kernel. Killing the
outermost host process is roughly equivalent to pulling the plug on a
real machine, but faster and without hurting real hardware.
This might not tell you much about performance, but you should be able
to run a lot of repeatable filesystem barrier tests this way.
-- Jamie
Theodore Tso wrote:
> Also, looking more closely on the jbd2 implementation, it looks like
> using the async_commit option, which relies on the checksum for more
> efficient commit, completely disables any barrier support. That's
> because the only place we go into ordered more is if we are writing a
> synchronous journal commit. If async journal commit is enabled, then
> we don't write a barrier at all, which leaves us in potential trouble
> with if data blocks end up getting reordered with respect to the
> journal commit in data=ordered more.
That is a great optimisation to make the filesystem safe against power
fails without barriers. Upon replay, the filesystem is consistent.
But does it break fsync()? Consistency isn't enough for fsync().
Returning from fsync() means "this data is committed now".
-- Jamie
Chris Mason wrote:
> Everyone thinks the barriers are slow because we also think running
> without barriers is mostly safe.
+1. There really hasn't been much effort at optimising barriers.
> Making them faster in general may be possible, but they are somewhat
> pushed off to the side right now because so few people are running them.
+1 too.
-- Jamie
Theodore Tso wrote:
> On Fri, May 16, 2008 at 11:03:15PM +0100, Jamie Lokier wrote:
> > The MacOS X folks decided that speed is most important for fsync().
> > fsync() does not guarantee commit to platter. *But* they added an
> > fcntl() for applications to request a commit to platter, which SQLite
> > at least uses. I don't know if MacOS X uses barriers for filesystem
> > operations.
>
> Out of curiosity, exactly *what* semantics did MacOS X give fsync(),
> then? Did it simply start the process of staging writes to disk, but
> not wait for the writes to hit the platter before returning? That's
> basically the equivalent of ext3's barrier=0.
I haven't read the code and don't use MacOS myself.
>From its fcntl() man page:
Note that while fsync() will flush all data from the host to the
drive (i.e. the "permanent storage device"), the drive itself may
not physically write the data to the platters for quite some time
and it may be written in an out-of-order sequence.
Specifically, if the drive loses power or the OS crashes, the
application may find that only some or none of their data was
written. The disk drive may also re-order the data so that later
writes may be present while earlier writes are not.
This is not a theoretical edge case. This scenario is easily
reproduced with real world workloads and drive power failures.
For applications that require tighter guarantess about the
integrity of their data, MacOS X provides the F_FULLFSYNC
fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered
data to permanent storage. Applications such as databases that
require a strict ordering of writes should use F_FULLFSYNC to
ensure their data is written in the order they expect. Please see
fcntl(2) for more detail.
Some notable things:
1. Para 2 says "if the drive loses power __or the OS crashes__".
Does this mean some drives will abandon cached writes when reset
despite retaining power?
2. Para 3 to be re-read by the skeptical.
3. Para 4 perpetuates the confused idea that write ordering is what
it's all about, for things like databases. In fact, sometimes
ordering barriers are all that's needed and flush is unnecessary
performance baggage. But sometimes an fsync() which only
guarantees ordering is insufficient. An "ideal"
database-friendly block layer would offer both.
I doubt if common unix mail transports use F_FULLSYNC on Darwin
instead of fsync(), before reporting a mail received safely, but they
probably should. I recall SQLite does use it (unless I'm confusing
it with some other database).
-- Jamie
Theodore Tso wrote:
> As I mentioned earlier, adding blkdev_issue_flush() to ext[34]/fsync.c
> is I believe not necessary. We should be doing the right thing in the
> commit.c file. In the future, if we want some extra bonus points, in
> the case where writes have taken place to the inode, but no metadata
> operations have taken place (the common case when a database is
> writing to a pre-existing tablespace), it would be nice if fsync()
> could notice this case, force out the datablocks the old-fashioned way
> without forcing an journal commit, and then calling
> blkdev_issue_flush(). That should give us some nice performance wins
> for database workloads; unfortunately it probably won't do us any good
> on mailserver workloads.
Last time I looked, the database write + fsync case did not result in
a journal commit in some cases, and therefore no blkdev_issue_flush().
The problem was one of correctness. Has this been fixed?
A database had no way to issue a hard disk barrier in these cases
without doing weird stuff like forcing metadata changes prior to fsync
(e.g. toggling permissions bits), which causes an intolerable two disk
seeks per fsync. That is _way_ expensive.
There is also no way in ext3 to _just_ fdatasync() (no metadata even
if it has changed), with disk barrier/flush.
Imho a good place to call blkdev_issue_flush() is in the VFS, after
it's written all the data blocks, unless the filesystem has a better
override. That would work with most filesystems automatically.
Request queue optimisations may trivially remove redundant flushes.
-- Jamie
Chris Mason wrote:
> On Sunday 18 May 2008, Andi Kleen wrote:
> > Andrew Morton <[email protected]> writes:
> > > On Fri, 16 May 2008 14:02:46 -0500
> > >
> > > Eric Sandeen <[email protected]> wrote:
> > >> A collection of patches to make ext3 & 4 use barriers by
> > >> default, and to call blkdev_issue_flush on fsync if they
> > >> are enabled.
> > >
> > > Last time this came up lots of workloads slowed down by 30% so I
> > > dropped the patches in horror.
> >
> > Didn't ext4 have some new checksum trick to avoid them?
>
> I didn't think checksumming avoided barriers completely. Just the barrier
> before the commit block, not the barrier after.
A little optimisation note.
You don't need the barrier after in some cases, or it can be deferred
until a better time. E.g. when the disk write cache is probably empty
(some time after write-idle), barrier flushes may take the same time
as NOPs.
This sequence:
#1 write metadata to journal
#1 write commit block (checksummed)
BARRIER
#1 write metadata in place
... time passes ...
#2 write metadata to journal
#2 write commit block (checksummed)
BARRIER
#2 write metadata in place
... time passes ...
#3 write metadata to journal
#3 write commit block (checksummed)
BARRIER
#3 write metadata in place
Can be rewritten as:
#1 write metadata to journal
#1 write commit block (checksummed)
... time passes ...
#2 write metadata to journal
#2 write commit block (checksummed)
... time passes ...
#3 write metadata to journal
#3 write commit block (checksummed)
... time passes ...
BARRIER (probably instant).
#1 write metadata in place
#2 write metadata in place
#3 write metadata in place
Provided some conditions hold. All the metadata and all the journal
writes being non-overlapping I/O ranges would be sufficient.
What's more, barriers can be deferred past data=ordered in-place data
writes, although that's not always an optimisation.
-- Jamie
Theodore Tso wrote:
> On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
> > To ensure that bits are truly on-disk after an fsync,
> > we should call blkdev_issue_flush if barriers are supported.
>
> This patch isn't necessary, and in fact will cause a double flush.
> When you call fsync(), it calls ext4_force_commit(), and we do a the
> equivalent of a blkdev_issue_flush() today (which is what happenes
> when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
> set_ordered_mode(bh) ends up causing.
ISTR fsync() on ext3 did not always force a commit, if in-place data
writes did not change any metadata. Has this been fixed in ext4 but
not ext3 then?
Does WRITE_BARRIER always cause a flush? It does not have to
according to Documentation/block/barrier.txt. There are caveats about
tagged queuing "not yet implemented" in the text, but can we rely on
that? The documentation is older than the current implementation;
those caveats might no longer apply.
-- Jamie
Jamie Lokier wrote:
> Theodore Tso wrote:
>> On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
>>> To ensure that bits are truly on-disk after an fsync,
>>> we should call blkdev_issue_flush if barriers are supported.
>> This patch isn't necessary, and in fact will cause a double flush.
>> When you call fsync(), it calls ext4_force_commit(), and we do a the
>> equivalent of a blkdev_issue_flush() today (which is what happenes
>> when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
>> set_ordered_mode(bh) ends up causing.
>
> ISTR fsync() on ext3 did not always force a commit, if in-place data
> writes did not change any metadata.
I think that might still be true, but I'm still looking through it (in
the background...)
I tried blktrace to see what was going on but I'm not sure what an "NB"
in the RWBS field means, anyone know?
-Eric
On Tuesday 20 May 2008, Jamie Lokier wrote:
> Chris Mason wrote:
> > On Sunday 18 May 2008, Andi Kleen wrote:
> > > Andrew Morton <[email protected]> writes:
> > > > On Fri, 16 May 2008 14:02:46 -0500
> > > >
> > > > Eric Sandeen <[email protected]> wrote:
> > > >> A collection of patches to make ext3 & 4 use barriers by
> > > >> default, and to call blkdev_issue_flush on fsync if they
> > > >> are enabled.
> > > >
> > > > Last time this came up lots of workloads slowed down by 30% so I
> > > > dropped the patches in horror.
> > >
> > > Didn't ext4 have some new checksum trick to avoid them?
> >
> > I didn't think checksumming avoided barriers completely. Just the
> > barrier before the commit block, not the barrier after.
>
> A little optimisation note.
>
> You don't need the barrier after in some cases, or it can be deferred
> until a better time. E.g. when the disk write cache is probably empty
> (some time after write-idle), barrier flushes may take the same time
> as NOPs.
I hesitate to get too fancy here, if the disk is idle we probably won't notice
the performance gain.
>
> This sequence:
>
> #1 write metadata to journal
> #1 write commit block (checksummed)
> BARRIER
> #1 write metadata in place
> ... time passes ...
> #2 write metadata to journal
> #2 write commit block (checksummed)
> BARRIER
> #2 write metadata in place
> ... time passes ...
> #3 write metadata to journal
> #3 write commit block (checksummed)
> BARRIER
> #3 write metadata in place
>
> Can be rewritten as:
>
> #1 write metadata to journal
> #1 write commit block (checksummed)
> ... time passes ...
> #2 write metadata to journal
> #2 write commit block (checksummed)
> ... time passes ...
> #3 write metadata to journal
> #3 write commit block (checksummed)
> ... time passes ...
> BARRIER (probably instant).
> #1 write metadata in place
> #2 write metadata in place
> #3 write metadata in place
>
> Provided some conditions hold. All the metadata and all the journal
> writes being non-overlapping I/O ranges would be sufficient.
This is true, and would be a fairly good performance boost. It fits nicely
with the jbd trick of avoiding writes of a metadata block if a later
transaction has logged it.
But, it complicates the decision about when you're allowed to dirty a metadata
block for writeback. It used to be dirty-after-commit and it would change to
dirty-after-barrier. I suspect that is some significant surgery into jbd.
Also, since a commit isn't really done until the barrier is done, you can't
reuse blocks freed by the committing transaction until after the barrier,
which means changes in the deletion handling code.
Maybe I'm a wimp, but these are the two parts of write ahead logging I always
found the most difficult.
>
> What's more, barriers can be deferred past data=ordered in-place data
> writes, although that's not always an optimisation.
>
It might be really interesting to have a
i'm-about-to-barrier-find-some-io-to-run call. Something along the lines of
draining the dirty pages when the drive is woken up in laptop mode. There's
lots of fun with page lock vs journal lock ordering, but Jan has a handle on
that I think.
-chris
Chris Mason wrote:
> > You don't need the barrier after in some cases, or it can be deferred
> > until a better time. E.g. when the disk write cache is probably empty
> > (some time after write-idle), barrier flushes may take the same time
> > as NOPs.
>
> I hesitate to get too fancy here, if the disk is idle we probably
> won't notice the performance gain.
I think you're right, but it's hard to be sure. One of the problems
with barrier-implemented-as-flush-all is that it flushes data=ordered
data, even when that's not wanted, and there can be a lot of data in
the disk's write cache, spread over many seeks.
Then it's good to delay barrier-flushes to batch metadata commits, but
good to issue the barrier-flushes prior to large batches of
data=ordered data, so the latter can be survive in the disk write
cache for seek optimisations with later requests which aren't yet
known.
All this sounds complicated at the JBD layer, and IMHO much simpler at
the request elevator layer.
> But, it complicates the decision about when you're allowed to dirty
> a metadata block for writeback. It used to be dirty-after-commit
> and it would change to dirty-after-barrier. I suspect that is some
> significant surgery into jbd.
Rather than tracking when it's "allowed" to dirty a metadata block, it
will be simpler to keep a flag saying "barrier needed", and just issue
the barrier prior to writing a metadata block, if the flag is set.
So metadata write scheduling doesn't need to be changed at all. That
will be quite simple.
You might still change the scheduling, but only as a performance
heuristic in any way which turns out to be easy.
Really, that flag should live in the request elevator instead, where
it could do more good. I.e. WRITE_BARRIER wouldn't actually issue a
barrier op to disk after writing. It would just set a request
elevator flag, so a barrier op is issued prior to the next WRITE.
That road opens some nice optimisations on software RAID, which aren't
possible if it's done at the JBD layer.
> Also, since a commit isn't really done until the barrier is done, you can't
> reuse blocks freed by the committing transaction until after the barrier,
> which means changes in the deletion handling code.
Good point.
In this case, re-allocating time isn't the problem: actually writing
to them is. Writes to recycled block require to be ordered after
commits which recycled them.
As above, just issue the barrier prior to the next write which needs
to be ordered - effectively it's glued on the front of the write op.
This comes for free with no change to deletion code (wow :-) if the
only operations are WRITE_BARRIER (= flush before and after or
equivalent) and WRITE (ordered by WRITE_BARRIER).
> > What's more, barriers can be deferred past data=ordered in-place data
> > writes, although that's not always an optimisation.
> >
>
> It might be really interesting to have a
> i'm-about-to-barrier-find-some-io-to-run call. Something along the lines of
> draining the dirty pages when the drive is woken up in laptop mode. There's
> lots of fun with page lock vs journal lock ordering, but Jan has a handle on
> that I think.
I'm suspecting the opposite might be better.
I'm-about-to-barrier-please-move-the-barrier-in-front-of-unordered-writes.
The more writes you _don't_ flush synchronously, the more
opportunities you give the disk's cache to reduce seeking.
It's only a hunch though.
-- Jamie
On Tuesday 20 May 2008, Jamie Lokier wrote:
> Chris Mason wrote:
> > > You don't need the barrier after in some cases, or it can be deferred
> > > until a better time. E.g. when the disk write cache is probably empty
> > > (some time after write-idle), barrier flushes may take the same time
> > > as NOPs.
> >
> > I hesitate to get too fancy here, if the disk is idle we probably
> > won't notice the performance gain.
>
> I think you're right, but it's hard to be sure. One of the problems
> with barrier-implemented-as-flush-all is that it flushes data=ordered
> data, even when that's not wanted, and there can be a lot of data in
> the disk's write cache, spread over many seeks.
Jens and I talked about tossing the barriers completely and just doing FUA for
all metadata writes. For drives with NCQ, we'll get something close to
optimal because the higher layer elevators are already doing most of the hard
work.
Either way, you do want the flush to cover all the data=ordered writes, at
least all the ordered writes from the transaction you're about to commit.
Telling the difference between data=ordered from an old transaction or from
the running transaction gets into pushing ordering down to the lower levels
(see below)
>
> Then it's good to delay barrier-flushes to batch metadata commits, but
> good to issue the barrier-flushes prior to large batches of
> data=ordered data, so the latter can be survive in the disk write
> cache for seek optimisations with later requests which aren't yet
> known.
>
> All this sounds complicated at the JBD layer, and IMHO much simpler at
> the request elevator layer.
>
> > But, it complicates the decision about when you're allowed to dirty
> > a metadata block for writeback. It used to be dirty-after-commit
> > and it would change to dirty-after-barrier. I suspect that is some
> > significant surgery into jbd.
>
> Rather than tracking when it's "allowed" to dirty a metadata block, it
> will be simpler to keep a flag saying "barrier needed", and just issue
> the barrier prior to writing a metadata block, if the flag is set.
>
Adding explicit ordering into the IO path is really interesting. We toss a
bunch of IO down to the lower layers with information about dependencies and
let the lower layers figure it out. James had a bunch of ideas here, but I'm
afraid the only people that understood it were James and the whiteboard he
was scribbling on.
The trick is to code the ordering in such a way that an IO failure breaks the
chain, and that the filesystem has some sensible chance to deal with all
these requests that have failed because an earlier write failed.
Also, once we go down the ordering road, it is really tempting to forget that
ordering does ensure consistency but doesn't tell us the write actually
happened. fsync and friends need to hook into the dependency chain to wait
for the barrier instead of waiting for the commit.
But, back to the short term for a second, what we need are some benchmarks for
barriers on and off and some guidance from the ext34 maintainers about
turning them on by default. We shouldn't be pushing this FS integrity
decision off on the distros.
My test prog is definitely a worst case, but I'm pretty confident that most
mail server workloads end up doing similar IO.
A 16MB or 32MB disk cache is common these days, and that is a very sizable
percentage of the jbd log size. I think the potential for corruptions on
power failure is only growing over time.
-chris
On Tue, May 20 2008, Jamie Lokier wrote:
> Does WRITE_BARRIER always cause a flush? It does not have to
> according to Documentation/block/barrier.txt. There are caveats about
> tagged queuing "not yet implemented" in the text, but can we rely on
> that? The documentation is older than the current implementation;
> those caveats might no longer apply.
It does, if you use ordered tags then that assumes write through
caching (or ordered tag + drain + flush after completion).
--
Jens Axboe
On Tue, May 20 2008, Eric Sandeen wrote:
> Jamie Lokier wrote:
> > Theodore Tso wrote:
> >> On Fri, May 16, 2008 at 02:09:56PM -0500, Eric Sandeen wrote:
> >>> To ensure that bits are truly on-disk after an fsync,
> >>> we should call blkdev_issue_flush if barriers are supported.
> >> This patch isn't necessary, and in fact will cause a double flush.
> >> When you call fsync(), it calls ext4_force_commit(), and we do a the
> >> equivalent of a blkdev_issue_flush() today (which is what happenes
> >> when you do a submit_bh(WRITE_BARRIER, bh), which is what setting
> >> set_ordered_mode(bh) ends up causing.
> >
> > ISTR fsync() on ext3 did not always force a commit, if in-place data
> > writes did not change any metadata.
>
> I think that might still be true, but I'm still looking through it (in
> the background...)
>
> I tried blktrace to see what was going on but I'm not sure what an "NB"
> in the RWBS field means, anyone know?
Eric already knows this now, but for the benefit of anyone else that may
be curious - it's an empty (data-less) barrier. 'N' is basically 'no
data' (eg not a read nor a write) and 'B' is barrier.
--
Jens Axboe
Jens Axboe wrote:
> On Tue, May 20 2008, Jamie Lokier wrote:
> > Does WRITE_BARRIER always cause a flush? It does not have to
> > according to Documentation/block/barrier.txt. There are caveats about
> > tagged queuing "not yet implemented" in the text, but can we rely on
> > that? The documentation is older than the current implementation;
> > those caveats might no longer apply.
>
> It does, if you use ordered tags then that assumes write through
> caching (or ordered tag + drain + flush after completion).
Oh. That's really unclear from the opening paragraph of barrier.txt,
which _defines_ what I/O barriers are for, and does not mention flushing:
I/O barrier requests are used to guarantee ordering around the barrier
requests. Unless you're crazy enough to use disk drives for
implementing synchronization constructs (wow, sounds interesting...),
the ordering is meaningful only for write requests for things like
journal checkpoints. All requests queued before a barrier request
must be finished (made it to the physical medium) before the barrier
request is started, and all requests queued after the barrier request
must be started only after the barrier request is finished (again,
made it to the physical medium).
So I assumed the reason flush is talked about later was only because
most devices don't offer an alternative.
Later in barrier.txt, in the section about flushing, it says:
the reason you use I/O barriers is mainly to protect filesystem
integrity when power failure or some other events abruptly stop the
drive from operating and possibly make the drive lose data in its
cache. So, I/O barriers need to guarantee that requests actually
get written to non-volatile medium in order
Woa! Nothing about flushing being important, just "to guarantee
... in order".
Thus flushing looks like an implementation detail - all we could do at
the time. It does not seem to be the _point_ of WRITE_BARRIER
(according to the text), which is to ensure journalling integrity by
ordering writes.
Really, the main reason I was confused was that I imagine some
SCSI-like devices letting you do partially ordered writes to
write-back cache - with their cache preserving ordering constraints
the same way as some CPU or database caches. (Perhaps I've been
thinking about CPUs too long!)
Anyway, moving on.... Let's admit I am wrong about that :-)
And get back to my idea. Ignoring actual disks for a moment ( ;-) ),
there are some I/O scheduling optimisations possible in the kernel
itself by distinguishing between barriers (for journalling) and
flushes (for fsync).
Basically, barriers can be moved around between ordered writes,
including postponing indefinitely (i.e. a "needs barrier" flag).
Unordered writes (in-place data?) can be reordered somewhat around
barriers and other writes. Nobody should wait for a barrier to complete.
On the other hand, flushes must be issued soon, and fdatasync/fsync
wait for the result. Reordering possibilities are different: all
writes can be moved before a flush (unless it's a barrier too), and
in-place data writes cannot be moved after one.
Both barriers and flushes can be merged if there are no intervening
writes except unordered writes. Double flushing, e.g. calling fsync
twice, or calling blkdev_issue_flush just to be sure somewhere,
shouldn't have any overhead.
The request elevator seems a good place to apply those heuristics.
I've written earlier about how to remove some barriers from ext3/4
journalling. This stuff seems to suggest even more I/O scheduling
optimisations with tree-like journalling (as in BTRFS?).
-- Jamie
Chris Mason wrote:
> Jens and I talked about tossing the barriers completely and just
> doing FUA for all metadata writes. For drives with NCQ, we'll get
> something close to optimal because the higher layer elevators are
> already doing most of the hard work.
Will need benchmarking, but might just work.
Still need barriers for reasonable performance + integrity on systems
without NCQ, like the kit on my desk. Without barriers I have to turn
write-cache off (I do see those power-off errors). But that's too slow.
> Either way, you do want the flush to cover all the data=ordered writes, at
> least all the ordered writes from the transaction you're about to commit.
> Telling the difference between data=ordered from an old transaction or from
> the running transaction gets into pushing ordering down to the lower levels
> (see below)
I grant you, it is all about pushing ordering down as far as reasonable.
> > > But, it complicates the decision about when you're allowed to dirty
> > > a metadata block for writeback. It used to be dirty-after-commit
> > > and it would change to dirty-after-barrier. I suspect that is some
> > > significant surgery into jbd.
> >
> > Rather than tracking when it's "allowed" to dirty a metadata block, it
> > will be simpler to keep a flag saying "barrier needed", and just issue
> > the barrier prior to writing a metadata block, if the flag is set.
>
> Adding explicit ordering into the IO path is really interesting. We
> toss a bunch of IO down to the lower layers with information about
> dependencies and let the lower layers figure it out.
That's right. You can keep it quite simple: just distinguish barriers
from flushes. Or make it more detailed, distinguishing different
block sets and partial barriers between them.
> James had a bunch of ideas here, but I'm afraid the only people that
> understood it were James and the whiteboard he was scribbling on.
I think simply distinguishing barrier from flush is relatively
understandable, isn't it?
> The trick is to code the ordering in such a way that an IO failure breaks the
> chain, and that the filesystem has some sensible chance to deal with all
> these requests that have failed because an earlier write failed.
Yeah, if you're going to do it _properly_ that's the way :-)
But the I/O failure strategy is really a different problem, and shouldn't
necessarily distract from barrier I/O elevator scheduling.
The I/O failure strategy applies just the same even with no barriers
and no disk cache at all. Even there, you want a failed write to
block subsequent dependent writes.
> Also, once we go down the ordering road, it is really tempting to forget that
> ordering does ensure consistency but doesn't tell us the write actually
> happened. fsync and friends need to hook into the dependency chain to wait
> for the barrier instead of waiting for the commit.
True, but fortunately it's quite a simple thing. fsync et al need to
ask for flush, nothing else does. You simply don't need consistency
from journalling writes without fsync. Journal writes without fsync
happen at times determined by arcane kernel daemons. Applications and
even users don't have any say, and therefore no particular
expectation.
> My test prog is definitely a worst case, but I'm pretty confident that most
> mail server workloads end up doing similar IO.
Mail servers seem like the most vulnerable systems to me too.
They are often designed to relay a successful fsync() result across
the network. RAID doesn't protect against power-fails with flushless
fsync, and SMTP does not redundantly hold the same email at different
locations for recovery. (The most solid email services will use a
distributed database which is not vulnerable to these problems. But
the most common mail systems are).
> A 16MB or 32MB disk cache is common these days, and that is a very sizable
> percentage of the jbd log size. I think the potential for corruptions on
> power failure is only growing over time.
That depends on the disk implementation. Another use of the 32MB is
to hold write-through NCQ commands in flight while they are sorted.
If the trend is towards that, perhaps it's quite resistant to power
failure.
Still, I'm on the side of safety - having witnessed these power-fail
corruptions myself. The type I've seen will get more common, as the
trend in home "appliances" is towards pulling the plug to turn them
off, and to have these larger storage devices in them.
-- Jamie
Eric Sandeen wrote:
> I'll propose that very close to 0% of users will ever report "having
> barriers off seems to have corrupted my disk on power loss!" even if
> that's exactly what happened. And it'd be very tricky to identify in a
> post-mortem.
Ooh.
Might it be possible for fsck to identify certain problems as likely
caused by power loss in misordered writes?
If metadata sectors had an update generation number, to be compared
with journal entries, perhaps that would be more feasible.
-- Jamie
Chris Mason wrote:
> write log blocks
> flush #1
> write commit block
> flush #2
> write metadata blocks
>
> I'd agree with Ted, there's a fairly small chance of things get reordered
> around flush #1.
Except when it crosses an MD disk boundary. Then it's really likely.
We could also ask if there's _any_ possibility, when they are a merged
single I/O, of them not getting written in the expected order?
What about when FUA is set, does that imply any order?
But it's all moot: Checksumming is the way forward here, no doubt.
Checksumming makes the multi-sector write "atomic or corrupt". That's
the same expectation as a commit sector provides by itself, but
generalised to the whole journal entry.
-- Jamie
Ric Wheeler wrote:
> The hard thing is to figure out how to test this kind of scenario
> without dropping power.
Apart from using virtual machines, you mean?
> To expose the failure mode, it might be
> sufficient to drop power to a drive with smartctl (or, if you have hot
> swap bays, just pull them).
I wonder, does sending any kind of reset commands to drives cause them
to discard their write cache?
If the hot swap power switch is accessible from software, that would
be a nice trick :-)
-- Jamie
On Tue, May 20 2008, Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, May 20 2008, Jamie Lokier wrote:
> > > Does WRITE_BARRIER always cause a flush? It does not have to
> > > according to Documentation/block/barrier.txt. There are caveats about
> > > tagged queuing "not yet implemented" in the text, but can we rely on
> > > that? The documentation is older than the current implementation;
> > > those caveats might no longer apply.
> >
> > It does, if you use ordered tags then that assumes write through
> > caching (or ordered tag + drain + flush after completion).
>
> Oh. That's really unclear from the opening paragraph of barrier.txt,
> which _defines_ what I/O barriers are for, and does not mention flushing:
>
> I/O barrier requests are used to guarantee ordering around the barrier
> requests. Unless you're crazy enough to use disk drives for
> implementing synchronization constructs (wow, sounds interesting...),
> the ordering is meaningful only for write requests for things like
> journal checkpoints. All requests queued before a barrier request
> must be finished (made it to the physical medium) before the barrier
> request is started, and all requests queued after the barrier request
> must be started only after the barrier request is finished (again,
> made it to the physical medium).
>
> So I assumed the reason flush is talked about later was only because
> most devices don't offer an alternative.
It may not mention flushing explicitly, but it does not have to since
the flushing is one way to implement what the above describes. Note how
it says that request must have made it to physical medium before
allowing others to continue? That means you have to either write through
or flush caches, otherwise you cannot make that guarentee.
> Later in barrier.txt, in the section about flushing, it says:
>
> the reason you use I/O barriers is mainly to protect filesystem
> integrity when power failure or some other events abruptly stop the
> drive from operating and possibly make the drive lose data in its
> cache. So, I/O barriers need to guarantee that requests actually
> get written to non-volatile medium in order
>
> Woa! Nothing about flushing being important, just "to guarantee
> ... in order".
>
> Thus flushing looks like an implementation detail - all we could do at
> the time. It does not seem to be the _point_ of WRITE_BARRIER
> (according to the text), which is to ensure journalling integrity by
> ordering writes.
Yeah, that is precisely what it is and why it does not mention flushing
explicitly!
> Really, the main reason I was confused was that I imagine some
> SCSI-like devices letting you do partially ordered writes to
> write-back cache - with their cache preserving ordering constraints
> the same way as some CPU or database caches. (Perhaps I've been
> thinking about CPUs too long!)
Right, that is what ordered tags give you. But if you have write back
caching enabled, then you get a completion event before the data is
actually on disk. Perhaps that is OK for some cases, perhaps not. The
Linux barrier implementation has always guarenteed that the data is
actually on platter before considering a barrier write done, as
described in the text you quote.
> Anyway, moving on.... Let's admit I am wrong about that :-)
>
> And get back to my idea. Ignoring actual disks for a moment ( ;-) ),
> there are some I/O scheduling optimisations possible in the kernel
> itself by distinguishing between barriers (for journalling) and
> flushes (for fsync).
>
> Basically, barriers can be moved around between ordered writes,
> including postponing indefinitely (i.e. a "needs barrier" flag).
> Unordered writes (in-place data?) can be reordered somewhat around
> barriers and other writes. Nobody should wait for a barrier to complete.
>
> On the other hand, flushes must be issued soon, and fdatasync/fsync
> wait for the result. Reordering possibilities are different: all
> writes can be moved before a flush (unless it's a barrier too), and
> in-place data writes cannot be moved after one.
>
> Both barriers and flushes can be merged if there are no intervening
> writes except unordered writes. Double flushing, e.g. calling fsync
> twice, or calling blkdev_issue_flush just to be sure somewhere,
> shouldn't have any overhead.
>
> The request elevator seems a good place to apply those heuristics.
> I've written earlier about how to remove some barriers from ext3/4
> journalling. This stuff seems to suggest even more I/O scheduling
> optimisations with tree-like journalling (as in BTRFS?).
There are some good ideas in there, I'll punt that to the fs people.
Ignoring actual disk drives makes things easier :-). While SCSI has
ordered IO with write back caching, SATA does not. You basically have to
drain and flush there, or use one of the other variants for getting data
in disk - see blkdev.h:
/*
* Hardbarrier is supported with one of the following methods.
*
* NONE : hardbarrier unsupported
* DRAIN : ordering by draining is enough
* DRAIN_FLUSH : ordering by draining w/ pre and post flushes
* DRAIN_FUA : ordering by draining w/ pre flush and FUA
* write
* TAG : ordering by tag is enough
* TAG_FLUSH : ordering by tag w/ pre and post flushes
* TAG_FUA : ordering by tag w/ pre flush and FUA write
--
Jens Axboe
Hi!
> >
> > Here's a test workload that corrupts ext3 50% of the time on power fail
> > testing for me. The machine in this test is my poor dell desktop (3ghz,
> > dual core, 2GB of ram), and the power controller is me walking over and
> > ripping the plug out the back.
>
> Here's a new version that still gets about corruptions 50% of the time, but
> does it with fewer files by using longer file names (240 chars instead of 160
> chars).
>
> I tested this one with a larger FS (40GB instead of 2GB) and larger log (128MB
> instead of 32MB). barrier-test -s 32 -p 1500 was still able to get a 50%
> corruption rate on the larger FS.
Ok, Andrew, is this enough to get barrier patch applied and stop
corrupting data in default config, or do you want some more testing?
I guess 20% benchmark regression is bad, but seldom and impossible to
debug data corruption is worse...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, May 21, 2008 at 01:22:25PM +0200, Pavel Machek wrote:
>
> Ok, Andrew, is this enough to get barrier patch applied and stop
> corrupting data in default config, or do you want some more testing?
It is for me; I think we have to enable barriers for ext3/4, and then
work to improve the overhead in ext4/jbd2. It's probably true that
the vast majority of systems don't run under conditions similar to
what Chris used to demonstrate the problem, but the default has to be
filesystem safety. People who are sure they are extremely unlikely to
run into this problem can turn barriers off (and I suspect they won't
see that much difference in the most common workloads anyway).
- Ted
On Wed, 21 May 2008 13:22:25 +0200 Pavel Machek <[email protected]> wrote:
> Hi!
> > >
> > > Here's a test workload that corrupts ext3 50% of the time on power fail
> > > testing for me. The machine in this test is my poor dell desktop (3ghz,
> > > dual core, 2GB of ram), and the power controller is me walking over and
> > > ripping the plug out the back.
> >
> > Here's a new version that still gets about corruptions 50% of the time, but
> > does it with fewer files by using longer file names (240 chars instead of 160
> > chars).
> >
> > I tested this one with a larger FS (40GB instead of 2GB) and larger log (128MB
> > instead of 32MB). barrier-test -s 32 -p 1500 was still able to get a 50%
> > corruption rate on the larger FS.
>
> Ok, Andrew, is this enough to get barrier patch applied and stop
> corrupting data in default config, or do you want some more testing?
>
> I guess 20% benchmark regression is bad, but seldom and impossible to
> debug data corruption is worse...
It is 20%? I recall 30% from a few years ago, but that's vague and it
might have changed. Has much quantitative testing been done recently?
I might have missed it.
If we do make this change I think it should be accompanied by noisy
printks so that as many people as possible know about the decision
which we just made for them.
afaik there is no need to enable this feature if the machine (actually
the disks) are on a UPS, yes?
Andrew Morton wrote:
> On Wed, 21 May 2008 13:22:25 +0200 Pavel Machek <[email protected]> wrote:
>>> I tested this one with a larger FS (40GB instead of 2GB) and larger log (128MB
>>> instead of 32MB). barrier-test -s 32 -p 1500 was still able to get a 50%
>>> corruption rate on the larger FS.
>> Ok, Andrew, is this enough to get barrier patch applied and stop
>> corrupting data in default config, or do you want some more testing?
>>
>> I guess 20% benchmark regression is bad, but seldom and impossible to
>> debug data corruption is worse...
>
> It is 20%? I recall 30% from a few years ago, but that's vague and it
> might have changed. Has much quantitative testing been done recently?
> I might have missed it.
>
> If we do make this change I think it should be accompanied by noisy
> printks so that as many people as possible know about the decision
> which we just made for them.
>
> afaik there is no need to enable this feature if the machine (actually
> the disks) are on a UPS, yes?
As long as your power supply (or your UPS) doesn't go boom, I suppose so.
It is too bad that there is no way to determine no-barrier safety from
software. (maybe apcupsd could do something... ;)
I guess it's levels of confidence. I agree that a user education
campaign is probably in order... maybe if this thread is long enough to
make LWN it'll raise some awareness. :)
-Eric
On Wed, May 21, 2008 at 11:03:24AM -0700, Andrew Morton wrote:
>
> afaik there is no need to enable this feature if the machine (actually
> the disks) are on a UPS, yes?
Yes, as long as you're confident that there won't be a kernel
bug/regression causing a lockup while the server is under severe
memory pressure while doing lots of fsync's, file creations, renames,
etc. And as long as your 100% confident that UPS's will never fail,
janitors will never accidentally hit the Emergency Power Office
switch, and so on.
- Ted
On Wed, 21 May 2008 14:29:49 -0400 Theodore Tso <[email protected]> wrote:
> On Wed, May 21, 2008 at 11:03:24AM -0700, Andrew Morton wrote:
> >
> > afaik there is no need to enable this feature if the machine (actually
> > the disks) are on a UPS, yes?
>
> Yes, as long as you're confident that there won't be a kernel
> bug/regression causing a lockup while the server is under severe
> memory pressure while doing lots of fsync's, file creations, renames,
> etc.
That won't cause corruption, will it? The problem is purely one of
"data which the disk acked didn't make it to platter"?
> And as long as your 100% confident that UPS's will never fail,
> janitors will never accidentally hit the Emergency Power Office
> switch, and so on.
Well. The product of two very small numbers is...
Theodore Tso wrote:
> On Wed, May 21, 2008 at 11:03:24AM -0700, Andrew Morton wrote:
> >
> > afaik there is no need to enable this feature if the machine (actually
> > the disks) are on a UPS, yes?
>
> Yes, as long as you're confident that there won't be a kernel
> bug/regression causing a lockup while the server is under severe
> memory pressure while doing lots of fsync's, file creations, renames,
> etc. And as long as your 100% confident that UPS's will never fail,
> janitors will never accidentally hit the Emergency Power Office
Can a kernel lockup cause this kind of corruption?
Will a system reboot wipe the disk's write cache?
I had imagined only power loss would prevent the disk from
writing it's cache eventually; is that wrong?
-- Jamie
On Wednesday 21 May 2008, Jamie Lokier wrote:
> Theodore Tso wrote:
> > On Wed, May 21, 2008 at 11:03:24AM -0700, Andrew Morton wrote:
> > > afaik there is no need to enable this feature if the machine (actually
> > > the disks) are on a UPS, yes?
> >
> > Yes, as long as you're confident that there won't be a kernel
> > bug/regression causing a lockup while the server is under severe
> > memory pressure while doing lots of fsync's, file creations, renames,
> > etc. And as long as your 100% confident that UPS's will never fail,
> > janitors will never accidentally hit the Emergency Power Office
>
> Can a kernel lockup cause this kind of corruption?
No
>
> Will a system reboot wipe the disk's write cache?
>
Only if you reboot with the power switch (some test rigs do)
> I had imagined only power loss would prevent the disk from
> writing it's cache eventually; is that wrong?
>
-chris
Andrew Morton wrote:
> > And as long as your 100% confident that UPS's will never fail,
> > janitors will never accidentally hit the Emergency Power Office
> > switch, and so on.
>
> Well. The product of two very small numbers is...
Another unquantified very small number.
You have a UPS, but the first time you discover it's not wired up
properly is when it fails.
This happened recently in a customer's office... The UPS power wiring
melted, leaving a big stinking hole, and all the servers went down -
quickly.
Another is misconfiguring the UPS daemon so it continues writing to
disk for 30 minutes until the battery runs down...
Like the chance of two disks going wrong in a RAID at the same
time... that would *never* happen would it?
*tip-toes out of the room*
-- Jamie
Eric Sandeen wrote:
> I guess it's levels of confidence. I agree that a user education
> campaign is probably in order... maybe if this thread is long enough to
> make LWN it'll raise some awareness. :)
My work here is done ;-)
-- Jamie
Andrew Morton wrote:
> If we do make this change I think it should be accompanied by noisy
> printks so that as many people as possible know about the decision
> which we just made for them.
I favour saying barrier=1 in /proc/mounts, even though it will become
the new default. At least for a while.
-- Jamie
On Tue, 20 May 2008, Jamie Lokier wrote:
> I doubt if common unix mail transports use F_FULLSYNC on Darwin
> instead of fsync(), before reporting a mail received safely, but they
> probably should. I recall SQLite does use it (unless I'm confusing
> it with some other database).
As far as I know that made its way first into MySQL 4.1.9 (2005-01), then
into SQLite (2005-02), then into PostgreSQL (2005-04), so the database
coders all picked that up around the same time and have been using it for
years now.
http://www.mail-archive.com/[email protected]/msg06502.html shows
part of that timeline, and even includes some useful comments on the
underlying Darwin implementation from that mailing list. That suggests
one reason Apple is able to make this work is that their selection process
for hard drives requires the drive honors the request to flush its cache
they send. The implied caveat there is that even their F_FULLSYNC won't
necessarily work with external Firewire/USB drives where that's out of
their control.
--
* Greg Smith [email protected] http://www.gregsmith.com Baltimore, MD
On Monday 19 May 2008 10:16, Chris Mason wrote:
> root@opti:~# fsck -f /dev/sda2
> fsck 1.40.8 (13-Mar-2008)
> e2fsck 1.40.8 (13-Mar-2008)
> /dev/sda2: recovering journal
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Problem in HTREE directory inode 281377 (/barrier-test): bad block number
> 13543.
> Clear HTree index<y>?
Nice, htree as a canary for disk corruption. This makes sense since
directory data is the only verifiable structure at the logical data
level and htree offers the only large scale, verifiable structure.
Thanks for the lovely test methodology example. Let me additionally
offer this tool:
http://code.google.com/p/zumastor/source/browse/trunk/ddsnap/tests/devspam.c?r=1564
devspam
The idea is to write an efficiently verifiable pattern across a range
of a file, including a mix of position-dependent codes and a user
supplied code. In read mode, devspam will check that the position
dependent codes are correct and match the user supplied code. This
can be easily extended to a "check that all the user supplied codes
are the same" mode, which would help detect consistency failure in
regular data files much as htree does with directories. Hmm, this
probably wants to incorporate a sequence number as well, to detect
corruption under a random block update load as you have triggered
with htree.
I used this tool to exorcise the majority of bugs in ddsnap. It is
a wonderful canary, not only catching bugs early but showing where
where they occurred.
>From what I have seen, Sun seems to rely mostly on MD5 checksums for
detecting corruption in ZFS. We should do more of that too.
Regards,
Daniel
Just a couple of comments about barriers that might be worth throwing in
the mix.
From what I have seen, running with barriers is almost always a win
over running with write cache disabled for medium to large files (large
files with a S-ATA drive go about twice as fast in my testing).
For very small files, running with the barrier or disabling the write
cache are a lot closer in performance.
When we are looking for ways to batch, it is also critical to keep in
mind the latency of the storage. The current ext3 transaction batching
code makes running multi-threaded workloads on high speed media (like
non-volatile disk arrays) run much slower. (Josef had some patches
which helped fix this in a similar way that XFS deals with this.)
One other note is that moving to a FLASH based device will not make the
issue of barriers go away since (most? all?) FLASH parts you are likely
to see have their own write cache which is used to buffer up writes in
order to make the erase cycle less painful. That means that we have a
fairly large *volatile* write cache which needs to be flushed/controlled
just like we do with S-ATA/SCSI/etc ;-)
ric
Andrew Morton wrote:
> On Fri, 16 May 2008 14:02:46 -0500
> Eric Sandeen <[email protected]> wrote:
>
>> A collection of patches to make ext3 & 4 use barriers by
>> default, and to call blkdev_issue_flush on fsync if they
>> are enabled.
>
> Last time this came up lots of workloads slowed down by 30% so I
> dropped the patches in horror.
>
> I just don't think we can quietly go and slow everyone's machines down
> by this much. The overhead of journalling is already pretty horrid.
>
> If we were seeing a significant number of "hey, my disk got wrecked"
> reports which attributable to this then yes, perhaps we should change
> the default. But I've never seen _any_, although I've seen claims that
> others have seen reports.
Here's a recent one, just FYI.
http://marc.info/?l=ext3-users&m=121205834608161&w=2
-Eric