Hello, guys.
It seems like the Linux-fountation was not able to find a mentor for
my project. If somebody is willing to mentor this project through the
Google Summer of Code, please contact Rik and me now, as little
time is left.
A link to the application:
http://rom.etherboot.org/share/xl0/gsoc2008/application-linux-foundation.txt
P.S:
I've also applied for two other project for this summer of code, and one of
them I would prefer over this one. But I'm still not sure I'll get accepted
there, through I see my chances as good.
On Fri, Apr 18, 2008 at 06:20:14PM +0400, Alexey Zaytsev wrote:
> Hello, guys.
>
> It seems like the Linux-fountation was not able to find a mentor for
> my project. If somebody is willing to mentor this project through the
> Google Summer of Code, please contact Rik and me now, as little
> time is left.
>
> A link to the application:
> http://rom.etherboot.org/share/xl0/gsoc2008/application-linux-foundation.txt
Hi Alexey,
I really don't think your project is likely to be successful given the
3 month timeframe of a GSoC. At least not without a mentor spending
vast amounts of time educating you about how things works within ext2
and e2fsck. Even given some broad hints about problems that you need
to address, you still have not addressed how you will solve
fundamental race conditions resulting from trying to read the multiple
blocks scattered all over the disk which comprise allocation bitmap
blocks while allocations might be taking place, for example.
Your approach of monitoring writes to the buffer cache for metadata
writes is completely busted; suppose the kernel modifies block #12345
in the filesystem; how do you know what that means? Could that be an
indirect block? If so, to which inode does it belong? If all you are
doing is monitoring metadata blocks, you would have no idea! The fact
that it apparently didn't even occur to you that this might be a
show-stopping problem scares the heck out of me. It leads me to
believe that this project is very likely to fail, and/or will require
vast amounts of time from the mentor. Unfortunately, the former is
something that I just don't have this summer.
Regards,
- Ted
On Sat, Apr 19, 2008 at 5:29 AM, Theodore Tso <[email protected]> wrote:
> On Fri, Apr 18, 2008 at 06:20:14PM +0400, Alexey Zaytsev wrote:
> > Hello, guys.
> >
> > It seems like the Linux-fountation was not able to find a mentor for
> > my project. If somebody is willing to mentor this project through the
> > Google Summer of Code, please contact Rik and me now, as little
> > time is left.
> >
> > A link to the application:
> > http://rom.etherboot.org/share/xl0/gsoc2008/application-linux-foundation.txt
>
> Hi Alexey,
>
> I really don't think your project is likely to be successful given the
> 3 month timeframe of a GSoC. At least not without a mentor spending
> vast amounts of time educating you about how things works within ext2
> and e2fsck. Even given some broad hints about problems that you need
> to address, you still have not addressed how you will solve
> fundamental race conditions resulting from trying to read the multiple
> blocks scattered all over the disk which comprise allocation bitmap
> blocks while allocations might be taking place, for example.
>
> Your approach of monitoring writes to the buffer cache for metadata
> writes is completely busted; suppose the kernel modifies block #12345
> in the filesystem; how do you know what that means? Could that be an
> indirect block? If so, to which inode does it belong?
Sorry, I still don't understand where the problem is.
If it is a block containing a metadata object fsck has already read, than we
already know what kind of object it is (there must be a way to quickly find all
cached objects derived from a given block), and can update the cached
version. And if fsck has not yet read the block, it can just be ignored, no
matter what kind of data it contains. If it contains metadata and fsck is
intrested in it, it will read it sooner or later anyway. If it
contains file data, why
should fsck even care?
And you are wrong if you think this problem never came to me. This is in fact
what motivated the design, and there is no coincidence it is not affected.
(well, at least I think it is not affected).
But you are probably right, this project may be not doable in just three
months. The changes on the kernel side probably are, but there is a
huge e2fsck work.
> If all you are
> doing is monitoring metadata blocks, you would have no idea! The fact
> that it apparently didn't even occur to you that this might be a
> show-stopping problem scares the heck out of me. It leads me to
> believe that this project is very likely to fail, and/or will require
> vast amounts of time from the mentor. Unfortunately, the former is
> something that I just don't have this summer.
>
> Regards,
>
> - Ted
>
On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote:
> If it is a block containing a metadata object fsck has already read,
> than we already know what kind of object it is (there must be a way
> to quickly find all cached objects derived from a given block), and
> can update the cached version. And if fsck has not yet read the
> block, it can just be ignored, no matter what kind of data it
> contains. If it contains metadata and fsck is intrested in it, it
> will read it sooner or later anyway. If it contains file data, why
> should fsck even care?
The problem is that e2fsck makes calculations on the filesystem data
read out from the disk and stores that in a highly compressed format.
So it doesn't remember that block #12345 was an indirect block for
inode #123, and that it contained data block numbers 17, 42, and 45.
Instead it just marks blocks #12345, #17, #42, and #45 as in use, and
then moves on.
If you are going to store all of the cached objects then you will need
to effectively store *all* of the filesystem metatdata in memory at
the same time. For a large filesystem, you won't have enough *room*
in memory store all of the cached objects. That's one of the reasons
why e2fsck has a lot of very clever design so that summary information
can be stored in a very compressed form in memory so that things can
be fast (by avoid re-reading objects from disk) as well as not
requiring vast amounts of memory.
Even if you *do* store all of the cached objects, it still takes time
to examine all of the objects and in the mean time, more changes will
have come rolling in, and you will either need to add a huge amount of
dependency to figure out what internal data structures need to be
updated based on the changes in some of the cached objects --- or you
will end up restarting the e2fsck checking process from scratch.
In either case, there is still the issue of knowing exactly whether a
particular read happened before or after some change in the
filesystem. This race condition is a really hard one to deal with,
especially on a multiple CPU system and the filesystem checker is
running in userspace.
> But you are probably right, this project may be not doable in just three
> months. The changes on the kernel side probably are, but there is a
> huge e2fsck work.
Yes, that is the concern. And without implementing the user-space
side, you'll never besure whether you completely got the kernel side
changes right!
Regards,
- Ted
Theodore Tso wrote:
> On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote:
>> If it is a block containing a metadata object fsck has already read,
>> than we already know what kind of object it is (there must be a way
>> to quickly find all cached objects derived from a given block), and
>> can update the cached version. And if fsck has not yet read the
>> block, it can just be ignored, no matter what kind of data it
>> contains. If it contains metadata and fsck is intrested in it, it
>> will read it sooner or later anyway. If it contains file data, why
>> should fsck even care?
It seems to me that what the proposed project really does, in essence,
is a read-only check of a filesystem snapshot. It's just that the
snapshot is proposed to be constructed in a complex and non-generic (and
maybe impossible) way.
If you really just want to verify a snapshot of the fs at a point in
time, surely there are simpler ways. If the device is on lvm, there's
already a script floating around to do it in automated fasion. (I'd
pondered the idea of introducing META_WRITE (to go with META_READ) and
maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
-Eric
On Sat, Apr 19, 2008 at 02:07:34PM -0500, Eric Sandeen wrote:
>
> It seems to me that what the proposed project really does, in essence,
> is a read-only check of a filesystem snapshot. It's just that the
> snapshot is proposed to be constructed in a complex and non-generic (and
> maybe impossible) way.
That's not a bad way of thinking about it; except that the snapshot is
being maintained in userspace, without any discussion of some kind of
filesystem-level freeze (which would be hard because the freeze, in
the best case, would take as long as e2image -r would take --- which
is roughly time required for e2fsck's pass1, which is in general
approximately 70% of the e2fsck run-time.)
> If you really just want to verify a snapshot of the fs at a point in
> time, surely there are simpler ways. If the device is on lvm, there's
> already a script floating around to do it in automated fasion. (I'd
> pondered the idea of introducing META_WRITE (to go with META_READ) and
> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
That would be great, although I think the major issue is not
necessarily the performance problems of using an LVM snapshot on a
very busy filesystem (althouh I could imagine for some users this
might be an issue), but rather for filesystem devices that aren't
using LVM at all. (I've heard some complaints that LVM imposes a
performance penalty even if you aren't using a snapshot; has anyone
done any benchmarks of a filesystem with and without LVM to see
whether or not there really is a significant performance penalty;
whether or not there really is one, the perception is definitely out
there that it does.)
If we could do a lightweight snapshot that didn't require an LVM, that
would be really great. But that's probably not an ext4 project, and
I'm not sure the it would be considered politically correct in the
LKML community.
- Ted
Theodore Tso wrote:
> On Sat, Apr 19, 2008 at 02:07:34PM -0500, Eric Sandeen wrote:
>> If you really just want to verify a snapshot of the fs at a point in
>> time, surely there are simpler ways. If the device is on lvm, there's
>> already a script floating around to do it in automated fasion. (I'd
>> pondered the idea of introducing META_WRITE (to go with META_READ) and
>> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
>
> That would be great, although I think the major issue is not
> necessarily the performance problems of using an LVM snapshot on a
> very busy filesystem
well, backing space for the snapshot could be an issue too. Basically,
if you're only using it for this purpose, why COW all the post-snapshot
data if you just don't care...
> (althouh I could imagine for some users this
> might be an issue), but rather for filesystem devices that aren't
> using LVM at all. (I've heard some complaints that LVM imposes a
> performance penalty even if you aren't using a snapshot; has anyone
> done any benchmarks of a filesystem with and without LVM to see
> whether or not there really is a significant performance penalty;
> whether or not there really is one, the perception is definitely out
> there that it does.)
I've heard from someone who did some testing about a minor penalty, but
I can't point to any published test so I guess that's just more hearsay.
It's intuitive that putting lvm on top of a block device might not be
absolutely, 100% free, though.... Adds to stack, too.
> If we could do a lightweight snapshot that didn't require an LVM, that
> would be really great. But that's probably not an ext4 project, and
> I'm not sure the it would be considered politically correct in the
> LKML community.
Yep; my original reply originally wished something about non-lvm
snapshots but... while yes, it'd be nice for this purpose, ponies for
everyone would be nice too... :) But I didn't mention it because... how
do you do a generic non-lvm snapshot of, say, /dev/sda3 without some
sort of volume manager...?
If there's some clever idea that could be implemented cleanly, I'd be
all ears. :)
-Eric
Theodore Tso <[email protected]> writes:
> That would be great, although I think the major issue is not
> necessarily the performance problems of using an LVM snapshot on a
> very busy filesystem (althouh I could imagine for some users this
> might be an issue), but rather for filesystem devices that aren't
> using LVM at all. (I've heard some complaints that LVM imposes a
> performance penalty even if you aren't using a snapshot; has anyone
It always disables barriers if you don't apply a so far unmerged patch
that enables them in some special circumstances (only single
backing device)
Not having barriers sometimes makes your workloads faster (and less
safe) and in other cases slower.
-Andi
Theodore Tso <[email protected]> writes:
>
> If you are going to store all of the cached objects then you will need
> to effectively store *all* of the filesystem metatdata in memory at
> the same time.
Are you sure about all data? I think he would just need some lookup table from
metadata block numbers to inode numbers and then when a hit occurs on a block
in the table somehow invalidate all data related to that inode
and restart that part. And the same thing for bitmap blocks. That lookup
table should be much smaller than the full metadata.
Anyways my favourite fsck wish list feature would be a way to record the
changes a read-only fsck would want to do and then some quick way
to apply them to a writable version of the file system without
doing a full rescan. Then you could regularly do a background check
and if it finds something wrong just remount and apply the changes
quickly.
Or perhaps just tell the kernel which objects is suspicious and
should be EIOed.
-Andi
Andi Kleen wrote:
> [LVM] always disables barriers if you don't apply a so far unmerged
> patch that enables them in some special circumstances (only single
> backing device)
(I continue to be surprised at the un-safety of Linux fsync)
> Not having barriers sometimes makes your workloads faster (and less
> safe) and in other cases slower.
I'm curious, how does it make them slower? Merely not issuing barrier
calls seems like it will always be the same speed or faster.
Thanks,
-- Jamie
On Sat, Apr 19, 2008 at 10:56 PM, Theodore Tso <[email protected]> wrote:
> On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote:
> > If it is a block containing a metadata object fsck has already read,
> > than we already know what kind of object it is (there must be a way
> > to quickly find all cached objects derived from a given block), and
> > can update the cached version. And if fsck has not yet read the
> > block, it can just be ignored, no matter what kind of data it
> > contains. If it contains metadata and fsck is intrested in it, it
> > will read it sooner or later anyway. If it contains file data, why
> > should fsck even care?
>
> The problem is that e2fsck makes calculations on the filesystem data
> read out from the disk and stores that in a highly compressed format.
> So it doesn't remember that block #12345 was an indirect block for
> inode #123, and that it contained data block numbers 17, 42, and 45.
> Instead it just marks blocks #12345, #17, #42, and #45 as in use, and
> then moves on.
>
> If you are going to store all of the cached objects then you will need
> to effectively store *all* of the filesystem metatdata in memory at
> the same time. For a large filesystem, you won't have enough *room*
> in memory store all of the cached objects. That's one of the reasons
> why e2fsck has a lot of very clever design so that summary information
> can be stored in a very compressed form in memory so that things can
> be fast (by avoid re-reading objects from disk) as well as not
> requiring vast amounts of memory.
>
Yes, I agree on this problem. Do you have any estimates on how
much RAM the current e2fsck uses in some test cases? I hope
my approach will not add much to this. The only big thing I see
is the data needed to associate each inode/dir entry with the parent
block. Probably one radix tree to enumerate the blocks and a
pointer added to the ext2_inode and ext2_dir_entry structures
to form a linked list of objects belonging to the same block.
Still no idea how much RAM the whole thing would consume.
> Even if you *do* store all of the cached objects, it still takes time
> to examine all of the objects and in the mean time, more changes will
> have come rolling in, and you will either need to add a huge amount of
> dependency to figure out what internal data structures need to be
> updated based on the changes in some of the cached objects --- or you
> will end up restarting the e2fsck checking process from scratch.
>
Not really. In my application I propose some changes to the fsck pass
order to avoid the need to rerun it. And I don't get what dependency you
are talking about. The only one I see is between the directory entries and
the directory inode. Should not be hard to solve.
(Or do I miss something? Could you give more examples maybe?)
> In either case, there is still the issue of knowing exactly whether a
> particular read happened before or after some change in the
> filesystem. This race condition is a really hard one to deal with,
> especially on a multiple CPU system and the filesystem checker is
> running in userspace.
I don't see why should fsck care about this. The notification is always sent
after the write happened, so fsck should just re-read the data. No problem
if it already read the (half-)updated version just before the notification.
Btw, how about an even simplyer method: just watch the journal commits
(changes to jbd needed). This way we can get all actual metadata updates,
without being flooded by the file data updates.
>
> > But you are probably right, this project may be not doable in just three
> > months. The changes on the kernel side probably are, but there is a
> > huge e2fsck work.
>
> Yes, that is the concern. And without implementing the user-space
> side, you'll never besure whether you completely got the kernel side
> changes right!
>
> Regards,
>
> - Ted
>
On Sat, Apr 19, 2008 at 11:07 PM, Eric Sandeen <[email protected]> wrote:
> Theodore Tso wrote:
> > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote:
> >> If it is a block containing a metadata object fsck has already read,
> >> than we already know what kind of object it is (there must be a way
> >> to quickly find all cached objects derived from a given block), and
> >> can update the cached version. And if fsck has not yet read the
> >> block, it can just be ignored, no matter what kind of data it
> >> contains. If it contains metadata and fsck is intrested in it, it
> >> will read it sooner or later anyway. If it contains file data, why
> >> should fsck even care?
>
> It seems to me that what the proposed project really does, in essence,
> is a read-only check of a filesystem snapshot. It's just that the
> snapshot is proposed to be constructed in a complex and non-generic (and
> maybe impossible) way.
Maybe complex and non-generic, but also quite efficient. Only the actually used
matadata is cached, and everything is done in userspace.
>
> If you really just want to verify a snapshot of the fs at a point in
> time, surely there are simpler ways. If the device is on lvm, there's
> already a script floating around to do it in automated fasion. (I'd
> pondered the idea of introducing META_WRITE (to go with META_READ) and
> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
How do you tell data from metadata on this level?
>
> -Eric
>
On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote:
> Andi Kleen wrote:
> > [LVM] always disables barriers if you don't apply a so far unmerged
> > patch that enables them in some special circumstances (only single
> > backing device)
>
> (I continue to be surprised at the un-safety of Linux fsync)
Note barrier less does not necessarily always mean unsafe fsync,
it just often means that.
Also surprisingly lot more syncs or write cache off tend to lower the MTBF
of your disk significantly, so "unsafer" fsync might actually be more safe
for your unbackuped data.
> > Not having barriers sometimes makes your workloads faster (and less
> > safe) and in other cases slower.
>
> I'm curious, how does it make them slower? Merely not issuing barrier
> calls seems like it will always be the same speed or faster.
Some setups detect the no barrier case and switch to full sync + wait
(or write cache off) which depending on the disk supporting NCQ can be slower.
-Andi
"Alexey Zaytsev" <[email protected]> writes:
>
> How do you tell data from metadata on this level?
You could always change the file system to pass down hints through
the block layer.
-Andi
Andi Kleen wrote:
> On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote:
> > Andi Kleen wrote:
> > > [LVM] always disables barriers if you don't apply a so far unmerged
> > > patch that enables them in some special circumstances (only single
> > > backing device)
> >
> > (I continue to be surprised at the un-safety of Linux fsync)
>
> Note barrier less does not necessarily always mean unsafe fsync,
> it just often means that.
>
> Also surprisingly lot more syncs or write cache off tend to lower the MTBF
> of your disk significantly, so "unsafer" fsync might actually be more safe
> for your unbackuped data.
That's really interesting, thanks. Do you have something to cite
about syncs reducing the MTBF?
( I'm really glad I added barriers instead of write cache off to my
2.4.26 based disk using devices now ;-) )
> > > Not having barriers sometimes makes your workloads faster (and less
> > > safe) and in other cases slower.
> >
> > I'm curious, how does it make them slower? Merely not issuing barrier
> > calls seems like it will always be the same speed or faster.
>
> Some setups detect the no barrier case and switch to full sync +
> wait (or write cache off) which depending on the disk supporting NCQ
> can be slower.
But to issue full syncs, that's implemented as barrier calls in the
block request layers isn't it? The filesystem isn't given a facility
to request the block device do full syncs or disable the write cache.
So when a blockdev doesn't offer barriers to the filesystem, it means
the driver doesn't support full syncs or cache disabling either, since
if it did, the request layer would expose them to the fs as barriers.
What am I missing from this picture? Do you mean that manual setup
(such as by a DBA) tends to disable the write cache?
Thanks,
-- Jamie
On Mon, Apr 21, 2008 at 04:23:42AM +0400, Alexey Zaytsev wrote:
> Not really. In my application I propose some changes to the fsck pass
> order to avoid the need to rerun it. And I don't get what dependency you
> are talking about. The only one I see is between the directory entries and
> the directory inode. Should not be hard to solve.
> (Or do I miss something? Could you give more examples maybe?)
And *this* is why I ultimately decided I didn't have the time to
mentor you. There are large numbers of other dependencies.
For example, between the direct and indirect blocks in the inode, and
the block allocation bitmaps. (Note that e2fsck keeps up to 3
different block bitmaps and 6 different inofr bitmaps.)
You need to know which inodes are directories and which inodes are
regular files. E2fsck currently keeps these bitmaps so we don't have
the cache the entire 128 byte inode for all inodes. (Instead, we
cache a single bit for every single inode. There's a ***reason*** for
all of these bitmaps.)
You also need to know which blocks are being used to store extended
attributes, which may potentially be shared across multiple inodes.
That's just *three* additional dependencis, and there are many more.
If you can't think of them, how much time would it take for me as
mentor to explain all of this to you?
> > In either case, there is still the issue of knowing exactly whether a
> > particular read happened before or after some change in the
> > filesystem. This race condition is a really hard one to deal with,
> > especially on a multiple CPU system and the filesystem checker is
> > running in userspace.
>
> I don't see why should fsck care about this. The notification is always sent
> after the write happened, so fsck should just re-read the data. No problem
> if it already read the (half-)updated version just before the notification.
Keep in mind that when a file gets deleted, a *large* number of
metadata blocks will potentially get updated. So while e2fsck is
handling these reads, a bunch more can start coming in from other
filesystem transactions, and since the kernel doesn't know what
userspace has already cached, it will have to send them again... and
again...
In fact if the filesystem is being very quickly updated, the
notifications could easily overrun whatever buffers has been set up to
transfer this information from userspace to the kernel side. Worse
yet, unless you also send down transaction boundaries, the userspace
won't know when the filesystem has reached a "stable state" which
would be internally consistent.
There are ways that this could be solved, but at the end of the day,
the $1,000,000 question is why not just do a kernel-side snapshot?
Then you don't have to completely rewrite e2fsck --- and given that
you've claimed the e2fsck code is "hard to understand", it seems
especially audacious that you would have thought you could do this in
3 months. If you really don't want to use LVM, you could have
proposed a snapshot solution which didn't involve devicemapper. It's
not clear it would have entered mainline, but at least there would
have been some non-zero chance that you would complete the project
successfully.
Regards,
- Ted
On Mon, Apr 21, 2008 at 01:37:37AM +0200, Andi Kleen wrote:
> Are you sure about all data? I think he would just need some lookup table from
> metadata block numbers to inode numbers and then when a hit occurs on a block
> in the table somehow invalidate all data related to that inode
> and restart that part. And the same thing for bitmap blocks. That lookup
> table should be much smaller than the full metadata.
Yeah, unfortunately it's close to all of the metadata. Consider that
e2fsck also has to deal with changes in the directory, and there can
be multiple hard links in a directory, so it's not just a simple
lookup table. You could try to condense the directory into a list of
inodes numbers and the number of times they were counted in a
directory, but then any time the directory changed, you'd have to
rescan the *entire* directory.
Also, consider that the lookup table might not be enough, if the
filesystem is actually corrupted, and there are multiple blocks
claimed by an inode. How you "invalidate all data" in that case
becomes less obvious.
It would be possible to condense the metdata somewhat by taking the
omitting unused inodes, and storing the indirect blocks as extents.
But there would still be a huge amount of metadata that would have to
be stored in memory. If you're willing to completely rewrite e2fsck
(which the on-line resize would need anyway, because the updated data
could invalidate the previously done work at any point anywhere in the
e2fsck processing), maybe the extra cached data structures won't be on
completely additive on top of the other intermediate data kept by
e2fsck, but it once again points out it would be insane for a student
to try to do this in 3 months.
> Anyways my favourite fsck wish list feature would be a way to record the
> changes a read-only fsck would want to do and then some quick way
> to apply them to a writable version of the file system without
> doing a full rescan. Then you could regularly do a background check
> and if it finds something wrong just remount and apply the changes
> quickly.
This is a read-only fsck while the filesystem is changing out from
underneath it, and the hope is that you can take the instructions
gathered from the read-only fsck (presumably run on a snapshot) and
then apply them to filesystem that has since been modified after the
snaphot was taken. Even if it has been remounted read-only at this
point, this gets really dicey. Consider that with certain types of
corruption, if the filesystem continues to get modified, the
corruption can get worse.
> Or perhaps just tell the kernel which objects is suspicious and
> should be EIOed.
Yeah; you could do that, as long as it's not a guarantee that all of
the objects which were suspicious were found. It would also be
possible to isolate the objects, perhaps with some potential inode and
block leakage that would get fixed at the next off-line fsck. Still,
it would be a lot of work. Let me know if someone is willing to pay
for this, and I could probably work with someone like Val to execute
this. But otherwise, it probably falls in the "we'd all like a pony"
sort of wishlist.....
- Ted
> snaphot was taken. Even if it has been remounted read-only at this
> point, this gets really dicey. Consider that with certain types of
> corruption, if the filesystem continues to get modified, the
> corruption can get worse.
I see, but perhaps you could do that on at least some common
type of corruptions and only give up in the extreme cases?
Mind you I don't have a good feeling what common and uncommon
types are.
>
> > Or perhaps just tell the kernel which objects is suspicious and
> > should be EIOed.
>
> Yeah; you could do that, as long as it's not a guarantee that all of
> the objects which were suspicious were found. It would also be
Ok to do the 100% job you probably need metadata checksums and always
validate on initial read.
-Andi
(sorry if this is a duplicate, my previous email was rejected)
Hi Andi,
On Seg, 2008-04-21 at 10:01 +0200, Andi Kleen wrote:
> > (I continue to be surprised at the un-safety of Linux fsync)
>
> Note barrier less does not necessarily always mean unsafe fsync,
> it just often means that.
Am I correct that the Linux fsync(), when used (from userspace)
directly on file descriptors associated with block devices doesn't
actually flush the disk write cache and wait for the data to reach the
disk before returning?
Is there a reason why this isn't being done other than performance?
I would imagine that the only reason a process is using fsync() is
because it is worried about data loss, and therefore is perfectly
willing to lose some performance if necessary..
Regards,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email [email protected]
> Am I correct that the Linux fsync(), when used (from userspace)
> directly on file descriptors associated with block devices doesn't
> actually flush the disk write cache and wait for the data to reach the
> disk before returning?
Not quite. It depends. Sometimes it does this and sometimes it doesn't,
depending on the disk and the controller and the file system and the
kernel version and the distribution default.
For details search the archives of linux-kernel/linux-fsdevel. This
has been discussed many times.
> Is there a reason why this isn't being done other than performance?
One reason against it is that in many (but not all) setups to guarantee
reaching the platter you have to disable the write cache, and at least
for consumer level hard disks disk vendors generally do not recommend
doing this because it significantly lowers the MTBF of the disk.
-Andi
Andi Kleen wrote:
> On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote:
>> Andi Kleen wrote:
>>> [LVM] always disables barriers if you don't apply a so far unmerged
>>> patch that enables them in some special circumstances (only single
>>> backing device)
>> (I continue to be surprised at the un-safety of Linux fsync)
>
> Note barrier less does not necessarily always mean unsafe fsync,
> it just often means that.
>
> Also surprisingly lot more syncs or write cache off tend to lower the MTBF
> of your disk significantly, so "unsafer" fsync might actually be more safe
> for your unbackuped data.
>
Hi Andi,
Where did you get this data?
I have never heard that using more barrier operations lowers the reliability or
the MTBF of a drive and I look at a fairly huge population when doing this ;-)
ric
On Seg, 2008-04-21 at 19:40 +0200, Andi Kleen wrote:
> > Is there a reason why this isn't being done other than performance?
>
> One reason against it is that in many (but not all) setups to guarantee
> reaching the platter you have to disable the write cache, and at least
> for consumer level hard disks disk vendors generally do not recommend
> doing this because it significantly lowers the MTBF of the disk.
I understand that, but if the disk/storage doesn't support flushing the
cache, I would expect fsync() to return EIO or ENOTSUP, I wouldn't
expect it to ignore my request and risk losing data without my
knowledge..
I know fsync() also flushes dirty buffers, but IMHO even if it flushes
the buffers it'd be better to return an error if a full sync wasn't
being done rather than returning success and misleading the application.
Anyway, sorry if this has been discussed before, I should take a look at
the archives..
Thanks,
Ricardo
--
Ricardo Manuel Correia
Lustre Engineering
Sun Microsystems, Inc.
Portugal
Phone +351.214134023 / x58723
Mobile +351.912590825
Email [email protected]
Ric Wheeler wrote:
>
> Andi Kleen wrote:
>> On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote:
>>> Andi Kleen wrote:
>>>> [LVM] always disables barriers if you don't apply a so far unmerged
>>>> patch that enables them in some special circumstances (only single
>>>> backing device)
>>> (I continue to be surprised at the un-safety of Linux fsync)
>> Note barrier less does not necessarily always mean unsafe fsync,
>> it just often means that.
>>
>> Also surprisingly lot more syncs or write cache off tend to lower the MTBF
>> of your disk significantly, so "unsafer" fsync might actually be more safe
>> for your unbackuped data.
>>
>
> Hi Andi,
>
> Where did you get this data?
>
> I have never heard that using more barrier operations lowers the reliability or
> the MTBF of a drive and I look at a fairly huge population when doing this ;-)
Ric, what about the other part - turning write cache off? I've also
heard it suggested that this might hurt drive lifespan, and it sorta
makes sense, I assume it keeps the head working harder...
-Eric
Eric Sandeen wrote:
> Ric Wheeler wrote:
>> Andi Kleen wrote:
>>> On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote:
>>>> Andi Kleen wrote:
>>>>> [LVM] always disables barriers if you don't apply a so far unmerged
>>>>> patch that enables them in some special circumstances (only single
>>>>> backing device)
>>>> (I continue to be surprised at the un-safety of Linux fsync)
>>> Note barrier less does not necessarily always mean unsafe fsync,
>>> it just often means that.
>>>
>>> Also surprisingly lot more syncs or write cache off tend to lower the MTBF
>>> of your disk significantly, so "unsafer" fsync might actually be more safe
>>> for your unbackuped data.
>>>
>> Hi Andi,
>>
>> Where did you get this data?
>>
>> I have never heard that using more barrier operations lowers the reliability or
>> the MTBF of a drive and I look at a fairly huge population when doing this ;-)
>
> Ric, what about the other part - turning write cache off? I've also
> heard it suggested that this might hurt drive lifespan, and it sorta
> makes sense, I assume it keeps the head working harder...
>
> -Eric
Turning the drive write cache off is the default case for most RAID products
(including our mid and high end arrays).
I have not seen an issue with drives wearing out with either setting (cache
disabled or enabled with barriers).
The theory does make some sense, but does not map into my experience ;-)
ric
On Mon, Apr 21, 2008 at 02:44:45PM -0400, Ric Wheeler wrote:
> Turning the drive write cache off is the default case for most RAID
> products (including our mid and high end arrays).
>
> I have not seen an issue with drives wearing out with either setting (cache
> disabled or enabled with barriers).
>
> The theory does make some sense, but does not map into my experience ;-)
To be fair though, the gigabytes of NVRAM on the array perform the job
that the drive's cache would do on a lower-end system.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Matthew Wilcox wrote:
> On Mon, Apr 21, 2008 at 02:44:45PM -0400, Ric Wheeler wrote:
>> Turning the drive write cache off is the default case for most RAID
>> products (including our mid and high end arrays).
>>
>> I have not seen an issue with drives wearing out with either setting (cache
>> disabled or enabled with barriers).
>>
>> The theory does make some sense, but does not map into my experience ;-)
>
> To be fair though, the gigabytes of NVRAM on the array perform the job
> that the drive's cache would do on a lower-end system.
The population I deal with personally is a huge number of 1U Centera nodes, each
of which has 4 high capacity ATA or S-ATA drives (no NVRAM). We run with
barriers (and write cache) enabled and I have not seen anything that leads me to
think that this is an issue.
One way to think about this is that even with barriers, relatively few
operations actually turn into cache flushes (fsync's, journal syncs, unmounts?).
Another thing to keep in mind is that drives are constantly writing and moving
heads - disabling write cache or doing a flush just adds an incremental number
of writes/head movements.
Using barriers or disabling write cache matters only when you are doing a write
intensive load, read intensive loads are not impacted (and random, cache miss
reads will move the heads often).
I just don't see it being an issue for any normal user (laptop user or desktop
user) since the write workload more people have is a small fraction of what we
run into in production data centers.
Running your drives in a moderate way will probably help them last longer, but I
am just not convinced that the write cache/barrier load makes much of a
difference...
ric
Andi Kleen wrote:
> > Is there a reason why this isn't being done other than performance?
>
> One reason against it is that in many (but not all) setups to guarantee
> reaching the platter you have to disable the write cache, and at least
> for consumer level hard disks disk vendors generally do not recommend
> doing this because it significantly lowers the MTBF of the disk.
I think the MTBF argument is a bit spurious, because guaranteeing it
reaches the platter with all modern disks is possible, with the
appropriate kernel changes, and does not require the write cache to be
disabled.
TBH, I think the reason is it's simply never been implemented. There
are other strategies for mitigating data loss, after all, and
filesystem structure is not at risk; barriers are fine for that.
Right now, you have the choice of 'disable write cache' or 'fsync
flushes sometimes but not always, depending on lots of factors'.
The option 'fsync flushes always, write cache enabled' isn't
implemented, though most hardware supports it.
Btw, on Darwin (Mac OS X) it _is_ because of performance that fsync()
doesn't issue a flush to platter. It has an fcntl(F_FULLSYNC) which
is documented to do the latter.
-- Jamie
On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <[email protected]> wrote:
> Theodore Tso wrote:
> > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote:
> >> If it is a block containing a metadata object fsck has already read,
> >> than we already know what kind of object it is (there must be a way
> >> to quickly find all cached objects derived from a given block), and
> >> can update the cached version. And if fsck has not yet read the
> >> block, it can just be ignored, no matter what kind of data it
> >> contains. If it contains metadata and fsck is intrested in it, it
> >> will read it sooner or later anyway. If it contains file data, why
> >> should fsck even care?
>
> It seems to me that what the proposed project really does, in essence,
> is a read-only check of a filesystem snapshot. It's just that the
> snapshot is proposed to be constructed in a complex and non-generic (and
> maybe impossible) way.
>
> If you really just want to verify a snapshot of the fs at a point in
> time, surely there are simpler ways. If the device is on lvm, there's
> already a script floating around to do it in automated fasion. (I'd
> pondered the idea of introducing META_WRITE (to go with META_READ) and
> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
>
Can I know where is this script? Or if u cannot locate it, does it
have any resemblance to all the stuff mentioned below?.
Apologizing for the regression of discussion back to this part again,
(and pardon my superficial knowledge of filesystem, just brainstorming
and eager to learn :-)), I think the idea of "online checker" can be
developed further, taking into consideration all that have been said
in this threads - morphing into "semi-online" (real online is not
feasible eg what have been fscked can be immediately be invalidated by
another subsequent corrupted writes, so the idea of fsck on read-only
snapshot is best we could achieved, and then mark the fsck results
with the timestamp, so that all writes beyond this timestamp may
invalidate the earlier fsck results. This idea has its equivalence
in the Oracle database world - "online datafile backup" feature, where
all transactions goes to memory + journal logs (a physical file
itself), and datafile is frozen for writing, enabling it to be
physically copied):
a. First, integrity of the filesystem must be treated as a WHOLE,
and therefore, all WRITES must somehow be frozen at THE SAME TIME,
and, after that point in time, all writes will then go direct to
memory only. So the permanent storage will be readonly. This I
guessed is the readonly snapshot part, correct?
b. Concerning all the different infinite combination of race
condition that can happened, it should not happen here. This is
because now the entire filesystem's integrity is maintained as a
whole.
c. The only difficulty i can see is that updates to the journal logs
- can this part of online updates just go to memory temporarily, while
the frozen image is being fsck?
d. When ALL fsck is done, everything in memory will get resync with
the filesystem. and during this short period of resyncing, all
writing should be completely frozen - no writing to disk nor memory,
as race condition may arise. after syncing, all read/writing to go
direct to the disk.
Complexity of cache interaction is beyond my understanding. Some are
rephrasing or adaptation of what I have read in this thread, so is my
understanding correct?
Thank you for sharing.
--
Regards,
Peter Teoh
Peter Teoh wrote:
> On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <[email protected]> wrote:
>> Theodore Tso wrote:
>> > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote:
>> >> If it is a block containing a metadata object fsck has already read,
>> >> than we already know what kind of object it is (there must be a way
>> >> to quickly find all cached objects derived from a given block), and
>> >> can update the cached version. And if fsck has not yet read the
>> >> block, it can just be ignored, no matter what kind of data it
>> >> contains. If it contains metadata and fsck is intrested in it, it
>> >> will read it sooner or later anyway. If it contains file data, why
>> >> should fsck even care?
>>
>> It seems to me that what the proposed project really does, in essence,
>> is a read-only check of a filesystem snapshot. It's just that the
>> snapshot is proposed to be constructed in a complex and non-generic (and
>> maybe impossible) way.
>>
>> If you really just want to verify a snapshot of the fs at a point in
>> time, surely there are simpler ways. If the device is on lvm, there's
>> already a script floating around to do it in automated fasion. (I'd
>> pondered the idea of introducing META_WRITE (to go with META_READ) and
>> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
>>
>
> Can I know where is this script? Or if u cannot locate it, does it
> have any resemblance to all the stuff mentioned below?.
Google for "lvcheck" and find it buried in a thread "forced fsck
(again?)" on the ext3-users list - I'm not sure if it has an upstream
home anywhere yet...
-Eric
On Apr 22, 2008 12:02 -0500, Eric Sandeen wrote:
> Peter Teoh wrote:
> > On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <[email protected]> wrote:
> >> If you really just want to verify a snapshot of the fs at a point in
> >> time, surely there are simpler ways. If the device is on lvm, there's
> >> already a script floating around to do it in automated fasion. (I'd
> >> pondered the idea of introducing META_WRITE (to go with META_READ) and
> >> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
> >
> > Can I know where is this script? Or if u cannot locate it, does it
> > have any resemblance to all the stuff mentioned below?.
>
> Google for "lvcheck" and find it buried in a thread "forced fsck
> (again?)" on the ext3-users list - I'm not sure if it has an upstream
> home anywhere yet...
We thought the best place to put it would be in the lvm2 utilities, since
it is tied to LVM snapshots (and not really a particular filesystem).
Eric, any chance you could pass the script over to the LVM folks at RH?
AFAIK, they are the official LVM/DM maintainers still (adopted as part of
Sistina and GFS).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> On Apr 22, 2008 12:02 -0500, Eric Sandeen wrote:
>> Peter Teoh wrote:
>>> On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <[email protected]> wrote:
>>>> If you really just want to verify a snapshot of the fs at a point in
>>>> time, surely there are simpler ways. If the device is on lvm, there's
>>>> already a script floating around to do it in automated fasion. (I'd
>>>> pondered the idea of introducing META_WRITE (to go with META_READ) and
>>>> maybe lvm could do a "metadata-only" snapshot to be lighter weight?)
>>> Can I know where is this script? Or if u cannot locate it, does it
>>> have any resemblance to all the stuff mentioned below?.
>> Google for "lvcheck" and find it buried in a thread "forced fsck
>> (again?)" on the ext3-users list - I'm not sure if it has an upstream
>> home anywhere yet...
>
> We thought the best place to put it would be in the lvm2 utilities, since
> it is tied to LVM snapshots (and not really a particular filesystem).
>
> Eric, any chance you could pass the script over to the LVM folks at RH?
> AFAIK, they are the official LVM/DM maintainers still (adopted as part of
> Sistina and GFS).
Sure, I'll see who I can bug :)
-Eric