2009-11-19 21:23:49

by Jody McIntyre

[permalink] [raw]
Subject: [patch 0/5] Journal guided resync and support

This is an updated implementation of journal guided resync, intended to be
suitable for production systems. This feature addresses the problem with RAID
arrays that take too long to resync - similar to the existing MD write-intent
bitmap feature, we resync only the stripes that were undergoing writes at the
time of the crash. Unlike write-intent bitmaps, our testing shows very little
performance degredation as a result of the feature - around 3-5% vs around 30%
for bitmaps.

This feature is based on work described in this paper:
http://www.usenix.org/events/fast05/tech/denehy.html

As a summary, we introduce a new data write mode known as declared mode. This
is based on ordered mode except that a list of blocks to be written during the
current transaction is added to the journal before the blocks themselves are
written to the disk. Then, if the system crashes, we can resync only those
blocks during journal replay and skip the rest of the resync of the RAID array.

The changes consist of patches to ext3, jbd, MD, and the raid456 personality.
These patches are currently against the RHEL 5 kernel 2.6.18-128.7.1. Porting
to ext4/jbd2 and a more modern kernel is a TODO item.

Changes since the previous set of patches: I have addressed all review comments
received. Noteable is a design change based on Neil Brown's suggestions: the
filesystem now sets a buffer flag (fs_raidsync) to inform MD that the
filesystem is taking responsibility for resyncing parity on this stripe in
the event of a system crash. For RAID 4/5/6, setting this flag causes the
write intent bitmap NOT to be updated for the write in question. There is
also a buffer flag (syncraid) used by jbd to resync parity. Together these
eliminate most of the need for ioctls, though one is still needed for e2fsck.

Unfortunately, we have determined that these patches are NOT useful to Lustre.
Therefore I will not be doing any more work on them. I am sending them now in
case they are useful as a starting point for someone else's work.

Cheers,
Jody


2009-11-24 12:02:24

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 0/5] Journal guided resync and support

Hi!

> This is an updated implementation of journal guided resync, intended to be
> suitable for production systems. This feature addresses the problem with RAID
> arrays that take too long to resync - similar to the existing MD write-intent
> bitmap feature, we resync only the stripes that were undergoing writes at the
> time of the crash. Unlike write-intent bitmaps, our testing shows very little
> performance degredation as a result of the feature - around 3-5% vs around 30%
> for bitmaps.

Good. Now when fs know about raid and wise versa... perhaps it is time
to journal surrounding data on stripe so that power fails do not
destroy data on degraded raid5?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-24 18:51:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch 0/5] Journal guided resync and support

On 2009-11-24, at 04:43, Pavel Machek wrote:
>> This is an updated implementation of journal guided resync,
>> intended to be suitable for production systems. This feature
>> addresses the problem with RAID arrays that take too long to resync
>> - similar to the existing MD write-intent bitmap feature, we resync
>> only the stripes that were undergoing writes at the time of the
>> crash. Unlike write-intent bitmaps, our testing shows very little
>> performance degredation as a result of the feature - around 3-5% vs
>> around 30%
>> for bitmaps.
>
> Good. Now when fs know about raid and wise versa... perhaps it is time
> to journal surrounding data on stripe so that power fails do not
> destroy data on degraded raid5?


That's an interesting idea. I suspect this could be done more
efficiently by only journaling the parity block update, but there is
no way for the journal to address this block.

That said, unfortunately Jody has no more time to work on these
patches (due to data ordering requirements Lustre can't use them), and
while they are functionally complete for RHEL5 + ext3, they need to be
ported to 2.6.current and ext4 by someone or they will die a silent
death.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.