2001-07-06 15:17:35

by Andrew Morton

[permalink] [raw]
Subject: ext3-2.4-0.9.0

An update of the ext3 journalling filesystem for 2.4 kernels
is available at

http://www.uow.edu.au/~andrewm/linux/ext3/

Patches are against 2.4.6-ac1 and 2.4.6.

Changes since 0.0.8 include:

- Multiplied the version numbering by ten to cater for bugfix
releases against the 0.9.0 stream.

- The main thrust has been the removal of a number of changes in
the core kernel which were required for to support the journalling
of data. This has caused some duplication of core code within
ext3, but it's not too bad.

- A number of cleanups and resyncs with latest ext2. (Thanks, Al).

- Reorganised and optimised ext3_write_inode() and the handling
of files which were opened O_SYNC.

- Move quota operations outside lock_super() - fixes last known
source of quota deadlocks in -ac kernels.

- Deleted large chunks of debug/development support code.

- Improved handling of corner-case errors.

- Improved robustness in out-of-memory situations.

The last change is probably the most significant - it prevents
possible crashes and fs corruption under extreme workloads.

-


2001-07-07 22:16:04

by NeilBrown

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.0

On Saturday July 7, [email protected] wrote:
> An update of the ext3 journalling filesystem for 2.4 kernels
> is available at
>
> http://www.uow.edu.au/~andrewm/linux/ext3/
>
> Patches are against 2.4.6-ac1 and 2.4.6.

I thought it was time to try out ext3 between nfsd and raid5, so I
built 2.4.6 plus this patch, and an ext3 filesystem on a largish
raid5 volume, exported it (with the "sync" flag), mounted it from
another machines with NFSv2, and ran "dbench 4".

This produces a live-lock (I think that it the right term).
Throughput would drop to zero (determined by watching the counts in
/proc/nfs/rpc/nfsd), but could be coaxed along by generating other
filesystem activity.

I tried nfs over ext3 on a plain ide disc and it worked fine.
I tried dbench directly on ext3/raid5 and it worked fine.
I tried dbench/nfs/ext2/raid5 and it worked fine.

So I think it is some interaction between ext3fs and raid5 triggered
by the high rate of "fsync" calls made by nfsd. Naturally I blame
ext3 because I know more about raid5 and nfsd :-)

One particular aspect of raid5 that *could* be related is that it is
very reticent to schedule write requests. It tries to hang on the them
as long as possible in the hope of getting more write requests in the
same stripe. My guess as to what is happening is that as write
request is submitted and then waited-for without an intervening
run_task_queue(&tq_disk);

When the system is livelocked, all I can tell at the moment (I am at
home and the console is at work so I cannot use alt-sysrq) is that
kjournal is waiting in wait_on_buffer and an nfsd thread is waiting on
the journal.

I will try to explore it more deeply next time I am at work, but if
there are any suggestions as to what it might be, or how I might more
easily find out what is going on, I am all ears.

NeilBrown

2001-07-08 01:04:34

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.0

Neil Brown wrote:
>
> On Saturday July 7, [email protected] wrote:
> > An update of the ext3 journalling filesystem for 2.4 kernels
> > is available at
> >
> > http://www.uow.edu.au/~andrewm/linux/ext3/
> >
> > Patches are against 2.4.6-ac1 and 2.4.6.
>
> I thought it was time to try out ext3 between nfsd and raid5, so I
> built 2.4.6 plus this patch, and an ext3 filesystem on a largish
> raid5 volume, exported it (with the "sync" flag), mounted it from
> another machines with NFSv2, and ran "dbench 4".
>
> This produces a live-lock (I think that it the right term).
> Throughput would drop to zero (determined by watching the counts in
> /proc/nfs/rpc/nfsd), but could be coaxed along by generating other
> filesystem activity.
>
> I tried nfs over ext3 on a plain ide disc and it worked fine.
> I tried dbench directly on ext3/raid5 and it worked fine.
> I tried dbench/nfs/ext2/raid5 and it worked fine.
>
> So I think it is some interaction between ext3fs and raid5 triggered
> by the high rate of "fsync" calls made by nfsd. Naturally I blame
> ext3 because I know more about raid5 and nfsd :-)

fsync will cause ext3 to commit the current transaction once all
handles against it close - so that will produce rapid bursts
of small numbers of writes.

> One particular aspect of raid5 that *could* be related is that it is
> very reticent to schedule write requests. It tries to hang on the them
> as long as possible in the hope of getting more write requests in the
> same stripe. My guess as to what is happening is that as write
> request is submitted and then waited-for without an intervening
> run_task_queue(&tq_disk);

Could well be. ext3 will happily feed 2,000 buffers into submit_bh()
prior to running tq_disk. Everything else is happy with this, so I blame
nfsd and raid5 :) Rapid fsyncs will break this up, however.

Does this patch help?

--- fs/jbd/commit.c 2001/07/01 04:24:42 1.40
+++ fs/jbd/commit.c 2001/07/08 00:53:42
@@ -202,6 +202,7 @@
spin_unlock(&journal_datalist_lock);
unlock_journal(journal);
ll_rw_block(WRITE, bufs, wbuf);
+ run_task_queue(&tq_disk);
lock_journal(journal);
journal_brelse_array(wbuf, bufs);
goto write_out_data;
@@ -410,6 +411,7 @@
bh->b_end_io = end_buffer_io_sync;
submit_bh(WRITE, bh);
}
+ run_task_queue(&tq_disk);
lock_journal(journal);

/* Force a new descriptor to be generated next

> When the system is livelocked, all I can tell at the moment (I am at
> home and the console is at work so I cannot use alt-sysrq) is that
> kjournal is waiting in wait_on_buffer and an nfsd thread is waiting on
> the journal.

That sounds like Something Wierd is going on. wait_on_buffer will
unplug and the disks should be going hell-for-leather.

> I will try to explore it more deeply next time I am at work, but if
> there are any suggestions as to what it might be, or how I might more
> easily find out what is going on, I am all ears.
>

I'll see if I can get it to happen here. Thanks.

-

2001-07-08 06:03:03

by NeilBrown

[permalink] [raw]
Subject: Re: ext3-2.4-0.9.0

On Sunday July 8, [email protected] wrote:
>
> Could well be. ext3 will happily feed 2,000 buffers into submit_bh()
> prior to running tq_disk. Everything else is happy with this, so I blame
> nfsd and raid5 :) Rapid fsyncs will break this up, however.
>

raid5 is definately happy with large sequences of requests between
tq_disk (infact, that is best), but I think I have found a situation
where lots of small requests can confuse it. It seems that your
intuation about the direction of blame is better than mine :-)

Then a write request happens to raid5, the queue is (potentially)
plugged, and then the request is (potentially) queued, and there is a
window between the two where the queue can be unplugged by another
process. If this happens, then the tq_disk run the follows the write
request will not wake-up the raid5d, so the raid5 queue will not be
run, and the request will just sit there until something else causes
raid5d to run.
I'm guessing that ext3 imposes more sequencing on requests than ext2
does, and so it is easier for one request being stalled to stall the
whole filesystem.

In any case, the follow patch against raid5 seems to have relieved the
situation, but more testing is underway.

So ThankYou to ext3 for helping to find a bug in raid5 :-)

NeilBrown

--- drivers/md/raid5.c 2001/07/07 06:23:02 1.1
+++ drivers/md/raid5.c 2001/07/08 00:22:52
@@ -66,9 +66,10 @@
BUG();
if (atomic_read(&conf->active_stripes)==0)
BUG();
- if (test_bit(STRIPE_DELAYED, &sh->state))
+ if (test_bit(STRIPE_DELAYED, &sh->state)) {
list_add_tail(&sh->lru, &conf->delayed_list);
- else if (test_bit(STRIPE_HANDLE, &sh->state)) {
+ md_wakeup_thread(conf->thread);
+ } else if (test_bit(STRIPE_HANDLE, &sh->state)) {
list_add_tail(&sh->lru, &conf->handle_list);
md_wakeup_thread(conf->thread);
} else {
@@ -1167,10 +1168,9 @@

raid5_activate_delayed(conf);

- if (conf->plugged) {
+ if (conf->plugged)
conf->plugged = 0;
- md_wakeup_thread(conf->thread);
- }
+ md_wakeup_thread(conf->thread);
spin_unlock_irqrestore(&conf->device_lock, flags);
}