2001-07-10 09:47:49

by Mike Black

[permalink] [raw]
Subject: 2.4.6 and ext3-2.4-0.9.1-246

I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
immediately hit a wall.

Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
locked up (telnet window stops until disk I/O is complete).
I'm using tiobench tiobench-0.3.2 and do have unmaskirq turned on so it
shouldn't be irq contention.
/dev/hda:
multcount = 0 (off)
I/O support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
/dev/hdc:
multcount = 0 (off)
I/O support = 1 (32-bit)
unmaskirq = 1 (on)
using_dma = 1 (on)
keepsettings = 0 (off)
nowerr = 0 (off)
readonly = 0 (off)
readahead = 8 (on)

Investigating this some I noticed that kswapd was taking a LOT of CPU time
(althought there was only 10Meg in swap). The swap files are located on the
RAID1 IDE set.
So...I moved the swapfiles to my SCSI subsystem (also EXT3 at this point)
and tested again.
Smoother although there was a quite a bit of jerkiness on the telnet window
still.
So...swap on IDE/RAID1/EXT3 was bad idea...I'd say 80% better when swap was
moved off of the IDE system to SCSI.

Here's my RAID1/IDE benchmark with EXT3
..ooops...spoke too soon.
The tiobench.pl locked up on 8 threads (after doing 1,2, & 4). Had to do a
ALT-SYSRQ-B as all windows were dead although I could get a login prompt.

It really looks like tiobench is a good stress tester for ext3.
________________________________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355


2001-07-10 17:53:16

by Andreas Dilger

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black writes:
> I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
> immediately hit a wall.
>
> Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
> locked up (telnet window stops until disk I/O is complete).
> Investigating this some I noticed that kswapd was taking a LOT of CPU time
> (althought there was only 10Meg in swap). The swap files are located on the
> RAID1 IDE set.

Are you saying you have swap _files_ or is that a typo? Not to say that this
is illegal or anything, but it sure is a waste of CPU/disk performance. If
you are swapping to a file on a journaled filesystem, you have a huge amount
of unnecessary overhead. Rather have a swap partition and avoid the fs
altogether.

It is also possible that there are still problems with the core kernel swap
code, and they are just more noticable when swapping on ext3. What form of
journaling are you using? Ordered, writeback, or full data journaling?

> Here's my RAID1/IDE benchmark with EXT3
> ..ooops...spoke too soon.
> The tiobench.pl locked up on 8 threads (after doing 1,2, & 4). Had to do a
> ALT-SYSRQ-B as all windows were dead although I could get a login prompt.

I've CC'd this to ext2-devel, where the core ext3 developers are more likely
to see it.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-07-10 18:18:16

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Hi,

On Tue, Jul 10, 2001 at 01:59:40PM -0400, Mike Black wrote:
> Yep -- I said __files__ -- I'm less concerned about performance than
> reliability -- I don't think you can RAID1 a swap partition can you?

You can on 2.4. 2.2 would let you do it but it was unsafe --- swap
could interact badly with raid reconstruction. 2.4 should be OK.

> Also,
> having it in files allows me to easily add more swap as needed.
> As far as journalling mode I just used tune2fs to put a journal on with
> default parameters so I assume that's full journaling.

The swap code bypasses filesystem writes: all it does is to ask the
filesystem where on disk the data resides, then it performs IO
straight to those disk blocks. The data journaling mode doesn't
really matter there.

Cheers,
Stephen

2001-07-10 18:27:26

by Mike Black

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

So it sounds like theres no advantage then to a swap partition vs file?

----- Original Message -----
From: "Stephen C. Tweedie" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "Andreas Dilger" <[email protected]>; "[email protected]"
<[email protected]>; "Ext2 development mailing list"
<[email protected]>
Sent: Tuesday, July 10, 2001 2:17 PM
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246


> Hi,
>
> On Tue, Jul 10, 2001 at 01:59:40PM -0400, Mike Black wrote:
> > Yep -- I said __files__ -- I'm less concerned about performance than
> > reliability -- I don't think you can RAID1 a swap partition can you?
>
> You can on 2.4. 2.2 would let you do it but it was unsafe --- swap
> could interact badly with raid reconstruction. 2.4 should be OK.
>
> > Also,
> > having it in files allows me to easily add more swap as needed.
> > As far as journalling mode I just used tune2fs to put a journal on with
> > default parameters so I assume that's full journaling.
>
> The swap code bypasses filesystem writes: all it does is to ask the
> filesystem where on disk the data resides, then it performs IO
> straight to those disk blocks. The data journaling mode doesn't
> really matter there.
>
> Cheers,
> Stephen

2001-07-10 18:30:16

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Hi,

On Tue, Jul 10, 2001 at 02:27:00PM -0400, Mike Black wrote:
> So it sounds like theres no advantage then to a swap partition vs file?

There are --- the cost of accessing the metadata to do the file
blocknr lookup, and the fragmentation you can get on files, both add
to their cost compared to partitions.

Cheers,
Stephen

2001-07-11 04:07:27

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
> immediately hit a wall.
>
> Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
> locked up (telnet window stops until disk I/O is complete).

Mike, we're going to need a lot more detail to reproduce this.

Let me describe how I didn't reproduce it and perhaps
you can point out any differences:

- Kernel 2.4.6+ext3-2.4-0.9.1.

- Two 4gig IDE partitions on separate disks combined into a
RADI1 device.

- 64 megs of memory (32meg lowmem, 32meg highmem)

- 1 gig swapfile on the ext3 raid device.

- Ran ./tiobench.pl --threads 16

That's a *lot* more aggressive than your setup, yet
it ran to completion quite happily.

I'd be particularly interested in knowing how much memory
you're using. It certainly sounds like you're experiencing
memory exhaustion. ext3's ability to recover from out-of-memory
situations was weakened recently so as to reduce our impact
on core kernel code. I'll be generating an incremental patch
which puts that code back in.

In the meantime, could you please retest with this somewhat lame
alternative?



--- linux-2.4.6/mm/vmscan.c Wed Jul 4 18:21:32 2001
+++ lk-ext3/mm/vmscan.c Wed Jul 11 14:03:10 2001
@@ -852,6 +870,9 @@ static int do_try_to_free_pages(unsigned
* list, so this is a relatively cheap operation.
*/
if (free_shortage()) {
+ extern void shrink_journal_memory(void);
+
+ shrink_journal_memory();
ret += page_launder(gfp_mask, user);
shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
shrink_icache_memory(DEF_PRIORITY, gfp_mask);

2001-07-11 12:18:19

by Mike Black

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

My system is:
Dual 1Ghz PIII
2G RAM
2x2G swapfiles
And I ran tiobench as tiobench.pl --size 4000 (twice memory)

Me thinkst SMP is probably the biggest difference in this list.

I ran this on another "smaller" memory (still dual CPU though) machine and
noticed this on top:

12983 root 15 0 548 544 448 D 73.6 0.2 0:11 tiotest
3 root 18 0 0 0 0 SW 72.6 0.0 0:52 kswapd

kswapd is taking an awful lot of CPU time. Not sure why it should be
hitting swap at all.
I noticed a similar behavior even with NO swap -- kswapd still chewing up
time.

________________________________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "[email protected]" <[email protected]>;
<[email protected]>
Sent: Wednesday, July 11, 2001 12:08 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> I started testing 2.4.6 with ext3-2.4-0.9.1-246 yesterday morning and
> immediately hit a wall.
>
> Testing on a an SMP kernel -- dual IDE RAID1 set the system temporarily
> locked up (telnet window stops until disk I/O is complete).

Mike, we're going to need a lot more detail to reproduce this.

Let me describe how I didn't reproduce it and perhaps
you can point out any differences:

- Kernel 2.4.6+ext3-2.4-0.9.1.

- Two 4gig IDE partitions on separate disks combined into a
RADI1 device.

- 64 megs of memory (32meg lowmem, 32meg highmem)

- 1 gig swapfile on the ext3 raid device.

- Ran ./tiobench.pl --threads 16

That's a *lot* more aggressive than your setup, yet
it ran to completion quite happily.

I'd be particularly interested in knowing how much memory
you're using. It certainly sounds like you're experiencing
memory exhaustion. ext3's ability to recover from out-of-memory
situations was weakened recently so as to reduce our impact
on core kernel code. I'll be generating an incremental patch
which puts that code back in.

In the meantime, could you please retest with this somewhat lame
alternative?



--- linux-2.4.6/mm/vmscan.c Wed Jul 4 18:21:32 2001
+++ lk-ext3/mm/vmscan.c Wed Jul 11 14:03:10 2001
@@ -852,6 +870,9 @@ static int do_try_to_free_pages(unsigned
* list, so this is a relatively cheap operation.
*/
if (free_shortage()) {
+ extern void shrink_journal_memory(void);
+
+ shrink_journal_memory();
ret += page_launder(gfp_mask, user);
shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
shrink_icache_memory(DEF_PRIORITY, gfp_mask);

2001-07-11 15:36:09

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> My system is:
> Dual 1Ghz PIII
> 2G RAM
> 2x2G swapfiles
> And I ran tiobench as tiobench.pl --size 4000 (twice memory)
>
> Me thinkst SMP is probably the biggest difference in this list.

No, the problem is in RAID1. The buffer allocation in there is nowhere
near strong enough for these loads.

> I ran this on another "smaller" memory (still dual CPU though) machine and
> noticed this on top:
>
> 12983 root 15 0 548 544 448 D 73.6 0.2 0:11 tiotest
> 3 root 18 0 0 0 0 SW 72.6 0.0 0:52 kswapd
>
> kswapd is taking an awful lot of CPU time. Not sure why it should be
> hitting swap at all.

It's not trying to swap stuff out - it's trying to find pages
to recycle. kswapd often goes berzerk like this. I think it
was a design objective.



For me, RAID1 works OK with tiobench, but it is trivially deadlockable
with other workloads. The usual failure mode is for bdflush to be
stuck in raid1_alloc_r1bh() - can't allocate any more r1bh's, can't
move dirty buffers to disk. Dead.

The below patch increases the size of the reserved r1bh pool, scales it
by PAGE_CACHE_SIZE and introduces a reservation policy for PF_FLUSH
callers (ie: bdflush). That fixes the raid1_alloc_r1bh() deadlocks.

bdflush can also deadlock in raid1_alloc_bh(), trying to allocate
buffer_heads. So we do the same thing there.

Putting swap on RAID1 would definitely have exacerbated the problem.
The last thing we want to do when we're trying to push stuff out
of memory is to have to allocate more of it. So I allowed PF_MEMALLOC
tasks to bite into the reserves as well.


Please, if you have time, apply and retest.

--- linux-2.4.6/include/linux/sched.h Wed May 2 22:00:07 2001
+++ lk-ext3/include/linux/sched.h Thu Jul 12 01:03:20 2001
@@ -413,7 +418,7 @@ struct task_struct {
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
-
+#define PF_FLUSH 0x00002000 /* Flushes buffers to disk */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */

/*
--- linux-2.4.6/include/linux/raid/raid1.h Tue Dec 12 08:20:08 2000
+++ lk-ext3/include/linux/raid/raid1.h Thu Jul 12 01:15:39 2001
@@ -37,12 +37,12 @@ struct raid1_private_data {
/* buffer pool */
/* buffer_heads that we have pre-allocated have b_pprev -> &freebh
* and are linked into a stack using b_next
- * raid1_bh that are pre-allocated have R1BH_PreAlloc set.
* All these variable are protected by device_lock
*/
struct buffer_head *freebh;
int freebh_cnt; /* how many are on the list */
struct raid1_bh *freer1;
+ unsigned freer1_cnt;
struct raid1_bh *freebuf; /* each bh_req has a page allocated */
md_wait_queue_head_t wait_buffer;

@@ -87,5 +87,4 @@ struct raid1_bh {
/* bits for raid1_bh.state */
#define R1BH_Uptodate 1
#define R1BH_SyncPhase 2
-#define R1BH_PreAlloc 3 /* this was pre-allocated, add to free list */
#endif
--- linux-2.4.6/fs/buffer.c Wed Jul 4 18:21:31 2001
+++ lk-ext3/fs/buffer.c Thu Jul 12 01:03:57 2001
@@ -2685,6 +2748,7 @@ int bdflush(void *sem)
sigfillset(&tsk->blocked);
recalc_sigpending(tsk);
spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;

up((struct semaphore *)sem);

@@ -2726,6 +2790,7 @@ int kupdate(void *sem)
siginitsetinv(&current->blocked, sigmask(SIGCONT) | sigmask(SIGSTOP));
recalc_sigpending(tsk);
spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;

up((struct semaphore *)sem);

--- linux-2.4.6/drivers/md/raid1.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 01:28:58 2001
@@ -51,6 +51,28 @@ static mdk_personality_t raid1_personali
static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
struct raid1_bh *raid1_retry_list = NULL, **raid1_retry_tail;

+/*
+ * We need to scale the number of reserved buffers by the page size
+ * to make writepage()s sucessful. --akpm
+ */
+#define R1_BLOCKS_PP (PAGE_CACHE_SIZE / 1024)
+#define FREER1_MEMALLOC_RESERVED (16 * R1_BLOCKS_PP)
+
+/*
+ * Return true if the caller make take a bh from the list.
+ * PF_FLUSH and PF_MEMALLOC tasks are allowed to use the reserves, because
+ * they're trying to *free* some memory.
+ *
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_bh(raid1_conf_t *conf, int cnt)
+{
+ int min_free = (current->flags & (PF_FLUSH|PF_MEMALLOC)) ?
+ cnt :
+ (cnt + FREER1_MEMALLOC_RESERVED * conf->raid_disks);
+ return conf->freebh_cnt >= min_free;
+}
+
static struct buffer_head *raid1_alloc_bh(raid1_conf_t *conf, int cnt)
{
/* return a linked list of "cnt" struct buffer_heads.
@@ -62,7 +84,7 @@ static struct buffer_head *raid1_alloc_b
while(cnt) {
struct buffer_head *t;
md_spin_lock_irq(&conf->device_lock);
- if (conf->freebh_cnt >= cnt)
+ if (may_take_bh(conf, cnt))
while (cnt) {
t = conf->freebh;
conf->freebh = t->b_next;
@@ -83,7 +105,7 @@ static struct buffer_head *raid1_alloc_b
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ wait_event(conf->wait_buffer, may_take_bh(conf, cnt));
}
}
return bh;
@@ -96,9 +118,9 @@ static inline void raid1_free_bh(raid1_c
while (bh) {
struct buffer_head *t = bh;
bh=bh->b_next;
- if (t->b_pprev == NULL)
+ if (conf->freebh_cnt >= FREER1_MEMALLOC_RESERVED) {
kfree(t);
- else {
+ } else {
t->b_next= conf->freebh;
conf->freebh = t;
conf->freebh_cnt++;
@@ -108,29 +130,6 @@ static inline void raid1_free_bh(raid1_c
wake_up(&conf->wait_buffer);
}

-static int raid1_grow_bh(raid1_conf_t *conf, int cnt)
-{
- /* allocate cnt buffer_heads, possibly less if kalloc fails */
- int i = 0;
-
- while (i < cnt) {
- struct buffer_head *bh;
- bh = kmalloc(sizeof(*bh), GFP_KERNEL);
- if (!bh) break;
- memset(bh, 0, sizeof(*bh));
-
- md_spin_lock_irq(&conf->device_lock);
- bh->b_pprev = &conf->freebh;
- bh->b_next = conf->freebh;
- conf->freebh = bh;
- conf->freebh_cnt++;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
static int raid1_shrink_bh(raid1_conf_t *conf, int cnt)
{
/* discard cnt buffer_heads, if we can find them */
@@ -147,7 +146,16 @@ static int raid1_shrink_bh(raid1_conf_t
md_spin_unlock_irq(&conf->device_lock);
return i;
}
-
+
+/*
+ * Return true if the caller make take a raid1_bh from the list.
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_r1bh(raid1_conf_t *conf)
+{
+ return ((conf->freer1_cnt > FREER1_MEMALLOC_RESERVED) ||
+ (current->flags & (PF_FLUSH|PF_MEMALLOC))) && conf->freer1;
+}

static struct raid1_bh *raid1_alloc_r1bh(raid1_conf_t *conf)
{
@@ -155,8 +163,9 @@ static struct raid1_bh *raid1_alloc_r1bh

do {
md_spin_lock_irq(&conf->device_lock);
- if (conf->freer1) {
+ if (may_take_r1bh(conf)) {
r1_bh = conf->freer1;
+ conf->freer1_cnt--;
conf->freer1 = r1_bh->next_r1;
r1_bh->next_r1 = NULL;
r1_bh->state = 0;
@@ -170,7 +179,7 @@ static struct raid1_bh *raid1_alloc_r1bh
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
- wait_event(conf->wait_buffer, conf->freer1);
+ wait_event(conf->wait_buffer, may_take_r1bh(conf));
} while (1);
}

@@ -178,49 +187,30 @@ static inline void raid1_free_r1bh(struc
{
struct buffer_head *bh = r1_bh->mirror_bh_list;
raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+ unsigned long flags;

r1_bh->mirror_bh_list = NULL;

- if (test_bit(R1BH_PreAlloc, &r1_bh->state)) {
- unsigned long flags;
- spin_lock_irqsave(&conf->device_lock, flags);
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (conf->freer1_cnt < FREER1_MEMALLOC_RESERVED) {
r1_bh->next_r1 = conf->freer1;
conf->freer1 = r1_bh;
+ conf->freer1_cnt++;
spin_unlock_irqrestore(&conf->device_lock, flags);
} else {
+ spin_unlock_irqrestore(&conf->device_lock, flags);
kfree(r1_bh);
}
raid1_free_bh(conf, bh);
}

-static int raid1_grow_r1bh (raid1_conf_t *conf, int cnt)
-{
- int i = 0;
-
- while (i < cnt) {
- struct raid1_bh *r1_bh;
- r1_bh = (struct raid1_bh*)kmalloc(sizeof(*r1_bh), GFP_KERNEL);
- if (!r1_bh)
- break;
- memset(r1_bh, 0, sizeof(*r1_bh));
-
- md_spin_lock_irq(&conf->device_lock);
- set_bit(R1BH_PreAlloc, &r1_bh->state);
- r1_bh->next_r1 = conf->freer1;
- conf->freer1 = r1_bh;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
static void raid1_shrink_r1bh(raid1_conf_t *conf)
{
md_spin_lock_irq(&conf->device_lock);
while (conf->freer1) {
struct raid1_bh *r1_bh = conf->freer1;
conf->freer1 = r1_bh->next_r1;
+ conf->freer1_cnt--; /* pedantry */
kfree(r1_bh);
}
md_spin_unlock_irq(&conf->device_lock);
@@ -1610,21 +1600,6 @@ static int raid1_run (mddev_t *mddev)
goto out_free_conf;
}

-
- /* pre-allocate some buffer_head structures.
- * As a minimum, 1 r1bh and raid_disks buffer_heads
- * would probably get us by in tight memory situations,
- * but a few more is probably a good idea.
- * For now, try 16 r1bh and 16*raid_disks bufferheads
- * This will allow at least 16 concurrent reads or writes
- * even if kmalloc starts failing
- */
- if (raid1_grow_r1bh(conf, 16) < 16 ||
- raid1_grow_bh(conf, 16*conf->raid_disks)< 16*conf->raid_disks) {
- printk(MEM_ERROR, mdidx(mddev));
- goto out_free_conf;
- }
-
for (i = 0; i < MD_SB_DISKS; i++) {

descriptor = sb->disks+i;
@@ -1713,6 +1688,8 @@ out_free_conf:
raid1_shrink_r1bh(conf);
raid1_shrink_bh(conf, conf->freebh_cnt);
raid1_shrink_buffers(conf);
+ if (conf->freer1_cnt != 0)
+ BUG();
kfree(conf);
mddev->private = NULL;
out:

2001-07-12 10:55:38

by Mike Black

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Nope -- still locked up on 8 threads....however...it's apparently not RAID1
causing this.
I'm repeating this now on my SCSI 7x36G RAID5 set and seeing similar
behavior. It's a little better though since its SCSI.
Since IDE hits the CPU harder the system appeared to lock up for a lot
longer -- it might have finished but I couldn't afford to wait that long.
The CPU is hitting 100% system usage which makes it appear as though it is
locked up.
I've got a vmstat running in a window and it pauses a lot. When I was
testing the IDE RAID1 it paused (locked?) for a LONG time.
But , it is recovering from the 100% system usage and here what is has so
far:
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 4000 4096 1 64.71 51.4% 0.826 2.00% 21.78 32.7% 1.218 0.85%
. 4000 4096 2 23.28 21.7% 0.935 1.76% 7.374 39.1% 1.261 0.96%
. 4000 4096 4 20.74 20.7% 1.087 2.50% 5.399 46.8% 1.278 1.09%

It's banging like crazy on the 8-thread run and I'm trying to let it finish
but it's really slow and non-responsive.
Here's the latest vmstat (10 second increments):
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us sy
id
4 82 6 0 3604 4244 1902520 0 0 6 3342 386 347 0 100
0
2 84 6 0 3620 4252 1902488 0 0 1 1506 173 17 0 99
1
4 82 6 0 3636 4228 1902472 0 0 1 3749 448 237 0 84
16
8 79 7 0 3620 4236 1902448 0 0 2 966 199 56 0 98
2
4 80 6 0 3620 4252 1902336 0 0 1 1040 330 557 0 100
0
0 86 5 0 3624 4252 1902332 0 0 0 627 335 725 0 98
2
15 75 5 0 3624 4264 1902636 0 0 1 953 494 182 0 90
10
16 76 6 0 3564 4280 1902748 0 0 1 1581 595 354 0 87
13
11 80 6 0 3564 4292 1902740 0 0 1 1337 174 67 0 100
0
18 74 6 0 3560 4308 1902716 0 0 0 703 313 353 0 100
0
7 78 7 0 3560 4324 1902632 0 0 5 2181 301 626 0 100
0
7 79 7 0 3560 4332 1902628 0 0 1 732 351 163 0 100
0
11 81 8 0 3224 4324 1902968 0 0 0 1 280 214 0 100
0
9 76 7 0 3560 4332 1902624 0 0 0 569 270 83 0 100
0
6 77 6 0 2832 4336 1903340 0 0 0 910 281 268 0 100
0
3 83 7 0 3564 4336 1902604 0 0 0 487 281 130 0 100
0
17 77 7 0 3560 4344 1902600 0 0 0 1056 377 102 0 100
0
9 76 7 0 3560 4364 1902256 0 0 1 3030 517 696 0 100
0
11 75 6 0 3560 4384 1902328 0 0 1 2145 230 131 0 100
0
12 72 7 0 3560 4416 1902296 0 0 16 2487 493 82 0 99
1
9 76 6 0 3560 4424 1902084 0 0 82 1938 423 1124 0 100
0

______________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "[email protected]" <[email protected]>;
<[email protected]>
Sent: Wednesday, July 11, 2001 11:36 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> My system is:
> Dual 1Ghz PIII
> 2G RAM
> 2x2G swapfiles
> And I ran tiobench as tiobench.pl --size 4000 (twice memory)
>
> Me thinkst SMP is probably the biggest difference in this list.

No, the problem is in RAID1. The buffer allocation in there is nowhere
near strong enough for these loads.

> I ran this on another "smaller" memory (still dual CPU though) machine and
> noticed this on top:
>
> 12983 root 15 0 548 544 448 D 73.6 0.2 0:11 tiotest
> 3 root 18 0 0 0 0 SW 72.6 0.0 0:52 kswapd
>
> kswapd is taking an awful lot of CPU time. Not sure why it should be
> hitting swap at all.

It's not trying to swap stuff out - it's trying to find pages
to recycle. kswapd often goes berzerk like this. I think it
was a design objective.



For me, RAID1 works OK with tiobench, but it is trivially deadlockable
with other workloads. The usual failure mode is for bdflush to be
stuck in raid1_alloc_r1bh() - can't allocate any more r1bh's, can't
move dirty buffers to disk. Dead.

The below patch increases the size of the reserved r1bh pool, scales it
by PAGE_CACHE_SIZE and introduces a reservation policy for PF_FLUSH
callers (ie: bdflush). That fixes the raid1_alloc_r1bh() deadlocks.

bdflush can also deadlock in raid1_alloc_bh(), trying to allocate
buffer_heads. So we do the same thing there.

Putting swap on RAID1 would definitely have exacerbated the problem.
The last thing we want to do when we're trying to push stuff out
of memory is to have to allocate more of it. So I allowed PF_MEMALLOC
tasks to bite into the reserves as well.


Please, if you have time, apply and retest.

--- linux-2.4.6/include/linux/sched.h Wed May 2 22:00:07 2001
+++ lk-ext3/include/linux/sched.h Thu Jul 12 01:03:20 2001
@@ -413,7 +418,7 @@ struct task_struct {
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
-
+#define PF_FLUSH 0x00002000 /* Flushes buffers to disk */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */

/*
--- linux-2.4.6/include/linux/raid/raid1.h Tue Dec 12 08:20:08 2000
+++ lk-ext3/include/linux/raid/raid1.h Thu Jul 12 01:15:39 2001
@@ -37,12 +37,12 @@ struct raid1_private_data {
/* buffer pool */
/* buffer_heads that we have pre-allocated have b_pprev -> &freebh
* and are linked into a stack using b_next
- * raid1_bh that are pre-allocated have R1BH_PreAlloc set.
* All these variable are protected by device_lock
*/
struct buffer_head *freebh;
int freebh_cnt; /* how many are on the list */
struct raid1_bh *freer1;
+ unsigned freer1_cnt;
struct raid1_bh *freebuf; /* each bh_req has a page allocated */
md_wait_queue_head_t wait_buffer;

@@ -87,5 +87,4 @@ struct raid1_bh {
/* bits for raid1_bh.state */
#define R1BH_Uptodate 1
#define R1BH_SyncPhase 2
-#define R1BH_PreAlloc 3 /* this was pre-allocated, add to free list */
#endif
--- linux-2.4.6/fs/buffer.c Wed Jul 4 18:21:31 2001
+++ lk-ext3/fs/buffer.c Thu Jul 12 01:03:57 2001
@@ -2685,6 +2748,7 @@ int bdflush(void *sem)
sigfillset(&tsk->blocked);
recalc_sigpending(tsk);
spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;

up((struct semaphore *)sem);

@@ -2726,6 +2790,7 @@ int kupdate(void *sem)
siginitsetinv(&current->blocked, sigmask(SIGCONT) | sigmask(SIGSTOP));
recalc_sigpending(tsk);
spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;

up((struct semaphore *)sem);

--- linux-2.4.6/drivers/md/raid1.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 01:28:58 2001
@@ -51,6 +51,28 @@ static mdk_personality_t raid1_personali
static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
struct raid1_bh *raid1_retry_list = NULL, **raid1_retry_tail;

+/*
+ * We need to scale the number of reserved buffers by the page size
+ * to make writepage()s sucessful. --akpm
+ */
+#define R1_BLOCKS_PP (PAGE_CACHE_SIZE / 1024)
+#define FREER1_MEMALLOC_RESERVED (16 * R1_BLOCKS_PP)
+
+/*
+ * Return true if the caller make take a bh from the list.
+ * PF_FLUSH and PF_MEMALLOC tasks are allowed to use the reserves, because
+ * they're trying to *free* some memory.
+ *
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_bh(raid1_conf_t *conf, int cnt)
+{
+ int min_free = (current->flags & (PF_FLUSH|PF_MEMALLOC)) ?
+ cnt :
+ (cnt + FREER1_MEMALLOC_RESERVED * conf->raid_disks);
+ return conf->freebh_cnt >= min_free;
+}
+
static struct buffer_head *raid1_alloc_bh(raid1_conf_t *conf, int cnt)
{
/* return a linked list of "cnt" struct buffer_heads.
@@ -62,7 +84,7 @@ static struct buffer_head *raid1_alloc_b
while(cnt) {
struct buffer_head *t;
md_spin_lock_irq(&conf->device_lock);
- if (conf->freebh_cnt >= cnt)
+ if (may_take_bh(conf, cnt))
while (cnt) {
t = conf->freebh;
conf->freebh = t->b_next;
@@ -83,7 +105,7 @@ static struct buffer_head *raid1_alloc_b
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ wait_event(conf->wait_buffer, may_take_bh(conf, cnt));
}
}
return bh;
@@ -96,9 +118,9 @@ static inline void raid1_free_bh(raid1_c
while (bh) {
struct buffer_head *t = bh;
bh=bh->b_next;
- if (t->b_pprev == NULL)
+ if (conf->freebh_cnt >= FREER1_MEMALLOC_RESERVED) {
kfree(t);
- else {
+ } else {
t->b_next= conf->freebh;
conf->freebh = t;
conf->freebh_cnt++;
@@ -108,29 +130,6 @@ static inline void raid1_free_bh(raid1_c
wake_up(&conf->wait_buffer);
}

-static int raid1_grow_bh(raid1_conf_t *conf, int cnt)
-{
- /* allocate cnt buffer_heads, possibly less if kalloc fails */
- int i = 0;
-
- while (i < cnt) {
- struct buffer_head *bh;
- bh = kmalloc(sizeof(*bh), GFP_KERNEL);
- if (!bh) break;
- memset(bh, 0, sizeof(*bh));
-
- md_spin_lock_irq(&conf->device_lock);
- bh->b_pprev = &conf->freebh;
- bh->b_next = conf->freebh;
- conf->freebh = bh;
- conf->freebh_cnt++;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
static int raid1_shrink_bh(raid1_conf_t *conf, int cnt)
{
/* discard cnt buffer_heads, if we can find them */
@@ -147,7 +146,16 @@ static int raid1_shrink_bh(raid1_conf_t
md_spin_unlock_irq(&conf->device_lock);
return i;
}
-
+
+/*
+ * Return true if the caller make take a raid1_bh from the list.
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_r1bh(raid1_conf_t *conf)
+{
+ return ((conf->freer1_cnt > FREER1_MEMALLOC_RESERVED) ||
+ (current->flags & (PF_FLUSH|PF_MEMALLOC))) && conf->freer1;
+}

static struct raid1_bh *raid1_alloc_r1bh(raid1_conf_t *conf)
{
@@ -155,8 +163,9 @@ static struct raid1_bh *raid1_alloc_r1bh

do {
md_spin_lock_irq(&conf->device_lock);
- if (conf->freer1) {
+ if (may_take_r1bh(conf)) {
r1_bh = conf->freer1;
+ conf->freer1_cnt--;
conf->freer1 = r1_bh->next_r1;
r1_bh->next_r1 = NULL;
r1_bh->state = 0;
@@ -170,7 +179,7 @@ static struct raid1_bh *raid1_alloc_r1bh
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
- wait_event(conf->wait_buffer, conf->freer1);
+ wait_event(conf->wait_buffer, may_take_r1bh(conf));
} while (1);
}

@@ -178,49 +187,30 @@ static inline void raid1_free_r1bh(struc
{
struct buffer_head *bh = r1_bh->mirror_bh_list;
raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+ unsigned long flags;

r1_bh->mirror_bh_list = NULL;

- if (test_bit(R1BH_PreAlloc, &r1_bh->state)) {
- unsigned long flags;
- spin_lock_irqsave(&conf->device_lock, flags);
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (conf->freer1_cnt < FREER1_MEMALLOC_RESERVED) {
r1_bh->next_r1 = conf->freer1;
conf->freer1 = r1_bh;
+ conf->freer1_cnt++;
spin_unlock_irqrestore(&conf->device_lock, flags);
} else {
+ spin_unlock_irqrestore(&conf->device_lock, flags);
kfree(r1_bh);
}
raid1_free_bh(conf, bh);
}

-static int raid1_grow_r1bh (raid1_conf_t *conf, int cnt)
-{
- int i = 0;
-
- while (i < cnt) {
- struct raid1_bh *r1_bh;
- r1_bh = (struct raid1_bh*)kmalloc(sizeof(*r1_bh), GFP_KERNEL);
- if (!r1_bh)
- break;
- memset(r1_bh, 0, sizeof(*r1_bh));
-
- md_spin_lock_irq(&conf->device_lock);
- set_bit(R1BH_PreAlloc, &r1_bh->state);
- r1_bh->next_r1 = conf->freer1;
- conf->freer1 = r1_bh;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
static void raid1_shrink_r1bh(raid1_conf_t *conf)
{
md_spin_lock_irq(&conf->device_lock);
while (conf->freer1) {
struct raid1_bh *r1_bh = conf->freer1;
conf->freer1 = r1_bh->next_r1;
+ conf->freer1_cnt--; /* pedantry */
kfree(r1_bh);
}
md_spin_unlock_irq(&conf->device_lock);
@@ -1610,21 +1600,6 @@ static int raid1_run (mddev_t *mddev)
goto out_free_conf;
}

-
- /* pre-allocate some buffer_head structures.
- * As a minimum, 1 r1bh and raid_disks buffer_heads
- * would probably get us by in tight memory situations,
- * but a few more is probably a good idea.
- * For now, try 16 r1bh and 16*raid_disks bufferheads
- * This will allow at least 16 concurrent reads or writes
- * even if kmalloc starts failing
- */
- if (raid1_grow_r1bh(conf, 16) < 16 ||
- raid1_grow_bh(conf, 16*conf->raid_disks)< 16*conf->raid_disks) {
- printk(MEM_ERROR, mdidx(mddev));
- goto out_free_conf;
- }
-
for (i = 0; i < MD_SB_DISKS; i++) {

descriptor = sb->disks+i;
@@ -1713,6 +1688,8 @@ out_free_conf:
raid1_shrink_r1bh(conf);
raid1_shrink_bh(conf, conf->freebh_cnt);
raid1_shrink_buffers(conf);
+ if (conf->freer1_cnt != 0)
+ BUG();
kfree(conf);
mddev->private = NULL;
out:

2001-07-12 11:33:59

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> Nope -- still locked up on 8 threads....however...it's apparently not RAID1
> causing this.


Well, aside from the RAID problems which we're triggering, you're
seeing interactions between RAID, ext3 and the VM. There's
another raid1 patch here, please.

> I'm repeating this now on my SCSI 7x36G RAID5 set and seeing similar
> behavior. It's a little better though since its SCSI.

RAID5 had a bug which would cause long stalls - ext3 triggered
it. It's fixed in 2.4.7-pre. I include that diff here, although
it'd be surprising if you were hitting it with that workload.

> ...
> I've got a vmstat running in a window and it pauses a lot. When I was
> testing the IDE RAID1 it paused (locked?) for a LONG time.

That's typical behaviour for an out-of-memory condition.

--- linux-2.4.6/drivers/md/raid1.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 15:27:09 2001
@@ -46,6 +46,30 @@
#define PRINTK(x...) do { } while (0)
#endif

+#define __raid1_wait_event(wq, condition) \
+do { \
+ wait_queue_t __wait; \
+ init_waitqueue_entry(&__wait, current); \
+ \
+ add_wait_queue(&wq, &__wait); \
+ for (;;) { \
+ set_current_state(TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ run_task_queue(&tq_disk); \
+ schedule(); \
+ } \
+ current->state = TASK_RUNNING; \
+ remove_wait_queue(&wq, &__wait); \
+} while (0)
+
+#define raid1_wait_event(wq, condition) \
+do { \
+ if (condition) \
+ break; \
+ __raid1_wait_event(wq, condition); \
+} while (0)
+

static mdk_personality_t raid1_personality;
static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
@@ -83,7 +107,7 @@ static struct buffer_head *raid1_alloc_b
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ raid1_wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
}
}
return bh;
@@ -170,7 +194,7 @@ static struct raid1_bh *raid1_alloc_r1bh
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
- wait_event(conf->wait_buffer, conf->freer1);
+ raid1_wait_event(conf->wait_buffer, conf->freer1);
} while (1);
}

--- linux-2.4.6/drivers/md/raid5.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid5.c Thu Jul 12 21:31:55 2001
@@ -66,10 +66,11 @@ static inline void __release_stripe(raid
BUG();
if (atomic_read(&conf->active_stripes)==0)
BUG();
- if (test_bit(STRIPE_DELAYED, &sh->state))
- list_add_tail(&sh->lru, &conf->delayed_list);
- else if (test_bit(STRIPE_HANDLE, &sh->state)) {
- list_add_tail(&sh->lru, &conf->handle_list);
+ if (test_bit(STRIPE_HANDLE, &sh->state)) {
+ if (test_bit(STRIPE_DELAYED, &sh->state))
+ list_add_tail(&sh->lru, &conf->delayed_list);
+ else
+ list_add_tail(&sh->lru, &conf->handle_list);
md_wakeup_thread(conf->thread);
} else {
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -1167,10 +1168,9 @@ static void raid5_unplug_device(void *da

raid5_activate_delayed(conf);

- if (conf->plugged) {
- conf->plugged = 0;
- md_wakeup_thread(conf->thread);
- }
+ conf->plugged = 0;
+ md_wakeup_thread(conf->thread);
+
spin_unlock_irqrestore(&conf->device_lock, flags);
}

2001-07-13 12:23:03

by Mike Black

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

I haven't done the RAID5 patch yet but I think one big problem is ext3
interaction with kswapd
My tiobench finally completed

tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 4000 4096 1 64.71 51.4% 0.826 2.00% 21.78 32.7% 1.218 0.85%
. 4000 4096 2 23.28 21.7% 0.935 1.76% 7.374 39.1% 1.261 0.96%
. 4000 4096 4 20.74 20.7% 1.087 2.50% 5.399 46.8% 1.278 1.09%
. 4000 4096 8 18.60 19.1% 1.265 2.67% 3.106 63.6% 1.286 1.17%

The CPU culprit is kswapd...this is apparently why the system appears to
lock up.
I don't even have swap turned on.
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
3 root 14 0 0 0 0 RW 85.6 0.0 39:50 kswapd

And...when I switch back to ext2 and do the same test:kswapd barely gets
used at all:
tiobench.pl --size 4000
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/secd . -T

File Block Num Seq Read Rand Read Seq Write Rand Write
Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
. 4000 4096 1 62.54 46.0% 0.806 2.27% 29.97 27.7% 1.343 0.94%
. 4000 4096 2 56.10 46.9% 1.030 3.03% 28.18 26.7% 1.320 1.30%
. 4000 4096 4 39.46 35.0% 1.204 3.34% 17.16 16.2% 1.309 1.28%
. 4000 4096 8 33.80 31.0% 1.384 3.74% 14.26 13.7% 1.309 1.21%


So...my question is why does ext3 cause kswapd to go nuts?

________________________________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "[email protected]" <[email protected]>;
<[email protected]>
Sent: Thursday, July 12, 2001 7:34 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> Nope -- still locked up on 8 threads....however...it's apparently not
RAID1
> causing this.


Well, aside from the RAID problems which we're triggering, you're
seeing interactions between RAID, ext3 and the VM. There's
another raid1 patch here, please.

> I'm repeating this now on my SCSI 7x36G RAID5 set and seeing similar
> behavior. It's a little better though since its SCSI.

RAID5 had a bug which would cause long stalls - ext3 triggered
it. It's fixed in 2.4.7-pre. I include that diff here, although
it'd be surprising if you were hitting it with that workload.

> ...
> I've got a vmstat running in a window and it pauses a lot. When I was
> testing the IDE RAID1 it paused (locked?) for a LONG time.

That's typical behaviour for an out-of-memory condition.

--- linux-2.4.6/drivers/md/raid1.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 15:27:09 2001
@@ -46,6 +46,30 @@
#define PRINTK(x...) do { } while (0)
#endif

+#define __raid1_wait_event(wq, condition) \
+do { \
+ wait_queue_t __wait; \
+ init_waitqueue_entry(&__wait, current); \
+ \
+ add_wait_queue(&wq, &__wait); \
+ for (;;) { \
+ set_current_state(TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ run_task_queue(&tq_disk); \
+ schedule(); \
+ } \
+ current->state = TASK_RUNNING; \
+ remove_wait_queue(&wq, &__wait); \
+} while (0)
+
+#define raid1_wait_event(wq, condition) \
+do { \
+ if (condition) \
+ break; \
+ __raid1_wait_event(wq, condition); \
+} while (0)
+

static mdk_personality_t raid1_personality;
static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
@@ -83,7 +107,7 @@ static struct buffer_head *raid1_alloc_b
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ raid1_wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
}
}
return bh;
@@ -170,7 +194,7 @@ static struct raid1_bh *raid1_alloc_r1bh
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
- wait_event(conf->wait_buffer, conf->freer1);
+ raid1_wait_event(conf->wait_buffer, conf->freer1);
} while (1);
}

--- linux-2.4.6/drivers/md/raid5.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid5.c Thu Jul 12 21:31:55 2001
@@ -66,10 +66,11 @@ static inline void __release_stripe(raid
BUG();
if (atomic_read(&conf->active_stripes)==0)
BUG();
- if (test_bit(STRIPE_DELAYED, &sh->state))
- list_add_tail(&sh->lru, &conf->delayed_list);
- else if (test_bit(STRIPE_HANDLE, &sh->state)) {
- list_add_tail(&sh->lru, &conf->handle_list);
+ if (test_bit(STRIPE_HANDLE, &sh->state)) {
+ if (test_bit(STRIPE_DELAYED, &sh->state))
+ list_add_tail(&sh->lru, &conf->delayed_list);
+ else
+ list_add_tail(&sh->lru, &conf->handle_list);
md_wakeup_thread(conf->thread);
} else {
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
@@ -1167,10 +1168,9 @@ static void raid5_unplug_device(void *da

raid5_activate_delayed(conf);

- if (conf->plugged) {
- conf->plugged = 0;
- md_wakeup_thread(conf->thread);
- }
+ conf->plugged = 0;
+ md_wakeup_thread(conf->thread);
+
spin_unlock_irqrestore(&conf->device_lock, flags);
}

2001-07-13 13:55:25

by Mike Black

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

I give up! I'm getting file system corruption now on the ext3 partition...
and I've got a kernel oops (soon to be decoded) This is the worst file
corruption I've ever seen other than having a disk go bad.
I'm removing ext3 for now.
________________________________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355

2001-07-13 14:14:30

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> I give up! I'm getting file system corruption now on the ext3 partition...
> and I've got a kernel oops (soon to be decoded) This is the worst file
> corruption I've ever seen other than having a disk go bad.

There was a truncate-related bug fixed in 0.9.2. What workload
were you using at the time?

2001-07-13 16:30:52

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Hi,

On Fri, Jul 13, 2001 at 09:54:56AM -0400, Mike Black wrote:
> I give up! I'm getting file system corruption now on the ext3 partition...
> and I've got a kernel oops (soon to be decoded)

Please, do send details. We already know that the VM has a hard job
under load, and journaling exacerbates that --- ext3 cannot always
write to disk without first allocating more memory, and the VM simply
doesn't have a mechanism for dealing with that reliably. It seems to
be compounded by (a) 2.4 having less write throttling than 2.2 had,
and (b) the zoned allocator getting confused about which zones
actually need to be recycled.

It's not just ext3 --- highmem bounce buffering and soft raid buffers
have the same problem, and work around it by doing their own internal
preallocation of emergency buffers. Loop devices and nbd will have a
similar problem if you use those for swap or writable mmaps, as will
NFS.

One proposed suggestion is to do per-zone memory reservations for the
VM's use: Ben LaHaise has prototype code for that and we'll be testing
to see if it makes for an improvement when used with ext3.

Cheers,
Stephen

2001-07-13 17:31:14

by Mike Black

[permalink] [raw]
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246

Here's the oops:
Message on console:
yeti kernel: EXT3-fs error (device md(9,0)): ext3_new_inode: reserved inode
or inode > inodes count - block_group = 0,inode=1

Here line 575:
J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));

Kernel BUG at transaction.c:575!
invalid operand: 0000
CPU: 1
EIP: 0010:[<c015b21d>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: 00000021 ebx: df83e850 ecx: 00000001 edx: 00000001
esi: d13fa880 edi: cf83e850 ebp: f7856600 esp: f73e5cac
ds: 0018 es: 0018 ss: 0018
Process syslogd (pid: 57, stackpage=f73e5000)
Stack: c0245fcb c0246140 0000023f f7856600 d13fa880 cf83e850 f78576694
c3217a58
00000000 00000000 00000000 d73f42a0 c015b689 d13fa880 cf83e850
00000000
00000912 f7856800 00000913 f73e5d34 c01529e9 d13fa880 f784eec0
d13fa880
Call Trace: c015b689 c01529e9 c01540ee c01543eb c01546ac c0154952 c015b62a
c0135694 c0154b46 c0135c88 c01364d6 c0154f96 c0154ad4 c01270b2
c0154f96 c0154ad4 c01270b2 c01531be c01331b6 c01531a4 c01332c5
c0106c7b
Code: 0f 0b 83 c4 0c f0 fe 0d a0 aa 28 c0 0f 88 35 f5 0c 00 8b 53

>>EIP; c015b21d <do_get_write_access+205/638> <=====
Trace; c015b689 <journal_get_write_access+39/5c>
Trace; c01529e9 <ext3_new_block+349/55c>
Trace; c01540ee <ext3_alloc_block+1e/24>
Trace; c01543eb <ext3_alloc_branch+3f/24c>
Trace; c01546ac <ext3_splice_branch+b4/130>
Trace; c0154952 <ext3_get_block_handle+22a/3ac>
Trace; c015b62a <do_get_write_access+612/638>
Code; c015b21d <do_get_write_access+205/638>
00000000 <_EIP>:
Code; c015b21d <do_get_write_access+205/638> <=====
0: 0f 0b ud2a <=====
Code; c015b21f <do_get_write_access+207/638>
2: 83 c4 0c add $0xc,%esp
Code; c015b222 <do_get_write_access+20a/638>
5: f0 fe 0d a0 aa 28 c0 lock decb 0xc028aaa0
Code; c015b229 <do_get_write_access+211/638>
c: 0f 88 35 f5 0c 00 js cf547 <_EIP+0xcf547> c022a764
<stext_lock+33bc/92c6>
Code; c015b22f <do_get_write_access+217/638>
12: 8b 53 00 mov 0x0(%ebx),%edx


________________________________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355
----- Original Message -----
From: "Andrew Morton" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "[email protected]" <[email protected]>;
<[email protected]>
Sent: Friday, July 13, 2001 10:15 AM
Subject: Re: 2.4.6 and ext3-2.4-0.9.1-246


Mike Black wrote:
>
> I give up! I'm getting file system corruption now on the ext3
partition...
> and I've got a kernel oops (soon to be decoded) This is the worst file
> corruption I've ever seen other than having a disk go bad.

There was a truncate-related bug fixed in 0.9.2. What workload
were you using at the time?

2001-07-13 17:38:25

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Hi,

On Fri, Jul 13, 2001 at 01:30:34PM -0400, Mike Black wrote:
> Here's the oops:
> Message on console:
> yeti kernel: EXT3-fs error (device md(9,0)): ext3_new_inode: reserved inode
> or inode > inodes count - block_group = 0,inode=1
>
> Here line 575:
> J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));

Many thanks. Were there any other log messages at all?

Cheers,
Stephen

2001-07-14 10:43:22

by Mike Black

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Only when I rebooted and fsck ran :-(

----- Original Message -----
From: "Stephen C. Tweedie" <[email protected]>
To: "Mike Black" <[email protected]>
Cc: "Andrew Morton" <[email protected]>; "[email protected]"
<[email protected]>; <[email protected]>
Sent: Friday, July 13, 2001 1:38 PM
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246


> Hi,
>
> On Fri, Jul 13, 2001 at 01:30:34PM -0400, Mike Black wrote:
> > Here's the oops:
> > Message on console:
> > yeti kernel: EXT3-fs error (device md(9,0)): ext3_new_inode: reserved
inode
> > or inode > inodes count - block_group = 0,inode=1
> >
> > Here line 575:
> > J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));
>
> Many thanks. Were there any other log messages at all?
>
> Cheers,
> Stephen

2001-07-14 10:52:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> Only when I rebooted and fsck ran :-(
>

What version of ext3 was it?

It's quite easy to reproduce the raid5/VM problems here - the
system slows to a crawl with the disk only using about 1/10th
of its bandwidth. Much worse if highmem is enabled.

Does this match your observations?

-

2001-07-14 11:57:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Mike Black wrote:
>
> Ummm...that would be the version(s) mentioned in the subject line???? :-)

doh.

OK, there was a nasty bug in 0.9.1 which I was not able to trigger
in a solid month's testing. But others with more worthy hardware
were able to find it quite quickly. Stephen fixed it in 0.9.2.
I don't know if it explains the failure you saw. This:

EXT3-fs error (device md(9,0)): ext3_new_inode: reserved
inode or inode > inodes count - block_group = 0,inode=1

is nasty. The LRU cache of inode bitmaps got wrecked. Ugly.

Maybe one more try?

> My .config has
> # CONFIG_NOHIGHMEM is not set
> CONFIG_HIGHMEM4G=y
> # CONFIG_HIGHMEM64G is not set
> CONFIG_HIGHMEM=y
> I've got 2G of RAM
>
> And the main thing I noticed was kswapd going nuts -- this was NOT observed
> with the same tiobench on ext2 (same filesystem). The performance with ext3
> reduced by about 66% on two threads -- and I think that is due to kswapd
> hogging CPU time.

Yup. I've nailed this one - it's lovely.

I'll be back.

-

2001-07-16 18:23:40

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: 2.4.6 and ext3-2.4-0.9.1-246

Hi,

On Sat, Jul 14, 2001 at 09:58:42PM +1000, Andrew Morton wrote:

> OK, there was a nasty bug in 0.9.1 which I was not able to trigger
> in a solid month's testing. But others with more worthy hardware
> were able to find it quite quickly.

It would depend very much on the workload. The problem would only
occur if you had two tasks collide when trying to allocate a block at
the same time, which essentially means doing mmap writes in the middle
of a sparse file. Most workloads would not ever trigger that no
matter how much you tried.

> Stephen fixed it in 0.9.2.
> I don't know if it explains the failure you saw.

Me neither, but it could conceivably do so. The worst case scenario
as an immediate result of that bug would be corruption in the middle
of an indirect block. We used to see that on ext2 on kernels before
2.4.3 as a result of a similar bug there, and the side effects of the
bug were often severe --- if an indirect block is corrupted this way,
then on subsequent delete, you can end up freeing arbitrary parts of
the fs and all bets are off beyond that.

With the 0.9.2 fix in place, I've seen no such problems with any
stress tests, although the VM problems being discussed elsewhere do
still sometimes cause things to stall for a while or lock up totally
after a few hours.

Cheers,
Stephen