Hello linux-kernel,
Does anyone have information on this subject ? I have the constant
failures with system swapping on RAID1, I just wanted to be shure
this may be the problem or not. It works without any problems with
2.2 kernel.
--
Best regards,
Peter mailto:[email protected]
On Thursday July 5, [email protected] wrote:
> Hello linux-kernel,
>
> Does anyone have information on this subject ? I have the constant
> failures with system swapping on RAID1, I just wanted to be shure
> this may be the problem or not. It works without any problems with
> 2.2 kernel.
It certainly should work in 2.4. What sort of "constant failures" are
you experiencing?
Though it does appear to work in 2.2, there is a possibility of data
corruption if you swap onto a raid1 array that is resyncing. This
possibility does not exist in 2.4.
NeilBrown
Hello Neil,
Thursday, July 05, 2001, 4:13:00 PM, you wrote:
NB> On Thursday July 5, [email protected] wrote:
>> Hello linux-kernel,
>>
>> Does anyone have information on this subject ? I have the constant
>> failures with system swapping on RAID1, I just wanted to be shure
>> this may be the problem or not. It works without any problems with
>> 2.2 kernel.
NB> It certainly should work in 2.4. What sort of "constant failures" are
NB> you experiencing?
NB> Though it does appear to work in 2.2, there is a possibility of data
NB> corruption if you swap onto a raid1 array that is resyncing. This
NB> possibility does not exist in 2.4.
The problem is I'm constantly getting these X-order-allocation errors
in kernel log and after which system becomes unstable and often hangs
or leaves process which cannot be killed even by "-9" signal.
Installed debuggin patches produce the following allocation paths:
> Jun 20 05:56:14 tor kernel: Call Trace: [__get_free_pages+20/36]
> [__get_free_pages+20/36] [kmem_cache_grow+187/520] [kmalloc+183/224]
> [raid1_alloc_r1bh+105/256] [raid1_make_request+832/852]
> [raid1_make_request+80/852]
> Jun 20 05:56:14 tor kernel: [md_make_request+79/124]
> [generic_make_request+293/308] [submit_bh+87/116] [brw_page+143/160]
> [rw_swap_page_base+336/428] [rw_swap_page+112/184] [swap_writepage+120/128]
> [page_launder+644/2132]
> Jun 20 05:56:14 tor kernel: [do_try_to_free_pages+52/124]
> [kswapd+89/228] [kernel_thread+40/56]
>
one more trace:
SR>>Jun 19 09:50:08 garnet kernel: __alloc_pages: 0-order allocation failed.
SR>>Jun 19 09:50:08 garnet kernel: __alloc_pages: 0-order allocation failed from
SR>>c01Jun 19 09:50:08 garnet kernel: ^M^Mf4a2bc74 c024ac20 00000000 c012ca09
SR>>c024abe0
SR>>Jun 19 09:50:08 garnet kernel: 00000008 c03225e0 00000003 00000001
SR>>c029c9Jun 19 09:50:08 garnet kernel: f0ebb760 00000001 00000008
SR>>c03225e0 c0197bJun 19 09:50:08 garnet kernel: Call Trace:
SR>>[alloc_bounce_page+13/140] [alloc_bouJun 19 09:50:08 garnet kernel:
SR>>[raid1_make_request+832/852] [md_make_requJun 19 09:50:08 garnet kernel:
SR>>[swap_writepage+120/128] [page_launder+644Jun 19 09:50:08 garnet kernel:
SR>>[sock_poll+35/40] [do_select+230/476] [sysJun 19 10:21:27 garnet kernel:
SR>>sending pkt_too_big to self
SR>>Jun 19 10:21:55 garnet kernel: sending pkt_too_big to self
SR>>Jun 19 10:34:36 garnet kernel: sending pkt_too_big to self
SR>>Jun 19 10:35:33 garnet last message repeated 2 times
SR>>Jun 19 10:36:50 garnet kernel: sending pkt_too_big to self
That's why I thought this problem is related to raid1 swapping I'm
using.
Well. Of couse I'm speaking about synced RAID1.
--
Best regards,
Peter mailto:[email protected]
Peter Zaitsev wrote:
>
> That's why I thought this problem is related to raid1 swapping I'm
> using.
Well there is the potential problem that RAID1 has that it can't avoid
allocating
memory in some occasions, for the 2nd bufferhead. ATARAID raid0 has the
same problem for
now, and there is no real solution to this. You can pre-allocate a bunch
of bufferheads,
but under high load you will run out of those, no matter how many you
pre-allocate.
Of course you can then wait for the "in flight" ones to become available
again, and that is
the best thing I've come up with so far. It would be nice if the 3
subsystems that need such
bufferheads now (MD RAID1, ATARAID RAID0 and the bouncebuffer(head)
code) could share their
pool.
Greetings,
Arjan van de Ven
Just out of curiousity what are the advantages to having a RAID1 swap
partition? Setting the swap priority to 0 (pri=0) in the fstab of all
the swap partitions on your system should have the same effect as doing
it with RAID but without the overhead, right? RAID1 would also mirror
your swap. Why would you want that?
Regards,
-Nick
Peter Zaitsev wrote:
>
> Hello linux-kernel,
>
> Does anyone have information on this subject ? I have the constant
> failures with system swapping on RAID1, I just wanted to be shure
> this may be the problem or not. It works without any problems with
> 2.2 kernel.
>
> --
> Best regards,
> Peter mailto:[email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Nicholas DeClario
Systems Engineer Guardian Digital, Inc.
(201) 934-9230 Pioneering. Open Source. Security.
[email protected] http://www.guardiandigital.com
Nick DeClario wrote:
>
> Just out of curiousity what are the advantages to having a RAID1 swap
> partition? Setting the swap priority to 0 (pri=0) in the fstab of all
> the swap partitions on your system should have the same effect as doing
> it with RAID but without the overhead, right? RAID1 would also mirror
> your swap. Why would you want that?
>
> Regards,
> -Nick
>
Hi,
Setting swap priority to 0 is equivalent to RAID0 (striping) not RAID1 (mirroring).
Mirroring your swap partition is important because if the disk containing
your swap fails, your system is dead. If you want to keep your system running
even if one disk fails you need to mirror ALL your active partitions including
swap.
If you only mirror your data partitions, your are only protected against data
loss in case of a disk crash (assuming you shutdown gracefully before it panics
while it tries to read/write on a crashed swap partition and leave your data in
some inconsistent state).
Regards
--
Joseph Bueno
In linux-kernel, you wrote:
> Peter Zaitsev wrote:
> >
> > That's why I thought this problem is related to raid1 swapping I'm
> > using.
>
> Well there is the potential problem that RAID1 has that it can't avoid allocating
> memory in some occasions, for the 2nd bufferhead. ATARAID raid0 has the same problem for
> now, and there is no real solution to this. You can pre-allocate a bunch of bufferheads,
> but under high load you will run out of those, no matter how many you pre-allocate.
Arjan, why doesn't it sleep instead (GFP_KERNEL)?
-- Pete
Hello Nick,
Thursday, July 05, 2001, 6:54:37 PM, you wrote:
Well The idea is simple. I want my system to survive if one of the
disk fails. So I store all of my data including swap on RAID
partitions.
ND> Just out of curiousity what are the advantages to having a RAID1 swap
ND> partition? Setting the swap priority to 0 (pri=0) in the fstab of all
ND> the swap partitions on your system should have the same effect as doing
ND> it with RAID but without the overhead, right? RAID1 would also mirror
ND> your swap. Why would you want that?
ND> Regards,
ND> -Nick
ND> Peter Zaitsev wrote:
>>
>> Hello linux-kernel,
>>
>> Does anyone have information on this subject ? I have the constant
>> failures with system swapping on RAID1, I just wanted to be shure
>> this may be the problem or not. It works without any problems with
>> 2.2 kernel.
>>
>> --
>> Best regards,
>> Peter mailto:[email protected]
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
--
Best regards,
Peter mailto:[email protected]
On Thu, 5 Jul 2001, Nick DeClario wrote:
> RAID1 would also mirror your swap. Why would you want that?
redundancy. no point having your data redundant if your swap isn't -
1 drive failure will take out the box the moment it tries to access
swap on the failed drive.
PS: i have 2 boxes deployed running RH's 2.4.2, with swap on top of
LVM on top of RAID1. no problems sofar, even during resync.
> Regards,
> -Nick
--paulj
On Thursday July 5, [email protected] wrote:
> Hello Neil,
>
> Thursday, July 05, 2001, 4:13:00 PM, you wrote:
>
> NB> On Thursday July 5, [email protected] wrote:
> >> Hello linux-kernel,
> >>
> >> Does anyone have information on this subject ? I have the constant
> >> failures with system swapping on RAID1, I just wanted to be shure
> >> this may be the problem or not. It works without any problems with
> >> 2.2 kernel.
>
> NB> It certainly should work in 2.4. What sort of "constant failures" are
> NB> you experiencing?
>
> NB> Though it does appear to work in 2.2, there is a possibility of data
> NB> corruption if you swap onto a raid1 array that is resyncing. This
> NB> possibility does not exist in 2.4.
>
>
>
> The problem is I'm constantly getting these X-order-allocation errors
> in kernel log and after which system becomes unstable and often hangs
> or leaves process which cannot be killed even by "-9" signal.
> Installed debuggin patches produce the following allocation paths:
These "X-order-allocation" failures are just an indication that you
are running out or memory. raid1 is explicitly written to cope.
If memory allocation fails it waits for some to be free, and it has
made sure in advance that there is some memory that it will get
first-dibs on when it becomes free, so there is no risk of deadlock.
However this does not explain why you are getting unkillable
processes.
Can you try to put swap on just one of the partitions that your raid1
together instead of on the raid1 array and see if you can get
processes to become unkillable.
Also, can you find out what that process is doing when it is
unkillable.
If you compile with alt-sysrq support, then alt-sysrq-t should print
the process table. If you can get this out of dmesg and run if though
ksymoops it might be most interesting.
NeilBrown
Neil Brown wrote:
>
> Also, can you find out what that process is doing when it is
> unkillable.
> If you compile with alt-sysrq support, then alt-sysrq-t should print
> the process table. If you can get this out of dmesg and run if though
> ksymoops it might be most interesting.
Neil, he showed us a trace the other day - kswapd was
stuck in raid1_alloc_r1_bh(). This is basically the
same situation as I had yesterday, where bdflush was stuck
in the same place.
It is completely fatal to the VM for these two processes to
get stuck in this way. The approach I took was to beef up
the reserved bh queues and to keep a number of them
reserved *only* for the swapout and dirty buffer flush functions.
That way, we have at hand the memory we need to be able to
free up memory.
It was necessary to define a new task_struct.flags bit so we
can identify when the caller is a `buffer flusher' - I expect
we'll need that in other places as well.
An easy way to demonstrate the problem is to put ext3 on RAID1,
boot with `mem=64m' and run `dd if=/dev/zero of=foo bs=1024k count=1k'.
The machine wedges on the first run. This is due to a bdflush deadlock.
Once swap is on RAID1, there will be kswapd deadlocks as well. The
patch *should* fix those, but I haven't tested that.
Could you please review these changes?
BTW: I removed the initial buffer_head reservation code. It's
not necessary with the modified reservation algorithm - as soon
as we start to use the device the reserve pools will build
up. There will be a deadlock opportunity if the machine is totally
and utterly oom when the RAID device initially starts up, but it's
really not worth the code space to even bother about this.
--- linux-2.4.6/include/linux/sched.h Wed May 2 22:00:07 2001
+++ lk-ext3/include/linux/sched.h Thu Jul 12 01:03:20 2001
@@ -413,7 +418,7 @@ struct task_struct {
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
-
+#define PF_FLUSH 0x00002000 /* Flushes buffers to disk */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
/*
--- linux-2.4.6/include/linux/raid/raid1.h Tue Dec 12 08:20:08 2000
+++ lk-ext3/include/linux/raid/raid1.h Thu Jul 12 01:15:39 2001
@@ -37,12 +37,12 @@ struct raid1_private_data {
/* buffer pool */
/* buffer_heads that we have pre-allocated have b_pprev -> &freebh
* and are linked into a stack using b_next
- * raid1_bh that are pre-allocated have R1BH_PreAlloc set.
* All these variable are protected by device_lock
*/
struct buffer_head *freebh;
int freebh_cnt; /* how many are on the list */
struct raid1_bh *freer1;
+ unsigned freer1_cnt;
struct raid1_bh *freebuf; /* each bh_req has a page allocated */
md_wait_queue_head_t wait_buffer;
@@ -87,5 +87,4 @@ struct raid1_bh {
/* bits for raid1_bh.state */
#define R1BH_Uptodate 1
#define R1BH_SyncPhase 2
-#define R1BH_PreAlloc 3 /* this was pre-allocated, add to free list */
#endif
--- linux-2.4.6/fs/buffer.c Wed Jul 4 18:21:31 2001
+++ lk-ext3/fs/buffer.c Thu Jul 12 01:03:57 2001
@@ -2685,6 +2748,7 @@ int bdflush(void *sem)
sigfillset(&tsk->blocked);
recalc_sigpending(tsk);
spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;
up((struct semaphore *)sem);
@@ -2726,6 +2790,7 @@ int kupdate(void *sem)
siginitsetinv(¤t->blocked, sigmask(SIGCONT) | sigmask(SIGSTOP));
recalc_sigpending(tsk);
spin_unlock_irq(&tsk->sigmask_lock);
+ current->flags |= PF_FLUSH;
up((struct semaphore *)sem);
--- linux-2.4.6/drivers/md/raid1.c Wed Jul 4 18:21:26 2001
+++ lk-ext3/drivers/md/raid1.c Thu Jul 12 01:28:58 2001
@@ -51,6 +51,28 @@ static mdk_personality_t raid1_personali
static md_spinlock_t retry_list_lock = MD_SPIN_LOCK_UNLOCKED;
struct raid1_bh *raid1_retry_list = NULL, **raid1_retry_tail;
+/*
+ * We need to scale the number of reserved buffers by the page size
+ * to make writepage()s sucessful. --akpm
+ */
+#define R1_BLOCKS_PP (PAGE_CACHE_SIZE / 1024)
+#define FREER1_MEMALLOC_RESERVED (16 * R1_BLOCKS_PP)
+
+/*
+ * Return true if the caller make take a bh from the list.
+ * PF_FLUSH and PF_MEMALLOC tasks are allowed to use the reserves, because
+ * they're trying to *free* some memory.
+ *
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_bh(raid1_conf_t *conf, int cnt)
+{
+ int min_free = (current->flags & (PF_FLUSH|PF_MEMALLOC)) ?
+ cnt :
+ (cnt + FREER1_MEMALLOC_RESERVED * conf->raid_disks);
+ return conf->freebh_cnt >= min_free;
+}
+
static struct buffer_head *raid1_alloc_bh(raid1_conf_t *conf, int cnt)
{
/* return a linked list of "cnt" struct buffer_heads.
@@ -62,7 +84,7 @@ static struct buffer_head *raid1_alloc_b
while(cnt) {
struct buffer_head *t;
md_spin_lock_irq(&conf->device_lock);
- if (conf->freebh_cnt >= cnt)
+ if (may_take_bh(conf, cnt))
while (cnt) {
t = conf->freebh;
conf->freebh = t->b_next;
@@ -83,7 +105,7 @@ static struct buffer_head *raid1_alloc_b
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
- wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
+ wait_event(conf->wait_buffer, may_take_bh(conf, cnt));
}
}
return bh;
@@ -96,9 +118,9 @@ static inline void raid1_free_bh(raid1_c
while (bh) {
struct buffer_head *t = bh;
bh=bh->b_next;
- if (t->b_pprev == NULL)
+ if (conf->freebh_cnt >= FREER1_MEMALLOC_RESERVED) {
kfree(t);
- else {
+ } else {
t->b_next= conf->freebh;
conf->freebh = t;
conf->freebh_cnt++;
@@ -108,29 +130,6 @@ static inline void raid1_free_bh(raid1_c
wake_up(&conf->wait_buffer);
}
-static int raid1_grow_bh(raid1_conf_t *conf, int cnt)
-{
- /* allocate cnt buffer_heads, possibly less if kalloc fails */
- int i = 0;
-
- while (i < cnt) {
- struct buffer_head *bh;
- bh = kmalloc(sizeof(*bh), GFP_KERNEL);
- if (!bh) break;
- memset(bh, 0, sizeof(*bh));
-
- md_spin_lock_irq(&conf->device_lock);
- bh->b_pprev = &conf->freebh;
- bh->b_next = conf->freebh;
- conf->freebh = bh;
- conf->freebh_cnt++;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
static int raid1_shrink_bh(raid1_conf_t *conf, int cnt)
{
/* discard cnt buffer_heads, if we can find them */
@@ -147,7 +146,16 @@ static int raid1_shrink_bh(raid1_conf_t
md_spin_unlock_irq(&conf->device_lock);
return i;
}
-
+
+/*
+ * Return true if the caller make take a raid1_bh from the list.
+ * Requires that conf->device_lock be held.
+ */
+static int may_take_r1bh(raid1_conf_t *conf)
+{
+ return ((conf->freer1_cnt > FREER1_MEMALLOC_RESERVED) ||
+ (current->flags & (PF_FLUSH|PF_MEMALLOC))) && conf->freer1;
+}
static struct raid1_bh *raid1_alloc_r1bh(raid1_conf_t *conf)
{
@@ -155,8 +163,9 @@ static struct raid1_bh *raid1_alloc_r1bh
do {
md_spin_lock_irq(&conf->device_lock);
- if (conf->freer1) {
+ if (may_take_r1bh(conf)) {
r1_bh = conf->freer1;
+ conf->freer1_cnt--;
conf->freer1 = r1_bh->next_r1;
r1_bh->next_r1 = NULL;
r1_bh->state = 0;
@@ -170,7 +179,7 @@ static struct raid1_bh *raid1_alloc_r1bh
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
- wait_event(conf->wait_buffer, conf->freer1);
+ wait_event(conf->wait_buffer, may_take_r1bh(conf));
} while (1);
}
@@ -178,49 +187,30 @@ static inline void raid1_free_r1bh(struc
{
struct buffer_head *bh = r1_bh->mirror_bh_list;
raid1_conf_t *conf = mddev_to_conf(r1_bh->mddev);
+ unsigned long flags;
r1_bh->mirror_bh_list = NULL;
- if (test_bit(R1BH_PreAlloc, &r1_bh->state)) {
- unsigned long flags;
- spin_lock_irqsave(&conf->device_lock, flags);
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (conf->freer1_cnt < FREER1_MEMALLOC_RESERVED) {
r1_bh->next_r1 = conf->freer1;
conf->freer1 = r1_bh;
+ conf->freer1_cnt++;
spin_unlock_irqrestore(&conf->device_lock, flags);
} else {
+ spin_unlock_irqrestore(&conf->device_lock, flags);
kfree(r1_bh);
}
raid1_free_bh(conf, bh);
}
-static int raid1_grow_r1bh (raid1_conf_t *conf, int cnt)
-{
- int i = 0;
-
- while (i < cnt) {
- struct raid1_bh *r1_bh;
- r1_bh = (struct raid1_bh*)kmalloc(sizeof(*r1_bh), GFP_KERNEL);
- if (!r1_bh)
- break;
- memset(r1_bh, 0, sizeof(*r1_bh));
-
- md_spin_lock_irq(&conf->device_lock);
- set_bit(R1BH_PreAlloc, &r1_bh->state);
- r1_bh->next_r1 = conf->freer1;
- conf->freer1 = r1_bh;
- md_spin_unlock_irq(&conf->device_lock);
-
- i++;
- }
- return i;
-}
-
static void raid1_shrink_r1bh(raid1_conf_t *conf)
{
md_spin_lock_irq(&conf->device_lock);
while (conf->freer1) {
struct raid1_bh *r1_bh = conf->freer1;
conf->freer1 = r1_bh->next_r1;
+ conf->freer1_cnt--; /* pedantry */
kfree(r1_bh);
}
md_spin_unlock_irq(&conf->device_lock);
@@ -1610,21 +1600,6 @@ static int raid1_run (mddev_t *mddev)
goto out_free_conf;
}
-
- /* pre-allocate some buffer_head structures.
- * As a minimum, 1 r1bh and raid_disks buffer_heads
- * would probably get us by in tight memory situations,
- * but a few more is probably a good idea.
- * For now, try 16 r1bh and 16*raid_disks bufferheads
- * This will allow at least 16 concurrent reads or writes
- * even if kmalloc starts failing
- */
- if (raid1_grow_r1bh(conf, 16) < 16 ||
- raid1_grow_bh(conf, 16*conf->raid_disks)< 16*conf->raid_disks) {
- printk(MEM_ERROR, mdidx(mddev));
- goto out_free_conf;
- }
-
for (i = 0; i < MD_SB_DISKS; i++) {
descriptor = sb->disks+i;
@@ -1713,6 +1688,8 @@ out_free_conf:
raid1_shrink_r1bh(conf);
raid1_shrink_bh(conf, conf->freebh_cnt);
raid1_shrink_buffers(conf);
+ if (conf->freer1_cnt != 0)
+ BUG();
kfree(conf);
mddev->private = NULL;
out:
On Thursday July 12, [email protected] wrote:
>
> Could you please review these changes?
I think I see what you are trying to do, and there is nothing
obviously wrong except this comment :-)
> + * Return true if the caller make take a raid1_bh from the list.
^^^^
but now that I see what the problem is, I think a simpler patch would
be
--- drivers/md/raid1.c 2001/07/12 02:00:35 1.1
+++ drivers/md/raid1.c 2001/07/12 02:01:42
@@ -83,6 +83,7 @@
cnt--;
} else {
PRINTK("raid1: waiting for %d bh\n", cnt);
+ run_task_queue(&tq_disk);
wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
}
}
@@ -170,6 +171,7 @@
memset(r1_bh, 0, sizeof(*r1_bh));
return r1_bh;
}
+ run_task_queue(&tq_disk);
wait_event(conf->wait_buffer, conf->freer1);
} while (1);
}
This is needed anyway to be "correct", as you should always unplug
the queues before waiting for IO to complete.
On the issue of whether to pre-allocate some reserved structures or
not, I think it's "6-of-one-half-a-dozen-of-the-other". My rationale
for pre-allocating was that the buffer that we hold on to would have
been allocated together and so probably are fairly dense within their
pages, and so there is no risk of hogging excess memory that isn't
actually being used. Mind you, if I was really serious about being
gentle on the memory allocation, I would use
kmem_cache_alloc(bh_cachep,SLAB_whatever)
instead of
kmalloc(sizeof(struct buffer_head), GFP_whatever)
but I hadn't 'got' the slab stuff properly when I was writing that
code.
Peter, does the above little patch help your problem?
NeilBrown
Neil Brown wrote:
>
> --- drivers/md/raid1.c 2001/07/12 02:00:35 1.1
> +++ drivers/md/raid1.c 2001/07/12 02:01:42
> @@ -83,6 +83,7 @@
> cnt--;
> } else {
> PRINTK("raid1: waiting for %d bh\n", cnt);
> + run_task_queue(&tq_disk);
> wait_event(conf->wait_buffer, conf->freebh_cnt >= cnt);
> }
> }
> @@ -170,6 +171,7 @@
> memset(r1_bh, 0, sizeof(*r1_bh));
> return r1_bh;
> }
> + run_task_queue(&tq_disk);
> wait_event(conf->wait_buffer, conf->freer1);
> } while (1);
> }
>
> This is needed anyway to be "correct", as you should always unplug
> the queues before waiting for IO to complete.
The problem with this approach is the waitqueue - you get several
tasks on the waitqueue, and bdflush loses the race - some other
thread steals the r1bh and bdflush goes back to sleep.
Replacing the wait_event() with a special raid1_wait_event()
which unplugs *each time* the caller is woken does help - but
it is still easy to deadlock the system.
Clearly this approach is racy: it assumes that the reserved buffers have
actually been submitted when we unplug - they may not yet have been.
But the lockup is too easy to trigger for that to be a satisfactory
explanation.
The most effective, aggressive, successful and grotty fix for this
problem is to remove the wait_event altogether and replace it with:
run_task_queue(tq_disk);
current->policy |= SCHED_YIELD;
__set_current_state(TASK_RUNNING);
schedule();
This can still deadlock in bad OOM situations, but I think we're
dead anyway. A combination of this approach plus the PF_FLUSH
reservations would work even better, but I found the PF_FLUSH
stuff was sufficient.
> Mind you, if I was really serious about being
> gentle on the memory allocation, I would use
> kmem_cache_alloc(bh_cachep,SLAB_whatever)
> instead of
> kmalloc(sizeof(struct buffer_head), GFP_whatever)
get/put_unused_buffer_head() should be exported API functions.
-