2001-02-16 00:39:00

by David D.W. Downey

[permalink] [raw]
Subject: [OTP] SMP board recommendations?


Anyone have a recommendation for a motherboard for a homebased SMP box?

I've tried the Abit VP6 and the MSI 6321 (694D Pro). Both give me the APIC
errors with system lockups on heavy I/O using the 2.4.1-ac1# and the
2.4.2-pre# kernels. (The ac-## line doesn't die ANYWHERE near as often as
the other board.)

I'm looking into the i810 server board with the onboard SCSI controllers.
I plan on installing either the Promise PDC20267 ATA100 controller or a
Promise FastTrak RAID card (if they come in ATA100) since the only SCSI I
have is the Yamaha 8424S SCSI CDR-W.

Since this IS off topic of sorts, please reply to me privately. Thanks


--
David D.W. Downey - RHCE
Consulting Engineer
Ensim Corporation - Sunnyvale, CA


2001-02-16 01:43:50

by Samuel Flory

[permalink] [raw]
Subject: mke2fs and kernel VM issues

What is believed to be the current status of the typical mke2fs
crashes/hangs due to vm issues? I can reliably reproduce the issue on a
heavily modifed VA kernel based on 2.2.18. Is there a kernel which is
believed to be a known good kernel? (both 2.2.x and 2.4.x)

Failure pattern:

System:
mylex raid 5 array 8 x 9G drives (not really all that big)
>=512M of RAM (1G of RAM works)
no swap (Not sure if this makes a difference.)

The system is attempting to create a single partition containing the
most of the entire RAID array.

errors:
buffy: Installing with LIVE AMMO
Creating partitions...
Initializing filesystems...
Out of Memory: Killed process 106 (portmap), saved process 2165
(mke2fs).<3>Out
of Memory: Killed process 2123 (buffy), saved process 2165
(mke2fs).willow: LOAD
FAILED
<3>Out of Memory: Killed process 195 (sisyphus_upload), saved process
2165 (mke2
fs).<3>Out of Memory: Killed process 2165 (mke2fs).

(Note that most of the above proccesses were dialog interfaces waiting
for user input or perl scripts waiting for mke2fs or buffy to exit.)


PS- Conversations with various VA empolyees indicates that others within
VA, and at least one vendor are seeing hangs while creating really large
filesystems on RAID arrays. (mostly 1/4 TB or larger) These issues
appear to come and go, and are endemic to the 2.2.x kernel line. Both
lnz and tytso seem to believe the issues to be vm entirely related.

--
Solving people's computer problems always
requires more hardware be given to you.
(The Second Rule of Hardware Acquisition)
Samuel J. Flory <[email protected]>

2001-02-16 05:03:52

by David D.W. Downey

[permalink] [raw]
Subject: Re: [OTP] SMP board recommendations?

Thank you all for your response.

Andre (ASL), thanks for the assist. Laurie and Janine took care of me.
Asus CUV4X-D mobo with 1GB of buffered ECC RAM. I'm in the process of
transfering all the hardware to the new board. I'll let you know if this
new board solves the APIC errors and the random lockups under heavy I/O
problems.

I do have one more problem that I just can NOT track down.

2.4.1-ac10 kernel on the old Abit VP6 mobo. I'm getting curious errors
from the 2.4.1, 2.4.1-ac10, and 2.4.2-pre[#] kernels.

I've been using

dd if=/dev/zero of=/tmp/testdd.img bs=1024k count=1500

for testing of I/O on the various boards I have here. Now, the funny part
is that I get "file size limit exceeded" at around 1.0GB. I was getting
this under the 2.4.2-pre# kernels so i switched to straight 2.4.1 and got
the same problem. I switched to the 2.4.1-ac# line and the problem
disappeared. Guess what? It's baaacckk!

So, I did a strace of the dd command and got the following from it

execve("/bin/dd", ["dd", "if=/dev/zero", "of=/tmp/testing.img", "bs=1024k", "count=1500"], [/* 22 vars */]) = 0
brk(0) = 0x804e7b8
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=7852, ...}) = 0
old_mmap(NULL, 7852, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=1183326, ...}) = 0
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\215"..., 4096) = 4096
old_mmap(NULL, 947548, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40017000
mprotect(0x400f7000, 30044, PROT_NONE) = 0
old_mmap(0x400f7000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0xdf000) = 0x400f7000
old_mmap(0x400fb000, 13660, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400fb000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x400ff000
mprotect(0x40017000, 917504, PROT_READ|PROT_WRITE) = 0
mprotect(0x40017000, 917504, PROT_READ|PROT_EXEC) = 0
munmap(0x40015000, 7852) = 0
personality(PER_LINUX) = 0
getpid() = 195
brk(0) = 0x804e7b8
brk(0x804e7f0) = 0x804e7f0
brk(0x804f000) = 0x804f000
open("/dev/zero", O_RDONLY|O_LARGEFILE) = 3
open("/tmp/testing.img", O_RDWR|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4
rt_sigaction(SIGINT, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGINT, {0x804ada8, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {0x804ada8, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGPIPE, {0x804ada8, [], 0x4000000}, NULL, 8) = 0
rt_sigaction(SIGUSR1, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGUSR1, {0x804ae70, [], 0x4000000}, NULL, 8) = 0
old_mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40100000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576

********* BIG ASS SNIP **********

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = -1 EFBIG (File too large)
--- SIGXFSZ (File size limit exceeded) ---
+++ killed by SIGXFSZ +++



Now, notice the beginning file creation call. It starts out with
O_LARGEFILE but ends with EFBIG. Since I'm not totally familiar with the
kernel code I could be wrong on my next statement and if I am, please tell
me, but it looks like it changes the file creation call from LARGEFILE to
EFBIG (or is this just the error call itself?)

Now, the kernel is supposed to be able to handle creating a 4TB file(?),
so 1.0GB should be nothing to it. NOTHING changed betwen it working and
not working. No hardware changes, no software additions, no recompiles of
existing applications/daemons.. nothing.

So, my question is now one of "What gives?" Any clues on how I can check
to see what's going wrong? Is my gut feeling that it's changing the file
type wrong? (IIUC, there are different open() calls for different size
files? No, I have nothing to base this one, just something I flashed on
and thought might explain the problem.)

I'm learning here guys, so please be gentle. You folks are the only ones I
have with the experience to tell me when I'm just fscked in the head and
when I'm bang on.

--
David D.W. Downey - RHCE
Consulting Engineer
Ensim Corporation - Sunnyvale, CA

2001-02-16 06:28:08

by Andre Hedrick

[permalink] [raw]
Subject: Re: [OTP] SMP board recommendations?


Hi David,

Just to let you and the rest of the world in on a secret, 'ASL, Inc.' is
the premier ATA server system builder. Jeff Nguyen is the only person
that I knew two years ago that was a pioneer and I have shared some
information with him before in the past, but here is ATA and it it here to
stay.

Cheers,

Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com

******* shameless toys of creation to challenage the GB/$$ *******
http://www.aslab.com/contents/servers/Sovereign-3400T.html
http://www.aslab.com/contents/servers/Sovereign-3450T.html


On Thu, 15 Feb 2001, David D.W. Downey wrote:

> Thank you all for your response.
>
> Andre (ASL), thanks for the assist. Laurie and Janine took care of me.
> Asus CUV4X-D mobo with 1GB of buffered ECC RAM. I'm in the process of
> transfering all the hardware to the new board. I'll let you know if this
> new board solves the APIC errors and the random lockups under heavy I/O
> problems.
>
> I do have one more problem that I just can NOT track down.
>
> 2.4.1-ac10 kernel on the old Abit VP6 mobo. I'm getting curious errors
> from the 2.4.1, 2.4.1-ac10, and 2.4.2-pre[#] kernels.
>
> I've been using
>
> dd if=/dev/zero of=/tmp/testdd.img bs=1024k count=1500
>
> for testing of I/O on the various boards I have here. Now, the funny part
> is that I get "file size limit exceeded" at around 1.0GB. I was getting
> this under the 2.4.2-pre# kernels so i switched to straight 2.4.1 and got
> the same problem. I switched to the 2.4.1-ac# line and the problem
> disappeared. Guess what? It's baaacckk!
>
> So, I did a strace of the dd command and got the following from it
>
> execve("/bin/dd", ["dd", "if=/dev/zero", "of=/tmp/testing.img", "bs=1024k", "count=1500"], [/* 22 vars */]) = 0
> brk(0) = 0x804e7b8
> open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
> open("/etc/ld.so.cache", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=7852, ...}) = 0
> old_mmap(NULL, 7852, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
> close(3) = 0
> open("/lib/libc.so.6", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0755, st_size=1183326, ...}) = 0
> read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\215"..., 4096) = 4096
> old_mmap(NULL, 947548, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40017000
> mprotect(0x400f7000, 30044, PROT_NONE) = 0
> old_mmap(0x400f7000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0xdf000) = 0x400f7000
> old_mmap(0x400fb000, 13660, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x400fb000
> close(3) = 0
> old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x400ff000
> mprotect(0x40017000, 917504, PROT_READ|PROT_WRITE) = 0
> mprotect(0x40017000, 917504, PROT_READ|PROT_EXEC) = 0
> munmap(0x40015000, 7852) = 0
> personality(PER_LINUX) = 0
> getpid() = 195
> brk(0) = 0x804e7b8
> brk(0x804e7f0) = 0x804e7f0
> brk(0x804f000) = 0x804f000
> open("/dev/zero", O_RDONLY|O_LARGEFILE) = 3
> open("/tmp/testing.img", O_RDWR|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4
> rt_sigaction(SIGINT, NULL, {SIG_DFL}, 8) = 0
> rt_sigaction(SIGINT, {0x804ada8, [], 0x4000000}, NULL, 8) = 0
> rt_sigaction(SIGQUIT, NULL, {SIG_DFL}, 8) = 0
> rt_sigaction(SIGQUIT, {0x804ada8, [], 0x4000000}, NULL, 8) = 0
> rt_sigaction(SIGPIPE, NULL, {SIG_DFL}, 8) = 0
> rt_sigaction(SIGPIPE, {0x804ada8, [], 0x4000000}, NULL, 8) = 0
> rt_sigaction(SIGUSR1, NULL, {SIG_DFL}, 8) = 0
> rt_sigaction(SIGUSR1, {0x804ae70, [], 0x4000000}, NULL, 8) = 0
> old_mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40100000
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
>
> ********* BIG ASS SNIP **********
>
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
> write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = -1 EFBIG (File too large)
> --- SIGXFSZ (File size limit exceeded) ---
> +++ killed by SIGXFSZ +++
>
>
>
> Now, notice the beginning file creation call. It starts out with
> O_LARGEFILE but ends with EFBIG. Since I'm not totally familiar with the
> kernel code I could be wrong on my next statement and if I am, please tell
> me, but it looks like it changes the file creation call from LARGEFILE to
> EFBIG (or is this just the error call itself?)
>
> Now, the kernel is supposed to be able to handle creating a 4TB file(?),
> so 1.0GB should be nothing to it. NOTHING changed betwen it working and
> not working. No hardware changes, no software additions, no recompiles of
> existing applications/daemons.. nothing.
>
> So, my question is now one of "What gives?" Any clues on how I can check
> to see what's going wrong? Is my gut feeling that it's changing the file
> type wrong? (IIUC, there are different open() calls for different size
> files? No, I have nothing to base this one, just something I flashed on
> and thought might explain the problem.)
>
> I'm learning here guys, so please be gentle. You folks are the only ones I
> have with the experience to tell me when I'm just fscked in the head and
> when I'm bang on.
>
> --
> David D.W. Downey - RHCE
> Consulting Engineer
> Ensim Corporation - Sunnyvale, CA
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2001-02-16 09:50:09

by Alan

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

> heavily modifed VA kernel based on 2.2.18. Is there a kernel which is
> believed to be a known good kernel? (both 2.2.x and 2.4.x)

I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have
its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which
have worked wonders for everyone so far.

2001-02-16 10:49:54

by Roeland Th. Jansen

[permalink] [raw]
Subject: Re: [OTP] SMP board recommendations?

On Thu, Feb 15, 2001 at 04:38:37PM -0800, David D.W. Downey wrote:
> I've tried the Abit VP6 and the MSI 6321 (694D Pro). Both give me the APIC
> errors with system lockups on heavy I/O using the 2.4.1-ac1# and the
> 2.4.2-pre# kernels. (The ac-## line doesn't die ANYWHERE near as often as
> the other board.)



the APIC code has been modified quite a bit and Maciej's fixes so far,
on this part shows that my BP6 stays alive while even the -AC kernels
were killed. I'd suggest you to try his patches and see if that works
for you.

IIRC, this is the one :

patch-2.4.1-io_apic-46
diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/apic.c linux-2.4.1/arch/i386/kernel/apic.c
--- linux-2.4.1.macro/arch/i386/kernel/apic.c Wed Dec 13 23:54:27 2000
+++ linux-2.4.1/arch/i386/kernel/apic.c Mon Feb 12 16:11:15 2001
@@ -23,6 +23,7 @@
#include <linux/mc146818rtc.h>
#include <linux/kernel_stat.h>

+#include <asm/atomic.h>
#include <asm/smp.h>
#include <asm/mtrr.h>
#include <asm/mpspec.h>
@@ -270,7 +271,13 @@ void __init setup_local_APIC (void)
* PCI Ne2000 networking cards and PII/PIII processors, dual
* BX chipset. ]
*/
-#if 0
+ /*
+ * Actually disabling the focus CPU check just makes the hang less
+ * frequent as it makes the interrupt distributon model be more
+ * like LRU than MRU (the short-term load is more even across CPUs).
+ * See also the comment in end_level_ioapic_irq(). --macro
+ */
+#if 1
/* Enable focus processor (bit==0) */
value &= ~(1<<9);
#else
@@ -764,7 +771,7 @@ asmlinkage void smp_error_interrupt(void
apic_write(APIC_ESR, 0);
v1 = apic_read(APIC_ESR);
ack_APIC_irq();
- irq_err_count++;
+ atomic_inc(&irq_err_count);

/* Here is what the APIC error bits mean:
0: Send CS error
diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/i8259.c linux-2.4.1/arch/i386/kernel/i8259.c
--- linux-2.4.1.macro/arch/i386/kernel/i8259.c Mon Nov 20 18:01:58 2000
+++ linux-2.4.1/arch/i386/kernel/i8259.c Sun Feb 11 19:54:33 2001
@@ -12,6 +12,7 @@
#include <linux/init.h>
#include <linux/kernel_stat.h>

+#include <asm/atomic.h>
#include <asm/system.h>
#include <asm/io.h>
#include <asm/irq.h>
@@ -321,7 +322,7 @@ spurious_8259A_irq:
printk("spurious 8259A interrupt: IRQ%d.\n", irq);
spurious_irq_mask |= irqmask;
}
- irq_err_count++;
+ atomic_inc(&irq_err_count);
/*
* Theoretically we do not have to handle this IRQ,
* but in Linux this does not cause problems and is
diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/io_apic.c linux-2.4.1/arch/i386/kernel/io_apic.c
--- linux-2.4.1.macro/arch/i386/kernel/io_apic.c Sat Feb 3 12:05:49 2001
+++ linux-2.4.1/arch/i386/kernel/io_apic.c Tue Feb 13 19:59:55 2001
@@ -33,6 +33,8 @@
#include <asm/smp.h>
#include <asm/desc.h>

+#define APIC_LOCKUP_DEBUG
+
static spinlock_t ioapic_lock = SPIN_LOCK_UNLOCKED;

/*
@@ -122,8 +124,14 @@ static void add_pin_to_irq(unsigned int
static void name##_IO_APIC_irq (unsigned int irq) \
__DO_ACTION(R, ACTION, FINAL)

-DO_ACTION( __mask, 0, |= 0x00010000, io_apic_sync(entry->apic))/* mask = 1 */
-DO_ACTION( __unmask, 0, &= 0xfffeffff, ) /* mask = 0 */
+DO_ACTION( __mask, 0, |= 0x00010000, io_apic_sync(entry->apic) )
+ /* mask = 1 */
+DO_ACTION( __unmask, 0, &= 0xfffeffff, )
+ /* mask = 0 */
+DO_ACTION( __mask_and_edge, 0, = (reg & 0xffff7fff) | 0x00010000, )
+ /* mask = 1, trigger = 0 */
+DO_ACTION( __unmask_and_level, 0, = (reg & 0xfffeffff) | 0x00008000, )
+ /* mask = 0, trigger = 1 */

static void mask_IO_APIC_irq (unsigned int irq)
{
@@ -847,6 +855,8 @@ void /*__init*/ print_local_APIC(void *

v = apic_read(APIC_EOI);
printk(KERN_DEBUG "... APIC EOI: %08x\n", v);
+ v = apic_read(APIC_RRR);
+ printk(KERN_DEBUG "... APIC RRR: %08x\n", v);
v = apic_read(APIC_LDR);
printk(KERN_DEBUG "... APIC LDR: %08x\n", v);
v = apic_read(APIC_DFR);
@@ -1191,12 +1201,61 @@ static unsigned int startup_level_ioapic
#define enable_level_ioapic_irq unmask_IO_APIC_irq
#define disable_level_ioapic_irq mask_IO_APIC_irq

-static void end_level_ioapic_irq (unsigned int i)
+static void end_level_ioapic_irq (unsigned int irq)
{
+ unsigned long v;
+
+/*
+ * It appears there is an erratum which affects at least version 0x11
+ * of I/O APIC (that's the 82093AA and cores integrated into various
+ * chipsets). Under certain conditions a level-triggered interrupt is
+ * erroneously delivered as edge-triggered one but the respective IRR
+ * bit gets set nevertheless. As a result the I/O unit expects an EOI
+ * message but it will never arrive and further interrupts are blocked
+ * from the source. The exact reason is so far unknown, but the
+ * phenomenon was observed when two consecutive interrupt requests
+ * from a given source get delivered to the same CPU and the source is
+ * temporarily disabled in between.
+ *
+ * A workaround is to simulate an EOI message manually. We achieve it
+ * by setting the trigger mode to edge and then to level when the edge
+ * trigger mode gets detected in the TMR of a local APIC for a
+ * level-triggered interrupt. We mask the source for the time of the
+ * operation to prevent an edge-triggered interrupt escaping meanwhile.
+ * The idea is from Manfred Spraul. --macro
+ */
+ v = apic_read(APIC_TMR + ((IO_APIC_VECTOR(irq) & ~0x1f) >> 1));
+
ack_APIC_irq();
+
+ if (!(v & (1 << (IO_APIC_VECTOR(irq) & 0x1f)))) {
+#ifdef APIC_MISMATCH_DEBUG
+ atomic_inc(&irq_mis_count);
+#endif
+ spin_lock(&ioapic_lock);
+ __mask_and_edge_IO_APIC_irq(irq);
+#ifdef APIC_LOCKUP_DEBUG
+ for (;;) {
+ struct irq_pin_list *entry = irq_2_pin + irq;
+ unsigned int reg;
+
+ if (entry->pin == -1)
+ break;
+ reg = io_apic_read(entry->apic, 0x10 + entry->pin * 2);
+ if (reg & 0x00004000)
+ printk(KERN_CRIT "Aieee!!! Remote IRR"
+ " still set after unlock!\n");
+ if (!entry->next)
+ break;
+ entry = irq_2_pin + entry->next;
+ }
+#endif
+ __unmask_and_level_IO_APIC_irq(irq);
+ spin_unlock(&ioapic_lock);
+ }
}

-static void mask_and_ack_level_ioapic_irq (unsigned int i) { /* nothing */ }
+static void mask_and_ack_level_ioapic_irq (unsigned int irq) { /* nothing */ }

static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
{
diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/irq.c linux-2.4.1/arch/i386/kernel/irq.c
--- linux-2.4.1.macro/arch/i386/kernel/irq.c Wed Dec 13 23:54:27 2000
+++ linux-2.4.1/arch/i386/kernel/irq.c Mon Feb 12 13:37:37 2001
@@ -33,6 +33,7 @@
#include <linux/irq.h>
#include <linux/proc_fs.h>

+#include <asm/atomic.h>
#include <asm/io.h>
#include <asm/smp.h>
#include <asm/system.h>
@@ -119,7 +120,12 @@ struct hw_interrupt_type no_irq_type = {
end_none
};

-volatile unsigned long irq_err_count;
+atomic_t irq_err_count;
+#ifdef CONFIG_X86_IO_APIC
+#ifdef APIC_MISMATCH_DEBUG
+atomic_t irq_mis_count;
+#endif
+#endif

/*
* Generic, controller-independent functions:
@@ -167,7 +173,12 @@ int get_irq_list(char *buf)
apic_timer_irqs[cpu_logical_map(j)]);
p += sprintf(p, "\n");
#endif
- p += sprintf(p, "ERR: %10lu\n", irq_err_count);
+ p += sprintf(p, "ERR: %10u\n", atomic_read(&irq_err_count));
+#ifdef CONFIG_X86_IO_APIC
+#ifdef APIC_MISMATCH_DEBUG
+ p += sprintf(p, "MIS: %10u\n", atomic_read(&irq_mis_count));
+#endif
+#endif
return p - buf;
}

diff -up --recursive --new-file linux-2.4.1.macro/include/asm-i386/hw_irq.h linux-2.4.1/include/asm-i386/hw_irq.h
--- linux-2.4.1.macro/include/asm-i386/hw_irq.h Sat Feb 3 13:12:29 2001
+++ linux-2.4.1/include/asm-i386/hw_irq.h Sun Feb 11 20:02:57 2001
@@ -13,6 +13,7 @@
*/

#include <linux/config.h>
+#include <asm/atomic.h>
#include <asm/irq.h>

/*
@@ -83,7 +84,9 @@ extern int IO_APIC_get_PCI_irq_vector(in
extern void send_IPI(int dest, int vector);

extern unsigned long io_apic_irqs;
-extern volatile unsigned long irq_err_count;
+
+extern atomic_t irq_err_count;
+extern atomic_t irq_mis_count;

extern char _stext, _etext;

diff -up --recursive --new-file linux-2.4.1.macro/include/asm-i386/io_apic.h linux-2.4.1/include/asm-i386/io_apic.h
--- linux-2.4.1.macro/include/asm-i386/io_apic.h Wed Nov 22 21:34:56 2000
+++ linux-2.4.1/include/asm-i386/io_apic.h Mon Feb 12 13:41:02 2001
@@ -12,6 +12,8 @@

#ifdef CONFIG_X86_IO_APIC

+#define APIC_MISMATCH_DEBUG
+
#define IO_APIC_BASE(idx) \
((volatile int *)__fix_to_virt(FIX_IO_APIC_BASE_0 + idx))

diff -up --recursive --new-file linux-2.4.1.macro/include/linux/irq.h linux-2.4.1/include/linux/irq.h
--- linux-2.4.1.macro/include/linux/irq.h Sat Feb 3 13:12:29 2001
+++ linux-2.4.1/include/linux/irq.h Sun Feb 11 20:08:41 2001
@@ -62,7 +62,4 @@ extern int setup_irq(unsigned int , stru
extern hw_irq_controller no_irq_type; /* needed in every arch ? */
extern void no_action(int cpl, void *dev_id, struct pt_regs *regs);

-extern volatile unsigned long irq_err_count;
-
#endif /* __asm_h */
-

--
Grobbebol's Home | Don't give in to spammers. -o)
http://www.xs4all.nl/~bengel | Use your real e-mail address /\
Linux 2.2.16 SMP 2x466MHz / 256 MB | on Usenet. _\_v

2001-02-16 11:03:17

by Tigran Aivazian

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

On Thu, 15 Feb 2001, Samuel Flory wrote:

> What is believed to be the current status of the typical mke2fs
> crashes/hangs due to vm issues? I can reliably reproduce the issue on a
> heavily modifed VA kernel based on 2.2.18. Is there a kernel which is
> believed to be a known good kernel? (both 2.2.x and 2.4.x)

I can mke2fs (successfully) on a 270G block device. Yes, of course, I also
get various page allocation failures while this happens but they are not
deadly, i.e. the thing (our volume manager) just retries until it works
and after a while I have a valid (and a very big) ext2 filesystem with 0
processes killed.

The kernel I use is 2.4.2-pre3. The machine has 6G RAM with the 3G given
to kernel virtual. The amount of swap is massive (2G) but it is never
used.

Regards,
Tigran

2001-02-16 12:46:09

by Theodore Tso

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

Date: Fri, 16 Feb 2001 09:48:17 +0000 (GMT)
From: Alan Cox <[email protected]>

> heavily modifed VA kernel based on 2.2.18. Is there a kernel which is
> believed to be a known good kernel? (both 2.2.x and 2.4.x)

I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have
its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which
have worked wonders for everyone so far.

Note that this only shows up when using mke2fs to create very large
filesystems, and you have relatively little memory. In this particular
case, for example, we saw it with a system that had "only" 256 megs of
memory, and creating a 72 gigabyte filesystem using a 8x9gb RAID
configuration.

Some folks at IBM (in the Mylex controller group) have found this
problem with 2.2.16, 2.2.18, and with some 2.2.19pre patch (they didn't
say exactly which level of the 2.2.19pre patch they were dealing with).
Some folks claiming that the problem exists under 2.2.18, and we've seen
it with our kernel, which is a 2.2.18 plus some set of 2.2.19pre*
patches.

The problem is that mke2fs issues a *lot* of writes when it is writing
the inode table, and apparently the write throttling isn't completely
working write under those circumstances. There is a workaround which
easily fixes the problem; if you set the MKE2FS_SYNC environment
variable to some value such as 5 or 10, then after writing every 5 or 10
block groups's worth of inode tables, mke2fs will call sync(). This
workaround did fix IBM's problem, which lends credence to the theory
that the problem is a VM bug related to a lack of sufficient write
throttling.

I've in the past considered making MKE2FS_SYNC=10 be the default, but
Stephen has requested that I not do this, since it's the best way of
showing off this particular VM bug.

- Ted

>From IBM/Mylex's bug report:

>The system I used for these tests is a Chardonnay with an AR160
>installed. The FW is 6.00-07 and the BIOS is 6.01-08. The system
>has several various kernel/DAC driver boot configurations set up.
>The RAID drive under test is a 3-drive RAID 5 with 8.6GB drives,
>for a total of 17GB. When one maximum sized partition is created,
>the total number of logical cylinders is 2209.
>
>I also obtained the 2.2.19 patch and upgraded kernel 2.2.18 to 2.2.19.
>
>Note: The first 4 tests fail to at least some extent. Please read each one.
>
>1. Kernel 2.2.16 and DAC driver 2.2.9 - 128MB of main memory.
> The system fails to complete the creation of an ext2 file system.
> The process mke2fs (or mkfs) gets terminated instead. Subsequent
> attempts to create an ext2 file (without rerunning fdisk) fail.
>
>2. Kernel 2.2.18 and DAC driver 2.2.9 - 128 MB of main memory.
> The ext2 file system is created, but hundreds of VM error messages
> scroll up on the screen. Expanding the swap space to exceed the
> memory size does not help (I think it might even be worse).
>
> I also tried running a copy compare script that loads the drives with
> heavy I/O. This failed after approximately 20 hours. The system was
> effectively locked up with VM error messages scrolling up the screen
> and all alternate terminal screens.
>
>3. Kernel 2.2.16 and DAC driver 2.2.10 - 128 MB of main memory.
> The system fails to complete the creation of an ext2 file system.
> The process mke2fs (or mkfs) gets terminated instead. Subsequent
> attempts to create an ext2 file (without rerunning fdisk) fail.
>
>4. Kernel 2.2.19 and DAC driver 2.2.9 - 128 MB of main memory.
> The system fails to complete the creation of an ext2 file system.
> The process mke2fs (or mkfs) gets terminated instead. Subsequent
> attempts to create an ext2 file (without rerunning fdisk) DO NOT
> FAIL.
>
>5. Kernel 2.2.16 and DAC driver 2.2.9 - 512 MB of main memory.
> The system completes the creation of an ext2 file system without
> any errors.
>
>6. Kernel 2.2.16 and DAC driver 2.2.10 - 512 MB of main memory.
> The system completes the creation of an ext2 file system without
> any errors.
>
>7. Kernel 2.2.18 and DAC driver 2.2.9 - 512 MB of main memory.
> The system completes the creation of an ext2 file system without
> any errors.
>
>8. Kernel 2.2.19 and DAC driver 2.2.9 - 512 MB of main memory.
> The system completes the creation of an ext2 file system without
> any errors.
>

2001-02-16 12:49:49

by Alan

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

> case, for example, we saw it with a system that had "only" 256 megs of
> memory, and creating a 72 gigabyte filesystem using a 8x9gb RAID
> configuration.

Ok I've only tested 90Gb on 2.2.19pre3, not more than that

> workaround did fix IBM's problem, which lends credence to the theory
> that the problem is a VM bug related to a lack of sufficient write
> throttling.

Yep. I think 2.4.1 is about the first kernel that gets this right



2001-02-16 19:01:38

by Samuel Flory

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

Alan Cox wrote:
>
> > heavily modifed VA kernel based on 2.2.18. Is there a kernel which is
> > believed to be a known good kernel? (both 2.2.x and 2.4.x)
>
> I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have
> its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which
> have worked wonders for everyone so far.


Hmm I believe Chip got his VM patches from Andrea. So the behavior
may be more 2.2.19pre3ish than 2.2.18ish.

--
Solving people's computer problems always
requires more hardware be given to you.
(The Second Rule of Hardware Acquisition)
Samuel J. Flory <[email protected]>

2001-02-16 19:02:26

by Samuel Flory

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

Tigran Aivazian wrote:
>
> On Thu, 15 Feb 2001, Samuel Flory wrote:
>
> > What is believed to be the current status of the typical mke2fs
> > crashes/hangs due to vm issues? I can reliably reproduce the issue on a
> > heavily modifed VA kernel based on 2.2.18. Is there a kernel which is
> > believed to be a known good kernel? (both 2.2.x and 2.4.x)
>
> I can mke2fs (successfully) on a 270G block device. Yes, of course, I also
> get various page allocation failures while this happens but they are not
> deadly, i.e. the thing (our volume manager) just retries until it works
> and after a while I have a valid (and a very big) ext2 filesystem with 0
> processes killed.
>
> The kernel I use is 2.4.2-pre3. The machine has 6G RAM with the 3G given
> to kernel virtual. The amount of swap is massive (2G) but it is never
> used.

I've never been able reliably reproduce any sort of mke2fs hang on
systems with more than 512M of RAM. It would be interesting to know if
other people are seeing this under SW-RAID, and other controllers.
(Currently everyone in direct contact with me uses a Mylex controller.)
The key seems to be 512 or smaller amounts of RAM, and a 80G or larger
logical drive.

--
Solving people's computer problems always
requires more hardware be given to you.
(The Second Rule of Hardware Acquisition)
Samuel J. Flory <[email protected]>

2001-02-19 15:05:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mke2fs and kernel VM issues

On Fri, Feb 16, 2001 at 04:44:48AM -0800, Theodore Y. Ts'o wrote:
> Note that this only shows up when using mke2fs to create very large
> filesystems, and you have relatively little memory. In this particular

If you can reproduce the oom of mke2fs on recent 2.2.19pre, could you try
again after applying this additional VM patch?

--- VM-locked/fs/buffer.c.~1~ Sun Feb 18 04:01:32 2001
+++ VM-locked/fs/buffer.c Sun Feb 18 23:03:32 2001
@@ -1530,9 +1530,13 @@
struct buffer_head *p = tmp;
tmp = tmp->b_this_page;

- if (buffer_dirty(p))
- if (test_and_set_bit(BH_Wait_IO, &p->b_state))
- ll_rw_block(WRITE, 1, &p);
+ if (buffer_dirty(p) || buffer_locked(p))
+ if (test_and_set_bit(BH_Wait_IO, &p->b_state)) {
+ if (buffer_dirty(p))
+ ll_rw_block(WRITE, 1, &p);
+ else if (buffer_locked(p))
+ wait_on_buffer(p);
+ }
} while (tmp != bh);

/* Restore the visibility of the page before returning. */
--- VM-locked/include/linux/fs.h.~1~ Sun Feb 18 04:01:32 2001
+++ VM-locked/include/linux/fs.h Sun Feb 18 22:59:00 2001
@@ -810,7 +810,6 @@
if (test_and_clear_bit(BH_Dirty, &bh->b_state)) {
if (bh->b_list == BUF_DIRTY)
refile_buffer(bh);
- clear_bit(BH_Wait_IO, &bh->b_state);
}
}

--- VM-locked/include/linux/locks.h.~1~ Sun Feb 18 06:31:15 2001
+++ VM-locked/include/linux/locks.h Sun Feb 18 22:59:09 2001
@@ -29,6 +29,7 @@
extern inline void unlock_buffer(struct buffer_head *bh)
{
clear_bit(BH_Lock, &bh->b_state);
+ clear_bit(BH_Wait_IO, &bh->b_state);
wake_up(&bh->b_wait);
}

Andrea