2003-02-13 20:02:57

by Patrick Mansfield

[permalink] [raw]
Subject: 2.5.60 and current bk oops in file_ra_state_init

Hi -

Using the scsi-misc-2.5 (2.5.60 plus a few changes), or the current
bk (as of this morning) I'm hitting an oops in file_ra_state_init.

I was trying to run a bunch of raw IO's (/dev/raw/rawN) at once to several
(well 25) disks, trying to maximize IO's/second by repeatedly reading the
same block of a disk, and got the following oops. I hit this both on a
netfinity and NUMAQ box.

The oops does not always occur, but it is fairly easy to hit on the NUMAQ
(8 cpus).

The program I'm using reads one block, seeks to the start, and repeats.

I have not tried earlier kernels.

[patman@elm3b79 patman]$ <1>Unable to handle kernel paging request at virtual address 6b6b6b6b
printing eip:
c013a389
*pde = 5a5a5a5a
Oops: 0000
CPU: 4
EIP: 0060:[<c013a389>] Not tainted
EFLAGS: 00010246
EIP is at file_ra_state_init+0x19/0x30
eax: 6b6b6b6b ebx: f329f374 ecx: 00000000 edx: f329f3c4
esi: f3f8c158 edi: f329f3e0 ebp: f3249f28 esp: f3249f24
ds: 007b es: 007b ss: 0068
Process dd (pid: 2069, threadinfo=f3248000 task=f4660680)
Stack: ffffffe9 f3249f54 c0152d2e f329f3c4 f34a5988 c0110b65 00001000 f35b5000
00008000 f51f87a8 00000000 f35b5000 f3249f9c c0152cbc f322a0c4 f51f87a8
00008000 f322a0c4 f51f87a8 0000000e bffff6b0 000003f9 00000001 00000001
Call Trace:
[<c0152d2e>] dentry_open+0x5e/0x1d0
[<c0110b65>] old_mmap+0x125/0x140
[<c0152cbc>] filp_open+0x4c/0x60
[<c0153125>] sys_open+0x35/0x80
[<c01094db>] syscall_call+0x7/0xb

Code: 8b 00 89 42 18 5f 5d c3 eb 0d 90 90 90 90 90 90 90 90 90 90


Here's the first set of garbled oopses:


[patman@elm3b79 patman]$ Unable to handle kernel pa<g1>iUnnga brleqe uteos th anadtl vei krteuranel l adpdargiensgs 6rbeq6bu6ebs6tb<
4> a pt rvinirttinuagl e iapd:dr
ess c60b163ba368b69
b
*pd per <=1i >5Unna2atbcilnef g0t7 o1e
haipnOdo:l
e<p1s c:>ke U010nra3n0eab03l0l
e89 p
taCgo PhUi*:nagp n dd r lee e q7 =u
k e6ersb6tEnbeI6P<l:4b > p6a b
gati 0n0v6i0g: [rr<ct0ue1q3aal3u8 e9a>sd]d tr e saNsot t6 tbavi6inbtre6tdu
bal6 EbaF
dLdAGrSe: s0s0 0p160r2bi4n6t
6inbg6 bei6bp:

EIP cips0 1art3i anft3ii8lneg9_ r
ae_istpa:t
*ep_dien cit0+10=3x 16a93b/806x9
30
b6*bep6adbx
:e 6=b 66b6b6bb66bb 6 b e
bx: f3057ac4 ecx: 00000000 edx: f3057b14
esi: f3f8c518 edi: f3057b30 ebp: f3149f28 esp: f3149f24
ds: 007b es: 007b ss: 0068
Process reread_loop (pid: 1985, threadinfo=f3148000 task=f4509360)
Stack: ffffffe9 f3149f54 c0152d2e f3057b14 f34a5af4 f3149f44 00001000 f3018000
00000000 f51f87a8 00000003 f3018000 f3149f9c c0152cbc f322b34c f51f87a8
00000000 f322b34c f51f87a8 0000000e 40016000 00000402 00000001 00000001
Call Trace:
[<c0152d2e>] dentry_open+0x5e/0x1d0
[<c0152cbc>] filp_open+0x4c/0x60
[<c0153125>] sys_open+0x35/0x80
[<c01094db>] syscall_call+0x7/0xb

Code: 8b 00 89 42 18 5f 5d c3 eb 0d 90 90 90 90 90 90 90 90 90 90
Oops: 0000
CPU: 4
EIP: 0060:[<c013a389>] Not tainted
EFLAGS: 00010246
EIP is at file_ra_state_init+0x19/0x30
eax: 6b6b6b6b ebx: f3353be4 ecx: 00000000 edx: f3353c34
esi: f32ccbf8 edi: f3353c50 ebp: f3377f28 esp: f3377f24
ds: 007b es: 007b ss: 0068
Process reread_loop (pid: 1987, threadinfo=f3376000 task=f31f5320)
Stack: ffffffe9 f3377f54 c0152d2e f3353c34 f34a581c f3377f44 00001000 f2fcc000
00000000 f51f87a8 00000003 f2fcc000 f3377f9c c0152cbc f322a1d4 f51f87a8
00000000 f322a1d4 f51f87a8 0000000e 40016000 00000402 00000001 00000001
Call Trace:
[<c0152d2e>] dentry_open+0x5e/0x1d0
[<c0152cbc>] filp_open+0x4c/0x60
[<c0153125>] sys_open+0x35/0x80
[<c01094db>] syscall_call+0x7/0xb

Code: 8b 00 89 42 18 5f 5d c3 eb 0d 90 90 90 90 90 90 90 90 90 90
Oops: 0000
CPU: 2
EIP: 0060:[<c013a389>] Not tainted
EFLAGS: 00010246
EIP is at file_ra_state_init+0x19/0x30
eax: 6b6b6b6b ebx: f31a2d04 ecx: 00000000 edx: f31a2d54
esi: f32cc838 edi: f31a2d70 ebp: f3321f28 esp: f3321f24
ds: 007b es: 007b ss: 0068
Process reread_loop (pid: 1988, threadinfo=f3320000 task=f31f4d00)
Stack: ffffffe9 f3321f54 c0152d2e f31a2d54 f34a56b0 f3321f44 00001000 f34a8000
00000000 f51f87a8 00000003 f34a8000 f3321f9c c0152cbc f322a2e4 f51f87a8
00000000 f322a2e4 f51f87a8 0000000e 40016000 00000402 00000001 00000001
Call Trace:
[<c0152d2e>] dentry_open+0x5e/0x1d0
[<c0152cbc>] filp_open+0x4c/0x60
[<c0153125>] sys_open+0x35/0x80
[<c01094db>] syscall_call+0x7/0xb

Code: 8b 00 89 42 18 5f 5d c3 eb 0d 90 90 90 90 90 90 90 90 90 90

-- Patrick Mansfield


2003-02-13 21:06:30

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.60 and current bk oops in file_ra_state_init

Patrick Mansfield <[email protected]> wrote:
>
> Hi -
>
> Using the scsi-misc-2.5 (2.5.60 plus a few changes), or the current
> bk (as of this morning) I'm hitting an oops in file_ra_state_init.

Oh lovely. Looks like the blockdev/gendisk refcounting has gone wrong and a
request queue was freed under your app's feet.

> I was trying to run a bunch of raw IO's (/dev/raw/rawN) at once to several
> (well 25) disks, trying to maximize IO's/second by repeatedly reading the
> same block of a disk, and got the following oops. I hit this both on a
> netfinity and NUMAQ box.

Was this test frequently opening and closing device nodes, or does it just
open them once and hold them?

Can you please prepare a testcase which I can use to reproduce this?

Be aware that there are some fairly significant problems with direct I/O at
present - direct-io can cause multiple outstanding requests against the same
sector, and that makes the IO scheduler keel over. But it would be strange
for that to be related to this failure.

BTW, you don't need to use the raw driver any more. Just open /dev/sda1 with
O_DIRECT. In fact, I'd be interested in seeing if this change makes the oops
go away.

Thanks.

2003-02-13 22:28:17

by Patrick Mansfield

[permalink] [raw]
Subject: Re: 2.5.60 and current bk oops in file_ra_state_init

On Thu, Feb 13, 2003 at 01:13:44PM -0800, Andrew Morton wrote:
> Was this test frequently opening and closing device nodes, or does it just
> open them once and hold them?

Opens once only.

> Can you please prepare a testcase which I can use to reproduce this?

I have only been able to reproduce it on the qla2300 running the latest
qla2300 driver. So maybe it is a qla only problem (blah).

As per previous oops reports by wli and Martin, running with the qlogicisp
driver on the isp1020 gives an oops in isp1020_intr_handler :(

I'm rebooting again.

I was running this program, simultaneously one per drive up to 25, so I
should have only one IO outstanding per disk (added the O_DIRECT):

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>

int
main (int argc, char **argv)
{
int fd;
int loop_cnt;
char *dev_name;
int block_size = 512;
char buff[4096], *bufp;
int res;


dev_name = argv[1];
loop_cnt = atoi(argv[2]);
fprintf(stderr, "Re-reading device %s with %d iterations\n",
dev_name, loop_cnt);
fd = open (dev_name, O_RDONLY | O_DIRECT);
if (fd == -1) {
perror("open");
exit(1);
}

/* memset(buff, 'x', block_size); */
bufp = (char*) ((int)(buff + 4096) & 0xfffff000);
fprintf(stderr, "bufp is 0x%x; buff is 0x%x\n", bufp, buff);
while (loop_cnt-- > 0) {
#ifdef NOTNOW
#endif
res = read(fd, bufp, block_size);
if (res != block_size) {
fprintf(stderr, "Read only %d bytes of %d, cnt %d\n",
res, block_size, loop_cnt);
exit(1);
}
res = lseek(fd, 0, SEEK_SET);
if (res == -1) {
perror("lseek");
exit(1);
}
}

close(fd);
}

-- Patrick Mansfield