2007-10-23 15:56:34

by Kamalesh Babulal

[permalink] [raw]
Subject: [BUG] 2.6.23-git18 Kernel oops in sg helpers

Hi,

Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
over the AMD box

Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
[<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
PGD 10185b067 PUD 10075b067 PMD 0
Oops: 0002 [1] SMP
CPU 3
Modules linked in:
Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
RSP: 0000:ffff810181edf948 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000004 RSI: 0000000000000002 RDI: ffffffff80573dac
RBP: ffff81018ca9a020 R08: 0000000000000004 R09: ffff810181edf8d4
R10: 00000000000000db R11: ffffffff8041926c R12: ffff81018ca9a040
R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff81018071e380(0063) knlGS:00000000f7f9a900
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000018 CR3: 0000000101281000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process fsx-linux (pid: 18676, threadinfo ffff810181ede000, task ffff810181fc0720)
Stack: 0000000300000001 ffff810100000000 ffff81018ca9a040 0000000000000001
0000000200000002 ffff81018ca9a000 ffff81010079d870 ffff810002903c40
ffff810082692000 ffff810180712bd0 ffff810002903c70 0000000002000000
Call Trace:
[<ffffffff803ecb6b>] scsi_dma_map+0x3f/0x4e
[<ffffffff803fd3fd>] mptscsih_qcmd+0x1bc/0x4af
[<ffffffff803e6b41>] scsi_dispatch_cmd+0x1e7/0x277
[<ffffffff803ec0b8>] scsi_request_fn+0x2df/0x369
[<ffffffff80350e4c>] cfq_insert_request+0x2a6/0x2ae
[<ffffffff80346b91>] elv_insert+0xcf/0x18a
[<ffffffff8034a3d6>] __make_request+0x550/0x58b
[<ffffffff8034a62e>] generic_make_request+0x1bb/0x1f0
[<ffffffff8034a737>] submit_bio+0xd4/0xdf
[<ffffffff802a13f7>] dio_bio_submit+0x52/0x66
[<ffffffff802a2107>] __blockdev_direct_IO+0x813/0xa1c
[<ffffffff80260f14>] pagevec_lookup_tag+0x1a/0x21
[<ffffffff802df355>] ext3_direct_IO+0x107/0x19e
[<ffffffff802dfd8c>] ext3_get_block+0x0/0xe2
[<ffffffff8025a7b7>] generic_file_direct_IO+0xcb/0x111
[<ffffffff8025aebb>] generic_file_aio_read+0x86/0x160
[<ffffffff8027e7a6>] do_sync_read+0xc8/0x10b
[<ffffffff80298141>] __mark_inode_dirty+0x29/0x17d
[<ffffffff80245f75>] autoremove_wake_function+0x0/0x2e
[<ffffffff80290ce3>] notify_change+0x255/0x26a
[<ffffffff802813dd>] vfs_getattr+0x2b/0x2f
[<ffffffff802814c5>] vfs_fstat+0x33/0x3a
[<ffffffff8027e894>] vfs_read+0xab/0x12e
[<ffffffff8027eb98>] sys_read+0x45/0x6e
[<ffffffff80222922>] ia32_sysret+0x0/0xa
Code: c7 41 18 00 00 00 00 8b 44 24 20 e9 7b 01 00 00 e8 27 f8 ff
RIP [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
RSP <ffff810181edf948>
CR2: 0000000000000018
BUG: soft lockup - CPU#3 stuck for 11s! [swapper:0]
CPU 3:
Modules linked in:
Pid: 0, comm: swapper Tainted: G D 2.6.23-git18-autokern1 #1
RIP: 0010:[<ffffffff8048b971>] [<ffffffff8048b971>] _spin_lock_irqsave+0x15/0x24
RSP: 0000:ffff81000177fe98 EFLAGS: 00000286
RAX: 0000000000000282 RBX: ffff81018f6f0000 RCX: ffff81018f6f0068
RDX: ffff810082692800 RSI: 0000000000000001 RDI: ffff810082692850
RBP: ffff81000177fe10 R08: 0000000000000028 R09: 0000000000000086
R10: 0000000000000001 R11: 0000000000000028 R12: ffffffff8020c256
R13: 0000000000000001 R14: ffff810082692800 R15: 0000000000000028
FS: 0000000000000000(0000) GS:ffff81018071e380(0000) knlGS:00000000f7caf080
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000810e708 CR3: 000000018195f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
<IRQ> [<ffffffff803e8c5d>] scsi_eh_scmd_add+0x2c/0x9c
[<ffffffff803e8dab>] scsi_times_out+0x0/0x87
[<ffffffff803e8e19>] scsi_times_out+0x6e/0x87
[<ffffffff8023bc46>] run_timer_softirq+0x14f/0x1a0
[<ffffffff8022d6cf>] scheduler_tick+0xff/0x10b
[<ffffffff802384e1>] __do_softirq+0x50/0xbb
[<ffffffff8020c7ac>] call_softirq+0x1c/0x28
[<ffffffff8020e54f>] do_softirq+0x2e/0x97
[<ffffffff8021c684>] smp_apic_timer_interrupt+0x3e/0x51
[<ffffffff802099ee>] default_idle+0x0/0x3d
[<ffffffff8020c256>] apic_timer_interrupt+0x66/0x70
<EOI> [<ffffffff80209a17>] default_idle+0x29/0x3d
[<ffffffff80209bc4>] cpu_idle+0x8b/0xae



--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.


2007-10-23 18:44:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> Hi,
>
> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> over the AMD box
>
> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> PGD 10185b067 PUD 10075b067 PMD 0
> Oops: 0002 [1] SMP
> CPU 3
> Modules linked in:
> Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> RSP: 0000:ffff810181edf948 EFLAGS: 00010002

Can you check where gart_map_sg+0x26c is at? Make sure you have
CONFIG_DEBUG_INFO defined, then do:

$ gdb vmlinux
$ l *gart_map_sg+0x26c

Thanks!

--
Jens Axboe

2007-10-23 22:47:46

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Tue, 23 Oct 2007 20:49:40 +0530
Kamalesh Babulal <[email protected]> wrote:

> Hi,
>
> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> over the AMD box
>
> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> PGD 10185b067 PUD 10075b067 PMD 0

Does this work?


diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index c56e9ee..ae7e016 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -338,7 +338,6 @@ static int __dma_map_cont(struct scatterlist *start, int nelems,

BUG_ON(s != start && s->offset);
if (s == start) {
- *sout = *s;
sout->dma_address = iommu_bus_base;
sout->dma_address += iommu_page*PAGE_SIZE + s->offset;
sout->dma_length = s->length;
@@ -365,7 +364,7 @@ static inline int dma_map_cont(struct scatterlist *start, int nelems,
{
if (!need) {
BUG_ON(nelems != 1);
- *sout = *start;
+ sout->dma_address = start->dma_address;
sout->dma_length = start->length;
return 0;
}
--
1.5.2.4

2007-10-24 08:32:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Wed, Oct 24 2007, FUJITA Tomonori wrote:
> On Tue, 23 Oct 2007 20:49:40 +0530
> Kamalesh Babulal <[email protected]> wrote:
>
> > Hi,
> >
> > Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> > over the AMD box
> >
> > Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> > [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > PGD 10185b067 PUD 10075b067 PMD 0
>
> Does this work?
>
>
> diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
> index c56e9ee..ae7e016 100644
> --- a/arch/x86/kernel/pci-gart_64.c
> +++ b/arch/x86/kernel/pci-gart_64.c
> @@ -338,7 +338,6 @@ static int __dma_map_cont(struct scatterlist *start, int nelems,
>
> BUG_ON(s != start && s->offset);
> if (s == start) {
> - *sout = *s;
> sout->dma_address = iommu_bus_base;
> sout->dma_address += iommu_page*PAGE_SIZE + s->offset;
> sout->dma_length = s->length;
> @@ -365,7 +364,7 @@ static inline int dma_map_cont(struct scatterlist *start, int nelems,
> {
> if (!need) {
> BUG_ON(nelems != 1);
> - *sout = *start;
> + sout->dma_address = start->dma_address;
> sout->dma_length = start->length;
> return 0;
> }
> --
> 1.5.2.4

Care to write up a proper changelog?

--
Jens Axboe

2007-10-24 08:51:31

by Benny Halevy

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Oct. 24, 2007, 10:32 +0200, Jens Axboe <[email protected]> wrote:
> On Wed, Oct 24 2007, FUJITA Tomonori wrote:
>> On Tue, 23 Oct 2007 20:49:40 +0530
>> Kamalesh Babulal <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
>>> over the AMD box
>>>
>>> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
>>> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
>>> PGD 10185b067 PUD 10075b067 PMD 0
>> Does this work?
>>
>>
>> diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
>> index c56e9ee..ae7e016 100644
>> --- a/arch/x86/kernel/pci-gart_64.c
>> +++ b/arch/x86/kernel/pci-gart_64.c
>> @@ -338,7 +338,6 @@ static int __dma_map_cont(struct scatterlist *start, int nelems,
>>
>> BUG_ON(s != start && s->offset);
>> if (s == start) {
>> - *sout = *s;
>> sout->dma_address = iommu_bus_base;
>> sout->dma_address += iommu_page*PAGE_SIZE + s->offset;
>> sout->dma_length = s->length;
>> @@ -365,7 +364,7 @@ static inline int dma_map_cont(struct scatterlist *start, int nelems,
>> {
>> if (!need) {
>> BUG_ON(nelems != 1);
>> - *sout = *start;
>> + sout->dma_address = start->dma_address;

I don't see this could fix anything since "s" above and "start" here are still
dereferenced. Also, this makes sout->dma_address inconsistent with sout->page_link
and with the end marker.

Benny

>> sout->dma_length = start->length;
>> return 0;
>> }
>> --
>> 1.5.2.4
>
> Care to write up a proper changelog?
>

2007-10-24 11:55:27

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
> On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> > Hi,
> >
> > Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> > over the AMD box
> >
> > Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> > [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > PGD 10185b067 PUD 10075b067 PMD 0
> > Oops: 0002 [1] SMP
> > CPU 3
> > Modules linked in:
> > Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> > RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > RSP: 0000:ffff810181edf948 EFLAGS: 00010002
>
> Can you check where gart_map_sg+0x26c is at? Make sure you have
> CONFIG_DEBUG_INFO defined, then do:
>
> $ gdb vmlinux
> $ l *gart_map_sg+0x26c

Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
output from that version, the panic (also below) seems the same:

(gdb) l *gart_map_sg+0x26c
0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
428 goto error;
429 out++;
430 flush_gart();
431 if (out < nents) {
432 sgmap = sg_next(sgmap);
433 sgmap->dma_length = 0;
434 }
435 return out;
436
437 error:

So it seems sg_next has returned 0.

-apw

elm3b6 login: -- 0:conmux-control -- time-stamp -- Oct/24/07 3:31:05 --
-- 0:conmux-control -- time-stamp -- Oct/24/07 3:46:40 --
Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
[<ffffffff8022011e>] gart_map_sg+0x26c/0x406
PGD 101a8f067 PUD 10193c067 PMD 0
Oops: 0002 [1] SMP
CPU 3
Modules linked in:
Pid: 18339, comm: fsx-linux Not tainted 2.6.24-rc1-autokern1 #1
RIP: 0010:[<ffffffff8022011e>] [<ffffffff8022011e>] gart_map_sg+0x26c/0x406
RSP: 0000:ffff810181e03948 EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000004 RSI: 0000000000000002 RDI: ffffffff8057918c
RBP: ffff810181d0d820 R08: 0000000000000004 R09: ffff810181e038d4
R10: 00000000000000db R11: ffffffff804198f0 R12: ffff810181d0d840
R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff81018071e380(0063) knlGS:00000000f7fb9900
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000018 CR3: 0000000101a39000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process fsx-linux (pid: 18339, threadinfo ffff810181e02000, task ffff810181f2f560)
Stack: 0000000300000001 ffff810100000000 ffff810181d0d840 0000000000000001
0000000200000002 ffff810181d0d800 ffff810100773870 ffff810002905da0
ffff8100022d6000 ffff8101807082f0 ffff810002905dd0 0000000002000000
Call Trace:
[<ffffffff803ed20b>] scsi_dma_map+0x3f/0x4e
[<ffffffff803fda81>] mptscsih_qcmd+0x1bc/0x4af
[<ffffffff803e71ad>] scsi_dispatch_cmd+0x1e7/0x277
[<ffffffff803ec758>] scsi_request_fn+0x2df/0x369
[<ffffffff803514a8>] cfq_insert_request+0x2a6/0x2ae
[<ffffffff803471f5>] elv_insert+0xcf/0x18a
[<ffffffff8034aa31>] __make_request+0x550/0x58b
[<ffffffff8034ac89>] generic_make_request+0x1bb/0x1f0
[<ffffffff8034ad92>] submit_bio+0xd4/0xdf
[<ffffffff802a15fb>] dio_bio_submit+0x52/0x66
[<ffffffff802a230b>] __blockdev_direct_IO+0x813/0xa1c
[<ffffffff80261108>] pagevec_lookup_tag+0x1a/0x21
[<ffffffff802df9b9>] ext3_direct_IO+0x107/0x19e
[<ffffffff802e03f0>] ext3_get_block+0x0/0xe2
[<ffffffff8025a9ab>] generic_file_direct_IO+0xcb/0x111
[<ffffffff8025b0af>] generic_file_aio_read+0x86/0x160
[<ffffffff8027e9a2>] do_sync_read+0xc8/0x10b
[<ffffffff80298345>] __mark_inode_dirty+0x29/0x17d
[<ffffffff80246141>] autoremove_wake_function+0x0/0x2e
[<ffffffff80290ee7>] notify_change+0x255/0x26a
[<ffffffff802815d9>] vfs_getattr+0x2b/0x2f
[<ffffffff802816c1>] vfs_fstat+0x33/0x3a
[<ffffffff8027ea90>] vfs_read+0xab/0x12e
[<ffffffff8027ed94>] sys_read+0x45/0x6e
[<ffffffff802229d2>] ia32_sysret+0x0/0xa


Code: c7 41 18 00 00 00 00 8b 44 24 20 e9 7b 01 00 00 e8 27 f8 ff
RIP [<ffffffff8022011e>] gart_map_sg+0x26c/0x406
RSP <ffff810181e03948>
CR2: 0000000000000018

2007-10-24 12:26:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Wed, Oct 24 2007, Andy Whitcroft wrote:
> On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
> > On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> > > Hi,
> > >
> > > Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> > > over the AMD box
> > >
> > > Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> > > [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > > PGD 10185b067 PUD 10075b067 PMD 0
> > > Oops: 0002 [1] SMP
> > > CPU 3
> > > Modules linked in:
> > > Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> > > RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > > RSP: 0000:ffff810181edf948 EFLAGS: 00010002
> >
> > Can you check where gart_map_sg+0x26c is at? Make sure you have
> > CONFIG_DEBUG_INFO defined, then do:
> >
> > $ gdb vmlinux
> > $ l *gart_map_sg+0x26c
>
> Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
> output from that version, the panic (also below) seems the same:
>
> (gdb) l *gart_map_sg+0x26c
> 0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
> 428 goto error;
> 429 out++;
> 430 flush_gart();
> 431 if (out < nents) {
> 432 sgmap = sg_next(sgmap);
> 433 sgmap->dma_length = 0;
> 434 }
> 435 return out;
> 436
> 437 error:
>
> So it seems sg_next has returned 0.

Interesting. Can you add a

printk("mapped %d of %d\n", out, nents);

prior to that sg_next() call and reproduce?

--
Jens Axboe

2007-10-24 12:46:52

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Wed, 24 Oct 2007 12:54:36 +0100
Andy Whitcroft <[email protected]> wrote:

> On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
> > On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> > > Hi,
> > >
> > > Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> > > over the AMD box
> > >
> > > Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> > > [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > > PGD 10185b067 PUD 10075b067 PMD 0
> > > Oops: 0002 [1] SMP
> > > CPU 3
> > > Modules linked in:
> > > Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> > > RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > > RSP: 0000:ffff810181edf948 EFLAGS: 00010002
> >
> > Can you check where gart_map_sg+0x26c is at? Make sure you have
> > CONFIG_DEBUG_INFO defined, then do:
> >
> > $ gdb vmlinux
> > $ l *gart_map_sg+0x26c
>
> Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
> output from that version, the panic (also below) seems the same:
>
> (gdb) l *gart_map_sg+0x26c
> 0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
> 428 goto error;
> 429 out++;
> 430 flush_gart();
> 431 if (out < nents) {
> 432 sgmap = sg_next(sgmap);
> 433 sgmap->dma_length = 0;
> 434 }
> 435 return out;
> 436
> 437 error:
>
> So it seems sg_next has returned 0.

Have you tried this?

http://marc.info/?l=linux-kernel&m=119317981406073&w=2

2007-10-24 16:08:54

by Kamalesh Babulal

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

FUJITA Tomonori wrote:
> On Wed, 24 Oct 2007 12:54:36 +0100
> Andy Whitcroft <[email protected]> wrote:
>
>> On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
>>> On Tue, Oct 23 2007, Kamalesh Babulal wrote:
>>>> Hi,
>>>>
>>>> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
>>>> over the AMD box
>>>>
>>>> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
>>>> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
>>>> PGD 10185b067 PUD 10075b067 PMD 0
>>>> Oops: 0002 [1] SMP
>>>> CPU 3
>>>> Modules linked in:
>>>> Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
>>>> RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
>>>> RSP: 0000:ffff810181edf948 EFLAGS: 00010002
>>> Can you check where gart_map_sg+0x26c is at? Make sure you have
>>> CONFIG_DEBUG_INFO defined, then do:
>>>
>>> $ gdb vmlinux
>>> $ l *gart_map_sg+0x26c
>> Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
>> output from that version, the panic (also below) seems the same:
>>
>> (gdb) l *gart_map_sg+0x26c
>> 0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
>> 428 goto error;
>> 429 out++;
>> 430 flush_gart();
>> 431 if (out < nents) {
>> 432 sgmap = sg_next(sgmap);
>> 433 sgmap->dma_length = 0;
>> 434 }
>> 435 return out;
>> 436
>> 437 error:
>>
>> So it seems sg_next has returned 0.
>
> Have you tried this?
>
> http://marc.info/?l=linux-kernel&m=119317981406073&w=2
> -
Hi,
Thanks, this patch solves the kernel oops.
--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.

2007-10-24 18:07:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Wed, Oct 24 2007, Kamalesh Babulal wrote:
> FUJITA Tomonori wrote:
> > On Wed, 24 Oct 2007 12:54:36 +0100
> > Andy Whitcroft <[email protected]> wrote:
> >
> >> On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
> >>> On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> >>>> Hi,
> >>>>
> >>>> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> >>>> over the AMD box
> >>>>
> >>>> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> >>>> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> >>>> PGD 10185b067 PUD 10075b067 PMD 0
> >>>> Oops: 0002 [1] SMP
> >>>> CPU 3
> >>>> Modules linked in:
> >>>> Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> >>>> RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> >>>> RSP: 0000:ffff810181edf948 EFLAGS: 00010002
> >>> Can you check where gart_map_sg+0x26c is at? Make sure you have
> >>> CONFIG_DEBUG_INFO defined, then do:
> >>>
> >>> $ gdb vmlinux
> >>> $ l *gart_map_sg+0x26c
> >> Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
> >> output from that version, the panic (also below) seems the same:
> >>
> >> (gdb) l *gart_map_sg+0x26c
> >> 0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
> >> 428 goto error;
> >> 429 out++;
> >> 430 flush_gart();
> >> 431 if (out < nents) {
> >> 432 sgmap = sg_next(sgmap);
> >> 433 sgmap->dma_length = 0;
> >> 434 }
> >> 435 return out;
> >> 436
> >> 437 error:
> >>
> >> So it seems sg_next has returned 0.
> >
> > Have you tried this?
> >
> > http://marc.info/?l=linux-kernel&m=119317981406073&w=2
> > -
> Hi,
> Thanks, this patch solves the kernel oops.

Tomo, please do write the proper changelog so we can get this upstream.

--
Jens Axboe

2007-10-24 22:22:59

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Wed, 24 Oct 2007 21:38:30 +0530
Kamalesh Babulal <[email protected]> wrote:

> FUJITA Tomonori wrote:
> > On Wed, 24 Oct 2007 12:54:36 +0100
> > Andy Whitcroft <[email protected]> wrote:
> >
> >> On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
> >>> On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> >>>> Hi,
> >>>>
> >>>> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> >>>> over the AMD box
> >>>>
> >>>> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> >>>> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> >>>> PGD 10185b067 PUD 10075b067 PMD 0
> >>>> Oops: 0002 [1] SMP
> >>>> CPU 3
> >>>> Modules linked in:
> >>>> Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> >>>> RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> >>>> RSP: 0000:ffff810181edf948 EFLAGS: 00010002
> >>> Can you check where gart_map_sg+0x26c is at? Make sure you have
> >>> CONFIG_DEBUG_INFO defined, then do:
> >>>
> >>> $ gdb vmlinux
> >>> $ l *gart_map_sg+0x26c
> >> Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
> >> output from that version, the panic (also below) seems the same:
> >>
> >> (gdb) l *gart_map_sg+0x26c
> >> 0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
> >> 428 goto error;
> >> 429 out++;
> >> 430 flush_gart();
> >> 431 if (out < nents) {
> >> 432 sgmap = sg_next(sgmap);
> >> 433 sgmap->dma_length = 0;
> >> 434 }
> >> 435 return out;
> >> 436
> >> 437 error:
> >>
> >> So it seems sg_next has returned 0.
> >
> > Have you tried this?
> >
> > http://marc.info/?l=linux-kernel&m=119317981406073&w=2
> > -
> Hi,
> Thanks, this patch solves the kernel oops.

Thanks for testing!

Jens, here's the proper changelog.

-
From: FUJITA Tomonori <[email protected]>
Subject: [PATCH] x86: pci-gart fix

map_sg could copy the last sg element to another position (if merging
some elements). It breaks sg chaining. This copies only
dma_address/length instead of the whole sg element.

Signed-off-by: FUJITA Tomonori <[email protected]>
---
arch/x86/kernel/pci-gart_64.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index c56e9ee..ae7e016 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -338,7 +338,6 @@ static int __dma_map_cont(struct scatterlist *start, int nelems,

BUG_ON(s != start && s->offset);
if (s == start) {
- *sout = *s;
sout->dma_address = iommu_bus_base;
sout->dma_address += iommu_page*PAGE_SIZE + s->offset;
sout->dma_length = s->length;
@@ -365,7 +364,7 @@ static inline int dma_map_cont(struct scatterlist *start, int nelems,
{
if (!need) {
BUG_ON(nelems != 1);
- *sout = *start;
+ sout->dma_address = start->dma_address;
sout->dma_length = start->length;
return 0;
}
--
1.5.2.4

2007-10-25 05:34:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Thu, Oct 25 2007, FUJITA Tomonori wrote:
> On Wed, 24 Oct 2007 21:38:30 +0530
> Kamalesh Babulal <[email protected]> wrote:
>
> > FUJITA Tomonori wrote:
> > > On Wed, 24 Oct 2007 12:54:36 +0100
> > > Andy Whitcroft <[email protected]> wrote:
> > >
> > >> On Tue, Oct 23, 2007 at 08:44:20PM +0200, Jens Axboe wrote:
> > >>> On Tue, Oct 23 2007, Kamalesh Babulal wrote:
> > >>>> Hi,
> > >>>>
> > >>>> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
> > >>>> over the AMD box
> > >>>>
> > >>>> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
> > >>>> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > >>>> PGD 10185b067 PUD 10075b067 PMD 0
> > >>>> Oops: 0002 [1] SMP
> > >>>> CPU 3
> > >>>> Modules linked in:
> > >>>> Pid: 18676, comm: fsx-linux Not tainted 2.6.23-git18-autokern1 #1
> > >>>> RIP: 0010:[<ffffffff8021f2f6>] [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
> > >>>> RSP: 0000:ffff810181edf948 EFLAGS: 00010002
> > >>> Can you check where gart_map_sg+0x26c is at? Make sure you have
> > >>> CONFIG_DEBUG_INFO defined, then do:
> > >>>
> > >>> $ gdb vmlinux
> > >>> $ l *gart_map_sg+0x26c
> > >> Ok, this problem still seems to be about in 2.6.24-rc1. Here is the gdb
> > >> output from that version, the panic (also below) seems the same:
> > >>
> > >> (gdb) l *gart_map_sg+0x26c
> > >> 0xffffffff8022011e is in gart_map_sg (arch/x86/kernel/pci-gart_64.c:433).
> > >> 428 goto error;
> > >> 429 out++;
> > >> 430 flush_gart();
> > >> 431 if (out < nents) {
> > >> 432 sgmap = sg_next(sgmap);
> > >> 433 sgmap->dma_length = 0;
> > >> 434 }
> > >> 435 return out;
> > >> 436
> > >> 437 error:
> > >>
> > >> So it seems sg_next has returned 0.
> > >
> > > Have you tried this?
> > >
> > > http://marc.info/?l=linux-kernel&m=119317981406073&w=2
> > > -
> > Hi,
> > Thanks, this patch solves the kernel oops.
>
> Thanks for testing!
>
> Jens, here's the proper changelog.

Thanks, applied!

--
Jens Axboe

2007-10-25 08:53:25

by Benny Halevy

[permalink] [raw]
Subject: Re: [BUG] 2.6.23-git18 Kernel oops in sg helpers

On Oct. 24, 2007, 10:50 +0200, Benny Halevy <[email protected]> wrote:
> On Oct. 24, 2007, 10:32 +0200, Jens Axboe <[email protected]> wrote:
>> On Wed, Oct 24 2007, FUJITA Tomonori wrote:
>>> On Tue, 23 Oct 2007 20:49:40 +0530
>>> Kamalesh Babulal <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Kernel oops is triggered while running fsx-linux test, followed by cpu softlock
>>>> over the AMD box
>>>>
>>>> Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
>>>> [<ffffffff8021f2f6>] gart_map_sg+0x26c/0x406
>>>> PGD 10185b067 PUD 10075b067 PMD 0
>>> Does this work?
>>>
>>>
>>> diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
>>> index c56e9ee..ae7e016 100644
>>> --- a/arch/x86/kernel/pci-gart_64.c
>>> +++ b/arch/x86/kernel/pci-gart_64.c
>>> @@ -338,7 +338,6 @@ static int __dma_map_cont(struct scatterlist *start, int nelems,
>>>
>>> BUG_ON(s != start && s->offset);
>>> if (s == start) {
>>> - *sout = *s;
>>> sout->dma_address = iommu_bus_base;
>>> sout->dma_address += iommu_page*PAGE_SIZE + s->offset;
>>> sout->dma_length = s->length;
>>> @@ -365,7 +364,7 @@ static inline int dma_map_cont(struct scatterlist *start, int nelems,
>>> {
>>> if (!need) {
>>> BUG_ON(nelems != 1);
>>> - *sout = *start;
>>> + sout->dma_address = start->dma_address;
>
> I don't see this could fix anything since "s" above and "start" here are still
> dereferenced. Also, this makes sout->dma_address inconsistent with sout->page_link
> and with the end marker.

OK, it took me a day to figure out why the fix is working :)
The end of list marker was copied into sout and later, in line 432
sg_next(sgmap) returned NULL since sgmap became the last entry in the list
(which is strangely correct in the dma mapped vector).

431: if (out < nents) {
432: sgmap = sg_next(sgmap);
433: sgmap->dma_length = 0;
434: }

Alas, the dma mapping convention apparently requires dma_length == 0
as a terminator if the "compressed" list for dma mapping is shorter than
the sg list.

Although this change does not keep each sg->dma_address in sync with each
sg->page_link, previously there was nothing to keep sg->length in sync with
sg->dma_length so I actually think that keeping the dma mapping and the
page mappings orthogonal and independent may be even better since the
original sg list can still be reused safely even after dma mapping.

>
> Benny
>
>>> sout->dma_length = start->length;
>>> return 0;
>>> }
>>> --
>>> 1.5.2.4
>> Care to write up a proper changelog?
>>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>