2006-05-19 20:49:53

by Christian Kujau

[permalink] [raw]
Subject: SCSI ABORT with 2.6.17-rc4-mm1

[sorry for repost, local MTA problems here...]

Hi list, Hi Andrew,

I cannot boot 2.6.17-rc4-mm1 because my rootdisk is a scsi disk and upon
scsi-init (SYM53C8XX_2) I'm getting:

May 19 15:39:55 prinz sym0: <895> rev 0x1 at pci 0000:02:09.0 irq 161
May 19 15:39:55 prinz sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking
May 19 15:39:55 prinz sym0: SCSI BUS has been reset.
May 19 15:39:55 prinz scsi0 : sym-2.2.3
May 19 15:40:08 prinz 0:0:0:0: ABORT operation started.
May 19 15:40:13 prinz 0:0:0:0: ABORT operation timed-out.
May 19 15:40:13 prinz 0:0:0:0: DEVICE RESET operation started.
May 19 15:40:18 prinz 0:0:0:0: DEVICE RESET operation timed-out.
May 19 15:40:18 prinz 0:0:0:0: BUS RESET operation started.
May 19 15:40:23 prinz 0:0:0:0: BUS RESET operation timed-out.
May 19 15:40:23 prinz 0:0:0:0: HOST RESET operation started.
May 19 15:40:23 prinz sym0: SCSI BUS has been reset.
May 19 15:40:28 prinz 0:0:0:0: HOST RESET operation timed-out.
May 19 15:40:28 prinz 0:0:0:0: scsi: Device offlined - not ready after
error recovery
May 19 15:40:33 prinz 0:0:1:0: ABORT operation started.
May 19 15:40:38 prinz 0:0:1:0: ABORT operation timed-out.
May 19 15:40:38 prinz 0:0:1:0: DEVICE RESET operation started.
May 19 15:40:43 prinz 0:0:1:0: DEVICE RESET operation timed-out.
May 19 15:40:43 prinz 0:0:1:0: BUS RESET operation started.

I have backed out drivers-scsi-use-array_size-macro.patch, but to no
avail. There are other scsi-related patches in the broken-out
mm-directory, any hint which one to try first? Sometimes they're dependent
on each other, so I find it not easy to just "patch -R" all "*scsi*.patch"
files.

Please see http://www.nerdbynature.de/bits/2.6.17-rc4-mm1/ for a
netsconsole-dmesg for 2.6.17-rc4 (working fine) and a the -mm1.

I've tried different .configs for -mm1, created with:

- yes '' | make oldconfig (config-2.6-mm.2.6.17-rc4-mm1.oldconfig_default)
- yes 'N' | make oldconfig (config-2.6-mm.2.6.17-rc4-mm1.oldconfig_no)
- make oldlconfig (interactive, config-2.6-mm.2.6.17-rc4-mm1.oldconfig_my)

Thanks,
Christian.
--
BOFH excuse #442:

Trojan horse ran out of hay


--
BOFH excuse #442:

Trojan horse ran out of hay


2006-05-19 21:07:56

by Andrew Morton

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

"Christian Kujau" <[email protected]> wrote:
>
> [sorry for repost, local MTA problems here...]
>
> Hi list, Hi Andrew,
>
> I cannot boot 2.6.17-rc4-mm1 because my rootdisk is a scsi disk and upon
> scsi-init (SYM53C8XX_2) I'm getting:
>
> May 19 15:39:55 prinz sym0: <895> rev 0x1 at pci 0000:02:09.0 irq 161
> May 19 15:39:55 prinz sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking
> May 19 15:39:55 prinz sym0: SCSI BUS has been reset.
> May 19 15:39:55 prinz scsi0 : sym-2.2.3
> May 19 15:40:08 prinz 0:0:0:0: ABORT operation started.
> May 19 15:40:13 prinz 0:0:0:0: ABORT operation timed-out.
> May 19 15:40:13 prinz 0:0:0:0: DEVICE RESET operation started.
> May 19 15:40:18 prinz 0:0:0:0: DEVICE RESET operation timed-out.
> May 19 15:40:18 prinz 0:0:0:0: BUS RESET operation started.
> May 19 15:40:23 prinz 0:0:0:0: BUS RESET operation timed-out.
> May 19 15:40:23 prinz 0:0:0:0: HOST RESET operation started.
> May 19 15:40:23 prinz sym0: SCSI BUS has been reset.
> May 19 15:40:28 prinz 0:0:0:0: HOST RESET operation timed-out.
> May 19 15:40:28 prinz 0:0:0:0: scsi: Device offlined - not ready after
> error recovery
> May 19 15:40:33 prinz 0:0:1:0: ABORT operation started.
> May 19 15:40:38 prinz 0:0:1:0: ABORT operation timed-out.
> May 19 15:40:38 prinz 0:0:1:0: DEVICE RESET operation started.
> May 19 15:40:43 prinz 0:0:1:0: DEVICE RESET operation timed-out.
> May 19 15:40:43 prinz 0:0:1:0: BUS RESET operation started.
>
> I have backed out drivers-scsi-use-array_size-macro.patch, but to no
> avail. There are other scsi-related patches in the broken-out
> mm-directory, any hint which one to try first? Sometimes they're dependent
> on each other, so I find it not easy to just "patch -R" all "*scsi*.patch"
> files.
>
> Please see http://www.nerdbynature.de/bits/2.6.17-rc4-mm1/ for a
> netsconsole-dmesg for 2.6.17-rc4 (working fine) and a the -mm1.
>
> I've tried different .configs for -mm1, created with:
>
> - yes '' | make oldconfig (config-2.6-mm.2.6.17-rc4-mm1.oldconfig_default)
> - yes 'N' | make oldconfig (config-2.6-mm.2.6.17-rc4-mm1.oldconfig_no)
> - make oldlconfig (interactive, config-2.6-mm.2.6.17-rc4-mm1.oldconfig_my)
>

Thanks for the report, and thanks for testing. The full demsg output
really helps.


It goes pear-shaped very early:

--- prinz64-nc.2.6.17-rc4.log Fri May 19 13:56:34 2006
+++ prinz64-nc.2.6.17-rc4-mm1.log Fri May 19 13:56:58 2006
@@ -12,20 +12,17 @@
BIOS-e820: 00000000fefffc00 - 00000000ff000000 (reserved)
BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
DMI 2.2 present.
+ACPI: Unable to map RSDT header
+node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
+node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES


And from then on, ACPI is kaput. So your interrupts are kaput, as is the
disk controller.

I had some of this happening too - it's due to some of the MM patches from
Mel and/or Andy. I also managed to provoke "Too many memory regions,
truncating" out of it.


I hope that's all sorted out now. Please test next -mm (hopefully
tomorrow) and let us know?

Or, if you're super-keen,
http://www.zip.com.au/~akpm/linux/patches/stuff/x.bz2 is my current rollup
(against 2.6.17-rc4). It was compilable this morning, but I've since
merged stuff ;) It would be interesting to know if that has fixed the bug.

2006-05-19 22:57:51

by Mel Gorman

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

> "Christian Kujau" <[email protected]> wrote:
>>
>> [sorry for repost, local MTA problems here...]
>>
>> Hi list, Hi Andrew,
>>
>> I cannot boot 2.6.17-rc4-mm1 because my rootdisk is a scsi disk and upon
>> scsi-init (SYM53C8XX_2) I'm getting:
>>
>> May 19 15:39:55 prinz sym0: <895> rev 0x1 at pci 0000:02:09.0 irq 161
>> May 19 15:39:55 prinz sym0: Tekram NVRAM, ID 7, Fast-40, LVD, parity checking
>> May 19 15:39:55 prinz sym0: SCSI BUS has been reset.
>> May 19 15:39:55 prinz scsi0 : sym-2.2.3
>> May 19 15:40:08 prinz 0:0:0:0: ABORT operation started.
>> May 19 15:40:13 prinz 0:0:0:0: ABORT operation timed-out.
>> May 19 15:40:13 prinz 0:0:0:0: DEVICE RESET operation started.
>> May 19 15:40:18 prinz 0:0:0:0: DEVICE RESET operation timed-out.
>> May 19 15:40:18 prinz 0:0:0:0: BUS RESET operation started.
>> May 19 15:40:23 prinz 0:0:0:0: BUS RESET operation timed-out.
>> May 19 15:40:23 prinz 0:0:0:0: HOST RESET operation started.
>> May 19 15:40:23 prinz sym0: SCSI BUS has been reset.
>> May 19 15:40:28 prinz 0:0:0:0: HOST RESET operation timed-out.
>> May 19 15:40:28 prinz 0:0:0:0: scsi: Device offlined - not ready after
>> error recovery
>> May 19 15:40:33 prinz 0:0:1:0: ABORT operation started.
>> May 19 15:40:38 prinz 0:0:1:0: ABORT operation timed-out.
>> May 19 15:40:38 prinz 0:0:1:0: DEVICE RESET operation started.
>> May 19 15:40:43 prinz 0:0:1:0: DEVICE RESET operation timed-out.
>> May 19 15:40:43 prinz 0:0:1:0: BUS RESET operation started.
>>
>> I have backed out drivers-scsi-use-array_size-macro.patch, but to no
>> avail. There are other scsi-related patches in the broken-out
>> mm-directory, any hint which one to try first? Sometimes they're dependent
>> on each other, so I find it not easy to just "patch -R" all "*scsi*.patch"
>> files.
>>
>> Please see http://www.nerdbynature.de/bits/2.6.17-rc4-mm1/ for a
>> netsconsole-dmesg for 2.6.17-rc4 (working fine) and a the -mm1.
>>
>> I've tried different .configs for -mm1, created with:
>>
>> - yes '' | make oldconfig (config-2.6-mm.2.6.17-rc4-mm1.oldconfig_default)
>> - yes 'N' | make oldconfig (config-2.6-mm.2.6.17-rc4-mm1.oldconfig_no)
>> - make oldlconfig (interactive, config-2.6-mm.2.6.17-rc4-mm1.oldconfig_my)
>>
>
> Thanks for the report, and thanks for testing. The full demsg output
> really helps.
>
>
> It goes pear-shaped very early:
>
> --- prinz64-nc.2.6.17-rc4.log Fri May 19 13:56:34 2006
> +++ prinz64-nc.2.6.17-rc4-mm1.log Fri May 19 13:56:58 2006
> @@ -12,20 +12,17 @@
> BIOS-e820: 00000000fefffc00 - 00000000ff000000 (reserved)
> BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
> DMI 2.2 present.
> +ACPI: Unable to map RSDT header
> +node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> +node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
>
>
> And from then on, ACPI is kaput. So your interrupts are kaput, as is the
> disk controller.
>
> I had some of this happening too - it's due to some of the MM patches from
> Mel and/or Andy.

The warnings in this case is valid but I would think harmless. ZONE_NORMAL
on x86_64 begins at MAX_DMA32_PFN on the 4GiB boundary which is MAX_ORDER
aligned. From the e820 map, I am guessing the machine has 1GiB of memory
so the normal and highmem zones are empty. Andy's latest patches should
catch that.

The places where I now expect to see zone alignment error messages is
where the lowest PFN in a node is not aligned so the zone appears to
start unaligned. As the node_mem_map is aligned to the MAX_ORDER
boundary, we will see the warning, but it'll be harmless again.

I am struggling to see how the alignment patches or
arch-independent-zone-sizing would clobber the mapping of the ACPI table :(

> I also managed to provoke "Too many memory regions,
> truncating" out of it.
>

"Too many memory regions, truncating" is of concern because memory will be
effectively lost. Is this on x86_64 as well? If so, I need to submit a
patch that sets CONFIG_MAX_ACTIVE_REGIONS to 128 on x86_64 which is the
same value of E820MAX. This is similar to what PPC64 does for LMB regions
(see MAX_ACTIVE_REGIONS in arch/powerpc/Kconfig for example). If it's not
x86_64, what arch does it occur on?

> I hope that's all sorted out now. Please test next -mm (hopefully
> tomorrow) and let us know?
>
> Or, if you're super-keen,
> http://www.zip.com.au/~akpm/linux/patches/stuff/x.bz2 is my current rollup
> (against 2.6.17-rc4). It was compilable this morning, but I've since
> merged stuff ;) It would be interesting to know if that has fixed the bug.
>

2006-05-19 23:30:58

by Andrew Morton

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

[email protected] (Mel Gorman) wrote:
>
> I am struggling to see how the alignment patches or
> arch-independent-zone-sizing would clobber the mapping of the ACPI table :(

hm. Well something did it ;)

> > I also managed to provoke "Too many memory regions,
> > truncating" out of it.
> >
>
> "Too many memory regions, truncating" is of concern because memory will be
> effectively lost. Is this on x86_64 as well? If so, I need to submit a
> patch that sets CONFIG_MAX_ACTIVE_REGIONS to 128 on x86_64 which is the
> same value of E820MAX. This is similar to what PPC64 does for LMB regions
> (see MAX_ACTIVE_REGIONS in arch/powerpc/Kconfig for example). If it's not
> x86_64, what arch does it occur on?

Yes, it's x86_64. It kind of went away though. I seem to have been
finding various .config combinations which cause x86_64 to die horridly -
that was one.

2006-05-20 00:09:08

by Christian Kujau

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

Hi there,

On Fri, 19 May 2006, Andrew Morton wrote:
> DMI 2.2 present.
> +ACPI: Unable to map RSDT header
> +node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> +node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES

gah, diff(1) is actually not new to me, but I forgot to use it :(
Thanks for spotting this!

> Or, if you're super-keen,
> http://www.zip.com.au/~akpm/linux/patches/stuff/x.bz2 is my current rollup
> (against 2.6.17-rc4). It was compilable this morning, but I've since
> merged stuff ;) It would be interesting to know if that has fixed the bug.

I tried to be "super-keen" and applied x.bz2 to pristine 2.6.17-rc4, but
the scsi error persists (logs, .config coming in a few minutes.)

Furthermore, I had to do 2 more things to get rc4-mm* compiling:

1) apply the attached patch, as the compile breaks with:

CC drivers/pci/msi-apic.o
In file included from include/asm/msi.h:11,
from drivers/pci/msi.h:71,
from drivers/pci/msi-apic.c:8:
include/asm/smp.h:103: error: syntax error before '->' token
make[2]: *** [drivers/pci/msi-apic.o] Error 1
make[1]: *** [drivers/pci] Error 2
make: *** [drivers] Error 2

(this has been reported with 2.6.17-rc3-mm1, but was not fixed?)

2) disable CONFIG_ROOT_NFS=y, as the compile breaks with:

GEN .version
CHK include/linux/compile.h
UPD include/linux/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
fs/built-in.o: In function `nfs_root_setup':nfsroot.c:(.init.text+0x1809):
undefined reference to `root_nfs_parse_addr'
:nfsroot.c:(.init.text+0x1810): undefined reference to `root_server_addr'
fs/built-in.o: In function `nfs_root_data': undefined reference to
`root_server_path'
fs/built-in.o: In function `nfs_root_data': undefined reference to
`root_server_addr'

As said before, .config and dmesg for rc4-mm2 in a moment, netconsole is
not working...hm.

Thank you!
Christian.
--
"No one talks peace unless he's ready to back it up with war."
"He talks of peace if it is the only way to live."
-- Colonel Green and Surak of Vulcan, "The Savage Curtain",
stardate 5906.5.


Attachments:
msi-apic.c_2.6.17-rc4-mm1.diff (277.00 B)

2006-05-20 00:29:05

by Christian Kujau

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

Hi Mel,

On Fri, 19 May 2006, Mel Gorman wrote:
> The warnings in this case is valid but I would think harmless. ZONE_NORMAL
> on x86_64 begins at MAX_DMA32_PFN on the 4GiB boundary which is MAX_ORDER
> aligned. From the e820 map, I am guessing the machine has 1GiB of memory

yes, this (x86_64) box has 1GB of memory, non-ECC.

> I am struggling to see how the alignment patches or
> arch-independent-zone-sizing would clobber the mapping of the ACPI table :(

I'll try to disable ACPI in the next testing runs...

Thanks,
Christian.
--
"The combination of a number of things to make existence worthwhile."
"Yes, the philosophy of 'none,' meaning 'all.'"
-- Spock and Lincoln, "The Savage Curtain", stardate 5906.4

2006-05-20 00:35:29

by Mel Gorman

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

On Fri, 19 May 2006, Andrew Morton wrote:

> [email protected] (Mel Gorman) wrote:
>>
>> I am struggling to see how the alignment patches or
>> arch-independent-zone-sizing would clobber the mapping of the ACPI table :(
>
> hm. Well something did it ;)
>

Obviously. One option is to back out
have-x86_64-use-add_active_range-and-free_area_init_nodes.patch and see
what happens on Christian's machine.

>> > I also managed to provoke "Too many memory regions,
>> > truncating" out of it.
>> >
>>
>> "Too many memory regions, truncating" is of concern because memory will be
>> effectively lost. Is this on x86_64 as well? If so, I need to submit a
>> patch that sets CONFIG_MAX_ACTIVE_REGIONS to 128 on x86_64 which is the
>> same value of E820MAX. This is similar to what PPC64 does for LMB regions
>> (see MAX_ACTIVE_REGIONS in arch/powerpc/Kconfig for example). If it's not
>> x86_64, what arch does it occur on?
>
> Yes, it's x86_64. It kind of went away though. I seem to have been
> finding various .config combinations which cause x86_64 to die horridly -
> that was one.
>

Can you post up some of the configs and I'll see can I reproduce it
locally please?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-05-20 00:41:11

by Christian Kujau

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

On Sat, 20 May 2006, Christian Kujau wrote:
> I tried to be "super-keen" and applied x.bz2 to pristine 2.6.17-rc4, but the
> scsi error persists (logs, .config coming in a few minutes.)

Please see .config and dmesgs here:

http://www.nerdbynature.de/bits/2.6.17-rc4-mm2.x/

I'll try with ACPI disabled later on and let you know. If you have more
patches to test/back-out I'll be happy to test. What puzzles me: sym53c8xx
does not seem *too* exotic but I seem to be the only one whining...

Thanks,
Christian.
--
"The combination of a number of things to make existence worthwhile."
"Yes, the philosophy of 'none,' meaning 'all.'"
-- Spock and Lincoln, "The Savage Curtain", stardate 5906.4

2006-05-20 03:55:16

by Christian Kujau

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

On Sat, 20 May 2006, Mel Gorman wrote:
> Obviously. One option is to back out
> have-x86_64-use-add_active_range-and-free_area_init_nodes.patch and see what
> happens on Christian's machine.

I've disabled CONFIG_PM and backed out above patch (from -rc4-mm1), but
sadly, the error persists:

http://nerdbynature.de/bits/2.6.17-rc4-mm1/no-CONFIG_PM/
http://nerdbynature.de/bits/2.6.17-rc4-mm2.x/no-CONFIG_PM/
(the first one with the said patch backed out)

Thanks for your ideas,
Christian.
--
There's another way to survive. Mutual trust -- and help.
-- Kirk, "Day of the Dove", stardate unknown

2006-05-26 18:26:31

by Christian Kujau

[permalink] [raw]
Subject: Re: SCSI ABORT with 2.6.17-rc4-mm1

Just in case the news didn't get through: the issue has been fixed in
-mm3. I'm not sure about what the real fix was, since

- rc4 is working
- rc4-mm1 is not working
- rc4-mm2 is not working
- rc4-mm3 is working

Mel Gorman sent me the zonesizing-v13 patch for -mm3 (thanks again!),
which was also working, results are here:

http://nerdbynature.de/bits/2.6.17-rc4-mm3/

Thanks to all involved,
Christian.
--
BOFH excuse #435:

Internet shut down due to maintenance