The patchset contains mainly scalability and NUMA stuff, and anything
else that stops things from irritating me. It's meant to be pretty stable,
not so much a testing ground for new stuff.
I'd be very interested in feedback from anyone willing to test on any
platform, however large or small.
ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-mjb
3.bz2
additional:
http://www.aracnet.com/~fletch/linux/2.5.59/pidmaps_nodepages
Since 2.5.62-mjb2 (~ = changed, + = added, - = dropped)
Notes: Fixes some critical scheduler hangs.
- discontig_x440 Pat Gaughen / IBM NUMA team
+ early_ioremap Dave Hansen
+ x440disco_A0 Pat Gaughen / IBM NUMA team
+ fix_was_sched Ingo / wli / Rick Lindsley
+ no_kirq Martin J. Bligh
+ auto_disable_tsc John Stultz
+ cleaner_inodes Andrew Morton
Pending:
scheduler callers profiling (Anton)
PPC64 NUMA patches (Anton)
Child runs first (akpm)
Kexec
e1000 fixes
Non-PAE aligned kernel splits (Dave Hansen)
Update the lost timer ticks code
Ingo scheduler updates
Present in this patch:
early_printk Dave Hansen et al.
Allow printk before console_init
confighz Andrew Morton / Dave Hansen
Make HZ a config option of 100 Hz or 1000 Hz
config_page_offset Dave Hansen / Andrea
Make PAGE_OFFSET a config option
vmalloc_stats Dave Hansen
Expose useful vmalloc statistics
local_pgdat William Lee Irwin
Move the pgdat structure into the remapped space with lmem_map
numameminfo Martin Bligh / Keith Mannthey
Expose NUMA meminfo information under /proc/meminfo.numa
notsc Martin Bligh
Enable notsc option for NUMA-Q (new version for new config system)
mpc_apic_id Martin J. Bligh
Fix null ptr dereference (optimised away, but ...)
doaction Martin J. Bligh
Fix cruel torture of macros and small furry animals in io_apic.c
kgdb Andrew Morton / Various People
The older version of kgdb, synched with 2.5.54-mm1
noframeptr Martin Bligh
Disable -fomit_frame_pointer
ingosched Ingo Molnar
Modify NUMA scheduler to have independant tick basis.
schedstat Rick Lindsley
Provide stats about the scheduler under /proc/stat
sched_tunables Robert Love
Provide tunable parameters for the scheduler (+ NUMA scheduler)
early_ioremap Dave Hansen
Provide ioremap in very early boot when we only have 8Mb address space
x440disco_A0 Pat Gaughen / IBM NUMA team
SLIT/SRAT parsing for x440 discontigmem
acpi_x440_hack Anonymous Coward
Stops x440 crashing, but owner is ashamed of it ;-)
numa_pci_fix Dave Hansen
Fix a potential error in the numa pci code from Stanford Checker
pfn_to_nid William Lee Irwin
Turn pfn_to_nid into a macro
kprobes Vamsi Krishna S
Add kernel probes hooks to the kernel
dmc_exit1 Dave McCracken
Speed up the exit path, pt 1.
dmc_exit2 Dave McCracken
Speed up the exit path, pt 1.
shpte Dave McCracken
Shared pagetables (as a config option)
thread_info_cleanup (4K stacks pt 1) Dave Hansen / Ben LaHaise
Prep work to reduce kernel stacks to 4K
interrupt_stacks (4K stacks pt 2) Dave Hansen / Ben LaHaise
Create a per-cpu interrupt stack.
stack_usage_check (4K stacks pt 3) Dave Hansen / Ben LaHaise
Check for kernel stack overflows.
4k_stack (4K stacks pt 4) Dave Hansen
Config option to reduce kernel stacks to 4K
fix_kgdb Dave Hansen
Fix interaction between kgdb and 4K stacks
stacks_from_slab William Lee Irwin
Take kernel stacks from the slab cache, not page allocation.
thread_under_page William Lee Irwin
Fix THREAD_SIZE < PAGE_SIZE case
lkcd LKCD team
Linux kernel crash dump support
percpu_loadavg Martin J. Bligh
Provide per-cpu loadaverages, and real load averages
irq_affinity Martin J. Bligh
Workaround for irq_affinity on clustered apic mode systems (eg x440)
kirq_clustered_fix Dave Hansen / Martin J. Bligh
Fix kirq for clustered apic systems (eg x440)
fix_was_sched Ingo / wli / Rick Lindsley
Fix scheduler hangs from deadlocks
no_kirq Martin J. Bligh
Allow disabling of kirq to work properly
auto_disable_tsc John Stultz
Automatically disable the TSC for NUMA-Q
cleaner_inodes Andrew Morton
Make noatime filesystems more efficient
-mjb Martin J. Bligh
Add a tag to the makefile
On Mon, 2003-02-24 at 10:08, Martin J. Bligh wrote:
> The patchset contains mainly scalability and NUMA stuff, and anything
> else that stops things from irritating me. It's meant to be pretty stable,
> not so much a testing ground for new stuff.
>
> I'd be very interested in feedback from anyone willing to test on any
> platform, however large or small.
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-mjb
> 3.bz2
>
Martin,
I have been seeing system hangs on my 16 processor numaq while running
contest. The system will hang within a few seconds to half an hour.
Unfortunately there is no stack trace or any other indication on the
system console. I have been running your 2.5.62-mjb2 without problems
previously. Any ideas what I can do to narrow this down?
Mark.
--
Mark Haverkamp <[email protected]>
>> The patchset contains mainly scalability and NUMA stuff, and anything
>> else that stops things from irritating me. It's meant to be pretty
>> stable, not so much a testing ground for new stuff.
>>
>> I'd be very interested in feedback from anyone willing to test on any
>> platform, however large or small.
>>
>> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-
>> mjb 3.bz2
>>
>
> Martin,
>
> I have been seeing system hangs on my 16 processor numaq while running
> contest. The system will hang within a few seconds to half an hour.
> Unfortunately there is no stack trace or any other indication on the
> system console. I have been running your 2.5.62-mjb2 without problems
> previously. Any ideas what I can do to narrow this down?
Humpf. Can you try backing out this patch (it caused me similar problems on
59, but seemed fine in 62). I suspect it's just changing timing enough that
we hit some other bug ... if you could, would be nice to try the ALT+SYSRQ
stuff, or turn on NMI watchdogs and get a backtrace ... I've not been able
to reproduce this on recent kernels.
Thanks,
M.
diff -urpN -X /home/fletch/.diff.exclude
330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h
340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h
--- 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h Fri Jan 17
09:18:31 2003
+++ 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h Mon Feb
24 08:14:42 2003
@@ -32,6 +32,7 @@ static inline void mps_oem_check(struct
if (mpc->mpc_oemptr)
smp_read_mpc_oem((struct mp_config_oemtable *) mpc->mpc_oemptr,
mpc->mpc_oemsize);
+ tsc_disable=1;
}
/* Hook from generic ACPI tables.c */
On Wed, 2003-02-26 at 07:55, Martin J. Bligh wrote:
> >> The patchset contains mainly scalability and NUMA stuff, and anything
> >> else that stops things from irritating me. It's meant to be pretty
> >> stable, not so much a testing ground for new stuff.
> >>
> >> I'd be very interested in feedback from anyone willing to test on any
> >> platform, however large or small.
> >>
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-
> >> mjb 3.bz2
> >>
> >
> > Martin,
> >
> > I have been seeing system hangs on my 16 processor numaq while running
> > contest. The system will hang within a few seconds to half an hour.
> > Unfortunately there is no stack trace or any other indication on the
> > system console. I have been running your 2.5.62-mjb2 without problems
> > previously. Any ideas what I can do to narrow this down?
>
> Humpf. Can you try backing out this patch (it caused me similar problems on
> 59, but seemed fine in 62). I suspect it's just changing timing enough that
> we hit some other bug ...
OK, I'll try this.
> if you could, would be nice to try the ALT+SYSRQ
> stuff, or turn on NMI watchdogs and get a backtrace ... I've not been able
> to reproduce this on recent kernels.
I'll try these first and see what happens.
Mark.
--
Mark Haverkamp <[email protected]>
On Wed, 2003-02-26 at 07:55, Martin J. Bligh wrote:
> >> The patchset contains mainly scalability and NUMA stuff, and anything
> >> else that stops things from irritating me. It's meant to be pretty
> >> stable, not so much a testing ground for new stuff.
> >>
> >> I'd be very interested in feedback from anyone willing to test on any
> >> platform, however large or small.
> >>
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/mbligh/2.5.62/patch-2.5.62-
> >> mjb 3.bz2
> >>
> >
> > Martin,
> >
> > I have been seeing system hangs on my 16 processor numaq while running
> > contest. The system will hang within a few seconds to half an hour.
> > Unfortunately there is no stack trace or any other indication on the
> > system console. I have been running your 2.5.62-mjb2 without problems
> > previously. Any ideas what I can do to narrow this down?
>
> Humpf. Can you try backing out this patch (it caused me similar problems on
> 59, but seemed fine in 62). I suspect it's just changing timing enough that
> we hit some other bug ... if you could, would be nice to try the ALT+SYSRQ
> stuff, or turn on NMI watchdogs and get a backtrace ... I've not been able
> to reproduce this on recent kernels.
>
> Thanks,
>
> M.
>
> diff -urpN -X /home/fletch/.diff.exclude
> 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h
> 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h
> --- 330-no_kirq/include/asm-i386/mach-numaq/mach_mpparse.h Fri Jan 17
> 09:18:31 2003
> +++ 340-auto_disable_tsc/include/asm-i386/mach-numaq/mach_mpparse.h Mon Feb
> 24 08:14:42 2003
> @@ -32,6 +32,7 @@ static inline void mps_oem_check(struct
> if (mpc->mpc_oemptr)
> smp_read_mpc_oem((struct mp_config_oemtable *) mpc->mpc_oemptr,
> mpc->mpc_oemsize);
> + tsc_disable=1;
> }
>
> /* Hook from generic ACPI tables.c */
>
I turned on NMI watchdogs and when the system hung, I saw no output. My
serial console is through a terminal server that isn't set up to pass
along the sysrq, so I need to get this fixed. In any case I backed out
the patch that you suggested and I have had no system hangs since.
Mark.
--
Mark Haverkamp <[email protected]>
|
| I turned on NMI watchdogs and when the system hung, I saw no output. My
| serial console is through a terminal server that isn't set up to pass
| along the sysrq, so I need to get this fixed. In any case I backed out
| the patch that you suggested and I have had no system hangs since.
|
| Mark.
| --
| Mark Haverkamp <[email protected]>
Mark,
You can also use my "echo key > sysrq" patch.
It was updated to 2.5.62 by Zwane M.
It's available at http://www.osdl.org/archive/rddunlap/patches/magickey_2562.patch
(after a possible 15-minute rsync delay).
--
~Randy
> I turned on NMI watchdogs and when the system hung, I saw no output. My
> serial console is through a terminal server that isn't set up to pass
> along the sysrq, so I need to get this fixed. In any case I backed out
> the patch that you suggested and I have had no system hangs since.
OK, I'll back out that patch for now, but it seems to indicate underlying
crud. What parameter did you set for NMI watchdog?
M.
>> > I turned on NMI watchdogs and when the system hung, I saw no output.
>> > My serial console is through a terminal server that isn't set up to
>> > pass along the sysrq, so I need to get this fixed. In any case I
>> > backed out the patch that you suggested and I have had no system hangs
>> > since.
>>
>> OK, I'll back out that patch for now, but it seems to indicate underlying
>> crud. What parameter did you set for NMI watchdog?
>
> I set it to 1. In Documentation/nmi_watchdog.txt this looked like the
> only option. Now that I look at apic.h, I see that I could set it to 2
> also. If you like I can try this also.
2 is what we used sucessfully last time, but I can't remember the
difference off the top of my head ... if you could try that, would be most
useful.
M.
On Wed, 2003-02-26 at 14:53, Martin J. Bligh wrote:
> > I turned on NMI watchdogs and when the system hung, I saw no output. My
> > serial console is through a terminal server that isn't set up to pass
> > along the sysrq, so I need to get this fixed. In any case I backed out
> > the patch that you suggested and I have had no system hangs since.
>
> OK, I'll back out that patch for now, but it seems to indicate underlying
> crud. What parameter did you set for NMI watchdog?
I set it to 1. In Documentation/nmi_watchdog.txt this looked like the
only option. Now that I look at apic.h, I see that I could set it to 2
also. If you like I can try this also.
Mark.
--
Mark Haverkamp <[email protected]>
On Wed, 2003-02-26 at 15:05, Martin J. Bligh wrote:
> >> > I turned on NMI watchdogs and when the system hung, I saw no output.
> >> > My serial console is through a terminal server that isn't set up to
> >> > pass along the sysrq, so I need to get this fixed. In any case I
> >> > backed out the patch that you suggested and I have had no system hangs
> >> > since.
> >>
> >> OK, I'll back out that patch for now, but it seems to indicate underlying
> >> crud. What parameter did you set for NMI watchdog?
> >
> > I set it to 1. In Documentation/nmi_watchdog.txt this looked like the
> > only option. Now that I look at apic.h, I see that I could set it to 2
> > also. If you like I can try this also.
>
> 2 is what we used sucessfully last time, but I can't remember the
> difference off the top of my head ... if you could try that, would be most
> useful.
Still no luck getting a stack trace. With nmi_watchdog=2, I get these
kind of messages on occasion:
Uhhuh. NMI received for unknown reason 35 on CPU 11.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
But when the system finally froze, there was nothing.
Mark.
--
Mark Haverkamp <[email protected]>