Hello,
[Please CC me if you reply, for I am not subscribed to LKML.]
This is my first time posting to LKML.
I am a Debian user. The sources for 2.6.26 recently became available
in the Debian unstable repositories. Trying them out by building
custom kernels (think 'make oldconfig'), I found that one machine
worked while another froze early in boot. No oops, no error msg of
any kind, just a hard freeze without even Magic SysRq working!
I suspected a dumb config error on my part, but found that the Debian
stock kernel exhibited the same problem. So I filed a bug report in
the Debian BTS:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=493479
There is much info about my hardware and configs there, but I can
repost them here if that is helpful. The machine that works with
2.6.26 has a Gigabyte GA-M59SLI-S5 mboard; the broken machine has an
ECS AMD690GM-M2 mboard.
After much experimenting with various configs and rebuilds, I was
finally able to discover that a kernel boot parameter,
"hpet=disabled", allowed me to boot on the troublesome machine.
Both custom and Debian stock kernels of version 2.6.25 (most recently
based on 2.6.25.10) work fine on this machine, no problem with HPET.
A member of the Debian kernel team (Bastian Blank) tried to help, but
ended up suggesting bisecting using 'git'. I am not (yet) a developer
so I was not really thinking of getting that deeply involved, but I
spent so much time trying to track this problem on Saturday night and
all day Sunday, that I decided to give it a try!
Starting with Linus' instructions here,
http://lkml.org/lkml/2007/7/10/248
I ran:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
and:
git checkout v2.6.25
I built a kernel on the ECS machine and it worked (as expected), so I ran:
git bisect good
then:
git checkout v2.6.26-rc4
hoping maybe to save some iterations by not starting with the 2.6.26 release.
This 2.6.26-rc4 kernel froze early in boot, so I ran:
git bisect bad
Here is a summary of my first git bisecting experiment:
======================================================
Iteration ID status
--------- ---------- ------
1 2.6.25 good
2 2.6.26-rc4 bad
3 10c993a6b5418cb1026775765ba4c70ffb70853d bad
4 334d094504c2fe1c44211ecb49146ae6bca8c321 bad
5 eddeb0e2d863e3941d8768e70cb50c6120e61fa0 bad
6 77ad386e596c6b0930cc2e09e3cce485e3ee7f72 bad
7 ede1389f8ab4f3a1343e567133fa9720a054a3aa bad
8 c048fdfe6178e082be918d4062c86d9764979112 bad
9 f73920cd63d316008738427a0df2caab6cc88ad7 bad
10 04aaa7ba096c707a8df337b29303f1a5a65f0462 good
11 8fa6878ffc6366f490e99a1ab31127fb599657c9 good
12 1180e01de50c0c7683c6648251f32957bc2d7850 good
13 1e934dda0c77c8ad13fdda02074f2cfcea118a56 bad
14 322850af8d93735f67b8ebf84bb1350639be3f34 good
15 3def3d6ddf43dbe20c00c3cbc38dfacc8586998f bad
16 700efc1b9f6afe34caae231b87d129ad8ffb559f good
First commit causing failure:
commit 3def3d6ddf43dbe20c00c3cbc38dfacc8586998f
Author: Yinghai Lu <[email protected]>
Date: Fri Feb 22 17:07:16 2008 -0800
x86: clean up e820_reserve_resources on 64-bit
e820_resource_resources could use insert_resource instead of request_resource
also move code_resource, data_resource, bss_resource, and crashk_res
out of e820_reserve_resources.
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
======================================================
So, it seems that this commit made a change that works on some
(most?) systems, like my Gigabyte mboard machine, but causes
others, like my ECS mboard machine, to freeze early in boot
unless HPET is disabled.
I don't know how important the High Precision Event Timer really
is to the health of my machine, but for the sake of principle I
would really like to see it working again, like with 2.6.25 and
before! ;)
For me this is a "regression," but I have found a workaround. I'm
not sure what sort of problem is important enough to Linux kernel
developers to qualify as a true regression, so I brought my problem
here in case its something that should be reported and/or fixed.
I work as a programming tutor at a community college, so I'm willing
to make code changes and build test kernels, if anyone can make
suggestions. I looked at the diff between the last working commit
and the first broken (for me) commit, and found that I did not have
a clue about the hardware issues involved:
git diff 700efc1b9f6afe34caae231b87d129ad8ffb559f 3def3d6ddf43dbe20c00c3cbc38dfacc8586998f
There are only 3 files involved,
arch/x86/kernel/e820_64.c
arch/x86/kernel/setup_64.c
include/asm-x86/e820_64.h
and I could see that 'setup_64.c' is not implicated in my freeze
because the code change is in an #ifdef block depending on
CONFIG_KEXEC, which is not enabled in my custom kernels (though it
is in the Debian stock kernels).
If what I am describing is considered a regression bug, as I do, then I
am willing to try code changes to get 2.6.26 working on BOTH of my
machines.
Thx (and please CC replies to me),
Dave Witbrodt
[ Let's CC people, so that they'll at least see this mail
when they're back from holidays ]
On Mon, 2008-08-04 at 16:57 -0700, David Witbrodt wrote:
> Hello,
>
> [Please CC me if you reply, for I am not subscribed to LKML.]
>
> This is my first time posting to LKML.
>
> I am a Debian user. The sources for 2.6.26 recently became available
> in the Debian unstable repositories. Trying them out by building
> custom kernels (think 'make oldconfig'), I found that one machine
> worked while another froze early in boot. No oops, no error msg of
> any kind, just a hard freeze without even Magic SysRq working!
>
> I suspected a dumb config error on my part, but found that the Debian
> stock kernel exhibited the same problem. So I filed a bug report in
> the Debian BTS:
>
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=493479
>
> There is much info about my hardware and configs there, but I can
> repost them here if that is helpful. The machine that works with
> 2.6.26 has a Gigabyte GA-M59SLI-S5 mboard; the broken machine has an
> ECS AMD690GM-M2 mboard.
>
> After much experimenting with various configs and rebuilds, I was
> finally able to discover that a kernel boot parameter,
> "hpet=disabled", allowed me to boot on the troublesome machine.
> Both custom and Debian stock kernels of version 2.6.25 (most recently
> based on 2.6.25.10) work fine on this machine, no problem with HPET.
>
> A member of the Debian kernel team (Bastian Blank) tried to help, but
> ended up suggesting bisecting using 'git'. I am not (yet) a developer
> so I was not really thinking of getting that deeply involved, but I
> spent so much time trying to track this problem on Saturday night and
> all day Sunday, that I decided to give it a try!
>
> Starting with Linus' instructions here,
> http://lkml.org/lkml/2007/7/10/248
>
> I ran:
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
>
> and:
> git checkout v2.6.25
Since you have that git tree, could you try to see if the latest -git
still has this problem?
> Since you have that git tree, could you try to see if the latest -git
> still has this problem?
I had forgotten to do that at the time I posted.
Last night, after the post, I _did_ try building "master". The commit
date was from Aug. 1 -- I am at work right now, so I can't provide exact
info until I get home in about 6 hours. I checked 'Makefile', and the
version there was 2.6.27-rc1, IIRC.
It built fine, but same freeze was seen.
A reversion of the commit I mentioned _will_ solve the problem, but
looking at the code I saw that it was an attempt to provide better
functionality. I'm willing to help test modifications to make the new
code work!
Thanks,
Dave W.
On Tue, Aug 5, 2008 at 7:14 AM, David Witbrodt <[email protected]> wrote:
>
>
>> Since you have that git tree, could you try to see if the latest -git
>> still has this problem?
>
> I had forgotten to do that at the time I posted.
>
> Last night, after the post, I _did_ try building "master". The commit
> date was from Aug. 1 -- I am at work right now, so I can't provide exact
> info until I get home in about 6 hours. I checked 'Makefile', and the
> version there was 2.6.27-rc1, IIRC.
> It built fine, but same freeze was seen.
>
> A reversion of the commit I mentioned _will_ solve the problem, but
> looking at the code I saw that it was an attempt to provide better
> functionality. I'm willing to help test modifications to make the new
> code work!
please boot with "debug apic=verbose initcall_debut" to check exactly
where it hangs...
YH
> Since you have that git tree, could you try to see if the latest -git
> still has this problem?
In a previous msg I mentioned that I had tried compiling the HEAD of
my git repository, but only after I had posted to LKML. I was at work
when I wrote the prev msg, so I could not provide details except from
memory.
OK, now I'm home:
======================================
$ git-show
commit 2b12a4c524812fb3f6ee590a02e65b95c8c32229
Merge: 4744b43... 7f30491...
Author: Linus Torvalds <[email protected]>
Date: Fri Aug 1 14:59:11 2008 -0700
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] Move include/asm-ia64 to arch/ia64/include/asm
$ head Makefile
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 27
EXTRAVERSION = -rc1
NAME = Rotary Wombat
# *DOCUMENTATION*
# To see a list of typical targets execute "make help"
# More info can be located in ./README
# Comments in this file are targeted only to the developer, do not
======================================
This kernel built fine, but froze at boot just like the 2.6.26 kernels,
unless using "hpet=disabled".
Sorry that I forgot to do this before my orig. post.
Dave W.
> please boot with "debug apic=verbose initcall_debut" to check exactly
> where it hangs...
In my OP, I mentioned that I submitted a bug report to the Debian BTS
before coming to LKML. I hoped to keep the bug an internal Debian matter,
since the kernels I compile were always from Debianized kernel sources.
Below I comply with your request, booting the kernel built from the HEAD
of the git tree I downloaded yesterday, dated
Fri Aug 1 14:59:11 2008 -0700
and with commit ID
2b12a4c524812fb3f6ee590a02e65b95c8c32229
Before continuing, I would like to mention that in my original post to
the Debian BTS, I reported the last lines on the screen for several
kernels booted with "debug earlyprink=vga initcall_debug loglevel=7".
I originally thought I was to blame -- some error in my '.config' --
so, unfortunately, I made a lot of irrelevant noise in the Debian BTS
thread as I scrambled to determine the cause of the freeze. So maybe
the info there is not useful at all, but here is the link again,
Just In Case:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=493479
OK, now booting 2.6.27-rc1 with "ro debug apic=verbose initcall_debug"...
Here is the last visible output before the freeze:
=========================================
calling chr_dev_init+0x0/0xa2
initcall chr_dev_init returned 0 after 0 msecs
calling firmware_class_init+0x0/0x71
initcall firmware_class_init returned 0 after 0 msecs
calling loopback_init+0x0/0xc
initcall loopback_init returned 0 after 0 msecs
calling cpufreq_gov_performance_init+0x0/0xc
initcall cpufreq_gov_performance_init returned 0 after 0 msecs
calling init_acpi_pm_clocksource+0x0/0xb4
initcall init_acpi_pm_clocksource returned 0 after 0 msecs
calling pci_bios_assign_resources+0x0/0x8b
pci 0000:00:01.0: PCI bridge, secondary bus 0000:01
pci 0000:00:01.0: IO window: 0xe000-0xefff
pci 0000:00:01.0: MEM window: 0xfdd00000-0xfdefffff
pci 0000:00:01.0: PREFETCH window: 0x000000d8000000-0x000000dfffffff
pci 0000:00:14.4: PCI bridge, secondary bus 0000:02
pci 0000:00:14.4: IO window: 0xd000-0xdfff
pci 0000:00:14.4: MEM window: 0xfdc00000-0xfdcfffff
pci 0000:00:14.4: PREFETCH window: 0x000000fdf00000-0x000000fdffffff
initcall pci_bios_assign_resources returned 0 after 285696 msecs
calling inet_init+0x0/0x250
NET: Registered protocol family 2
=========================================
I can tell you that the "285696" figure is way off if "msecs" is
supposed to mean milliseconds. It might be accurate if microseconds
are intended, but the entire process from GRUB handing off to the
kernel until the freeze occurs is just a few moments: 3 seconds at
the most, probably less.
This info was copied by hand. I had no other way to transfer the info
into this post, so I apologize in advance for any errors. I did
double check it, but some of those hex values are typos waiting to
happen.... (I'm pretty sure I got them right, though ;)
Only 3 files were impacted by the commit that is causing the freeze
for my machine with the ECS mboard. If you would like to give me
some code to insert in those files (or other files) that would
print more helpful output during the boot, I would be more than
happy to give it a try.
Dave W.
Had a bit of a scare tonight, about possibly wasting the time of you
good folks.
My desktop machine (mboard = Gigabyte GA-M59SLI-S5) was built last
year, in May 2007. It runs 2.6.26 with no HPET regression, as mentioned.
The troublesome machine (mboard = ECS AMD690GM-M2) was built this year,
in May or June. I actually bought two identical motherboards, which
were on sale at a very nice price, so I could make 2 "servers" for my
home network.
One machine (call it "fileserver") is in working order, and is the
machine I've been using for all of the testing I've done in this bug
thread. The other ECS machine (call it "webserver") was not really
in working condition -- actually, it is OK, just not hooked up while
I've been backing up files from an older Pentium 4 machine and a
Pentium 3 machine.
The "scare" has to do with the CPU/BIOS situation. The webserver uses
an AMD Athlon 64 X2 3600+, fully recognized by the mboard BIOS. The
fileserver uses a very new model: AMD X2 4850e. The ECS mboard runs
this CPU fine, but the BIOS does not "recognize" it.
I asked ECS about the possibility of a BIOS update in early June. The
response:
=========================
ECS Support(USA) Posted : GMT 2008/06/14 00:19:14
Thank for your question. It is hard to say if there will be a BIOS
version to support your CPU. But for sure we will pass this along to
the engineering department in Taipei. Thanks.
=========================
The last update for this mboard I know of was from Dec. 2007:
http://www.ecsusa.com/ECSWebSite/Products/ProductsDetail.aspx?detailid=789&DetailName=Bios&MenuID=46&LanID=9
Tonight, fearing that some peculiarity of the CPU might be causing the
problem instead of the motherboard hardware itself, I got the other
machine (ECS mboard + Athlon 64 X2 3600) running and tested the
2.6.27-rc1 kernel on it: froze on boot, but ran with "hpet=disabled".
Well, at least I'm glad I didn't waste everybody's time on some weird
exception. Of course, this bug is not really a problem for me at all
at the present: I can easily run 2.6.25 kernels on these two boxes, and
even 2.6.26+ kernels with "hpet=disabled" if need be. I just would like
to see this issue fixed on the hardware I own, thinking in terms of the
future. The Debian Developers are trying to get 2.6.26 into the next
stable release, but right now it looks like anyone with this ECS
motherboard who would try to install Linux from media with 2.6.26 would
have a seizure... their machine, that is. ;)
Dave W.
OK, even though I am not a developer, I _hate_ feeling powerless to help...
so I went looking for more details about where the freeze is occuring.
First, I updated my git tree:
===== BEGIN INFO ========================
$ head Makefile
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 27
EXTRAVERSION = -rc2
NAME = Rotary Wombat
# *DOCUMENTATION*
# To see a list of typical targets execute "make help"
# More info can be located in ./README
# Comments in this file are targeted only to the developer, do not
$ git show |head
commit 685d87f7ccc649ab92b55e18e507a65d0e694eb9
Author: Linus Torvalds <[email protected]>
Date: Wed Aug 6 19:24:47 2008 -0700
Revert "pcm_native.c: remove unused label"
This reverts commit 680db0136e0778a0d7e025af7572c6a8d82279e2. The label
is actually used, but hidden behind CONFIG_SND_DEBUG and the horrible
snd_assert() macro.
===== END INFO ========================
Since the last initcall function listed before the freeze was inet_init(),
I decided to try to locate the files involved using 'grep -R'. I found
that inet_init() is defined in 'net/ipv4/af_inet.c'. So, I thought it
wouldn't hurt to print some more info during the kernel boot process, and
I added some debugging printk() calls to af_inetc.:
===== BEGIN DIFF ========================
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8a3ac1f..8e98094 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1421,14 +1421,17 @@ static int __init inet_init(void)
BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
+ printk(" Calling proto_register(&tcp_prot, 1)\n");
rc = proto_register(&tcp_prot, 1);
if (rc)
goto out;
+ printk(" Calling proto_register(&udp_prot, 1)\n");
rc = proto_register(&udp_prot, 1);
if (rc)
goto out_unregister_tcp_proto;
+ printk(" Calling proto_register(&raw_prot, 1)\n");
rc = proto_register(&raw_prot, 1);
if (rc)
goto out_unregister_udp_proto;
@@ -1437,15 +1440,18 @@ static int __init inet_init(void)
* Tell SOCKET that we are alive...
*/
+ printk(" Calling sock_register()\n");
(void)sock_register(&inet_family_ops);
#ifdef CONFIG_SYSCTL
+ printk(" Calling ip_static_sysctl_init()\n");
ip_static_sysctl_init();
#endif
/*
* Add all the base protocols.
*/
+ printk(" Adding base protocols\n");
if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
printk(KERN_CRIT "inet_init: Cannot add ICMP protocol\n");
@@ -1459,6 +1465,7 @@ static int __init inet_init(void)
#endif
/* Register the socket-side information for inet_create. */
+ printk(" Initializing lists for inet_create\n");
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);
@@ -1469,23 +1476,31 @@ static int __init inet_init(void)
* Set the ARP module up
*/
+ printk(" Calling arp_init()\n");
arp_init();
/*
* Set the IP module up
*/
+ printk(" Calling\n");
+
+ printk(" Calling ip_init()\n");
ip_init();
+ printk(" Calling tcp_v4_init()\n");
tcp_v4_init();
/* Setup TCP slab cache for open requests. */
+ printk(" Calling tcp_init()\n");
tcp_init();
/* Setup UDP memory threshold */
+ printk(" Calling udp_init()\n");
udp_init();
/* Add UDP-Lite (RFC 3828) */
+ printk(" Calling udplit4_register()\n");
udplite4_register();
/*
@@ -1509,10 +1524,13 @@ static int __init inet_init(void)
if (init_ipv4_mibs())
printk(KERN_CRIT "inet_init: Cannot init ipv4 mibs\n");
+ printk(" Calling ipv4_proc_init()\n");
ipv4_proc_init();
+ printk(" Calling ipfrag_init()\n");
ipfrag_init();
+ printk(" Calling dev_add_pack(&ip_packet_type)\n");
dev_add_pack(&ip_packet_type);
rc = 0;
===== END DIFF ========================
[Hopefully the text formatting is preserved in the emails. The
archived messages via the web interface have their whitespace
formatting totally destroyed!]
After building and running the kernel, the last line on the terminal was:
Initializing lists for inet_create
So the freeze occurs in this "for" loop (or the loop immediately
following it):
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);
HTH,
Dave W.
OK, suffering from insomnia this morning, I added printk()'s to
net/ipv4/af_inet.c in order to find the code where the freeze
happens. One of 2 loops was the culprit:
===== BEGIN CODE =======================
#ifdef CONFIG_IP_MULTICAST
if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
printk(KERN_CRIT "inet_init: Cannot add IGMP protocol\n");
#endif
/* Register the socket-side information for inet_create. */
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);
for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);
/*
* Set the ARP module up
*/
===== END CODE =======================
Feeling better, I tried to get a few hours of sleep before I had to
go to work.
Knowing where to focus more attention, I restored the original version
of af_inet.c from the git tree with
git show HEAD:net/ivp4/af_inet.c
and then made the following changes to discover which loop was
the problem:
===== BEGIN DIFF ========================
#ifdef CONFIG_IP_MULTICAST
if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
printk(KERN_CRIT "inet_init: Cannot add IGMP protocol\n");
#endif
+ printk(" First loop:\n");
+ printk(" SOCK_MAX = %d\n", SOCK_MAX);
+ int dwindex=0;
/* Register the socket-side information for inet_create. */
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
+ {
+ printk(" initializing: &inetsw[%d] = %p\n", dwindex, r);
INIT_LIST_HEAD(r);
+ ++dwindex;
+ }
+ printk(" Second loop:\n");
+ printk(" INETSW_ARRAY_LEN = %d\n", INETSW_ARRAY_LEN);
+ printk(" Initial q = %p\n", inetsw_array);
+ printk(" Final q = %p\n", &inetsw_array[INETSW_ARRAY_LEN]);
+ dwindex=0;
for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
+ {
+ printk(" initializing: &q[%d]\n", dwindex);
inet_register_protosw(q);
+ ++dwindex;
+ }
/*
* Set the ARP module up
*/
===== END DIFF ========================
I then built the kernel, installed it, and rebooted. The following
output was observed:
===== BEGIN OUTPUT ========================
...
NET: Registered protocol family 2
First loop:
SOCK_MAX = 11
initializing: &initsw[0] = ffffffff809c8460
initializing: &initsw[1] = ffffffff809c8470
initializing: &initsw[2] = ffffffff809c8480
initializing: &initsw[3] = ffffffff809c8490
initializing: &initsw[4] = ffffffff809c84a0
initializing: &initsw[5] = ffffffff809c84b0
initializing: &initsw[6] = ffffffff809c84c0
initializing: &initsw[7] = ffffffff809c84d0
initializing: &initsw[8] = ffffffff809c84e0
initializing: &initsw[9] = ffffffff809c84f0
initializing: &initsw[10] = ffffffff809c8500
Second loop:
INETSW_ARRAY_LEN = 3
Initial q = ffffffff806f8a20
Final q = ffffffff806f8a60
initializing: &q[0]
===== END OUTPUT ========================
This is where my kernels (2.6.26* and 2.6.27*) are freezing, in
the call of inet_register_protosw().
As I find time, I will keep trying to dig deeper. Hopefully one
of you on the LKML has an idea of what's wrong, because even
though I am familiar with C and C++ I have no background at all
with Linux kernel code itself.
Dave W.