On Thu, 2006-01-12 at 16:07 -0600, Clark Williams wrote:
> Ingo/Steve,
>
> Did I miss a "Don't turn on latency tracing for x86_64" message
> somewhere?
>
> I'm seeing a failure on my Athlon64 3000+ where, when I turn on
> CONFIG_LATENCY_TRACE, the init program segfaults. Doesn't happen with a
> statically linked shell like sash (e.g. init=/sbin/sash boots ok), but
> if /sbin/init or /bin/sh is used, the init program segfaults. Presumably
> any dynamically linked program will fail.
>
> Attached is console output for a boot failure (segfault messages
> truncated after four lines, since they're all the same) as well as the
> config files for both working and failing kernels.
>
> I'm not sure where to start looking on this one. Barring any advise, I'm
> going to look for occurrences of CONFIG_LATENCY_TRACE, especially in
> proximity to exec.
OK, I'm actually sending you this email on a x86_64 running
2.6.15-rt4-sr2, with latency tracing on. But unfortunately, I have a
AMD X2 that each core has it's own tsc counter that is not in sync, and
since the latency tracer uses tsc, I get garbage. But beware, the tsc
does slow down when the cpu idles, so it gives bad results even for non
x2 systems.
I finally was able to boot this with using the PM timer, but the
beginning of my dmesg is still filled with:
read_tsc: ACK! TSC went backward! Unsynced TSCs?
Have you tried booting with idle=poll? I wonder if that would help?
-- Steve
On Thu, 2006-01-12 at 22:18 -0500, Steven Rostedt wrote:
> OK, I'm actually sending you this email on a x86_64 running
> 2.6.15-rt4-sr2, with latency tracing on. But unfortunately, I have a
> AMD X2 that each core has it's own tsc counter that is not in sync, and
> since the latency tracer uses tsc, I get garbage. But beware, the tsc
> does slow down when the cpu idles, so it gives bad results even for non
> x2 systems.
>
Hmm, I didn't realize that (I'm running on a uni-processor system). I
just pulled your rt4-sr2 patch and will apply/rebuild/test.
> I finally was able to boot this with using the PM timer, but the
> beginning of my dmesg is still filled with:
>
> read_tsc: ACK! TSC went backward! Unsynced TSCs?
>
> Have you tried booting with idle=poll? I wonder if that would help?
No, I thought that was strictly an SMP issue. I'll try it as well.
Thanks,
Clark
--
Clark Williams <[email protected]>
On Fri, 2006-01-13 at 09:06 -0600, Clark Williams wrote:
> On Thu, 2006-01-12 at 22:18 -0500, Steven Rostedt wrote:
> > OK, I'm actually sending you this email on a x86_64 running
> > 2.6.15-rt4-sr2, with latency tracing on. But unfortunately, I have a
> > AMD X2 that each core has it's own tsc counter that is not in sync, and
> > since the latency tracer uses tsc, I get garbage. But beware, the tsc
> > does slow down when the cpu idles, so it gives bad results even for non
> > x2 systems.
> >
> Hmm, I didn't realize that (I'm running on a uni-processor system). I
> just pulled your rt4-sr2 patch and will apply/rebuild/test.
>
> > I finally was able to boot this with using the PM timer, but the
> > beginning of my dmesg is still filled with:
> >
> > read_tsc: ACK! TSC went backward! Unsynced TSCs?
> >
> > Have you tried booting with idle=poll? I wonder if that would help?
>
> No, I thought that was strictly an SMP issue. I'll try it as well.
It effects SMP mostly. I doubt that it will effect UP, but the tsc is
still not consistent.
-- Steve
On Fri, 2006-01-13 at 10:54 -0500, Steven Rostedt wrote:
>
> > > Have you tried booting with idle=poll? I wonder if that would help?
> >
> > No, I thought that was strictly an SMP issue. I'll try it as well.
>
> It effects SMP mostly. I doubt that it will effect UP, but the tsc is
> still not consistent.
I tried rt4-sr2 and got the same results (also with idle=poll for
completeness :).
I'm beginning to think that it's some sort of interaction with the BIOS
(either that or just weird hardware). When I compared the two sets of
boot outputs, the boot that succeeded (dmesg-rt4-sr2.good.txt, with
WAKEUP_TIMING, LATENCY_TIMING and LATENCY_TRACE turned off) shows APIC
probes and PCI probes that don't have corresponding prints in the failed
boot (dmesg-rt4-sr2.fail.txt).
I tried booting variously with:
pci=routeirq
acpi=noirq
lapic
but saw no change in behavior.
Have you tried booting your system with a up kernel?
Clark
--
Clark Williams <[email protected]>
On Fri, 13 Jan 2006, Clark Williams wrote:
>
> Have you tried booting your system with a up kernel?
>
Not a x86_64 up. But serveral up i386 boxes.
-- Steve
Steven Rostedt wrote:
> On Fri, 13 Jan 2006, Clark Williams wrote:
>>Have you tried booting your system with a up kernel?
>>
> Not a x86_64 up. But serveral up i386 boxes.
>
I had a very similar problem on a x86_64 up. I got a segfault in init
with LATENCY_TRACE enabled on 2.6.15-rt2.
I get it at ffffffff8010fe30, which should be mcount according to my
System.map [1]. It seems a bit weird because i have tried to alter
mcount somewhat. Initially by removing the initial comparison, but later
i tried a few other things also. Nothing had any effect at all.
AFAIK glibc also has a mcount symbol, and it's almost as if ld.so would
have linked the glibc mcount symbol to the kernel symbol mcount. That
would naturally lead to a pagefault :)
And it would be consistent with the fact that statically linked shells
works.
It's probably something completely different, bevause that would be
really weird. OTOH, it was really weird that i could change the asm for
mcount in entry.S without any effect whatsoever.
I'll verify that it hasn't gone away in lates 2.6.15-rt tomorrow.
[1]
ffffffff8010fd68 T machine_check
ffffffff8010fdf0 T call_debug
ffffffff8010fe00 T call_softirq
ffffffff8010fe30 T mcount
ffffffff8010fe65 t skiptrace
ffffffff8010fe7c t out
/Mikael
On Sat, 2006-01-14 at 06:12 +0100, Mikael Andersson wrote:
> Steven Rostedt wrote:
> > On Fri, 13 Jan 2006, Clark Williams wrote:
> >>Have you tried booting your system with a up kernel?
> >>
> > Not a x86_64 up. But serveral up i386 boxes.
> >
>
> I had a very similar problem on a x86_64 up. I got a segfault in init
> with LATENCY_TRACE enabled on 2.6.15-rt2.
> I get it at ffffffff8010fe30, which should be mcount according to my
> System.map [1]. It seems a bit weird because i have tried to alter
> mcount somewhat. Initially by removing the initial comparison, but later
> i tried a few other things also. Nothing had any effect at all.
Glad to see that I'm not completely insane :).
>
> AFAIK glibc also has a mcount symbol, and it's almost as if ld.so would
> have linked the glibc mcount symbol to the kernel symbol mcount. That
> would naturally lead to a pagefault :)
> And it would be consistent with the fact that statically linked shells
> works.
>
> It's probably something completely different, bevause that would be
> really weird. OTOH, it was really weird that i could change the asm for
> mcount in entry.S without any effect whatsoever.
>
Yeah, I don't really see how ld.so would know about the kernel mcount.
AFAIK it only knows about symbols exported from ELF libraries that it
has loaded in user space.
I'm still trying to figure out if LATENCY_TRACE could effect
do_execve(), since that's the routine that starts the init program. I'm
wondering if do_execve() is doing something that aggrevates ld.so, so
that it loads the program and/or libraries incorrectly.
Guess that's what I'll be doing Monday morning...
Clark
--
Clark Williams <[email protected]>