2008-08-15 03:40:54

by David Witbrodt

[permalink] [raw]
Subject: Re: HPET regression in 2.6.26 versus 2.6.25 -- retried 2.6.27-rc3 patch (and patch method)



> > I used 'git apply --check ' first, and got no errors, so
> > I applied it, built, installed, and rebooted.
>
> could apply with
> cat revert_*.;patch | patch -p1

Yinghai,

When I got home from work tonight, I decided to make sure that I applied
your patch correctly. I considered trying it once with cat/patch and
once with 'git apply', and then comparing the results with 'diff'. I
ended up becoming convinced that 'git apply' worked just fine, and
never tried the cat/patch method:

=================================
$ git show |head
commit 30a2f3c60a84092c8084dfe788b710f8d0768cd4
Author: Linus Torvalds <[email protected]>
Date: Tue Aug 12 18:55:39 2008 -0700

Linux 2.6.27-rc3

diff --git a/Makefile b/Makefile
index fd3ca6e..53bf6ec 100644
--- a/Makefile
+++ b/Makefile

$ git apply --check --verbose ../yh-patch1-2.6.27-rc3.diff
Checking patch arch/x86/kernel/e820.c...
Checking patch arch/x86/kernel/setup.c...
Checking patch include/asm-x86/e820.h...

$ git apply --verbose ../yh-patch1-2.6.27-rc3.diff
Checking patch arch/x86/kernel/e820.c...
Checking patch arch/x86/kernel/setup.c...
Checking patch include/asm-x86/e820.h...
Applied patch arch/x86/kernel/e820.c cleanly.
Applied patch arch/x86/kernel/setup.c cleanly.
Applied patch include/asm-x86/e820.h cleanly.

dawitbro@fileserver:~/sandbox/git-kernel/linux-2.6$ git status
# Not currently on any branch.
# Changed but not updated:
# (use "git add <file>..." to update what will be committed)
#
# modified: arch/x86/kernel/e820.c
# modified: arch/x86/kernel/setup.c
# modified: include/asm-x86/e820.h
#
no changes added to commit (use "git add" and/or "git commit -a")

$ git diff > ../yh-patch1-2.6.27-rc3.git-apply.diff
=================================

As you can see, I renamed your patch file to 'yh-patch1-2.6.27-rc3.diff', and
after applying your patch I created a diff against 2.6.27-rc3 called
'yh-patch1-2.6.27-rc3.git-apply.diff'.

I have attached the output of the following command so that you can see that
your patch applied correctly:

diff -y -W 200 yh-patch1-2.6.27-rc3.diff yh-patch1-2.6.27-rc3.git-apply.diff

At this point, I retried a kernel with your patch...


> can you try enable kexec and kdump in you .config.
>
> it should works. my .config have config_kexec

I ran 'make oldconfig' to get my .config up to date, then 'make menuconfig'
to make sure I had enabled CONFIG_KEXEC and CONFIG_CRASH_DUMP:

$ egrep 'KEXEC|DUMP' .config
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y


I took these steps, and posted this info, for the sake of our individual and
collective sanity! ;-) I want you all to be sure that I applied the patch
correctly, and adjusted the .config as requested, before building the kernel.

The kernel built without error, so I installed and rebooted. It locked up in
the usual way. I rebooted, using the "hpet=disable" parameter, and it booted
just fine, just like all the others since 3def3d6d...


>> I do not know how to bisect with your patch if I have a "bad" but no "good"
>> to start with. Can you explain how I should proceed when I _do_ get home?
>> (I can just enabled those config options and try the patch again, but I am
>> confused about the bisect you are asking me to perform.)
>
> just like the old way doing git-bisect, but before compiling, apply
> the batch, and before git-bisect good or bad, revert the patch.

Since I am still confused about how I should perform the bisect you are
asking for, I will wait until you can clarify. I responded earlier from
work, where I had no access to a Linux machine, so I could not quote the
git documentation I have read... to explain better why I am confused. Here
are the sections of 'git help bisect' that I had in mind:

===== BEGIN QUOTED SECTIONS =================
[...]
This command uses git-rev-list --bisect option to help drive the binary
search process to find which change introduced a bug, given an old
"good" commit object name and a later "bad" commit object name.

[...]
Basic bisect commands: start, bad, good

The way you use it is:


$ git bisect start
$ git bisect bad # Current version is bad
$ git bisect good v2.6.13-rc2 # v2.6.13-rc2 was the last version
# tested that was good
When you give at least one bad and one good versions, it will bisect
the revision tree and say something like:

Bisecting: 675 revisions left to test after this

[...]
and you continue along, compiling that one, testing it, and depending
on whether it is good or bad, you say "git bisect good" or "git bisect
bad", and ask for the next bisection.
===== END QUOTED SECTIONS =================

If there is a way to use 'git bisect' beginning with a "bad" version
but no "good" version, then it is an advanced usage that I have not
read about and do not understand how to use. As soon as you tell me
how to carry out the process, I will do so and report the results.


In the meantime, can you comment on the bisection I did last night?
I found something very interesting about the commit that first causes
the lockup (3def3d6d...), and the very next commit (1e934dda...) -- if
I checkout 1e94... and try to revert the changes made in 3def..., the
kernel freezes in spite of the revert.

Because of this, I would conclude that your patch for 2.6.27-rc3 was
doomed before you began, and we should look more carefully at the
commits from February instead of trying to revert at the 2.6.27 HEAD.

I am not a kernel developer, so my opinions are probably safe to ignore,
but I think we should be trying to extract information from my (faulty)
machines about what is happening differently between the "bad" commit
3def3d6d... and the "good" commit before it.


I am too tired to continue my experiments tonight -- it is summer
semester final exam time at the college where I tutor, and answer
questions about calculus 2, linear algebra, nursing math, etc., has
worn me right out -- but while I wait to find out the next step I
should do with the 2.6.27-rc3 patch, I am going to try to get some
useful info printed to the console before the kernel locks up back
in those Feb. revisions.


Thanks,
Dave W.


Attachments:
yh-patch1-2.6.27-rc3.compare.txt (7.98 kB)

2008-08-15 09:08:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: HPET regression in 2.6.26 versus 2.6.25 -- retried 2.6.27-rc3 patch (and patch method)


* David Witbrodt <[email protected]> wrote:

> I found something very interesting about the commit that first causes
> the lockup (3def3d6d...), and the very next commit (1e934dda...) -- if
> I checkout 1e94... and try to revert the changes made in 3def..., the
> kernel freezes in spite of the revert.
>
> Because of this, I would conclude that your patch for 2.6.27-rc3 was
> doomed before you began, and we should look more carefully at the
> commits from February instead of trying to revert at the 2.6.27 HEAD.

i'm still wondering whether we could try to figure out something about
the nature of the hard lockup itself.

Have you tried to activate the NMI watchdog? It _usually_ works fine if
you use a boot option along the lines of:

"lapic nmi_watchdog=2 idle=poll"

The best test would be to first boot the broken kernel with also
hpet=disable and the above options, and check in /proc/interrupts
whether the NMI count is increasing. If the NMI watchdog is working, you
should see a steady trickle of NMI irqs:

rhea:~> while sleep 1; do grep NMI /proc/interrupts ; done
NMI: 4395 Non-maskable interrupts
NMI: 4396 Non-maskable interrupts
NMI: 4397 Non-maskable interrupts
NMI: 4398 Non-maskable interrupts
^C

if it does not work, you'll see:

pluto:~> while sleep 1; do grep NMI /proc/interrupts ; done
NMI: 0 Non-maskable interrupts
NMI: 0 Non-maskable interrupts
NMI: 0 Non-maskable interrupts
NMI: 0 Non-maskable interrupts
^C

NOTE: the NMI watchdog disables high-res timers so it might change your
test enough to make the lockup go away. Hopefully it wont :-)

So, in the ideal situation, your test of the NMI watchdog will show a
steady trickle of watchdog NMI. Then i'd suggest to remove the
hpet=disable, to provoke the lockup. Hopefully it occurs, _and_ after
the hard lockup has happened, you should see a nice stack backtrace
printed out by the NMI watchdog. That gives us the exact location of
lockup.

One theory is that the changed resource allocations are buggy in certain
circumstances and cause us to stomp over key kernel data structures. We
could for example overwrite a networking lock - that's why you lock up
in the networking code. hpet=disable deactivates those resource
allocations and works around the symptoms of the bug.

Ingo