2010-07-20 17:35:21

by dann frazier

[permalink] [raw]
Subject: ia64 hang/mca running gdb 'make check'

Debian's ia64 autobuilders have been experiencing system crashes while
trying to run the gdb test suite:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574

I was able to reproduce this w/ the latest git tree, and bisected it
down to this commit, introduced in 2.6.32:

commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
Author: Hugh Dickins <[email protected]>
Date: Mon Sep 21 17:03:34 2009 -0700

mm: ZERO_PAGE without PTE_SPECIAL

Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

Contrary to how I'd imagined it, there's nothing ugly about this, just a
zero_pfn test built into one or another block of vm_normal_page().

But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
that: it would have to take vm_flags to weed out some cases.

fyi, I found this to not be reproducible on SLES11 SP1 (which is
2.6.32-based). I compared the .configs and found that the relevant
difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
reliably fails w/ 16KB pages.


2010-07-21 01:56:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, 20 Jul 2010 11:35:12 -0600
dann frazier <[email protected]> wrote:

> Debian's ia64 autobuilders have been experiencing system crashes while
> trying to run the gdb test suite:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
>
> I was able to reproduce this w/ the latest git tree, and bisected it
> down to this commit, introduced in 2.6.32:
>
> commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> Author: Hugh Dickins <[email protected]>
> Date: Mon Sep 21 17:03:34 2009 -0700
>
> mm: ZERO_PAGE without PTE_SPECIAL
>
> Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
>
> Contrary to how I'd imagined it, there's nothing ugly about this, just a
> zero_pfn test built into one or another block of vm_normal_page().
>
> But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> that: it would have to take vm_flags to weed out some cases.
>
> fyi, I found this to not be reproducible on SLES11 SP1 (which is
> 2.6.32-based). I compared the .configs and found that the relevant
> difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> reliably fails w/ 16KB pages.
>

Sorry, I have no idea...
Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?

Thanks,
-Kame



2010-07-21 03:06:42

by dann frazier

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 20 Jul 2010 11:35:12 -0600
> dann frazier <[email protected]> wrote:
>
> > Debian's ia64 autobuilders have been experiencing system crashes while
> > trying to run the gdb test suite:
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> >
> > I was able to reproduce this w/ the latest git tree, and bisected it
> > down to this commit, introduced in 2.6.32:
> >
> > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > Author: Hugh Dickins <[email protected]>
> > Date: Mon Sep 21 17:03:34 2009 -0700
> >
> > mm: ZERO_PAGE without PTE_SPECIAL
> >
> > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> >
> > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > zero_pfn test built into one or another block of vm_normal_page().
> >
> > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > that: it would have to take vm_flags to weed out some cases.
> >
> > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > 2.6.32-based). I compared the .configs and found that the relevant
> > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > reliably fails w/ 16KB pages.
> >
>
> Sorry, I have no idea...
> Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?


dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
a0000001008784c0 d __ksymtab_empty_zero_page
a000000100882688 d __kcrctab_empty_zero_page
a000000100884ca4 r __kstrtab_empty_zero_page
a000000100974000 D empty_zero_page

2010-07-21 04:20:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, 20 Jul 2010, dann frazier wrote:
> On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 20 Jul 2010 11:35:12 -0600
> > dann frazier <[email protected]> wrote:
> >
> > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > trying to run the gdb test suite:
> > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > >
> > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > down to this commit, introduced in 2.6.32:
> > >
> > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > Author: Hugh Dickins <[email protected]>
> > > Date: Mon Sep 21 17:03:34 2009 -0700
> > >
> > > mm: ZERO_PAGE without PTE_SPECIAL
> > >
> > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > >
> > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > zero_pfn test built into one or another block of vm_normal_page().
> > >
> > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > that: it would have to take vm_flags to weed out some cases.
> > >
> > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > 2.6.32-based). I compared the .configs and found that the relevant
> > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > reliably fails w/ 16KB pages.
> > >
> >
> > Sorry, I have no idea...
> > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
>
>
> dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> a0000001008784c0 d __ksymtab_empty_zero_page
> a000000100882688 d __kcrctab_empty_zero_page
> a000000100884ca4 r __kstrtab_empty_zero_page
> a000000100974000 D empty_zero_page

Thanks a lot for reporting this, but I too have no idea yet.

It is likely that the bug is not to be found in that 62eede62, but
rather in one of the preceding patches to mm/memory.c which 62eede62
was extending to ia64 and other architectures without PTE_SPECIAL.

I wonder, from looking at that gdb testsuite log, is it plausible
that all these hangs/crashes occurred when writing out a coredump?
Is that something you could check for us? or rule out the possibility.

I was rather proud of the get_dump_page() simplification,
but perhaps there's something nasty lurking in there.

Hugh

2010-07-21 12:54:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

> On Tue, 20 Jul 2010, dann frazier wrote:
> > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > dann frazier <[email protected]> wrote:
> > >
> > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > trying to run the gdb test suite:
> > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > >
> > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > down to this commit, introduced in 2.6.32:
> > > >
> > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > Author: Hugh Dickins <[email protected]>
> > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > >
> > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > >
> > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > >
> > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > zero_pfn test built into one or another block of vm_normal_page().
> > > >
> > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > that: it would have to take vm_flags to weed out some cases.
> > > >
> > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > reliably fails w/ 16KB pages.
> > > >
> > >
> > > Sorry, I have no idea...
> > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> >
> >
> > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > a0000001008784c0 d __ksymtab_empty_zero_page
> > a000000100882688 d __kcrctab_empty_zero_page
> > a000000100884ca4 r __kstrtab_empty_zero_page
> > a000000100974000 D empty_zero_page
>
> Thanks a lot for reporting this, but I too have no idea yet.
>
> It is likely that the bug is not to be found in that 62eede62, but
> rather in one of the preceding patches to mm/memory.c which 62eede62
> was extending to ia64 and other architectures without PTE_SPECIAL.
>
> I wonder, from looking at that gdb testsuite log, is it plausible
> that all these hangs/crashes occurred when writing out a coredump?
> Is that something you could check for us? or rule out the possibility.
>
> I was rather proud of the get_dump_page() simplification,
> but perhaps there's something nasty lurking in there.

Ug. I did tested some zero page thing at developing 62eede62 on ia64.
but unforunatelly, I've lost ia64 test environment by physical machine
crash. and I don't remember I did test which page size ;)

Umm... I also have no idea. sorry.


2010-07-27 07:19:28

by dann frazier

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
> On Tue, 20 Jul 2010, dann frazier wrote:
> > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > dann frazier <[email protected]> wrote:
> > >
> > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > trying to run the gdb test suite:
> > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > >
> > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > down to this commit, introduced in 2.6.32:
> > > >
> > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > Author: Hugh Dickins <[email protected]>
> > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > >
> > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > >
> > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > >
> > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > zero_pfn test built into one or another block of vm_normal_page().
> > > >
> > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > that: it would have to take vm_flags to weed out some cases.
> > > >
> > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > reliably fails w/ 16KB pages.
> > > >
> > >
> > > Sorry, I have no idea...
> > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> >
> >
> > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > a0000001008784c0 d __ksymtab_empty_zero_page
> > a000000100882688 d __kcrctab_empty_zero_page
> > a000000100884ca4 r __kstrtab_empty_zero_page
> > a000000100974000 D empty_zero_page
>
> Thanks a lot for reporting this, but I too have no idea yet.
>
> It is likely that the bug is not to be found in that 62eede62, but
> rather in one of the preceding patches to mm/memory.c which 62eede62
> was extending to ia64 and other architectures without PTE_SPECIAL.
>
> I wonder, from looking at that gdb testsuite log, is it plausible
> that all these hangs/crashes occurred when writing out a coredump?
> Is that something you could check for us? or rule out the possibility.

Yep, seems so. I've reduced it down to this test case:

dannf@rx2600:~> cat > foo.c
int leaf(void) {
return 0;
}

int main(void) {
leaf();
}
dannf@rx2600:~> gcc -g foo.c -o foo
dannf@rx2600:~> gdb ./foo
GNU gdb (GDB) SUSE (7.0-0.4.16)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ia64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/dannf/foo...done.
(gdb) break leaf
Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
(gdb) run
Starting program: /home/dannf/foo
Missing separate debuginfo for /lib/ld-linux-ia64.so.2
Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
Missing separate debuginfo for /lib/libc.so.6.1
Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"

Breakpoint 1, leaf () at foo.c:2
2 return 0;
(gdb) gcore /tmp/save

[bang]

> I was rather proud of the get_dump_page() simplification,
> but perhaps there's something nasty lurking in there.
>
> Hugh
>

--
dann frazier

2010-07-27 09:08:24

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, 27 Jul 2010 01:19:15 -0600
dann frazier <[email protected]> wrote:

> On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
> > On Tue, 20 Jul 2010, dann frazier wrote:
> > > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > > dann frazier <[email protected]> wrote:
> > > >
> > > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > > trying to run the gdb test suite:
> > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > > >
> > > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > > down to this commit, introduced in 2.6.32:
> > > > >
> > > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > > Author: Hugh Dickins <[email protected]>
> > > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > > >
> > > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > > >
> > > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > > >
> > > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > > zero_pfn test built into one or another block of vm_normal_page().
> > > > >
> > > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > > that: it would have to take vm_flags to weed out some cases.
> > > > >
> > > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > > reliably fails w/ 16KB pages.
> > > > >
> > > >
> > > > Sorry, I have no idea...
> > > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> > >
> > >
> > > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > > a0000001008784c0 d __ksymtab_empty_zero_page
> > > a000000100882688 d __kcrctab_empty_zero_page
> > > a000000100884ca4 r __kstrtab_empty_zero_page
> > > a000000100974000 D empty_zero_page
> >
> > Thanks a lot for reporting this, but I too have no idea yet.
> >
> > It is likely that the bug is not to be found in that 62eede62, but
> > rather in one of the preceding patches to mm/memory.c which 62eede62
> > was extending to ia64 and other architectures without PTE_SPECIAL.
> >
> > I wonder, from looking at that gdb testsuite log, is it plausible
> > that all these hangs/crashes occurred when writing out a coredump?
> > Is that something you could check for us? or rule out the possibility.
>
> Yep, seems so. I've reduced it down to this test case:
>
> dannf@rx2600:~> cat > foo.c
> int leaf(void) {
> return 0;
> }
>
> int main(void) {
> leaf();
> }
> dannf@rx2600:~> gcc -g foo.c -o foo
> dannf@rx2600:~> gdb ./foo
> GNU gdb (GDB) SUSE (7.0-0.4.16)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "ia64-suse-linux".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /home/dannf/foo...done.
> (gdb) break leaf
> Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
> (gdb) run
> Starting program: /home/dannf/foo
> Missing separate debuginfo for /lib/ld-linux-ia64.so.2
> Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
> Missing separate debuginfo for /lib/libc.so.6.1
> Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"
>
> Breakpoint 1, leaf () at foo.c:2
> 2 return 0;
> (gdb) gcore /tmp/save
>
> [bang]
>

Does this happen on 2.6.34 or 2.6.35-rc kernel ?

Thanks,
-Kame

2010-07-27 14:43:40

by dann frazier

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, Jul 27, 2010 at 06:03:30PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Jul 2010 01:19:15 -0600
> dann frazier <[email protected]> wrote:
>
> > On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
> > > On Tue, 20 Jul 2010, dann frazier wrote:
> > > > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > > > dann frazier <[email protected]> wrote:
> > > > >
> > > > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > > > trying to run the gdb test suite:
> > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > > > >
> > > > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > > > down to this commit, introduced in 2.6.32:
> > > > > >
> > > > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > > > Author: Hugh Dickins <[email protected]>
> > > > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > > > >
> > > > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > > > >
> > > > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > > > >
> > > > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > > > zero_pfn test built into one or another block of vm_normal_page().
> > > > > >
> > > > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > > > that: it would have to take vm_flags to weed out some cases.
> > > > > >
> > > > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > > > reliably fails w/ 16KB pages.
> > > > > >
> > > > >
> > > > > Sorry, I have no idea...
> > > > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> > > >
> > > >
> > > > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > > > a0000001008784c0 d __ksymtab_empty_zero_page
> > > > a000000100882688 d __kcrctab_empty_zero_page
> > > > a000000100884ca4 r __kstrtab_empty_zero_page
> > > > a000000100974000 D empty_zero_page
> > >
> > > Thanks a lot for reporting this, but I too have no idea yet.
> > >
> > > It is likely that the bug is not to be found in that 62eede62, but
> > > rather in one of the preceding patches to mm/memory.c which 62eede62
> > > was extending to ia64 and other architectures without PTE_SPECIAL.
> > >
> > > I wonder, from looking at that gdb testsuite log, is it plausible
> > > that all these hangs/crashes occurred when writing out a coredump?
> > > Is that something you could check for us? or rule out the possibility.
> >
> > Yep, seems so. I've reduced it down to this test case:
> >
> > dannf@rx2600:~> cat > foo.c
> > int leaf(void) {
> > return 0;
> > }
> >
> > int main(void) {
> > leaf();
> > }
> > dannf@rx2600:~> gcc -g foo.c -o foo
> > dannf@rx2600:~> gdb ./foo
> > GNU gdb (GDB) SUSE (7.0-0.4.16)
> > Copyright (C) 2009 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> > and "show warranty" for details.
> > This GDB was configured as "ia64-suse-linux".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /home/dannf/foo...done.
> > (gdb) break leaf
> > Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
> > (gdb) run
> > Starting program: /home/dannf/foo
> > Missing separate debuginfo for /lib/ld-linux-ia64.so.2
> > Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
> > Missing separate debuginfo for /lib/libc.so.6.1
> > Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"
> >
> > Breakpoint 1, leaf () at foo.c:2
> > 2 return 0;
> > (gdb) gcore /tmp/save
> >
> > [bang]
> >
>
> Does this happen on 2.6.34 or 2.6.35-rc kernel ?

I've been testing w/ a 2.6.35-rc4+, though it was originally reported
on a 2.6.32.

--
dann frazier

2010-07-29 06:38:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, 27 Jul 2010, dann frazier wrote:
> On Tue, Jul 27, 2010 at 06:03:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 27 Jul 2010 01:19:15 -0600
> > dann frazier <[email protected]> wrote:
> > > On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
> > > > On Tue, 20 Jul 2010, dann frazier wrote:
> > > > > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > > > > dann frazier <[email protected]> wrote:
> > > > > >
> > > > > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > > > > trying to run the gdb test suite:
> > > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > > > > >
> > > > > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > > > > down to this commit, introduced in 2.6.32:
> > > > > > >
> > > > > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > > > > Author: Hugh Dickins <[email protected]>
> > > > > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > > > > >
> > > > > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > > > > >
> > > > > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > > > > >
> > > > > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > > > > zero_pfn test built into one or another block of vm_normal_page().
> > > > > > >
> > > > > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > > > > that: it would have to take vm_flags to weed out some cases.
> > > > > > >
> > > > > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > > > > reliably fails w/ 16KB pages.
> > > > > > >
> > > > > >
> > > > > > Sorry, I have no idea...
> > > > > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> > > > >
> > > > >
> > > > > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > > > > a0000001008784c0 d __ksymtab_empty_zero_page
> > > > > a000000100882688 d __kcrctab_empty_zero_page
> > > > > a000000100884ca4 r __kstrtab_empty_zero_page
> > > > > a000000100974000 D empty_zero_page
> > > >
> > > > Thanks a lot for reporting this, but I too have no idea yet.
> > > >
> > > > It is likely that the bug is not to be found in that 62eede62, but
> > > > rather in one of the preceding patches to mm/memory.c which 62eede62
> > > > was extending to ia64 and other architectures without PTE_SPECIAL.
> > > >
> > > > I wonder, from looking at that gdb testsuite log, is it plausible
> > > > that all these hangs/crashes occurred when writing out a coredump?
> > > > Is that something you could check for us? or rule out the possibility.
> > >
> > > Yep, seems so. I've reduced it down to this test case:
> > >
> > > dannf@rx2600:~> cat > foo.c
> > > int leaf(void) {
> > > return 0;
> > > }
> > >
> > > int main(void) {
> > > leaf();
> > > }
> > > dannf@rx2600:~> gcc -g foo.c -o foo
> > > dannf@rx2600:~> gdb ./foo
> > > GNU gdb (GDB) SUSE (7.0-0.4.16)
> > > Copyright (C) 2009 Free Software Foundation, Inc.
> > > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > > This is free software: you are free to change and redistribute it.
> > > There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> > > and "show warranty" for details.
> > > This GDB was configured as "ia64-suse-linux".
> > > For bug reporting instructions, please see:
> > > <http://www.gnu.org/software/gdb/bugs/>...
> > > Reading symbols from /home/dannf/foo...done.
> > > (gdb) break leaf
> > > Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
> > > (gdb) run
> > > Starting program: /home/dannf/foo
> > > Missing separate debuginfo for /lib/ld-linux-ia64.so.2
> > > Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
> > > Missing separate debuginfo for /lib/libc.so.6.1
> > > Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"
> > >
> > > Breakpoint 1, leaf () at foo.c:2
> > > 2 return 0;
> > > (gdb) gcore /tmp/save
> > >
> > > [bang]
> > >
> >
> > Does this happen on 2.6.34 or 2.6.35-rc kernel ?
>
> I've been testing w/ a 2.6.35-rc4+, though it was originally reported
> on a 2.6.32.

Thanks a lot for narrowing down to that simple testcase, and
thanks a lot for checking it's just as bad on recent kernels.

I'm sorry to say that I'm still just as baffled.

Let's note that gdb's gcore is building up its own version of a
coredump, not going through the get_dump_page() code I was wondering
about. If I read gcore correctly (possibly not!), it will be reading
selected areas from /proc/<pid>/mem i.e. using access_process_vm().

But why the (16kB but not 64kB!) zero page should make that freeze
or reboot, I have no idea.

What would I be doing if I had an Itanium? I think I'd be trying to
narrow down exactly where it goes bad (tedious when the penalty is
a freeze or reboot).

As it is, I'm hoping that someone with an ia64 can investigate...

Hugh

2010-07-29 07:38:11

by Luming Yu

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Tue, Jul 27, 2010 at 5:03 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 27 Jul 2010 01:19:15 -0600
> dann frazier <[email protected]> wrote:
>
>> On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
>> > On Tue, 20 Jul 2010, dann frazier wrote:
>> > > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
>> > > > On Tue, 20 Jul 2010 11:35:12 -0600
>> > > > dann frazier <[email protected]> wrote:
>> > > >
>> > > > > Debian's ia64 autobuilders have been experiencing system crashes while
>> > > > > trying to run the gdb test suite:
>> > > > >   http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
>> > > > >
>> > > > > I was able to reproduce this w/ the latest git tree, and bisected it
>> > > > > down to this commit, introduced in 2.6.32:
>> > > > >
>> > > > >   commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
>> > > > >   Author: Hugh Dickins <[email protected]>
>> > > > >   Date:   Mon Sep 21 17:03:34 2009 -0700
>> > > > >
>> > > > >     mm: ZERO_PAGE without PTE_SPECIAL
>> > > > >
>> > > > >     Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
>> > > > >     those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
>> > > > >
>> > > > >     Contrary to how I'd imagined it, there's nothing ugly about this, just a
>> > > > >     zero_pfn test built into one or another block of vm_normal_page().
>> > > > >
>> > > > >     But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
>> > > > >     my_zero_pfn() inlines.  Reinstate its mremap move_pte() shuffling of
>> > > > >     ZERO_PAGEs we did from 2.6.17 to 2.6.19?  Not unless someone shouts for
>> > > > >     that: it would have to take vm_flags to weed out some cases.
>> > > > >
>> > > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
>> > > > > 2.6.32-based). I compared the .configs and found that the relevant
>> > > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
>> > > > > reliably fails w/ 16KB pages.
>> > > > >
>> > > >
>> > > > Sorry, I have no idea...
>> > > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
>> > >
>> > >
>> > > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
>> > > a0000001008784c0 d __ksymtab_empty_zero_page
>> > > a000000100882688 d __kcrctab_empty_zero_page
>> > > a000000100884ca4 r __kstrtab_empty_zero_page
>> > > a000000100974000 D empty_zero_page
>> >
>> > Thanks a lot for reporting this, but I too have no idea yet.
>> >
>> > It is likely that the bug is not to be found in that 62eede62, but
>> > rather in one of the preceding patches to mm/memory.c which 62eede62
>> > was extending to ia64 and other architectures without PTE_SPECIAL.
>> >
>> > I wonder, from looking at that gdb testsuite log, is it plausible
>> > that all these hangs/crashes occurred when writing out a coredump?
>> > Is that something you could check for us? or rule out the possibility.
>>
>> Yep, seems so. I've reduced it down to this test case:
>>
>> dannf@rx2600:~> cat > foo.c
>> int leaf(void) {
>>   return 0;
>> }
>>
>> int main(void) {
>>   leaf();
>> }
>> dannf@rx2600:~> gcc -g foo.c -o foo
>> dannf@rx2600:~> gdb ./foo
>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "ia64-suse-linux".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> Reading symbols from /home/dannf/foo...done.
>> (gdb) break leaf
>> Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
>> (gdb) run
>> Starting program: /home/dannf/foo
>> Missing separate debuginfo for /lib/ld-linux-ia64.so.2
>> Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
>> Missing separate debuginfo for /lib/libc.so.6.1
>> Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"
>>
>> Breakpoint 1, leaf () at foo.c:2
>> 2          return 0;
>> (gdb) gcore /tmp/save
>>
>> [bang]
>>
>
> Does this happen on 2.6.34 or 2.6.35-rc kernel ?


# gdb ./foo
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ia64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/foo...done.
(gdb) break leaf
Breakpoint 1 at 0x40000000000005a1: file foo.c, line 2.
(gdb) run
Starting program: /root/foo

Breakpoint 1, leaf () at foo.c:2
2 }
(gdb) gcore /tmp/save
Segmentation fault
# cat /proc/version
Linux version 2.6.35-rc3+ ...


Is the "segmentation fault" to be called reproduced?

>
> Thanks,
> -Kame
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

2010-07-29 08:03:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, 29 Jul 2010 15:38:06 +0800
Luming Yu <[email protected]> wrote:

> On Tue, Jul 27, 2010 at 5:03 PM, KAMEZAWA Hiroyuki

> # gdb ./foo
> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "ia64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/foo...done.
> (gdb) break leaf
> Breakpoint 1 at 0x40000000000005a1: file foo.c, line 2.
> (gdb) run
> Starting program: /root/foo
>
> Breakpoint 1, leaf () at foo.c:2
> 2 }
> (gdb) gcore /tmp/save
> Segmentation fault
> # cat /proc/version
> Linux version 2.6.35-rc3+ ...
>
>

Hmm. What is EXEC_PAGESIZE installed in /usr/include/asm-generic/param.h ?
And what happnes when modify it to 16k if it's 64k ?

Thanks
-Kame



2010-07-29 08:40:57

by Luming Yu

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, Jul 29, 2010 at 3:58 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 29 Jul 2010 15:38:06 +0800
> Luming Yu <[email protected]> wrote:
>
>> On Tue, Jul 27, 2010 at 5:03 PM, KAMEZAWA Hiroyuki
>
>> # gdb ./foo
>> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "ia64-redhat-linux-gnu".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> Reading symbols from /root/foo...done.
>> (gdb) break leaf
>> Breakpoint 1 at 0x40000000000005a1: file foo.c, line 2.
>> (gdb) run
>> Starting program: /root/foo
>>
>> Breakpoint 1, leaf () at foo.c:2
>> 2       }
>> (gdb) gcore /tmp/save
>> Segmentation fault
>> # cat /proc/version
>> Linux version 2.6.35-rc3+ ...
>>
>>
>
> Hmm. What is EXEC_PAGESIZE installed in /usr/include/asm-generic/param.h ?

I use stock gdb shipped with RHEL 5.5.

> And what happnes when modify it to 16k if it's 64k ?

Want me to repbuild a gdb with this modification?

>
> Thanks
> -Kame
>
>
>
>
>

2010-07-29 08:49:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, 29 Jul 2010 16:40:50 +0800
Luming Yu <[email protected]> wrote:

> On Thu, Jul 29, 2010 at 3:58 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Thu, 29 Jul 2010 15:38:06 +0800
> > Luming Yu <[email protected]> wrote:
> >
> >> On Tue, Jul 27, 2010 at 5:03 PM, KAMEZAWA Hiroyuki
> >
> >> # gdb ./foo
> >> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
> >> Copyright (C) 2009 Free Software Foundation, Inc.
> >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >> This is free software: you are free to change and redistribute it.
> >> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> >> and "show warranty" for details.
> >> This GDB was configured as "ia64-redhat-linux-gnu".
> >> For bug reporting instructions, please see:
> >> <http://www.gnu.org/software/gdb/bugs/>...
> >> Reading symbols from /root/foo...done.
> >> (gdb) break leaf
> >> Breakpoint 1 at 0x40000000000005a1: file foo.c, line 2.
> >> (gdb) run
> >> Starting program: /root/foo
> >>
> >> Breakpoint 1, leaf () at foo.c:2
> >> 2       }
> >> (gdb) gcore /tmp/save
> >> Segmentation fault
> >> # cat /proc/version
> >> Linux version 2.6.35-rc3+ ...
> >>
> >>
> >
> > Hmm. What is EXEC_PAGESIZE installed in /usr/include/asm-generic/param.h ?
>
> I use stock gdb shipped with RHEL 5.5.
>
Hmm. RHEL5.5's EXEC_PAGESIZE is 64k, right ?
(And your kernel is 16k.)

> > And what happnes when modify it to 16k if it's 64k ?
>
> Want me to repbuild a gdb with this modification?
>
Ahhh, yes. It will be required...but plz when you have free time.
I don't think the difference can cause MCA or hang...

Thanks,
-Kame

2010-07-29 19:22:30

by dann frazier

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Wed, Jul 28, 2010 at 08:50:18PM -0700, Hugh Dickins wrote:
> On Tue, 27 Jul 2010, dann frazier wrote:
> > On Tue, Jul 27, 2010 at 06:03:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 27 Jul 2010 01:19:15 -0600
> > > dann frazier <[email protected]> wrote:
> > > > On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
> > > > > On Tue, 20 Jul 2010, dann frazier wrote:
> > > > > > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > > > > > dann frazier <[email protected]> wrote:
> > > > > > >
> > > > > > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > > > > > trying to run the gdb test suite:
> > > > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > > > > > >
> > > > > > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > > > > > down to this commit, introduced in 2.6.32:
> > > > > > > >
> > > > > > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > > > > > Author: Hugh Dickins <[email protected]>
> > > > > > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > > > > > >
> > > > > > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > > > > > >
> > > > > > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > > > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > > > > > >
> > > > > > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > > > > > zero_pfn test built into one or another block of vm_normal_page().
> > > > > > > >
> > > > > > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > > > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > > > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > > > > > that: it would have to take vm_flags to weed out some cases.
> > > > > > > >
> > > > > > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > > > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > > > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > > > > > reliably fails w/ 16KB pages.
> > > > > > > >
> > > > > > >
> > > > > > > Sorry, I have no idea...
> > > > > > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> > > > > >
> > > > > >
> > > > > > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > > > > > a0000001008784c0 d __ksymtab_empty_zero_page
> > > > > > a000000100882688 d __kcrctab_empty_zero_page
> > > > > > a000000100884ca4 r __kstrtab_empty_zero_page
> > > > > > a000000100974000 D empty_zero_page
> > > > >
> > > > > Thanks a lot for reporting this, but I too have no idea yet.
> > > > >
> > > > > It is likely that the bug is not to be found in that 62eede62, but
> > > > > rather in one of the preceding patches to mm/memory.c which 62eede62
> > > > > was extending to ia64 and other architectures without PTE_SPECIAL.
> > > > >
> > > > > I wonder, from looking at that gdb testsuite log, is it plausible
> > > > > that all these hangs/crashes occurred when writing out a coredump?
> > > > > Is that something you could check for us? or rule out the possibility.
> > > >
> > > > Yep, seems so. I've reduced it down to this test case:
> > > >
> > > > dannf@rx2600:~> cat > foo.c
> > > > int leaf(void) {
> > > > return 0;
> > > > }
> > > >
> > > > int main(void) {
> > > > leaf();
> > > > }
> > > > dannf@rx2600:~> gcc -g foo.c -o foo
> > > > dannf@rx2600:~> gdb ./foo
> > > > GNU gdb (GDB) SUSE (7.0-0.4.16)
> > > > Copyright (C) 2009 Free Software Foundation, Inc.
> > > > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > > > This is free software: you are free to change and redistribute it.
> > > > There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> > > > and "show warranty" for details.
> > > > This GDB was configured as "ia64-suse-linux".
> > > > For bug reporting instructions, please see:
> > > > <http://www.gnu.org/software/gdb/bugs/>...
> > > > Reading symbols from /home/dannf/foo...done.
> > > > (gdb) break leaf
> > > > Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
> > > > (gdb) run
> > > > Starting program: /home/dannf/foo
> > > > Missing separate debuginfo for /lib/ld-linux-ia64.so.2
> > > > Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
> > > > Missing separate debuginfo for /lib/libc.so.6.1
> > > > Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"
> > > >
> > > > Breakpoint 1, leaf () at foo.c:2
> > > > 2 return 0;
> > > > (gdb) gcore /tmp/save
> > > >
> > > > [bang]
> > > >
> > >
> > > Does this happen on 2.6.34 or 2.6.35-rc kernel ?
> >
> > I've been testing w/ a 2.6.35-rc4+, though it was originally reported
> > on a 2.6.32.
>
> Thanks a lot for narrowing down to that simple testcase, and
> thanks a lot for checking it's just as bad on recent kernels.
>
> I'm sorry to say that I'm still just as baffled.
>
> Let's note that gdb's gcore is building up its own version of a
> coredump, not going through the get_dump_page() code I was wondering
> about. If I read gcore correctly (possibly not!), it will be reading
> selected areas from /proc/<pid>/mem i.e. using access_process_vm().

This appears to be correct. I was able to collect the following
stacktrace using INIT:

[ 2535.074197] Backtrace of pid 4605 (gdb)
[ 2535.074197]
[ 2535.074197] Call Trace:
[ 2535.074197] [<a00000010000bb00>] ia64_native_leave_kernel+0x0/0x270
[ 2535.074197] sp=e000004081c77c40 bsp=e000004081c71018
[ 2535.074197] [<a000000100334720>] __copy_user+0x160/0x960
[ 2535.074197] sp=e000004081c77e10 bsp=e000004081c71018
[ 2535.074197] [<a000000100176b00>] access_process_vm+0x2c0/0x380
[ 2535.074197] sp=e000004081c77e10 bsp=e000004081c70f60

> But why the (16kB but not 64kB!) zero page should make that freeze
> or reboot, I have no idea.
>
> What would I be doing if I had an Itanium? I think I'd be trying to
> narrow down exactly where it goes bad (tedious when the penalty is
> a freeze or reboot).
>
> As it is, I'm hoping that someone with an ia64 can investigate...
>
> Hugh
>

--
dann frazier

2010-07-30 00:46:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, 29 Jul 2010 13:22:16 -0600
dann frazier <[email protected]> wrote:

> On Wed, Jul 28, 2010 at 08:50:18PM -0700, Hugh Dickins wrote:
> > On Tue, 27 Jul 2010, dann frazier wrote:
> > > On Tue, Jul 27, 2010 at 06:03:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 27 Jul 2010 01:19:15 -0600
> > > > dann frazier <[email protected]> wrote:
> > > > > On Tue, Jul 20, 2010 at 09:19:50PM -0700, Hugh Dickins wrote:
> > > > > > On Tue, 20 Jul 2010, dann frazier wrote:
> > > > > > > On Wed, Jul 21, 2010 at 10:51:36AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > > > On Tue, 20 Jul 2010 11:35:12 -0600
> > > > > > > > dann frazier <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > Debian's ia64 autobuilders have been experiencing system crashes while
> > > > > > > > > trying to run the gdb test suite:
> > > > > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588574
> > > > > > > > >
> > > > > > > > > I was able to reproduce this w/ the latest git tree, and bisected it
> > > > > > > > > down to this commit, introduced in 2.6.32:
> > > > > > > > >
> > > > > > > > > commit 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1
> > > > > > > > > Author: Hugh Dickins <[email protected]>
> > > > > > > > > Date: Mon Sep 21 17:03:34 2009 -0700
> > > > > > > > >
> > > > > > > > > mm: ZERO_PAGE without PTE_SPECIAL
> > > > > > > > >
> > > > > > > > > Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
> > > > > > > > > those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.
> > > > > > > > >
> > > > > > > > > Contrary to how I'd imagined it, there's nothing ugly about this, just a
> > > > > > > > > zero_pfn test built into one or another block of vm_normal_page().
> > > > > > > > >
> > > > > > > > > But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
> > > > > > > > > my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
> > > > > > > > > ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
> > > > > > > > > that: it would have to take vm_flags to weed out some cases.
> > > > > > > > >
> > > > > > > > > fyi, I found this to not be reproducible on SLES11 SP1 (which is
> > > > > > > > > 2.6.32-based). I compared the .configs and found that the relevant
> > > > > > > > > difference is the PAGE_SIZE. It does not fail w/ 64KB pages, but
> > > > > > > > > reliably fails w/ 16KB pages.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sorry, I have no idea...
> > > > > > > > Hmm, what is the address of empty_zero_page[] on your debian(16kb-page) ?
> > > > > > >
> > > > > > >
> > > > > > > dannf@krebs:~$ grep empty_zero_page /boot/System.map-2.6.32-5-mckinley
> > > > > > > a0000001008784c0 d __ksymtab_empty_zero_page
> > > > > > > a000000100882688 d __kcrctab_empty_zero_page
> > > > > > > a000000100884ca4 r __kstrtab_empty_zero_page
> > > > > > > a000000100974000 D empty_zero_page
> > > > > >
> > > > > > Thanks a lot for reporting this, but I too have no idea yet.
> > > > > >
> > > > > > It is likely that the bug is not to be found in that 62eede62, but
> > > > > > rather in one of the preceding patches to mm/memory.c which 62eede62
> > > > > > was extending to ia64 and other architectures without PTE_SPECIAL.
> > > > > >
> > > > > > I wonder, from looking at that gdb testsuite log, is it plausible
> > > > > > that all these hangs/crashes occurred when writing out a coredump?
> > > > > > Is that something you could check for us? or rule out the possibility.
> > > > >
> > > > > Yep, seems so. I've reduced it down to this test case:
> > > > >
> > > > > dannf@rx2600:~> cat > foo.c
> > > > > int leaf(void) {
> > > > > return 0;
> > > > > }
> > > > >
> > > > > int main(void) {
> > > > > leaf();
> > > > > }
> > > > > dannf@rx2600:~> gcc -g foo.c -o foo
> > > > > dannf@rx2600:~> gdb ./foo
> > > > > GNU gdb (GDB) SUSE (7.0-0.4.16)
> > > > > Copyright (C) 2009 Free Software Foundation, Inc.
> > > > > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > > > > This is free software: you are free to change and redistribute it.
> > > > > There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> > > > > and "show warranty" for details.
> > > > > This GDB was configured as "ia64-suse-linux".
> > > > > For bug reporting instructions, please see:
> > > > > <http://www.gnu.org/software/gdb/bugs/>...
> > > > > Reading symbols from /home/dannf/foo...done.
> > > > > (gdb) break leaf
> > > > > Breakpoint 1 at 0x40000000000005c1: file foo.c, line 2.
> > > > > (gdb) run
> > > > > Starting program: /home/dannf/foo
> > > > > Missing separate debuginfo for /lib/ld-linux-ia64.so.2
> > > > > Try: zypper install -C "debuginfo(build-id)=d5bfb8b5940e174d54b978ca515dc0df76c7618c"
> > > > > Missing separate debuginfo for /lib/libc.so.6.1
> > > > > Try: zypper install -C "debuginfo(build-id)=ca78657bd9173653d95f8504a313d2b6db8cb1d6"
> > > > >
> > > > > Breakpoint 1, leaf () at foo.c:2
> > > > > 2 return 0;
> > > > > (gdb) gcore /tmp/save
> > > > >
> > > > > [bang]
> > > > >
> > > >
> > > > Does this happen on 2.6.34 or 2.6.35-rc kernel ?
> > >
> > > I've been testing w/ a 2.6.35-rc4+, though it was originally reported
> > > on a 2.6.32.
> >
> > Thanks a lot for narrowing down to that simple testcase, and
> > thanks a lot for checking it's just as bad on recent kernels.
> >
> > I'm sorry to say that I'm still just as baffled.
> >
> > Let's note that gdb's gcore is building up its own version of a
> > coredump, not going through the get_dump_page() code I was wondering
> > about. If I read gcore correctly (possibly not!), it will be reading
> > selected areas from /proc/<pid>/mem i.e. using access_process_vm().
>
> This appears to be correct. I was able to collect the following
> stacktrace using INIT:
>
> [ 2535.074197] Backtrace of pid 4605 (gdb)
> [ 2535.074197]
> [ 2535.074197] Call Trace:
> [ 2535.074197] [<a00000010000bb00>] ia64_native_leave_kernel+0x0/0x270
> [ 2535.074197] sp=e000004081c77c40 bsp=e000004081c71018
> [ 2535.074197] [<a000000100334720>] __copy_user+0x160/0x960
> [ 2535.074197] sp=e000004081c77e10 bsp=e000004081c71018
> [ 2535.074197] [<a000000100176b00>] access_process_vm+0x2c0/0x380
> [ 2535.074197] sp=e000004081c77e10 bsp=e000004081c70f60
>

Could you show full stack ? IIUC, ia64's gdb has to call both of strace(PEEK) and
/proc/pid/mem to check hidden regiter stack.

Thanks,
-Kame




2010-07-30 02:02:13

by Hugh Dickins

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, 29 Jul 2010, dann frazier wrote:
> On Wed, Jul 28, 2010 at 08:50:18PM -0700, Hugh Dickins wrote:
> >
> > Let's note that gdb's gcore is building up its own version of a
> > coredump, not going through the get_dump_page() code I was wondering
> > about. If I read gcore correctly (possibly not!), it will be reading
> > selected areas from /proc/<pid>/mem i.e. using access_process_vm().
>
> This appears to be correct. I was able to collect the following
> stacktrace using INIT:
>
> [ 2535.074197] Backtrace of pid 4605 (gdb)
> [ 2535.074197]
> [ 2535.074197] Call Trace:
> [ 2535.074197] [<a00000010000bb00>] ia64_native_leave_kernel+0x0/0x270
> [ 2535.074197] sp=e000004081c77c40 bsp=e000004081c71018
> [ 2535.074197] [<a000000100334720>] __copy_user+0x160/0x960
> [ 2535.074197] sp=e000004081c77e10 bsp=e000004081c71018
> [ 2535.074197] [<a000000100176b00>] access_process_vm+0x2c0/0x380
> [ 2535.074197] sp=e000004081c77e10 bsp=e000004081c70f60

Thanks a lot, dann. But it was the [vdso] line in foo's /proc/<pid>/maps
which you sent me privately, that set me thinking on the right track.
Here's what I believe is the appropriate patch: please give it a try
and let us know...

[PATCH] mm: fix ia64 crash when gcore reads gate area

Debian's ia64 autobuilders have been seeing kernel freeze or reboot
when running the gdb testsuite (Debian bug 588574): dannf bisected to
2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without
PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.

I'd missed updating the gate_vma handling in __get_user_pages(): that
happens to use vm_normal_page() (nowadays failing on the zero page),
yet reported success even when it failed to get a page - boom when
access_process_vm() tried to copy that to its intermediate buffer.

Fix this, resisting cleanups: in particular, leave it for now reporting
success when not asked to get any pages - very probably safe to change,
but let's not risk it without testing exposure.

Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
Because setup_gate() pads each 64kB of its gate area with zero pages.

Reported-by: Andreas Barth <[email protected]>
Bisected-by: dann frazier <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

mm/memory.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

--- 2.6.35-rc6/mm/memory.c 2010-05-30 17:58:57.000000000 -0700
+++ linux/mm/memory.c 2010-07-29 17:57:29.000000000 -0700
@@ -1394,10 +1394,20 @@ int __get_user_pages(struct task_struct
return i ? : -EFAULT;
}
if (pages) {
- struct page *page = vm_normal_page(gate_vma, start, *pte);
+ struct page *page;
+
+ page = vm_normal_page(gate_vma, start, *pte);
+ if (!page) {
+ if (!(gup_flags & FOLL_DUMP) &&
+ is_zero_pfn(pte_pfn(*pte)))
+ page = pte_page(*pte);
+ else {
+ pte_unmap(pte);
+ return i ? : -EFAULT;
+ }
+ }
pages[i] = page;
- if (page)
- get_page(page);
+ get_page(page);
}
pte_unmap(pte);
if (vmas)

2010-07-30 04:34:38

by dann frazier

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, Jul 29, 2010 at 07:01:56PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jul 2010, dann frazier wrote:
> > On Wed, Jul 28, 2010 at 08:50:18PM -0700, Hugh Dickins wrote:
> > >
> > > Let's note that gdb's gcore is building up its own version of a
> > > coredump, not going through the get_dump_page() code I was wondering
> > > about. If I read gcore correctly (possibly not!), it will be reading
> > > selected areas from /proc/<pid>/mem i.e. using access_process_vm().
> >
> > This appears to be correct. I was able to collect the following
> > stacktrace using INIT:
> >
> > [ 2535.074197] Backtrace of pid 4605 (gdb)
> > [ 2535.074197]
> > [ 2535.074197] Call Trace:
> > [ 2535.074197] [<a00000010000bb00>] ia64_native_leave_kernel+0x0/0x270
> > [ 2535.074197] sp=e000004081c77c40 bsp=e000004081c71018
> > [ 2535.074197] [<a000000100334720>] __copy_user+0x160/0x960
> > [ 2535.074197] sp=e000004081c77e10 bsp=e000004081c71018
> > [ 2535.074197] [<a000000100176b00>] access_process_vm+0x2c0/0x380
> > [ 2535.074197] sp=e000004081c77e10 bsp=e000004081c70f60
>
> Thanks a lot, dann. But it was the [vdso] line in foo's /proc/<pid>/maps
> which you sent me privately, that set me thinking on the right track.
> Here's what I believe is the appropriate patch: please give it a try
> and let us know...

dannf@rx2600:~> gdb foo
GNU gdb (GDB) SUSE (7.0-0.4.16)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ia64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/dannf/foo...done.
(gdb) break leaf
Breakpoint 1 at 0x4000000000000401: file foo.c, line 2.
(gdb) run
Starting program: /home/dannf/foo

Breakpoint 1, leaf () at foo.c:2
2 return 0;
(gdb) gcore
Saved corefile core.3952
(gdb)

good work Hugh!

-dann

>
> [PATCH] mm: fix ia64 crash when gcore reads gate area
>
> Debian's ia64 autobuilders have been seeing kernel freeze or reboot
> when running the gdb testsuite (Debian bug 588574): dannf bisected to
> 2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without
> PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.
>
> I'd missed updating the gate_vma handling in __get_user_pages(): that
> happens to use vm_normal_page() (nowadays failing on the zero page),
> yet reported success even when it failed to get a page - boom when
> access_process_vm() tried to copy that to its intermediate buffer.
>
> Fix this, resisting cleanups: in particular, leave it for now reporting
> success when not asked to get any pages - very probably safe to change,
> but let's not risk it without testing exposure.
>
> Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
> Because setup_gate() pads each 64kB of its gate area with zero pages.
>
> Reported-by: Andreas Barth <[email protected]>
> Bisected-by: dann frazier <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]
> ---
>
> mm/memory.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> --- 2.6.35-rc6/mm/memory.c 2010-05-30 17:58:57.000000000 -0700
> +++ linux/mm/memory.c 2010-07-29 17:57:29.000000000 -0700
> @@ -1394,10 +1394,20 @@ int __get_user_pages(struct task_struct
> return i ? : -EFAULT;
> }
> if (pages) {
> - struct page *page = vm_normal_page(gate_vma, start, *pte);
> + struct page *page;
> +
> + page = vm_normal_page(gate_vma, start, *pte);
> + if (!page) {
> + if (!(gup_flags & FOLL_DUMP) &&
> + is_zero_pfn(pte_pfn(*pte)))
> + page = pte_page(*pte);
> + else {
> + pte_unmap(pte);
> + return i ? : -EFAULT;
> + }
> + }
> pages[i] = page;
> - if (page)
> - get_page(page);
> + get_page(page);
> }
> pte_unmap(pte);
> if (vmas)
>

--
dann frazier

2010-07-30 17:53:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: ia64 hang/mca running gdb 'make check'

On Thu, 29 Jul 2010, dann frazier wrote:
>
> dannf@rx2600:~> gdb foo
> GNU gdb (GDB) SUSE (7.0-0.4.16)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "ia64-suse-linux".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /home/dannf/foo...done.
> (gdb) break leaf
> Breakpoint 1 at 0x4000000000000401: file foo.c, line 2.
> (gdb) run
> Starting program: /home/dannf/foo
>
> Breakpoint 1, leaf () at foo.c:2
> 2 return 0;
> (gdb) gcore
> Saved corefile core.3952
> (gdb)

Many thanks for pursuing this and reporting back, dann.
Patch to Linus follows in a few moments.

Hugh

2010-07-30 18:00:46

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] mm: fix ia64 crash when gcore reads gate area

Debian's ia64 autobuilders have been seeing kernel freeze or reboot
when running the gdb testsuite (Debian bug 588574): dannf bisected to
2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without
PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.

I'd missed updating the gate_vma handling in __get_user_pages(): that
happens to use vm_normal_page() (nowadays failing on the zero page),
yet reported success even when it failed to get a page - boom when
access_process_vm() tried to copy that to its intermediate buffer.

Fix this, resisting cleanups: in particular, leave it for now reporting
success when not asked to get any pages - very probably safe to change,
but let's not risk it without testing exposure.

Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
Because setup_gate() pads each 64kB of its gate area with zero pages.

Reported-by: Andreas Barth <[email protected]>
Bisected-by: dann frazier <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Tested-by: dann frazier <[email protected]>
Cc: [email protected]
---
Please add into 2.6.32-stable, 2.6.33-stable, 2.6.34-stable.

mm/memory.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

--- 2.6.35-rc6/mm/memory.c 2010-05-30 17:58:57.000000000 -0700
+++ linux/mm/memory.c 2010-07-29 17:57:29.000000000 -0700
@@ -1394,10 +1394,20 @@ int __get_user_pages(struct task_struct
return i ? : -EFAULT;
}
if (pages) {
- struct page *page = vm_normal_page(gate_vma, start, *pte);
+ struct page *page;
+
+ page = vm_normal_page(gate_vma, start, *pte);
+ if (!page) {
+ if (!(gup_flags & FOLL_DUMP) &&
+ is_zero_pfn(pte_pfn(*pte)))
+ page = pte_page(*pte);
+ else {
+ pte_unmap(pte);
+ return i ? : -EFAULT;
+ }
+ }
pages[i] = page;
- if (page)
- get_page(page);
+ get_page(page);
}
pte_unmap(pte);
if (vmas)