LinuxLists.cc - 2.6.15-rc4 panic in __nr_to_section() with CONFIG

2005-12-07 22:37:26

Subject: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

Hi Andy,

I getting a panic while doing "cat /proc/<pid>/smaps" on
a process. I debugged a little to find out that faulting
IP is in _nr_to_section() - seems to be getting somehow
called by pte_offset_map_lock() from smaps_pte_range
(which show_smaps) calls.

Any ideas on why or how to debug further ?

Thanks,
Badari

> dis -l 0xc000000000108380 20

/usr/src/linux-2.6.15-rc4/fs/proc/task_mmu.c: 210 <<<<<<<<<<<<<<
0xc000000000108390 <.show_smap+344>: rldicr r0,r0,32,31
0xc000000000108394 <.show_smap+348>: add r0,r10,r0
include/linux/mmzone.h: 529
0xc000000000108398 <.show_smap+352>: rldicl r9,r0,31,33
/usr/src/linux-2.6.15-rc4/fs/proc/task_mmu.c: 210
0xc00000000010839c <.show_smap+356>: rldicl r8,r0,52,12
include/linux/mmzone.h: 528
0xc0000000001083a0 <.show_smap+360>: rldicl r0,r0,40,24
include/linux/mmzone.h: 529 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
0xc0000000001083a4 <.show_smap+364>: rldicr r9,r9,3,60
0xc0000000001083a8 <.show_smap+368>: ldx r11,r9,r11 <<<<<<

<1>Unable to handle kernel paging request for data at address
0xc0000001006d3698
Faulting instruction address: 0xc0000000001083a8
Oops: Kernel access of bad area, sig: 11 [#3]
SMP NR_CPUS=128 NUMA PSERIES LPAR
Modules linked in: evdev joydev st usbserial ehci_hcd ohci_hcd ipv6
usbcore dm_mod
NIP: C0000000001083A8 LR: C0000000001082A8 CTR: 0000000000000000
REGS: c0000000e17a7860 TRAP: 0300 Not tainted (2.6.15-rc4)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24000424 XER: 20000010
DAR: C0000001006D3698, DSISR: 0000000040010000
TASK = c0000000d92c07e0[11538] 'cat' THREAD: c0000000e17a4000 CPU: 0
GPR00: 0000004000000600 C0000000E17A7AE0 C000000000635AF8
C0000000E17A7B50
GPR04: 0000000000000000 0000000000000000 C0000000E17A7B78
0000000000000000
GPR08: 0004000000600000 0000000100000018 0000000600000793
C0000000006D3680
GPR12: 0000000022000428 C0000000004DD000 000001003FFFFFFF
C00000000F5FECC0
GPR16: C0000000DA912E00 0000010040000000 000001003FFFFFFF
C0000000D984E100
GPR20: 0000000000000038 0000010040000000 000001000FFFFFFF
C0000000E2A68000
GPR24: 000000001001740A 0000010010000000 C0000000E17A7B50
C0000000E2A68400
GPR28: 0000010000200000 0000010000000000 C00000000055ED50
0000010000000000
NIP [C0000000001083A8] .show_smap+0x170/0x39c
LR [C0000000001082A8] .show_smap+0x70/0x39c
Call Trace:
[C0000000E17A7AE0] [C000000000108544] .show_smap+0x30c/0x39c
(unreliable)
[C0000000E17A7C10] [C0000000000EC2BC] .seq_read+0x4bc/0x510
[C0000000E17A7CF0] [C0000000000BB400] .vfs_read+0x174/0x254
[C0000000E17A7D90] [C0000000000BB5F0] .sys_read+0x54/0x9c
[C0000000E17A7E30] [C000000000008600] syscall_exit+0x0/0x18
Instruction dump:
2faa0000 419e0188 3c004000 e97e8038 7fbfeb78 38e00000 780007c6 7c0a0214
7809f860 7808a302 78004602 79291f24 <7d69582a> 2fab0000 419e000c
78001d28

2005-12-07 22:52:41

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Wed, 2005-12-07 at 14:37 -0800, Badari Pulavarty wrote:
> Hi Andy,
>
> I getting a panic while doing "cat /proc/<pid>/smaps" on
> a process. I debugged a little to find out that faulting
> IP is in _nr_to_section() - seems to be getting somehow
> called by pte_offset_map_lock() from smaps_pte_range
> (which show_smaps) calls.
>
> Any ideas on why or how to debug further ?

You're sure it's inside of the pte_offset_map_lock()?

It's probably this call chain:

pte_offset_map_lock()
pte_offset_map()
pmd_page()
pfn_to_page()
__pfn_to_section()
__nr_to_section()

I'd probably take a hard look at the PMD first to make sure it looks
good. Then, maybe go through some of the conversions in
pte_offset_map_lock() from that chain and print out each step inside of
smaps_pte_range(). Can you trigger it easily?

-- Dave

2005-12-07 23:05:06

by Andy Whitcroft

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

Badari Pulavarty wrote:
> Hi Andy,
>
> I getting a panic while doing "cat /proc/<pid>/smaps" on
> a process. I debugged a little to find out that faulting
> IP is in _nr_to_section() - seems to be getting somehow
> called by pte_offset_map_lock() from smaps_pte_range
> (which show_smaps) calls.
>
> Any ideas on why or how to debug further ?

>From dave's call graph I'd ask the question whether we should be calling
pfn_valid() before pfn_to_page(). When I reviewed the proposed
pfn_to_page() implementation I only recall one use and that already had
the pfn_valid() in it. I'll review -rc4 in the morning.

-apw

2005-12-07 23:22:37

by Badari Pulavarty

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Wed, 2005-12-07 at 23:05 +0000, Andy Whitcroft wrote:
> Badari Pulavarty wrote:
> > Hi Andy,
> >
> > I getting a panic while doing "cat /proc/<pid>/smaps" on
> > a process. I debugged a little to find out that faulting
> > IP is in _nr_to_section() - seems to be getting somehow
> > called by pte_offset_map_lock() from smaps_pte_range
> > (which show_smaps) calls.
> >
> > Any ideas on why or how to debug further ?
>
> From dave's call graph I'd ask the question whether we should be calling
> pfn_valid() before pfn_to_page(). When I reviewed the proposed
> pfn_to_page() implementation I only recall one use and that already had
> the pfn_valid() in it. I'll review -rc4 in the morning.

BTW, the problem seems to be while dealing with shared memory areas
that are backed by largepages.

db2inst1@elm3b157:~> cat /proc/12030/maps
...
100000000-101598000 rw-s 00000000 00:07
589826 /SYSV2a079461 (deleted)
...

db2inst1@elm3b157:~> cat /proc/12030/smaps
...
100000000-101598000 rw-s 00000000 00:07 5Segmentation fault

Thanks,
Badari

2005-12-07 23:34:38

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Wed, 2005-12-07 at 15:22 -0800, Badari Pulavarty wrote:
> On Wed, 2005-12-07 at 23:05 +0000, Andy Whitcroft wrote:
> > Badari Pulavarty wrote:
> > > Hi Andy,
> > >
> > > I getting a panic while doing "cat /proc/<pid>/smaps" on
> > > a process. I debugged a little to find out that faulting
> > > IP is in _nr_to_section() - seems to be getting somehow
> > > called by pte_offset_map_lock() from smaps_pte_range
> > > (which show_smaps) calls.
> > >
> > > Any ideas on why or how to debug further ?
> >
> > From dave's call graph I'd ask the question whether we should be calling
> > pfn_valid() before pfn_to_page(). When I reviewed the proposed
> > pfn_to_page() implementation I only recall one use and that already had
> > the pfn_valid() in it. I'll review -rc4 in the morning.
>
> BTW, the problem seems to be while dealing with shared memory areas
> that are backed by largepages.

Should you even be making it into the pte function with large pages?
Don't they just stop at the pmd level?

Maybe smaps_pmd_range() needs a pmd_huge() check.

-- Dave

2005-12-08 00:48:31

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Wed, 2005-12-07 at 15:22 -0800, Badari Pulavarty wrote:
> BTW, the problem seems to be while dealing with shared memory areas
> that are backed by largepages.

I think this is likely not directly a sparsemem problem. It probably
just shows symptoms earlier.

See the attached patch. It attempts to detect and handle hugetlb pages
in the smaps code. However, I think one of the root issues here is that
bad_pmd() triggers for hugetlb pmds. I audited a few places where it is
called, and at least a couple of them can't have hugepages handed into
them, like fork().

-- Dave

---

proc-dups-dave/fs/proc/task_mmu.c | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+)

diff -puN fs/proc/task_mmu.c~task_mmu_fix fs/proc/task_mmu.c
--- proc-dups/fs/proc/task_mmu.c~task_mmu_fix 2005-12-07 16:34:38.000000000 -0800
+++ proc-dups-dave/fs/proc/task_mmu.c 2005-12-07 16:34:47.000000000 -0800
@@ -245,6 +245,28 @@ static inline void smaps_pmd_range(struc
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
+
+ if (pmd_huge(*pmd)) {
+ struct page *page;
+
+ page = follow_huge_pmd(vma->vm_mm, addr, pmd, 0);
+ if (!page)
+ continue;
+
+ mss->resident += HPAGE_SIZE;
+ if (page_count(page) >= 2) {
+ if (pte_dirty(*(pte_t *)pmd))
+ mss->shared_dirty += HPAGE_SIZE;
+ else
+ mss->shared_clean += HPAGE_SIZE;
+ } else {
+ if (pte_dirty(*(pte_t *)pmd))
+ mss->private_dirty += HPAGE_SIZE;
+ else
+ mss->private_clean += HPAGE_SIZE;
+ }
+ continue;
+ }
if (pmd_none_or_clear_bad(pmd))
continue;
smaps_pte_range(vma, pmd, addr, next, mss);
_

2005-12-08 16:07:21

by Badari Pulavarty

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Wed, 2005-12-07 at 16:48 -0800, Dave Hansen wrote:
> On Wed, 2005-12-07 at 15:22 -0800, Badari Pulavarty wrote:
> > BTW, the problem seems to be while dealing with shared memory areas
> > that are backed by largepages.
>
> I think this is likely not directly a sparsemem problem. It probably
> just shows symptoms earlier.
>
> See the attached patch. It attempts to detect and handle hugetlb pages
> in the smaps code. However, I think one of the root issues here is that
> bad_pmd() triggers for hugetlb pmds. I audited a few places where it is
> called, and at least a couple of them can't have hugepages handed into
> them, like fork().
>
> -- Dave
>
>
>
> ---
>
> proc-dups-dave/fs/proc/task_mmu.c | 22 ++++++++++++++++++++++
> 1 files changed, 22 insertions(+)
>
> diff -puN fs/proc/task_mmu.c~task_mmu_fix fs/proc/task_mmu.c
> --- proc-dups/fs/proc/task_mmu.c~task_mmu_fix 2005-12-07 16:34:38.000000000 -0800
> +++ proc-dups-dave/fs/proc/task_mmu.c 2005-12-07 16:34:47.000000000 -0800
> @@ -245,6 +245,28 @@ static inline void smaps_pmd_range(struc
> pmd = pmd_offset(pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> +
> + if (pmd_huge(*pmd)) {
> + struct page *page;
> +
> + page = follow_huge_pmd(vma->vm_mm, addr, pmd, 0);
> + if (!page)
> + continue;
> +
> + mss->resident += HPAGE_SIZE;
> + if (page_count(page) >= 2) {
> + if (pte_dirty(*(pte_t *)pmd))
> + mss->shared_dirty += HPAGE_SIZE;
> + else
> + mss->shared_clean += HPAGE_SIZE;
> + } else {
> + if (pte_dirty(*(pte_t *)pmd))
> + mss->private_dirty += HPAGE_SIZE;
> + else
> + mss->private_clean += HPAGE_SIZE;
> + }
> + continue;
> + }
> if (pmd_none_or_clear_bad(pmd))
> continue;
> smaps_pte_range(vma, pmd, addr, next, mss);

No. It doesn't help. It looks like ppc pmd_huge() always returns 0.
Don't know why ? :(

Thanks,
Badari

2005-12-08 19:15:39

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Thu, 2005-12-08 at 08:07 -0800, Badari Pulavarty wrote:
> No. It doesn't help. It looks like ppc pmd_huge() always returns 0.
> Don't know why ? :(

The ppc64 hugetlb pages don't line up on PMD boundaries like they do on
i386. The entries are stored in regular old PTEs.

I really don't like coding the two different hugetlb cases, but I can't
think of a better way to do it. Anyone care to test on ppc64?

-- Dave

proc-dups-dave/fs/proc/task_mmu.c | 53 +++++++++++++++++++++++++++-----------
1 files changed, 39 insertions(+), 14 deletions(-)

diff -puN fs/proc/task_mmu.c~task_mmu_fix fs/proc/task_mmu.c
--- proc-dups/fs/proc/task_mmu.c~task_mmu_fix 2005-12-07 16:34:38.000000000 -0800
+++ proc-dups-dave/fs/proc/task_mmu.c 2005-12-08 11:14:09.000000000 -0800
@@ -198,6 +198,24 @@ static int show_map(struct seq_file *m,
return show_map_internal(m, v, NULL);
}

+static void smaps_account_for_page(pte_t ptent, struct page *page,
+ unsigned long page_size,
+ struct mem_size_stats *mss)
+{
+ mss->resident += page_size;
+ if (page_count(page) >= 2) {
+ if (pte_dirty(ptent))
+ mss->shared_dirty += page_size;
+ else
+ mss->shared_clean += page_size;
+ } else {
+ if (pte_dirty(ptent))
+ mss->private_dirty += page_size;
+ else
+ mss->private_clean += page_size;
+ }
+}
+
static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mem_size_stats *mss)
@@ -205,32 +223,38 @@ static void smaps_pte_range(struct vm_ar
pte_t *pte, ptent;
spinlock_t *ptl;
unsigned long pfn;
+ unsigned long page_size;
struct page *page;

+ if (pmd_huge(*pmd)) {
+ struct page *page;
+
+ page = follow_huge_pmd(vma->vm_mm, addr, pmd, 0);
+ if (!page)
+ return;
+
+ ptent = *(pte_t *)pmd;
+ smaps_account_for_page(ptent, page, HPAGE_SIZE, mss);
+ return;
+ }
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
ptent = *pte;
if (!pte_present(ptent))
continue;

- mss->resident += PAGE_SIZE;
+ page_size = PAGE_SIZE;
+ if (is_hugepage_only_range(vma->vm_mm, addr, end - addr))
+ page_size = HPAGE_SIZE;
+
+ mss->resident += page_size;
pfn = pte_pfn(ptent);
if (!pfn_valid(pfn))
continue;

page = pfn_to_page(pfn);
- if (page_count(page) >= 2) {
- if (pte_dirty(ptent))
- mss->shared_dirty += PAGE_SIZE;
- else
- mss->shared_clean += PAGE_SIZE;
- } else {
- if (pte_dirty(ptent))
- mss->private_dirty += PAGE_SIZE;
- else
- mss->private_clean += PAGE_SIZE;
- }
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ smaps_account_for_page(ptent, page, page_size, mss);
+ } while (pte++, addr += page_size, addr != end);
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
}
@@ -245,7 +269,8 @@ static inline void smaps_pmd_range(struc
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
+
+ if (!pmd_huge(*pmd) && pmd_none_or_clear_bad(pmd))
continue;
smaps_pte_range(vma, pmd, addr, next, mss);
} while (pmd++, addr = next, addr != end);
_

2005-12-08 19:33:33

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Thu, 8 Dec 2005, Dave Hansen wrote:
> On Thu, 2005-12-08 at 08:07 -0800, Badari Pulavarty wrote:
> > No. It doesn't help. It looks like ppc pmd_huge() always returns 0.
> > Don't know why ? :(
>
> The ppc64 hugetlb pages don't line up on PMD boundaries like they do on
> i386. The entries are stored in regular old PTEs.
>
> I really don't like coding the two different hugetlb cases, but I can't
> think of a better way to do it. Anyone care to test on ppc64?

Oh, it isn't worth that effort, just test is_vm_hugetlb_page(vma)
in show_smap, and skip it if so - or make up appropriate numbers
from (vm_end - vm_start) in that case if you like.

Hugh

2005-12-08 19:47:52

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Thu, 2005-12-08 at 19:33 +0000, Hugh Dickins wrote:
> Oh, it isn't worth that effort, just test is_vm_hugetlb_page(vma)
> in show_smap, and skip it if so - or make up appropriate numbers
> from (vm_end - vm_start) in that case if you like.

I think the reason Badari was looking at it was that some DB guys want
to get some statistics out of there. They'll certainly care about
HugeTLB pages.

With the HugeTLB prefault mechanism having gone away, we can't assume
that the pages are resident. So, I don't think just using the VMA's
size will work.

-- Dave

2005-12-08 19:52:51

by Badari Pulavarty

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Thu, 2005-12-08 at 11:47 -0800, Dave Hansen wrote:
> On Thu, 2005-12-08 at 19:33 +0000, Hugh Dickins wrote:
> > Oh, it isn't worth that effort, just test is_vm_hugetlb_page(vma)
> > in show_smap, and skip it if so - or make up appropriate numbers
> > from (vm_end - vm_start) in that case if you like.
>
> I think the reason Badari was looking at it was that some DB guys want
> to get some statistics out of there. They'll certainly care about
> HugeTLB pages.
>
> With the HugeTLB prefault mechanism having gone away, we can't assume
> that the pages are resident. So, I don't think just using the VMA's
> size will work.

Yep. You replied before I can :)

Thanks,
Badari

2005-12-08 23:55:06

by David Gibson

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Thu, Dec 08, 2005 at 11:15:35AM -0800, Dave Hansen wrote:
> On Thu, 2005-12-08 at 08:07 -0800, Badari Pulavarty wrote:
> > No. It doesn't help. It looks like ppc pmd_huge() always returns 0.
> > Don't know why ? :(
>
> The ppc64 hugetlb pages don't line up on PMD boundaries like they do on
> i386. The entries are stored in regular old PTEs.
>
> I really don't like coding the two different hugetlb cases, but I can't
> think of a better way to do it. Anyone care to test on ppc64?

> - mss->resident += PAGE_SIZE;
> + page_size = PAGE_SIZE;
> + if (is_hugepage_only_range(vma->vm_mm, addr, end - addr))
> + page_size = HPAGE_SIZE;

This is an incorrect usage of is_hugepage_only_range(). Although it
will get the right answer by accident here, that function should
*only* be used for testing whether a range is suitable for normal
pages, never for determining if hugepages are actually in use here.
You have the VMA here, so test its flag instead here.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2005-12-09 00:16:39

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Fri, 2005-12-09 at 10:48 +1100, David Gibson wrote:
> On Thu, Dec 08, 2005 at 11:15:35AM -0800, Dave Hansen wrote:
> > On Thu, 2005-12-08 at 08:07 -0800, Badari Pulavarty wrote:
> > > No. It doesn't help. It looks like ppc pmd_huge() always returns 0.
> > > Don't know why ? :(
> >
> > The ppc64 hugetlb pages don't line up on PMD boundaries like they do on
> > i386. The entries are stored in regular old PTEs.
> >
> > I really don't like coding the two different hugetlb cases, but I can't
> > think of a better way to do it. Anyone care to test on ppc64?
>
> > - mss->resident += PAGE_SIZE;
> > + page_size = PAGE_SIZE;
> > + if (is_hugepage_only_range(vma->vm_mm, addr, end - addr))
> > + page_size = HPAGE_SIZE;
>
> This is an incorrect usage of is_hugepage_only_range(). Although it
> will get the right answer by accident here, that function should
> *only* be used for testing whether a range is suitable for normal
> pages, never for determining if hugepages are actually in use here.
> You have the VMA here, so test its flag instead here.

Sounds good to me. The flag should be faster to test as well. Updated
patch attached. Shall I cook another one up to comment
is_hugepage_only_range() so poor suckers like me don't make the mistake
again?

-- Dave

proc-dups-dave/fs/proc/task_mmu.c | 53 +++++++++++++++++++++++++++-----------
2 files changed, 39 insertions(+), 14 deletions(-)

diff -puN fs/proc/task_mmu.c~task_mmu_fix fs/proc/task_mmu.c
--- proc-dups/fs/proc/task_mmu.c~task_mmu_fix 2005-12-07 16:34:38.000000000 -0800
+++ proc-dups-dave/fs/proc/task_mmu.c 2005-12-08 16:13:55.000000000 -0800
@@ -198,6 +198,24 @@ static int show_map(struct seq_file *m,
return show_map_internal(m, v, NULL);
}

+static void smaps_account_for_page(pte_t ptent, struct page *page,
+ unsigned long page_size,
+ struct mem_size_stats *mss)
+{
+ mss->resident += page_size;
+ if (page_count(page) >= 2) {
+ if (pte_dirty(ptent))
+ mss->shared_dirty += page_size;
+ else
+ mss->shared_clean += page_size;
+ } else {
+ if (pte_dirty(ptent))
+ mss->private_dirty += page_size;
+ else
+ mss->private_clean += page_size;
+ }
+}
+
static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mem_size_stats *mss)
@@ -205,32 +223,38 @@ static void smaps_pte_range(struct vm_ar
pte_t *pte, ptent;
spinlock_t *ptl;
unsigned long pfn;
+ unsigned long page_size;
struct page *page;

+ if (pmd_huge(*pmd)) {
+ struct page *page;
+
+ page = follow_huge_pmd(vma->vm_mm, addr, pmd, 0);
+ if (!page)
+ return;
+
+ ptent = *(pte_t *)pmd;
+ smaps_account_for_page(ptent, page, HPAGE_SIZE, mss);
+ return;
+ }
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
ptent = *pte;
if (!pte_present(ptent))
continue;

- mss->resident += PAGE_SIZE;
+ page_size = PAGE_SIZE;
+ if (is_vm_hugetlb_page(vma))
+ page_size = HPAGE_SIZE;
+
+ mss->resident += page_size;
pfn = pte_pfn(ptent);
if (!pfn_valid(pfn))
continue;

page = pfn_to_page(pfn);
- if (page_count(page) >= 2) {
- if (pte_dirty(ptent))
- mss->shared_dirty += PAGE_SIZE;
- else
- mss->shared_clean += PAGE_SIZE;
- } else {
- if (pte_dirty(ptent))
- mss->private_dirty += PAGE_SIZE;
- else
- mss->private_clean += PAGE_SIZE;
- }
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ smaps_account_for_page(ptent, page, page_size, mss);
+ } while (pte++, addr += page_size, addr != end);
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
}
@@ -245,7 +269,8 @@ static inline void smaps_pmd_range(struc
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
+
+ if (!pmd_huge(*pmd) && pmd_none_or_clear_bad(pmd))
continue;
smaps_pte_range(vma, pmd, addr, next, mss);
} while (pmd++, addr = next, addr != end);
diff -puN include/linux/hugetlb.h~task_mmu_fix include/linux/hugetlb.h
_

2005-12-12 16:19:06

by Badari Pulavarty

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Thu, 2005-12-08 at 16:16 -0800, Dave Hansen wrote:
> On Fri, 2005-12-09 at 10:48 +1100, David Gibson wrote:
> > On Thu, Dec 08, 2005 at 11:15:35AM -0800, Dave Hansen wrote:
> > > On Thu, 2005-12-08 at 08:07 -0800, Badari Pulavarty wrote:
> > > > No. It doesn't help. It looks like ppc pmd_huge() always returns 0.
> > > > Don't know why ? :(
> > >
> > > The ppc64 hugetlb pages don't line up on PMD boundaries like they do on
> > > i386. The entries are stored in regular old PTEs.
> > >
> > > I really don't like coding the two different hugetlb cases, but I can't
> > > think of a better way to do it. Anyone care to test on ppc64?
> >
> > > - mss->resident += PAGE_SIZE;
> > > + page_size = PAGE_SIZE;
> > > + if (is_hugepage_only_range(vma->vm_mm, addr, end - addr))
> > > + page_size = HPAGE_SIZE;
> >
> > This is an incorrect usage of is_hugepage_only_range(). Although it
> > will get the right answer by accident here, that function should
> > *only* be used for testing whether a range is suitable for normal
> > pages, never for determining if hugepages are actually in use here.
> > You have the VMA here, so test its flag instead here.
>
> Sounds good to me. The flag should be faster to test as well. Updated
> patch attached. Shall I cook another one up to comment
> is_hugepage_only_range() so poor suckers like me don't make the mistake
> again?
>
> -- Dave
>
> proc-dups-dave/fs/proc/task_mmu.c | 53 +++++++++++++++++++++++++++-----------
> 2 files changed, 39 insertions(+), 14 deletions(-)
>
> diff -puN fs/proc/task_mmu.c~task_mmu_fix fs/proc/task_mmu.c
> --- proc-dups/fs/proc/task_mmu.c~task_mmu_fix 2005-12-07 16:34:38.000000000 -0800
> +++ proc-dups-dave/fs/proc/task_mmu.c 2005-12-08 16:13:55.000000000 -0800
> @@ -198,6 +198,24 @@ static int show_map(struct seq_file *m,
> return show_map_internal(m, v, NULL);
> }
>
> +static void smaps_account_for_page(pte_t ptent, struct page *page,
> + unsigned long page_size,
> + struct mem_size_stats *mss)
> +{
> + mss->resident += page_size;
> + if (page_count(page) >= 2) {
> + if (pte_dirty(ptent))
> + mss->shared_dirty += page_size;
> + else
> + mss->shared_clean += page_size;
> + } else {
> + if (pte_dirty(ptent))
> + mss->private_dirty += page_size;
> + else
> + mss->private_clean += page_size;
> + }
> +}
> +
> static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> struct mem_size_stats *mss)
> @@ -205,32 +223,38 @@ static void smaps_pte_range(struct vm_ar
> pte_t *pte, ptent;
> spinlock_t *ptl;
> unsigned long pfn;
> + unsigned long page_size;
> struct page *page;
>
> + if (pmd_huge(*pmd)) {
> + struct page *page;
> +
> + page = follow_huge_pmd(vma->vm_mm, addr, pmd, 0);
> + if (!page)
> + return;
> +
> + ptent = *(pte_t *)pmd;
> + smaps_account_for_page(ptent, page, HPAGE_SIZE, mss);
> + return;
> + }
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> do {
> ptent = *pte;
> if (!pte_present(ptent))
> continue;
>
> - mss->resident += PAGE_SIZE;
> + page_size = PAGE_SIZE;
> + if (is_vm_hugetlb_page(vma))
> + page_size = HPAGE_SIZE;
> +
> + mss->resident += page_size;
> pfn = pte_pfn(ptent);
> if (!pfn_valid(pfn))
> continue;
>
> page = pfn_to_page(pfn);
> - if (page_count(page) >= 2) {
> - if (pte_dirty(ptent))
> - mss->shared_dirty += PAGE_SIZE;
> - else
> - mss->shared_clean += PAGE_SIZE;
> - } else {
> - if (pte_dirty(ptent))
> - mss->private_dirty += PAGE_SIZE;
> - else
> - mss->private_clean += PAGE_SIZE;
> - }
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + smaps_account_for_page(ptent, page, page_size, mss);
> + } while (pte++, addr += page_size, addr != end);
> pte_unmap_unlock(pte - 1, ptl);
> cond_resched();
> }
> @@ -245,7 +269,8 @@ static inline void smaps_pmd_range(struc
> pmd = pmd_offset(pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> - if (pmd_none_or_clear_bad(pmd))
> +
> + if (!pmd_huge(*pmd) && pmd_none_or_clear_bad(pmd))
> continue;
> smaps_pte_range(vma, pmd, addr, next, mss);
> } while (pmd++, addr = next, addr != end);
> diff -puN include/linux/hugetlb.h~task_mmu_fix include/linux/hugetlb.h
> _

NACK.

Like I mentioned to you privately, I still get Oops with this patch for
largepages. Please let me know, if you have a new version to try.

Thanks,
Badari

2005-12-12 21:28:26

by Dave Hansen

[permalink] [raw]

Subject: Re: 2.6.15-rc4 panic in __nr_to_section() with CONFIG_SPARSEMEM

On Mon, 2005-12-12 at 08:19 -0800, Badari Pulavarty wrote:
> NACK.
>
> Like I mentioned to you privately, I still get Oops with this patch for
> largepages. Please let me know, if you have a new version to try.

I'm going to throw this back over the wall at Adam and Dave Gibson. It
is going to take a bit more than a cosmetic fix. The complexity comes
because there are two _distinct_ hugetlb cases here. The first is when
the HPTE is condensed into the PMD like on normal i386 or in the !
64K_PAGES case on ppc64. The second is when the HPTE is stored like a
normal PTE in a PTE page like on the 64K_PAGES ppc64 case.

These need to be handled in two different places in the smaps pagetable
walk. Add that into the normal small PTE case and we end up having
_three_ cases to handle for the end of the pagetable walk.

Before we did faulting for these PTEs, it would have been easy to have a
if() to hack it in at the top of the pagetable walk, but we can't do
that any more.

The big question is: do we ever need to deal with large pages in a
normal pagetable walk? If not, we can probably just hack the two extra
cases in. How do we tell in generic code whether we're looking at a
real "huge PMD", or one that points to a PTE page with "huge PTE"s?
That seems to be a ppc64-specific question for now.

Badari, appended is a patch that doesn't fix the accounting in smaps,
but will simply skip the huge page vmas. It will at least keep you from
oopsing.

-- Dave

smaps-fix-dave/fs/proc/task_mmu.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/proc/task_mmu.c~smaps_fix fs/proc/task_mmu.c
--- smaps-fix/fs/proc/task_mmu.c~smaps_fix 2005-12-12 13:19:05.000000000 -0800
+++ smaps-fix-dave/fs/proc/task_mmu.c 2005-12-12 13:21:07.000000000 -0800
@@ -289,7 +289,7 @@ static int show_smap(struct seq_file *m,
struct mem_size_stats mss;

memset(&mss, 0, sizeof mss);
- if (vma->vm_mm)
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma))
smaps_pgd_range(vma, vma->vm_start, vma->vm_end, &mss);
return show_map_internal(m, v, &mss);
}
_