2008-03-01 18:57:25

by Arjan van de Ven

[permalink] [raw]
Subject: bisected boot regression post 2.6.25-rc3.. please revert

Hi Linus, Ingo, Hans,

Please revert commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd
( x86: fix pmd_bad and pud_bad to support huge pages )
since it prevents the kernel to finish booting on my (Penryn based)
laptop. The boot stops right after freeing the init memory.
Took a while to bisect (due to it touching page*.h, which forces a
full recompile), but it definitely is caused by this commit...

Please revert.



[arjan@localhost linux.trees.git]$ git-bisect good
cded932b75ab0a5f9181ee3da34a0a488d1a14fd is first bad commit
commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd
Author: Hans Rosenfeld <[email protected]>
Date: Mon Feb 18 18:10:47 2008 +0100

x86: fix pmd_bad and pud_bad to support huge pages

I recently stumbled upon a problem in the support for huge pages. If a
program using huge pages does not explicitly unmap them, they remain
mapped (and therefore, are lost) after the program exits.

I observed that the free huge page count in /proc/meminfo decreased when
running my program, and it did not increase after the program exited.
After running the program a few times, no more huge pages could be
allocated.

The reason for this seems to be that the x86 pmd_bad and pud_bad
consider pmd/pud entries having the PSE bit set invalid. I think there
is nothing wrong with this bit being set, it just indicates that the
lowest level of translation has been reached. This bit has to be (and
is) checked after the basic validity of the entry has been checked, like
in this fragment from follow_page() in mm/memory.c:

if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
goto no_page_table;

if (pmd_huge(*pmd)) {
BUG_ON(flags & FOLL_GET);
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}

Note that this code currently doesn't work as intended if the pmd refers
to a huge page, the pmd_huge() check can not be reached if the page is
huge.

Extending pmd_bad() (and, for future 1GB page support, pud_bad()) to
allow for the PSE bit being set fixes this. For similar reasons,
allowing the NX bit being set is necessary, too. I have seen huge pages
having the NX bit set in their pmd entry, which would cause the same
problem.

Signed-Off-By: Hans Rosenfeld <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

:040000 040000 1051a11e76304c0fa9229af65cbbd288a8125651
fd91df38defe41df09dde3c0955fb32a4476467a M include


--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org


2008-03-03 07:46:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Arjan van de Ven <[email protected]> wrote:

> Hi Linus, Ingo, Hans,
>
> Please revert commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd ( x86:
> fix pmd_bad and pud_bad to support huge pages ) since it prevents the
> kernel to finish booting on my (Penryn based) laptop. The boot stops
> right after freeing the init memory. Took a while to bisect (due to it
> touching page*.h, which forces a full recompile), but it definitely is
> caused by this commit...

hm, lets figure out why this patch breaks your box, ok? We obviously
have to revert it if we cannot figure it out, but lets at least try -
because the patch itself fixes a real regression and it's not obviously
wrong either. I think there might be some bug hiding somewhere that we
really want to fix instead of this revert.

Could you try to hack up a debug patch perhaps? Uninline the fucntion(s)
then add two versions (one is that breaks on your box and one is that
works on your box) of this same pmd_bad()/pud_bad() functions and do
something like this (pseudocode):

pmd_bad()
{
if (pmd_bad_working(x) != pmd_bad_broken(x))
panic_timeout++;

return pmd_bad_working(x);
}

i.e. we actually use the working function so your box should boot up
just fine - but we instrument things with the broken function as well
and detect the cases where the two values differ. It is an anomaly if
either function ever returns true instead of false.

if after bootup you have a non-zero panic_timeout then there is a
material difference somewhere. In that case try to stick a dump_stack()
into the above case as well - and if the box still boots, send us the
stackdumps. (if the box doesnt boot then perhaps the printout itself
hangs - in that case try a save_stack_trace() hack and print it out
later)

Ingo

2008-03-03 09:13:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Ingo Molnar <[email protected]> wrote:

> Could you try to hack up a debug patch perhaps? Uninline the
> fucntion(s) then add two versions (one is that breaks on your box and
> one is that works on your box) of this same pmd_bad()/pud_bad()
> functions and do something like this (pseudocode):

i.e. something like the (tested) patch below. It is not clear whether
your testcase is on 32-bit or 64-bit - so i went for the more likely
32-bit case first.

Ingo

------------------>
Subject: x86: patches/x86-debug-bad-page.patch
From: Ingo Molnar <[email protected]>
Date: Mon Mar 03 09:53:17 CET 2008

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable_32.c | 7 +++++++
include/asm-x86/pgtable_32.h | 6 +++++-
2 files changed, 12 insertions(+), 1 deletion(-)

Index: linux-x86.q/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/pgtable_32.c
+++ linux-x86.q/arch/x86/mm/pgtable_32.c
@@ -381,3 +381,10 @@ void __pmd_free_tlb(struct mmu_gather *t
}

#endif
+
+int pmd_bad(pmd_t pmd)
+{
+ WARN_ON_ONCE(pmd_bad_v1(pmd) != pmd_bad_v2(pmd));
+
+ return pmd_bad_v1(pmd);
+}
Index: linux-x86.q/include/asm-x86/pgtable_32.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/pgtable_32.h
+++ linux-x86.q/include/asm-x86/pgtable_32.h
@@ -92,7 +92,11 @@ extern unsigned long pg0[];
/* To avoid harmful races, pmd_none(x) should check only the lower when PAE */
#define pmd_none(x) (!(unsigned long)pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
-#define pmd_bad(x) ((pmd_val(x) \
+
+extern int pmd_bad(pmd_t pmd);
+
+#define pmd_bad_v1(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#define pmd_bad_v2(x) ((pmd_val(x) \
& ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)) \
!= _KERNPG_TABLE)

2008-03-03 17:00:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> Could you try to hack up a debug patch perhaps? Uninline the
>> fucntion(s) then add two versions (one is that breaks on your box and
>> one is that works on your box) of this same pmd_bad()/pud_bad()
>> functions and do something like this (pseudocode):
>
> i.e. something like the (tested) patch below. It is not clear whether
> your testcase is on 32-bit or 64-bit - so i went for the more likely
> 32-bit case first.


------------[ cut here ]------------
WARNING: at arch/x86/mm/pgtable_32.c:387 pmd_bad+0x44/0x53()
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.25-rc3 #14
[<c0424ba5>] warn_on_slowpath+0x41/0x67
[<c0408c5c>] ? native_sched_clock+0x94/0xa6
[<c043f432>] ? lock_release_holdtime+0x1a/0x115
[<c04702d4>] ? handle_mm_fault+0x297/0x7e2
[<c063eee6>] ? _spin_unlock+0x1d/0x20
[<c04707f0>] ? handle_mm_fault+0x7b3/0x7e2
[<c04851c1>] ? do_sync_read+0xab/0xe9
[<c0417223>] pmd_bad+0x44/0x53
[<c046f37f>] follow_page+0x8b/0x1f2
[<c0470aa0>] get_user_pages+0x281/0x2ef
[<c0488dec>] get_arg_page+0x2d/0x79
[<c04f63cd>] ? strnlen_user+0x2f/0x4d
[<c0488efb>] copy_strings+0xc3/0x160
[<c0488fb4>] copy_strings_kernel+0x1c/0x2b
[<c048a109>] do_execve+0x11a/0x1f0
[<c0403351>] sys_execve+0x29/0x51
[<c0404ac6>] syscall_call+0x7/0xb
[<c0407697>] ? kernel_execve+0x17/0x1c
[<c04021d1>] ? _stext+0x69/0x12c
[<c040574f>] ? kernel_thread_helper+0x7/0x10
=======================
---[ end trace 63b854b89f6f375d ]---


2008-03-03 17:24:16

by Nish Aravamudan

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

On 3/1/08, Arjan van de Ven <[email protected]> wrote:
> Hi Linus, Ingo, Hans,
>
> Please revert commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd
> ( x86: fix pmd_bad and pud_bad to support huge pages )
> since it prevents the kernel to finish booting on my (Penryn based)
> laptop. The boot stops right after freeing the init memory.
> Took a while to bisect (due to it touching page*.h, which forces a
> full recompile), but it definitely is caused by this commit...
>
> Please revert.
>
>
>
> [arjan@localhost linux.trees.git]$ git-bisect good
> cded932b75ab0a5f9181ee3da34a0a488d1a14fd is first bad commit
> commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd
> Author: Hans Rosenfeld <[email protected]>
> Date: Mon Feb 18 18:10:47 2008 +0100
>
> x86: fix pmd_bad and pud_bad to support huge pages
>
> I recently stumbled upon a problem in the support for huge pages. If a
> program using huge pages does not explicitly unmap them, they remain
> mapped (and therefore, are lost) after the program exits.
>
> I observed that the free huge page count in /proc/meminfo decreased when
> running my program, and it did not increase after the program exited.
> After running the program a few times, no more huge pages could be
> allocated.
>
> The reason for this seems to be that the x86 pmd_bad and pud_bad
> consider pmd/pud entries having the PSE bit set invalid. I think there
> is nothing wrong with this bit being set, it just indicates that the
> lowest level of translation has been reached. This bit has to be (and
> is) checked after the basic validity of the entry has been checked, like
> in this fragment from follow_page() in mm/memory.c:
>
> if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> goto no_page_table;
>
> if (pmd_huge(*pmd)) {
> BUG_ON(flags & FOLL_GET);
> page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> goto out;
> }
>
> Note that this code currently doesn't work as intended if the pmd refers
> to a huge page, the pmd_huge() check can not be reached if the page is
> huge.
>
> Extending pmd_bad() (and, for future 1GB page support, pud_bad()) to
> allow for the PSE bit being set fixes this. For similar reasons,
> allowing the NX bit being set is necessary, too. I have seen huge pages
> having the NX bit set in their pmd entry, which would cause the same
> problem.
>
> Signed-Off-By: Hans Rosenfeld <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

Hrm, not that I necessarily have the right to ask for this, but
shouldn't hugepage related patches at least be cc'd to linux-mm? it
seems this went via LKML to the x86 tree? There are a few of us that
work on hugepages that would have seen this there, but it got lost in
the noise (for me) on LKML.

Thanks,
Nish

2008-03-03 17:43:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Arjan van de Ven <[email protected]> wrote:

> ------------[ cut here ]------------
> WARNING: at arch/x86/mm/pgtable_32.c:387 pmd_bad+0x44/0x53()
> Modules linked in:
> Pid: 1, comm: swapper Not tainted 2.6.25-rc3 #14
> [<c0424ba5>] warn_on_slowpath+0x41/0x67
> [<c0408c5c>] ? native_sched_clock+0x94/0xa6
> [<c043f432>] ? lock_release_holdtime+0x1a/0x115
> [<c04702d4>] ? handle_mm_fault+0x297/0x7e2
> [<c063eee6>] ? _spin_unlock+0x1d/0x20
> [<c04707f0>] ? handle_mm_fault+0x7b3/0x7e2
> [<c04851c1>] ? do_sync_read+0xab/0xe9
> [<c0417223>] pmd_bad+0x44/0x53
> [<c046f37f>] follow_page+0x8b/0x1f2
> [<c0470aa0>] get_user_pages+0x281/0x2ef

hm. I suspect some gcc related difference related to the handling of
this masking:

pmd_val(x) & ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)

versus:

pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)

perhaps it will work if you change it to:

pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)

?

in any case, the commit has to be reverted as it clearly isnt a NOP on
your box as it was intended to be. (it should only have made a
difference in a rare hugetlbfs case)

Ingo

2008-03-03 17:51:43

by Nish Aravamudan

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

On 3/3/08, Ingo Molnar <[email protected]> wrote:
>
> * Arjan van de Ven <[email protected]> wrote:
>
>
> > ------------[ cut here ]------------
> > WARNING: at arch/x86/mm/pgtable_32.c:387 pmd_bad+0x44/0x53()
> > Modules linked in:
> > Pid: 1, comm: swapper Not tainted 2.6.25-rc3 #14
> > [<c0424ba5>] warn_on_slowpath+0x41/0x67
> > [<c0408c5c>] ? native_sched_clock+0x94/0xa6
> > [<c043f432>] ? lock_release_holdtime+0x1a/0x115
> > [<c04702d4>] ? handle_mm_fault+0x297/0x7e2
> > [<c063eee6>] ? _spin_unlock+0x1d/0x20
> > [<c04707f0>] ? handle_mm_fault+0x7b3/0x7e2
> > [<c04851c1>] ? do_sync_read+0xab/0xe9
> > [<c0417223>] pmd_bad+0x44/0x53
> > [<c046f37f>] follow_page+0x8b/0x1f2
> > [<c0470aa0>] get_user_pages+0x281/0x2ef
>
>
> hm. I suspect some gcc related difference related to the handling of
> this masking:
>
>
> pmd_val(x) & ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)
>
>
> versus:
>
>
> pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)
>
>
> perhaps it will work if you change it to:
>
>
> pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)
>
>
> ?
>
> in any case, the commit has to be reverted as it clearly isnt a NOP on
> your box as it was intended to be. (it should only have made a
> difference in a rare hugetlbfs case)

On x86/{,_64}, _PAGE_PSE and _PAGE_PROTNONE are the same bit. Would
that have any effect here? We encountered that collision when adding
mprotect() support for hugepages.

Thanks,
Nish

2008-03-03 17:56:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Nish Aravamudan <[email protected]> wrote:

> > in any case, the commit has to be reverted as it clearly isnt a NOP
> > on your box as it was intended to be. (it should only have made a
> > difference in a rare hugetlbfs case)
>
> On x86/{,_64}, _PAGE_PSE and _PAGE_PROTNONE are the same bit. Would
> that have any effect here? We encountered that collision when adding
> mprotect() support for hugepages.

well, this is the PMD, and PROTNONE is a pte property.

Ingo

2008-03-03 17:59:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Ingo Molnar wrote:
> * Nish Aravamudan <[email protected]> wrote:
>
>>> in any case, the commit has to be reverted as it clearly isnt a NOP
>>> on your box as it was intended to be. (it should only have made a
>>> difference in a rare hugetlbfs case)
>> On x86/{,_64}, _PAGE_PSE and _PAGE_PROTNONE are the same bit. Would
>> that have any effect here? We encountered that collision when adding
>> mprotect() support for hugepages.
>
> well, this is the PMD, and PROTNONE is a pte property.

The PSE bit in the PTE becomes the PATX bit. It shouldn't ever be set
in Linux.

-hpa

2008-03-03 18:38:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Ingo Molnar wrote:
> * Arjan van de Ven <[email protected]> wrote:
>
>> ------------[ cut here ]------------
>> WARNING: at arch/x86/mm/pgtable_32.c:387 pmd_bad+0x44/0x53()
>> Modules linked in:
>> Pid: 1, comm: swapper Not tainted 2.6.25-rc3 #14
>> [<c0424ba5>] warn_on_slowpath+0x41/0x67
>> [<c0408c5c>] ? native_sched_clock+0x94/0xa6
>> [<c043f432>] ? lock_release_holdtime+0x1a/0x115
>> [<c04702d4>] ? handle_mm_fault+0x297/0x7e2
>> [<c063eee6>] ? _spin_unlock+0x1d/0x20
>> [<c04707f0>] ? handle_mm_fault+0x7b3/0x7e2
>> [<c04851c1>] ? do_sync_read+0xab/0xe9
>> [<c0417223>] pmd_bad+0x44/0x53
>> [<c046f37f>] follow_page+0x8b/0x1f2
>> [<c0470aa0>] get_user_pages+0x281/0x2ef
>
> hm. I suspect some gcc related difference related to the handling of
> this masking:
>
> pmd_val(x) & ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)
>
> versus:
>
> pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)
>
> perhaps it will work if you change it to:
>
> pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)
>
> ?
>
> in any case, the commit has to be reverted as it clearly isnt a NOP on
> your box as it was intended to be. (it should only have made a
> difference in a rare hugetlbfs case)

interesting observation: if I turn the macros into inlines... the difference goes away.

I'm half tempted to just do it as inline period ... any objections ?

2008-03-03 18:45:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert



On Mon, 3 Mar 2008, Arjan van de Ven wrote:
>
> interesting observation: if I turn the macros into inlines... the difference
> goes away.
>
> I'm half tempted to just do it as inline period ... any objections ?

Yes, I object. I want to understand why it would matter. If this is a
compiler bug, it's a really rather bad one. And if it's just some stupid
bug in our pmd_bad() macro, I still want to know what the problem was.

Can you compile both ways and look at what changed at the offending site
(which is apparently "follow_page()")?

And do you have some odd compiler version?

Linus

2008-03-03 21:14:33

by Segher Boessenkool

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

>> hm. I suspect some gcc related difference related to the handling of
>> this masking:
>> pmd_val(x) & ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)
>> versus:
>> pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)
>> perhaps it will work if you change it to:
>> pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)
>> ?
>> in any case, the commit has to be reverted as it clearly isnt a NOP
>> on your box as it was intended to be. (it should only have made a
>> difference in a rare hugetlbfs case)
>
> interesting observation: if I turn the macros into inlines... the
> difference goes away.

include/asm-x86/pgtable.h has

#define _PAGE_BIT_PSE 7

#define _PAGE_PSE (_AC(1, L)<<_PAGE_BIT_PSE)

and

#define _PAGE_BIT_NX 63

#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AC(1, ULL) << _PAGE_BIT_NX)

so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to
64-bit is 0x00000000ffffff7f, so in

(~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)

all the high bits are lost, while the original

~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)

works as intended, since the bit inversion is done on a 64-bit number.


Maybe all those pagetable bit definitions should use 64-bit (ULL or a
cast),
as it is now some dangerous detail is hidden behind the macros. Using
inline
functions for simple constants seems like overkill to me, but would
also work
of course.


Segher

2008-03-03 21:25:27

by Segher Boessenkool

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

> so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to
> 64-bit is 0x00000000ffffff7f,

Actually, it is signed, so this isn't true. Comments about unsafeness
still apply.


Segher

2008-03-03 22:03:41

by Arjan van de Ven

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Linus Torvalds wrote:
>
> On Mon, 3 Mar 2008, Arjan van de Ven wrote:
>> interesting observation: if I turn the macros into inlines... the difference
>> goes away.
>>
>> I'm half tempted to just do it as inline period ... any objections ?
>
> Yes, I object. I want to understand why it would matter. If this is a
> compiler bug, it's a really rather bad one. And if it's just some stupid
> bug in our pmd_bad() macro, I still want to know what the problem was.
>
> Can you compile both ways and look at what changed at the offending site
> (which is apparently "follow_page()")?
>
> And do you have some odd compiler version?

it's the F9 gcc, so yeah it's really new.

I'm staring at the disassembly and it's quite different but follow_page() is rather large;
trying to make a smaller testcase now

2008-03-03 22:35:31

by Segher Boessenkool

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

>> so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to
>> 64-bit is 0x00000000ffffff7f,
>
> Actually, it is signed, so this isn't true. Comments about unsafeness
> still apply.

It turns out that PAGE_SIZE is unsigned. So this gives us for

(~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)

the types UL, L, L, ULL resp.

The associativity of & is left-to-right, so this in turn becomes

UL, L, ULL

UL, ULL

ULL

and that cast from UL to ULL doesn't sign-extend.


Segher

2008-03-03 22:59:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Segher Boessenkool wrote:
>>> so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to
>>> 64-bit is 0x00000000ffffff7f,
>>
>> Actually, it is signed, so this isn't true. Comments about unsafeness
>> still apply.
>
> It turns out that PAGE_SIZE is unsigned. So this gives us for
>
> (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)
>
> the types UL, L, L, ULL resp.
>
> The associativity of & is left-to-right, so this in turn becomes
>
> UL, L, ULL
>
> UL, ULL
>
> ULL
>
> and that cast from UL to ULL doesn't sign-extend.
>

All the masks either need to be the proper size or signed.

-hpa

2008-03-03 23:01:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Segher Boessenkool wrote:
>>> so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to
>>> 64-bit is 0x00000000ffffff7f,
>>
>> Actually, it is signed, so this isn't true. Comments about unsafeness
>> still apply.
>
> It turns out that PAGE_SIZE is unsigned. So this gives us for
>
> (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)
>
> the types UL, L, L, ULL resp.
>
> The associativity of & is left-to-right, so this in turn becomes
>
> UL, L, ULL
>
> UL, ULL
>
> ULL
>
> and that cast from UL to ULL doesn't sign-extend.

I went through and made a bunch of these flags signed so that they would
sign-extend properly in 32 vs 64 bit. We could make them
unconditionally 64-bit, but I'm worried that will have unforeseen
consequences too.

In this case, PAGE_MASK should probably be signed too, so the sign
extension happens properly.

Alternatively, they could be cast to pteval_t and/or phys_addr_t so that
they will have an inherent size which matches the pagetable format
(using _AT() rather than _AC()).

J

2008-03-03 23:11:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Jeremy Fitzhardinge wrote:
>
> I went through and made a bunch of these flags signed so that they would
> sign-extend properly in 32 vs 64 bit. We could make them
> unconditionally 64-bit, but I'm worried that will have unforeseen
> consequences too.
>
> In this case, PAGE_MASK should probably be signed too, so the sign
> extension happens properly.
>

Yes, it should.

> Alternatively, they could be cast to pteval_t and/or phys_addr_t so that
> they will have an inherent size which matches the pagetable format
> (using _AT() rather than _AC()).

Either that, or we could use a custom constructor macro (PTEVAL_C()).
This is arguably The Right Thing, but the sign-extending version is OK, too.

-hpa

2008-03-04 01:05:27

by Arjan van de Ven

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Arjan van de Ven wrote:
> Linus Torvalds wrote:
>>
>> On Mon, 3 Mar 2008, Arjan van de Ven wrote:
>>> interesting observation: if I turn the macros into inlines... the
>>> difference
>>> goes away.
>>>
>>> I'm half tempted to just do it as inline period ... any objections ?
>>
>> Yes, I object. I want to understand why it would matter. If this is a
>> compiler bug, it's a really rather bad one. And if it's just some
>> stupid bug in our pmd_bad() macro, I still want to know what the
>> problem was.
>>
>> Can you compile both ways and look at what changed at the offending
>> site (which is apparently "follow_page()")?
>>
>> And do you have some odd compiler version?
>
> it's the F9 gcc, so yeah it's really new.
>
> I'm staring at the disassembly and it's quite different but
> follow_page() is rather large;
> trying to make a smaller testcase now

sadly a more isolated testcase doesn't show the same behavior yet;
so it's some fun interaction ;(

more staring at the assembly for me

2008-03-04 06:53:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Arjan van de Ven <[email protected]> wrote:

>> I'm staring at the disassembly and it's quite different but
>> follow_page() is rather large; trying to make a smaller testcase now
>
> sadly a more isolated testcase doesn't show the same behavior yet; so
> it's some fun interaction ;(
>
> more staring at the assembly for me

could you post the "bad" and the "good" disassembled functions, and the
specific gcc version string?

Ingo

2008-03-04 06:59:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Segher Boessenkool <[email protected]> wrote:

> so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to
> 64-bit is 0x00000000ffffff7f, so in
>
> (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX)
>
> all the high bits are lost, while the original
>
> ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)
>
> works as intended, since the bit inversion is done on a 64-bit number.

but we really are interested in the low bits here (lets ignore the NX
bit for now) and the patch has been in the queue for a long time (more
than a month), so if there was a trivial mask mixup problem it would
have shown on the first day. So i suspect some gcc bug instead - and
certainly the colorful mixture of types and signs in this expression
might have surprised a new version of gcc somewhere.

Ingo

2008-03-05 15:36:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Ingo Molnar wrote:
> * Arjan van de Ven <[email protected]> wrote:
>
>>> I'm staring at the disassembly and it's quite different but
>>> follow_page() is rather large; trying to make a smaller testcase now
>> sadly a more isolated testcase doesn't show the same behavior yet; so
>> it's some fun interaction ;(
>>
>> more staring at the assembly for me
>
> could you post the "bad" and the "good" disassembled functions, and the
> specific gcc version string?
>

http://www.fenrus.org/memory.asm.macro <-- failing one
http://www.fenrus.org/memory.asm.inline <-- working one

2008-03-09 11:56:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Linus Torvalds <[email protected]> wrote:

> > I'm half tempted to just do it as inline period ... any objections ?
>
> Yes, I object. I want to understand why it would matter. If this is a
> compiler bug, it's a really rather bad one. And if it's just some
> stupid bug in our pmd_bad() macro, I still want to know what the
> problem was.

ok, i think i figured it out: on PAE 32-bit we dont properly sign-extend
to a 64-bit pmd value in the new pmd_bad() macro - so if any physical
RAM is above 4GB (Arjan's laptop had 4GB of RAM in it?) then we'll start
seeing those high bits. This definitely needs PAE and more than 4GB of
RAM to trigger.

The best fix is the one below (it should solve Arjan's regression with
that now-reverted patch redone), as it is the right thing to do [that
way sign auto-extend trickles over into PAGE_MASK as well].

It boots fine on a >4GB box of mine but changing the type of PAGE_SIZE
affects _everything_ so i'll keep testing it and i'd suggest to delay
this fix to shortly after -rc5 is released instead of risking -rc5 with
such a late commit. I'll send this and the re-done hugetlbfs fix
together early next week.

Ingo

Signed-off-by: Ingo Molnar <[email protected]>

Index: linux/include/asm-x86/page.h
===================================================================
--- linux.orig/include/asm-x86/page.h
+++ linux/include/asm-x86/page.h
@@ -5,7 +5,7 @@

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
-#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
+#define PAGE_SIZE (_AC(1,L) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))

#ifdef __KERNEL__

2008-03-09 17:28:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert



On Sun, 9 Mar 2008, Ingo Molnar wrote:
>
> The best fix is the one below (it should solve Arjan's regression with
> that now-reverted patch redone), as it is the right thing to do [that
> way sign auto-extend trickles over into PAGE_MASK as well].

This *really* makes me worry.

I do agree that there are good reasons that "PAGE_MASK" should be signed
(since we do want the top bit to extend), but changing "PAGE_SIZE" to
signed seems to be a rather big change, considering that it's used
*everywhere*.

In particular, it's quite possibly used for things like

offset = something % PAGE_SIZE;

etc, where a signed divide is positively wrong.

But even for PAGE_MASK, we literally have code like this:

if ((size_avail & PAGE_MASK) < rg.size) {

where it so _happens_ that "size_avail" is unsigned, but what if it
wasn't? It could turn a unsigned comparison into a signed one, and
introduce any number of security bugs etc.

Sam goes even more for PAGE_SIZE. At least there are only about a thousand
users of PAGE_MASK in the kernel, PAGE_SIZE is used about six times as
many times, and just a _trivial_ grep like

git grep 'PAGE_SIZE.* [<>][= ]'

finds a lot of cases where I'm not at all sure that it's safe to change
PAGE_SIZE to a signed value.

In other words, there are lots of things like

if (x < PAGE_SIZE)
...

where we currently get a unsigned comparison, and where for all I know a
signed PAGE_SIZE means that we should use

if (x >= 0 && x < PAGE_SIZE)

instead.

In short, I refuse to apply this patch after an -rc1 release. I suspect
that I shouldn't apply something like this even *before* an -rc1, because
I think it's just a really bad idea to make these types signed even if it
were to give you magically easier sign extensions to "unsigned long long".

So I would *very* strongly instead argue:

- "unsigned long" is the native kernel type for all address manipulation,
and thus "PAGE_SIZE" and "PAGE_MASK" should continue to have that type.

- anything that uses any other type without explicitly making sure it's
safe is mis-using those macros. IOW, PAGE_MASk was *never* a type that
had anything what-so-ever to do with page table entry bits, and this is
purely a page table entry issue!

So my suggested patch would:

- make the page table code use a specific mask that it builds up itself,
and makes sure it's of the right type and has the rigth value in
whatever type "struct pte_entry" is. The fact that "pte_val()" is
larger than "unsigned long" on x86-32 is very clearly a PTE issue,
*not* an issue for PAGE_SIZE or PAGE_MASK.

Btw, just one look at your other patch should have convinced you of that
anyway. Do you really think this is a readable patch or that the result is
clean:

+#define pmd_bad_v1(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#define pmd_bad_v2(x) ((pmd_val(x) \
& ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)) \
!= _KERNPG_TABLE)

when the real problem is that the mask you build up here isn't safe o
pretty to begin with!

So make that whole "~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)"
expression a nice *clean* expression that is about page table entries
instead, and make *that* one be the right type and have the right bits.
And suddenly the problem just goes away.

In fact, if you look at that expression, you suddenly realize that
PAGE_MASK was *totally* the wrong value to use in the first place, whether
sign-extended o not! Notice how

PAGE_MASK | _PAGE_NX

is already a totally senseless operation if PAGE_MASK has all high bits
set!

So I think your whole argument and the patch is UTTER AND UNBELIEVABLE
CRAP!

Blaming it on PAGE_MASK was totally incorrect. It has nothing to do with
PAGE_MASK, and everything to do with the fact that the page table checking
patch was utterly failed and pure shit.

Linus

2008-03-09 18:58:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert


* Linus Torvalds <[email protected]> wrote:

> So I would *very* strongly instead argue:
>
> - "unsigned long" is the native kernel type for all address manipulation,
> and thus "PAGE_SIZE" and "PAGE_MASK" should continue to have that type.
>
> - anything that uses any other type without explicitly making sure it's
> safe is mis-using those macros. IOW, PAGE_MASk was *never* a type that
> had anything what-so-ever to do with page table entry bits, and this is
> purely a page table entry issue!
>
> So my suggested patch would:
>
> - make the page table code use a specific mask that it builds up itself,
> and makes sure it's of the right type and has the rigth value in
> whatever type "struct pte_entry" is. The fact that "pte_val()" is
> larger than "unsigned long" on x86-32 is very clearly a PTE issue,
> *not* an issue for PAGE_SIZE or PAGE_MASK.

yeah, indeed my patch was sloppy, i didnt think it through - i fell for
the lure of the easy-looking 'PAGE_SIZE is small, sign-extend it' hack.

Will do it cleanly and will also clean up all the pte/address/pgprot
type mixing that currently goes on in this maze of macros.

Ingo

2008-03-10 02:45:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Linus Torvalds wrote:
> So I would *very* strongly instead argue:
>
> - "unsigned long" is the native kernel type for all address manipulation,
> and thus "PAGE_SIZE" and "PAGE_MASK" should continue to have that type.
>
> - anything that uses any other type without explicitly making sure it's
> safe is mis-using those macros. IOW, PAGE_MASk was *never* a type that
> had anything what-so-ever to do with page table entry bits, and this is
> purely a page table entry issue!
>

Yes. PTE_MASK already exists for masking the address bits out of a
pte. It should have the type pteval_t, which will deal with the PAE
case cleanly. It should also only have the middle pfn bits set, so that
_PAGE_NX is also not included.

It's currently defined in terms of PHYSICAL_PAGE_MASK, which 1) is only
used to define PTE_MASK, and 2) defined (wrongly) in terms of PAGE_MASK
(though __PHYSICAL_MASK is correctly defined).

J

2008-03-10 04:35:58

by Paul Mackerras

[permalink] [raw]
Subject: Re: bisected boot regression post 2.6.25-rc3.. please revert

Linus Torvalds writes:

> But even for PAGE_MASK, we literally have code like this:
>
> if ((size_avail & PAGE_MASK) < rg.size) {
>
> where it so _happens_ that "size_avail" is unsigned, but what if it
> wasn't? It could turn a unsigned comparison into a signed one, and
> introduce any number of security bugs etc.

We have had PAGE_MASK being signed on powerpc for ages, so if you do
find any such bugs, please let me know. :) I'm not aware of any at
present, though.

The expression you quoted will be ok as long as either size_avail or
rg.size is unsigned, as far as I can see.

Our PAGE_SIZE is unsigned long.

Paul.