2006-02-22 21:45:51

by H. Peter Anvin

[permalink] [raw]
Subject: sys_mmap2 on different architectures

I've looked through the code for sys_mmap2 on several architectures, and
it looks like some architectures plays by the "shift is always 12" rule,
e.g. SPARC, and some expect userspace to actually obtain the page
size, e.g. PowerPC and MIPS. On some architectures, e.g. x86 and ARM,
the point is moot since PAGE_SIZE is always 2^12.

a. Is this correct, or have I misunderstood the code?

b. If so, is this right, or is this a bug? Right now both klibc and
?Clibc consider the latter a bug.

c. Which architectures are affected which way?

-hpa


2006-02-22 21:54:30

by David Miller

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

From: "H. Peter Anvin" <[email protected]>
Date: Wed, 22 Feb 2006 13:45:46 -0800

> I've looked through the code for sys_mmap2 on several architectures, and
> it looks like some architectures plays by the "shift is always 12" rule,
> e.g. SPARC, and some expect userspace to actually obtain the page
> size, e.g. PowerPC and MIPS. On some architectures, e.g. x86 and ARM,
> the point is moot since PAGE_SIZE is always 2^12.
>
> a. Is this correct, or have I misunderstood the code?
>
> b. If so, is this right, or is this a bug? Right now both klibc and
> ?Clibc consider the latter a bug.
>
> c. Which architectures are affected which way?

Right.

On sparc32 we had the issue where we had a 8K page size
platform (sun4) and the rest were using 4K page size.

I can't even think why we do that fixed shift actually. I think Jakub
Jalinek thought this might be a way to make applications assuming
4K page size work on the 8K page size machines.

I'm going to say that you can feel free to fix this to use PAGE_SHIFT
correctly all the time. Applications should be calling getpagesize()
and not assume what that value might be.

Please double check that we report the correct page size to userspace
and not a fixed 4K value :-)

2006-02-22 22:00:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

David S. Miller wrote:
> Please double check that we report the correct page size to userspace
> and not a fixed 4K value :-)

I haven't found any platforms yet which don't use the AT_PAGESZ entry in
the ELF area correctly. This is obviously a Good Thing. The klibc
"getpagesize" test tests this explicitly.

-hpa

2006-02-23 00:05:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

David S. Miller wrote:
>
> Right.
>
> On sparc32 we had the issue where we had a 8K page size
> platform (sun4) and the rest were using 4K page size.
>
> I can't even think why we do that fixed shift actually. I think Jakub
> Jalinek thought this might be a way to make applications assuming
> 4K page size work on the 8K page size machines.
>
> I'm going to say that you can feel free to fix this to use PAGE_SHIFT
> correctly all the time. Applications should be calling getpagesize()
> and not assume what that value might be.
>

Okay, what I'll do is that I'll hard-code 12 on i386, SPARC and ARM; on
other architectures I'll use getpagesize(). Of course, on 64-bit
architectures this is not an issue; there I just call sys_mmap.

-hpa

2006-02-23 00:19:22

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

On Wed, Feb 22, 2006 at 01:45:46PM -0800, H. Peter Anvin wrote:
> I've looked through the code for sys_mmap2 on several architectures, and
> it looks like some architectures plays by the "shift is always 12" rule,
> e.g. SPARC, and some expect userspace to actually obtain the page
> size, e.g. PowerPC and MIPS. On some architectures, e.g. x86 and ARM,
> the point is moot since PAGE_SIZE is always 2^12.

The sys_mmap2() ABI is that the page shift is always fixed to whatever
page size is reasonable for the architecture, typically 2^12. The syscall
should never be exposed as mmap2(), only as the large file size version
of mmap() (aka mmap64()). The other consideration is that it should not
be implemented in 64 bit ABIs, as those machines should be using a 64 bit
native mmap(). Does that clear things up a bit? Cheers,

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-23 00:22:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

Benjamin LaHaise wrote:
>
> The sys_mmap2() ABI is that the page shift is always fixed to whatever
> page size is reasonable for the architecture, typically 2^12. The syscall
> should never be exposed as mmap2(), only as the large file size version
> of mmap() (aka mmap64()). The other consideration is that it should not
> be implemented in 64 bit ABIs, as those machines should be using a 64 bit
> native mmap(). Does that clear things up a bit? Cheers,
>

That was the theory, but that doesn't seem to be actually what's
implemented. At least on MIPS and PPC, where page size is variable (to
the best of my knowledge), the shift seems to be whatever PAGE_SIZE the
kernel was compiled with. On the other hand, that's apparently what's
implemented on SPARC (with the fixed offset of 12.)

-hpa

2006-02-23 00:40:52

by David Miller

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

From: "H. Peter Anvin" <[email protected]>
Date: Wed, 22 Feb 2006 16:05:39 -0800

> Okay, what I'll do is that I'll hard-code 12 on i386, SPARC and ARM; on
> other architectures I'll use getpagesize(). Of course, on 64-bit
> architectures this is not an issue; there I just call sys_mmap.

Please just use getpagesize(), even on sparc, that sys_mmap2() fixed
shift of 12 is a bug.

2006-02-23 00:41:27

by David Miller

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

From: "H. Peter Anvin" <[email protected]>
Date: Wed, 22 Feb 2006 16:05:39 -0800

> Okay, what I'll do is that I'll hard-code 12 on i386, SPARC and ARM; on
> other architectures I'll use getpagesize(). Of course, on 64-bit
> architectures this is not an issue; there I just call sys_mmap.

Oh and BTW if you use 12 it will break when executing on a
64-bit kernel, where PAGE_SHIFT is variable and starting at
13.

2006-02-23 00:43:48

by David Miller

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

From: Benjamin LaHaise <[email protected]>
Date: Wed, 22 Feb 2006 19:14:11 -0500

> The sys_mmap2() ABI is that the page shift is always fixed to whatever
> page size is reasonable for the architecture, typically 2^12. The syscall
> should never be exposed as mmap2(), only as the large file size version
> of mmap() (aka mmap64()). The other consideration is that it should not
> be implemented in 64 bit ABIs, as those machines should be using a 64 bit
> native mmap(). Does that clear things up a bit? Cheers,

Aha, that part I didn't catch. Thanks for the clarification
Ben.

I wonder why we did things that way with a fixed shift...

2006-02-23 00:59:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

David S. Miller wrote:
> From: Benjamin LaHaise <[email protected]>
> Date: Wed, 22 Feb 2006 19:14:11 -0500
>
>
>>The sys_mmap2() ABI is that the page shift is always fixed to whatever
>>page size is reasonable for the architecture, typically 2^12. The syscall
>>should never be exposed as mmap2(), only as the large file size version
>>of mmap() (aka mmap64()). The other consideration is that it should not
>>be implemented in 64 bit ABIs, as those machines should be using a 64 bit
>>native mmap(). Does that clear things up a bit? Cheers,
>
>
> Aha, that part I didn't catch. Thanks for the clarification
> Ben.
>
> I wonder why we did things that way with a fixed shift...

Except the above doesn't seem to match reality on anything other than
SPARC, and the architectures where the shift is 12 anyway because that's
the only pagesize supported.

On the other hand, sys32_mmap2 on SPARC64 matches the SPARC32 sys_mmap2
in that the shift is hard-coded to 12:

.globl sys32_mmap2
sys32_mmap2:
sethi %hi(sys_mmap), %g1
jmpl %g1 + %lo(sys_mmap), %g0
sllx %o5, 12, %o5


At this point, I'm more than willing to treat SPARC as a special case,
but I really want to know what the rules actually _ARE_ as opposed to
what they are supposed to be (which they clearly are not.)

-hpa

2006-02-23 01:03:48

by David Miller

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

From: "H. Peter Anvin" <[email protected]>
Date: Wed, 22 Feb 2006 16:59:04 -0800

> On the other hand, sys32_mmap2 on SPARC64 matches the SPARC32 sys_mmap2
> in that the shift is hard-coded to 12:
>
> .globl sys32_mmap2
> sys32_mmap2:
> sethi %hi(sys_mmap), %g1
> jmpl %g1 + %lo(sys_mmap), %g0
> sllx %o5, 12, %o5

Another good catch...

> At this point, I'm more than willing to treat SPARC as a special case,
> but I really want to know what the rules actually _ARE_ as opposed to
> what they are supposed to be (which they clearly are not.)

I have to admit I'm totally stumped...

Why are you invoking mmap2() instead of mmap64() btw?

2006-02-23 01:07:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

David S. Miller wrote:
>
>>At this point, I'm more than willing to treat SPARC as a special case,
>>but I really want to know what the rules actually _ARE_ as opposed to
>>what they are supposed to be (which they clearly are not.)
>
> I have to admit I'm totally stumped...
>
> Why are you invoking mmap2() instead of mmap64() btw?

Most 32-bit architectures don't have sys_mmap64; in fact the only one
that seem to is parisc, for HPUX compatibility. I'm trying to keep the
differences between architectures as small as possible.

-hpa

2006-02-23 02:56:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

H. Peter Anvin writes:

> I've looked through the code for sys_mmap2 on several architectures, and
> it looks like some architectures plays by the "shift is always 12" rule,
> e.g. SPARC, and some expect userspace to actually obtain the page
> size, e.g. PowerPC and MIPS. On some architectures, e.g. x86 and ARM,
> the point is moot since PAGE_SIZE is always 2^12.
>
> a. Is this correct, or have I misunderstood the code?

PowerPC always uses 12, even if PAGE_SHIFT is 16 (i.e. for 64k
pages).

> b. If so, is this right, or is this a bug? Right now both klibc and
> ?Clibc consider the latter a bug.

Glibc seems to expect it to always be 12, according to this excerpt
from sysdeps/unix/sysv/linux/mmap64.c:

/* This is always 12, even on architectures where PAGE_SHIFT != 12. */
# ifndef MMAP2_PAGE_SHIFT
# define MMAP2_PAGE_SHIFT 12
# endif

I would be very reluctant to change the shift to be PAGE_SHIFT, since
that would be a change in the user/kernel ABI. Of course, userspace
is still expected to make sure addresses and offsets are multiples of
the page size (and thus the offset argument to mmap2 has to be a
multiple of 16 if the page size is 64k).

Regards,
Paul.

2006-02-23 03:36:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

Paul Mackerras wrote:
>
>>I've looked through the code for sys_mmap2 on several architectures, and
>>it looks like some architectures plays by the "shift is always 12" rule,
>> e.g. SPARC, and some expect userspace to actually obtain the page
>>size, e.g. PowerPC and MIPS. On some architectures, e.g. x86 and ARM,
>>the point is moot since PAGE_SIZE is always 2^12.
>>
>>a. Is this correct, or have I misunderstood the code?
>
> PowerPC always uses 12, even if PAGE_SHIFT is 16 (i.e. for 64k
> pages).
>

ACK on that. I was looking at old kernel sources (2.6.14-rc timeframe),
and I guess that one only supported 4K pages.

>>b. If so, is this right, or is this a bug? Right now both klibc and
>>?Clibc consider the latter a bug.
>
> Glibc seems to expect it to always be 12, according to this excerpt
> from sysdeps/unix/sysv/linux/mmap64.c:

That's what I thought, too, but it doesn't seem to match reality.

This is what I've found so far: (64-bit architectures excluded)

arm - N/A (PAGE_SHIFT == 12)
arm26 - MMAP2_PAGE_SHIFT == 12
cris - MMAP2_PAGE_SHIFT == PAGE_SHIFT (13)
frv - MMAP2_PAGE_SHIFT == 12
h8300 - N/A (PAGE_SHIFT == 12)
i386 - N/A (PAGE_SHIFT == 12)
m32r - N/A (PAGE_SHIFT == 12)
m68k - MMAP2_PAGE_SHIFT == PAGE_SHIFT (variable)
mips - MMAP2_PAGE_SHIFT == PAGE_SHIFT (variable)
parisc - MMAP2_PAGE_SHIFT == 12
ppc - MMAP2_PAGE_SHIFT == 12
s390 - N/A (PAGE_SHIFT == 12)
sh - N/A (PAGE_SHIFT == 12)
sparc - MMAP2_PAGE_SHIFT == 12
v850 - N/A (PAGE_SHIFT == 12)
xtensa - N/A (PAGE_SHIFT == 12)

So, excluding 64-bit architectures, we have 3 architectures which expect
getpagesize() to be used, 5 which expect the constant value 12, and 8
which get the same result either way. In effect, we have a system call
with subtly different semantics across architectures, and there isn't
any clear distinction each way.

This is something I don't enjoy about Linux :-/

> /* This is always 12, even on architectures where PAGE_SHIFT != 12. */
> # ifndef MMAP2_PAGE_SHIFT
> # define MMAP2_PAGE_SHIFT 12
> # endif
>
> I would be very reluctant to change the shift to be PAGE_SHIFT, since
> that would be a change in the user/kernel ABI. Of course, userspace
> is still expected to make sure addresses and offsets are multiples of
> the page size (and thus the offset argument to mmap2 has to be a
> multiple of 16 if the page size is 64k).

Changing the user-kernel ABI is bad. I'm just trying to get to the
bottom with what the API actually *IS*.

-hpa

2006-02-23 17:33:15

by Ralf Baechle

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

On Wed, Feb 22, 2006 at 07:35:50PM -0800, H. Peter Anvin wrote:

> This is what I've found so far: (64-bit architectures excluded)
>
> arm - N/A (PAGE_SHIFT == 12)
> arm26 - MMAP2_PAGE_SHIFT == 12
> cris - MMAP2_PAGE_SHIFT == PAGE_SHIFT (13)
> frv - MMAP2_PAGE_SHIFT == 12
> h8300 - N/A (PAGE_SHIFT == 12)
> i386 - N/A (PAGE_SHIFT == 12)
> m32r - N/A (PAGE_SHIFT == 12)
> m68k - MMAP2_PAGE_SHIFT == PAGE_SHIFT (variable)
> mips - MMAP2_PAGE_SHIFT == PAGE_SHIFT (variable)

A variable which happens to be fixed to 12 in practice. As explained by
Ben the API is only relevant to 32-bit kernels and afaik PAGE_SHIFT
other than 12 has only been successfully been tested on 64-bit kernels.

Ralf

2006-02-23 17:43:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

Ralf Baechle wrote:
>
> A variable which happens to be fixed to 12 in practice. As explained by
> Ben the API is only relevant to 32-bit kernels and afaik PAGE_SHIFT
> other than 12 has only been successfully been tested on 64-bit kernels.
>

No, that's not correct. This API is relevant to 32-bit *USERSPACE*. If
you support 32-bit userspace on a 64-bit kernel, it applies to 64-bit
kernels, too.

-hpa

2006-02-23 17:46:33

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

On Wed, Feb 22, 2006 at 04:43:47PM -0800, David S. Miller wrote:
> Aha, that part I didn't catch. Thanks for the clarification
> Ben.
>
> I wonder why we did things that way with a fixed shift...

Without that trick, we'd have needed an extra parameter for the syscall
on x86, which is already at the maximum number of registers with 6
arguments. This was easier than changing the syscall ABI. =-)

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-23 17:47:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: sys_mmap2 on different architectures

Benjamin LaHaise wrote:
> On Wed, Feb 22, 2006 at 04:43:47PM -0800, David S. Miller wrote:
>
>>Aha, that part I didn't catch. Thanks for the clarification
>>Ben.
>>
>>I wonder why we did things that way with a fixed shift...
>
>
> Without that trick, we'd have needed an extra parameter for the syscall
> on x86, which is already at the maximum number of registers with 6
> arguments. This was easier than changing the syscall ABI. =-)
>

Well, there is always the trick of making it a pointer. It was needed
for pselect() anyway. A real sys_mmap64 would definitely have been
cleaner, and will be needed to deal with the 16 TB barrier anyway :)

I personally think the S390 people had the right idea... once you run
out of registers, just make it a defined part of the ABI that we pass in
a single pointer to all the arguments.

-hpa