2012-10-19 16:48:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

From: Andi Kleen <[email protected]>

There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
to use 1GB huge pages on some mappings, and stay with 2MB on others. This
is useful together with NUMA policy: use 2MB interleaving on some mappings,
but 1GB on local mappings.

This patch extends the IPC/SHM syscall interfaces slightly to allow specifying
the page size.

It borrows some upper bits in the existing flag arguments and allows encoding
the log of the desired page size in addition to the *_HUGETLB flag.
When 0 is specified the default size is used, this makes the change fully
compatible.

Extending the internal hugetlb code to handle this is straight forward. Instead
of a single mount it just keeps an array of them and selects the right
mount based on the specified page size. When no page size is specified
it uses the mount of the default page size.

The change is not visible in /proc/mounts because internal mounts
don't appear there. It also has very little overhead: the additional
mounts just consume a super block, but not more memory when not used.

I also exported the new flags to the user headers
(they were previously under __KERNEL__). Right now only symbols
for x86 and some other architecture for 1GB and 2MB are defined.
The interface should already work for all other architectures
though. Only architectures that define multiple hugetlb sizes
actually need it (that is currently x86, tile, powerpc). However
tile and powerpc have user configurable hugetlb sizes, so it's
not easy to add defines. A program on those architectures would
need to query sysfs and use the appropiate log2.

v2: Port to new tree. Fix unmount.
v3: Ported to latest tree.
v4: Ported to latest tree. Minor changes for review feedback. Updated
description.
v5: Remove unnecessary prototypes to fix merge error (Hillf Danton)
v6: Rebased. Fix some unlikely error paths (Hillf Danton)
Acked-by: Rik van Riel <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mman.h | 3 ++
fs/hugetlbfs/inode.c | 66 ++++++++++++++++++++++++++++++--------
include/linux/hugetlb.h | 7 +++-
include/linux/shm.h | 19 +++++++++++
include/uapi/asm-generic/mman.h | 13 ++++++++
ipc/shm.c | 3 +-
mm/mmap.c | 5 ++-
7 files changed, 97 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
index 593e51d..513b05f 100644
--- a/arch/x86/include/asm/mman.h
+++ b/arch/x86/include/asm/mman.h
@@ -3,6 +3,9 @@

#define MAP_32BIT 0x40 /* only give out 32bit addresses */

+#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)
+#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
+
#include <asm-generic/mman.h>

#endif /* _ASM_X86_MMAN_H */
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c5bc355..f332275 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -923,7 +923,7 @@ static struct file_system_type hugetlbfs_fs_type = {
.kill_sb = kill_litter_super,
};

-static struct vfsmount *hugetlbfs_vfsmount;
+static struct vfsmount *hugetlbfs_vfsmount[HUGE_MAX_HSTATE];

static int can_do_hugetlb_shm(void)
{
@@ -932,9 +932,22 @@ static int can_do_hugetlb_shm(void)
return capable(CAP_IPC_LOCK) || in_group_p(shm_group);
}

+static int get_hstate_idx(int page_size_log)
+{
+ struct hstate *h;
+
+ if (!page_size_log)
+ return default_hstate_idx;
+ h = size_to_hstate(1 << page_size_log);
+ if (!h)
+ return -1;
+ return h - hstates;
+}
+
struct file *hugetlb_file_setup(const char *name, unsigned long addr,
size_t size, vm_flags_t acctflag,
- struct user_struct **user, int creat_flags)
+ struct user_struct **user,
+ int creat_flags, int page_size_log)
{
int error = -ENOMEM;
struct file *file;
@@ -944,9 +957,14 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
struct qstr quick_string;
struct hstate *hstate;
unsigned long num_pages;
+ int hstate_idx;
+
+ hstate_idx = get_hstate_idx(page_size_log);
+ if (hstate_idx < 0)
+ return ERR_PTR(-ENODEV);

*user = NULL;
- if (!hugetlbfs_vfsmount)
+ if (!hugetlbfs_vfsmount[hstate_idx])
return ERR_PTR(-ENOENT);

if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
@@ -963,7 +981,7 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
}
}

- root = hugetlbfs_vfsmount->mnt_root;
+ root = hugetlbfs_vfsmount[hstate_idx]->mnt_root;
quick_string.name = name;
quick_string.len = strlen(quick_string.name);
quick_string.hash = 0;
@@ -971,7 +989,7 @@ struct file *hugetlb_file_setup(const char *name, unsigned long addr,
if (!path.dentry)
goto out_shm_unlock;

- path.mnt = mntget(hugetlbfs_vfsmount);
+ path.mnt = mntget(hugetlbfs_vfsmount[hstate_idx]);
error = -ENOSPC;
inode = hugetlbfs_get_inode(root->d_sb, NULL, S_IFREG | S_IRWXUGO, 0);
if (!inode)
@@ -1011,8 +1029,9 @@ out_shm_unlock:

static int __init init_hugetlbfs_fs(void)
{
+ struct hstate *h;
int error;
- struct vfsmount *vfsmount;
+ int i;

error = bdi_init(&hugetlbfs_backing_dev_info);
if (error)
@@ -1029,14 +1048,27 @@ static int __init init_hugetlbfs_fs(void)
if (error)
goto out;

- vfsmount = kern_mount(&hugetlbfs_fs_type);
-
- if (!IS_ERR(vfsmount)) {
- hugetlbfs_vfsmount = vfsmount;
- return 0;
+ i = 0;
+ for_each_hstate (h) {
+ char buf[50];
+ unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
+
+ snprintf(buf, sizeof buf, "pagesize=%uK", ps_kb);
+ hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
+ buf);
+
+ if (IS_ERR(hugetlbfs_vfsmount[i])) {
+ pr_err(
+ "hugetlb: Cannot mount internal hugetlbfs for page size %uK",
+ ps_kb);
+ error = PTR_ERR(hugetlbfs_vfsmount[i]);
+ hugetlbfs_vfsmount[i] = NULL;
+ }
+ i++;
}
-
- error = PTR_ERR(vfsmount);
+ /* Non default hstates are optional */
+ if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
+ return 0;

out:
kmem_cache_destroy(hugetlbfs_inode_cachep);
@@ -1047,13 +1079,19 @@ static int __init init_hugetlbfs_fs(void)

static void __exit exit_hugetlbfs_fs(void)
{
+ struct hstate *h;
+ int i;
+
+
/*
* Make sure all delayed rcu free inodes are flushed before we
* destroy cache.
*/
rcu_barrier();
kmem_cache_destroy(hugetlbfs_inode_cachep);
- kern_unmount(hugetlbfs_vfsmount);
+ i = 0;
+ for_each_hstate (h)
+ kern_unmount(hugetlbfs_vfsmount[i++]);
unregister_filesystem(&hugetlbfs_fs_type);
bdi_destroy(&hugetlbfs_backing_dev_info);
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..3e7fa1a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -183,7 +183,8 @@ extern const struct file_operations hugetlbfs_file_operations;
extern const struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, unsigned long addr,
size_t size, vm_flags_t acct,
- struct user_struct **user, int creat_flags);
+ struct user_struct **user, int creat_flags,
+ int page_size_log);

static inline int is_file_hugepages(struct file *file)
{
@@ -195,12 +196,14 @@ static inline int is_file_hugepages(struct file *file)
return 0;
}

+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
static inline struct file *
hugetlb_file_setup(const char *name, unsigned long addr, size_t size,
- vm_flags_t acctflag, struct user_struct **user, int creat_flags)
+ vm_flags_t acctflag, struct user_struct **user, int creat_flags,
+ int page_size_log)
{
return ERR_PTR(-ENOSYS);
}
diff --git a/include/linux/shm.h b/include/linux/shm.h
index bcf8a6a..7bbab55 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -23,12 +23,31 @@ struct shmid_kernel /* private to the kernel */
struct task_struct *shm_creator;
};

+#endif
+
/* shm_mode upper byte flags */
#define SHM_DEST 01000 /* segment will be destroyed on last detach */
#define SHM_LOCKED 02000 /* segment will not be swapped */
#define SHM_HUGETLB 04000 /* segment will use huge TLB pages */
#define SHM_NORESERVE 010000 /* don't check for reservations */

+/* Bits [26:31] are reserved */
+
+/*
+ * When SHM_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
+ * This gives us 6 bits, which is enough until someone invents 128 bit address
+ * spaces.
+ *
+ * Assume these are all power of twos.
+ * When 0 use the default page size.
+ */
+#define SHM_HUGE_SHIFT 26
+#define SHM_HUGE_MASK 0x3f
+#define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT)
+#define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT)
+
+#ifdef __KERNEL__
+
#ifdef CONFIG_SYSVIPC
long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
unsigned long shmlba);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 32c8bd6..d2f35d8 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -13,6 +13,19 @@
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */

+/* Bits [26:31] are reserved */
+
+/*
+ * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
+ * This gives us 6 bits, which is enough until someone invents 128 bit address
+ * spaces.
+ *
+ * Assume these are all power of twos.
+ * When 0 use the default page size.
+ */
+#define MAP_HUGE_SHIFT 26
+#define MAP_HUGE_MASK 0x3f
+
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */

diff --git a/ipc/shm.c b/ipc/shm.c
index dff40c9..4fa6d8f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -495,7 +495,8 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (shmflg & SHM_NORESERVE)
acctflag = VM_NORESERVE;
file = hugetlb_file_setup(name, 0, size, acctflag,
- &shp->mlock_user, HUGETLB_SHMFS_INODE);
+ &shp->mlock_user, HUGETLB_SHMFS_INODE,
+ (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
} else {
/*
* Do not allow no accounting for OVERCOMMIT_NEVER, even
diff --git a/mm/mmap.c b/mm/mmap.c
index 2d94235..04e990b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1151,8 +1151,9 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
* memory so no accounting is necessary
*/
file = hugetlb_file_setup(HUGETLB_ANON_FILE, addr, len,
- VM_NORESERVE, &user,
- HUGETLB_ANONHUGE_INODE);
+ VM_NORESERVE,
+ &user, HUGETLB_ANONHUGE_INODE,
+ (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
if (IS_ERR(file))
return PTR_ERR(file);
}
--
1.7.7.6


2012-10-20 03:39:54

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Sat, Oct 20, 2012 at 12:48 AM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> There was some desire in large applications using MAP_HUGETLB/SHM_HUGETLB
> to use 1GB huge pages on some mappings, and stay with 2MB on others. This
> is useful together with NUMA policy: use 2MB interleaving on some mappings,
> but 1GB on local mappings.
>
> This patch extends the IPC/SHM syscall interfaces slightly to allow specifying
> the page size.
>
> It borrows some upper bits in the existing flag arguments and allows encoding
> the log of the desired page size in addition to the *_HUGETLB flag.
> When 0 is specified the default size is used, this makes the change fully
> compatible.
>
> Extending the internal hugetlb code to handle this is straight forward. Instead
> of a single mount it just keeps an array of them and selects the right
> mount based on the specified page size. When no page size is specified
> it uses the mount of the default page size.
>
> The change is not visible in /proc/mounts because internal mounts
> don't appear there. It also has very little overhead: the additional
> mounts just consume a super block, but not more memory when not used.
>
> I also exported the new flags to the user headers
> (they were previously under __KERNEL__). Right now only symbols
> for x86 and some other architecture for 1GB and 2MB are defined.
> The interface should already work for all other architectures
> though. Only architectures that define multiple hugetlb sizes
> actually need it (that is currently x86, tile, powerpc). However
> tile and powerpc have user configurable hugetlb sizes, so it's
> not easy to add defines. A program on those architectures would
> need to query sysfs and use the appropiate log2.
>
> v2: Port to new tree. Fix unmount.
> v3: Ported to latest tree.
> v4: Ported to latest tree. Minor changes for review feedback. Updated
> description.
> v5: Remove unnecessary prototypes to fix merge error (Hillf Danton)
> v6: Rebased. Fix some unlikely error paths (Hillf Danton)
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Hillf Danton <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> ---

Thanks:)

Acked-by: Hillf Danton <[email protected]>

Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

Andi,

> +#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)
> +#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
> +#define SHM_HUGE_SHIFT 26
> +#define SHM_HUGE_MASK 0x3f
> +#define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT)
> +#define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT)

Maybe I am missing something obvious, but does this not conflict with
include/uapi/asm-generic/mman-common.h:

#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000
...

0x4000000 == (1 << 26)
?

Thanks,

Michael

--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

2012-10-22 13:27:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

> Maybe I am missing something obvious, but does this not conflict with
> include/uapi/asm-generic/mman-common.h:
>
> #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
> # define MAP_UNINITIALIZED 0x4000000
> ...
>
> 0x4000000 == (1 << 26
>

You're right. Someone added that since I wrote the patch originally.
I owned them when originally submitted @) Thanks for catching.

Have to move my bits two up, which will still work, but limit the
maximum page size slightly more, but will likely still work for most
people.

I'll send a new patch.

BTW seriously MAP_UNINITIALIZED? Who came up with that?
MAP_COMPLETELY_INSECURE or MAP_INSANE would have been more appropiate.
Maybe I should list as an advantage that my patch will prohibit
too many further such "innovations".

-Andi
--
[email protected] -- Speaking for myself only.

2012-10-22 13:35:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, Oct 22, 2012 at 03:27:33PM +0200, Andi Kleen wrote:
> > Maybe I am missing something obvious, but does this not conflict with
> > include/uapi/asm-generic/mman-common.h:
> >
> > #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
> > # define MAP_UNINITIALIZED 0x4000000
> > ...
> >
> > 0x4000000 == (1 << 26
> >
>
> You're right. Someone added that since I wrote the patch originally.
> I owned them when originally submitted @) Thanks for catching.
>
> Have to move my bits two up, which will still work, but limit the

Two up won't work, need one up.

32..28 = 16 is too small for 2^30 = 1GB pages
32..27 = 32 max 4GB pages

So this will use up all remaining flag bits now.

-Andi

Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, Oct 22, 2012 at 3:35 PM, Andi Kleen <[email protected]> wrote:
> On Mon, Oct 22, 2012 at 03:27:33PM +0200, Andi Kleen wrote:
>> > Maybe I am missing something obvious, but does this not conflict with
>> > include/uapi/asm-generic/mman-common.h:
>> >
>> > #ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
>> > # define MAP_UNINITIALIZED 0x4000000
>> > ...
>> >
>> > 0x4000000 == (1 << 26
>> >
>>
>> You're right. Someone added that since I wrote the patch originally.
>> I owned them when originally submitted @) Thanks for catching.
>>
>> Have to move my bits two up, which will still work, but limit the
>
> Two up won't work, need one up.
>
> 32..28 = 16 is too small for 2^30 = 1GB pages
> 32..27 = 32 max 4GB pages

Not sure of your notation there. I assume 31..27 means 5 bits (32
through to 28 inclusive, 27 excluded). That gives you just 2^31 ==
2GB, not 4GB (unless your planning to always add 1 to the value in
those bits, since a value of zero has little meaning).

But there seems an obvious solution here: given your value in those
bits (call it 'n'), the why not apply a multiplier. I mean, certainly
you never want a value <= 12 for n, and I suspect that the reasonable
minimum could be much larger (e.g., 2^16). Call that minimum M. Then
you could interpret the value in your bits as meaning a page size of

(2^n) * M

> So this will use up all remaining flag bits now.

On the other hand, that seems really bad. It looks like that kills the
ability to further extend the mmap() API with new flags in the future.
It doesn't sound like we should be doing that.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-10-22 15:36:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

> Not sure of your notation there. I assume 31..27 means 5 bits (32
> through to 28 inclusive, 27 excluded). That gives you just 2^31 ==

[27...31]

You're right it's only 5 bits, so just 2GB.

Thinking about it more PowerPC has a 16GB page, so we probably
need to move this to prot.

However I'm not sure if any architectures use let's say the high
8 bits of prot.

>
> But there seems an obvious solution here: given your value in those
> bits (call it 'n'), the why not apply a multiplier. I mean, certainly
> you never want a value <= 12 for n, and I suspect that the reasonable
> minimum could be much larger (e.g., 2^16). Call that minimum M. Then
> you could interpret the value in your bits as meaning a page size of
>
> (2^n) * M

I considered that, but it would seem ugly and does not add that
many bits.

>
> > So this will use up all remaining flag bits now.
>
> On the other hand, that seems really bad. It looks like that kills the
> ability to further extend the mmap() API with new flags in the future.
> It doesn't sound like we should be doing that.

You can always add flags to PROT or add a mmap3(). Has been done before.
Or just don't do any new MAP_SECURITY_HOLEs

-Andi

--
[email protected] -- Speaking for myself only

Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, Oct 22, 2012 at 5:36 PM, Andi Kleen <[email protected]> wrote:
>> Not sure of your notation there. I assume 31..27 means 5 bits (32
>> through to 28 inclusive, 27 excluded). That gives you just 2^31 ==
> [27...31]

(Hmm -- sleeping as I wrote that, but I at least got the end point right.)

> You're right it's only 5 bits, so just 2GB.
>
> Thinking about it more PowerPC has a 16GB page, so we probably
> need to move this to prot.
>
> However I'm not sure if any architectures use let's say the high
> 8 bits of prot.

This is all seems to make an awful muck of the API...

I'm not sure whether anything is using the high 8 bits of prot, bun
passing I note that there seems to be no check that the unused bits
are zeroed so there's a small chance existing apps are passing random
garbage there. (Of course, mmap() is hardly the only API to have that
fault, and it hasn't stopped us from reusing bits in those APIs,
though sometimes we've gotten bitten by apps that did pass in random
garbage).

>> But there seems an obvious solution here: given your value in those
>> bits (call it 'n'), the why not apply a multiplier. I mean, certainly
>> you never want a value <= 12 for n, and I suspect that the reasonable
>> minimum could be much larger (e.g., 2^16). Call that minimum M. Then
>> you could interpret the value in your bits as meaning a page size of
>>
>> (2^n) * M
>
> I considered that, but it would seem ugly and does not add that
> many bits.
>
>>
>> > So this will use up all remaining flag bits now.
>>
>> On the other hand, that seems really bad. It looks like that kills the
>> ability to further extend the mmap() API with new flags in the future.
>> It doesn't sound like we should be doing that.
>
> You can always add flags to PROT or add a mmap3(). Has been done before.
> Or just don't do any new MAP_SECURITY_HOLEs

There seems to be a reasonable argument here for an mmap3() with a
64-bit flags argument...

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-10-22 16:11:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, Oct 22, 2012 at 05:53:45PM +0200, Michael Kerrisk (man-pages) wrote:
> I'm not sure whether anything is using the high 8 bits of prot, bun
> passing I note that there seems to be no check that the unused bits
> are zeroed so there's a small chance existing apps are passing random
> garbage there. (Of course, mmap() is hardly the only API to have that
> fault, and it hasn't stopped us from reusing bits in those APIs,
> though sometimes we've gotten bitten by apps that did pass in random
> garbage).

Ok.

>
> >> But there seems an obvious solution here: given your value in those
> >> bits (call it 'n'), the why not apply a multiplier. I mean, certainly
> >> you never want a value <= 12 for n, and I suspect that the reasonable
> >> minimum could be much larger (e.g., 2^16). Call that minimum M. Then
> >> you could interpret the value in your bits as meaning a page size of
> >>
> >> (2^n) * M
> >
> > I considered that, but it would seem ugly and does not add that
> > many bits.
> >
> >>
> >> > So this will use up all remaining flag bits now.
> >>
> >> On the other hand, that seems really bad. It looks like that kills the
> >> ability to further extend the mmap() API with new flags in the future.
> >> It doesn't sound like we should be doing that.
> >
> > You can always add flags to PROT or add a mmap3(). Has been done before.
> > Or just don't do any new MAP_SECURITY_HOLEs
>
> There seems to be a reasonable argument here for an mmap3() with a
> 64-bit flags argument...

It's just a pain to deploy.

I think I would rather do the offset then. That could still handle
PowerPC

14 + 31 = 44 = 16GB (minimum size 16K)

-Andi
--
[email protected] -- Speaking for myself only.

Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

>> >> But there seems an obvious solution here: given your value in those
>> >> bits (call it 'n'), the why not apply a multiplier. I mean, certainly
>> >> you never want a value <= 12 for n, and I suspect that the reasonable
>> >> minimum could be much larger (e.g., 2^16). Call that minimum M. Then
>> >> you could interpret the value in your bits as meaning a page size of
>> >>
>> >> (2^n) * M
>> >
>> > I considered that, but it would seem ugly and does not add that
>> > many bits.
>> >
>> >>
>> >> > So this will use up all remaining flag bits now.
>> >>
>> >> On the other hand, that seems really bad. It looks like that kills the
>> >> ability to further extend the mmap() API with new flags in the future.
>> >> It doesn't sound like we should be doing that.
>> >
>> > You can always add flags to PROT or add a mmap3(). Has been done before.
>> > Or just don't do any new MAP_SECURITY_HOLEs
>>
>> There seems to be a reasonable argument here for an mmap3() with a
>> 64-bit flags argument...
>
> It's just a pain to deploy.
>
> I think I would rather do the offset then. That could still handle
> PowerPC
>
> 14 + 31 = 44 = 16GB (minimum size 16K)

Since PowerPC already allows 16GB page sizes, doesn't there need to be
allowance for the possibility of future expansion? Choosing a larger
minimum size (like 2^16) would allow that. Does the minimum size need
to be 16k? (Surely, if you want a HUGEPAGE, you want a bigger page
than that? I am not sure.)

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, Oct 22, 2012 at 6:29 PM, Andi Kleen <[email protected]> wrote:
>> Since PowerPC already allows 16GB page sizes, doesn't there need to be
>> allowance for the possibility of future expansion? Choosing a larger
>> minimum size (like 2^16) would allow that. Does the minimum size need
>> to be 16k? (Surely, if you want a HUGEPAGE, you want a bigger page
>> than that? I am not sure.)
>
> Some architectures have configurable huge page sizes, so it depends on
> the user. I thought 16K is reasonable. Can make it larger too.
>
> But I personally consider even 16GB pages somewhat too big.

I do not know the answer course ;-). Just thought that it was worth
emphasizing that some system already allows the upper limit you
propose. It seems inevitable that some other system will allow
something even bigger.

Anyway, I got distracted from my earlier more important point. This
proposed change will chew up most (all?) of the remaining bit-space in
'flags'. This seems like a mistake from a future extensibility point
of view... It sounds a lot like you'll force someone else to write and
deploy mmap3()...

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-10-22 16:59:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

> Since PowerPC already allows 16GB page sizes, doesn't there need to be
> allowance for the possibility of future expansion? Choosing a larger
> minimum size (like 2^16) would allow that. Does the minimum size need
> to be 16k? (Surely, if you want a HUGEPAGE, you want a bigger page
> than that? I am not sure.)

Some architectures have configurable huge page sizes, so it depends on
the user. I thought 16K is reasonable. Can make it larger too.

But I personally consider even 16GB pages somewhat too big.

-Andi
--
[email protected] -- Speaking for myself only

2012-10-22 21:39:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, 22 Oct 2012 15:27:33 +0200
Andi Kleen <[email protected]> wrote:

> BTW seriously MAP_UNINITIALIZED? Who came up with that?
> MAP_COMPLETELY_INSECURE or MAP_INSANE would have been more appropiate.

heh. It's a NOMMU-only thing.


config MMAP_ALLOW_UNINITIALIZED
bool "Allow mmapped anonymous memory to be uninitialized"
depends on EXPERT && !MMU
default n
help
Normally, and according to the Linux spec, anonymous memory obtained
from mmap() has it's contents cleared before it is passed to
userspace. Enabling this config option allows you to request that
mmap() skip that if it is given an MAP_UNINITIALIZED flag, thus
providing a huge performance boost. If this option is not enabled,
then the flag will be ignored.

This is taken advantage of by uClibc's malloc(), and also by
ELF-FDPIC binfmt's brk and stack allocator.

Because of the obvious security issues, this option should only be
enabled on embedded devices where you control what is run in
userspace. Since that isn't generally a problem on no-MMU systems,
it is normally safe to say Y here.

See Documentation/nommu-mmap.txt for more information.

2012-10-23 01:44:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, 2012-10-22 at 17:53 +0200, Michael Kerrisk (man-pages) wrote:

> This is all seems to make an awful muck of the API...

.../...

> There seems to be a reasonable argument here for an mmap3() with a
> 64-bit flags argument...

I tend to agree. There's a similar issue happening when we try to shovel
things into protection bits, like we do with SAO (strong access
ordering) and want to do with per-page endian on embedded.

It think we need a pile of additional flag space.

Cheers,
Ben.

2012-10-23 02:28:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Tue, Oct 23, 2012 at 12:44:24PM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2012-10-22 at 17:53 +0200, Michael Kerrisk (man-pages) wrote:
>
> > This is all seems to make an awful muck of the API...
>
> .../...
>
> > There seems to be a reasonable argument here for an mmap3() with a
> > 64-bit flags argument...
>
> I tend to agree. There's a similar issue happening when we try to shovel

Could you comment on the expect range of page sizes on PPC?

I looked at this again and I don't think we have anywhere near true 28 flags
so far. The man page currently only lists 16 (including MAP_UNUS^INITIALIZED)

So I don't see why I can't have 6 bits from that.

I have no idea why the MAP_UNINITIALIZED flag was put into this strange
location anyways instead of directly after the existing flags or just
into one of the unused slots.

I suppose I could put my bits before it, there's plenty of space.

Existing flags on x86:

#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */

4 unused
8 unused

#define MAP_FIXED 0x10 /* Interpret addr exactly */
#define MAP_ANONYMOUS 0x20 /* don't use a file */

0x40 unused

#define MAP_GROWSDOWN 0x0100 /* stack-like segment */

0x200 unused
0x400 unused

#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
#define MAP_LOCKED 0x2000 /* pages are locked */
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */

/* all free here: 6 bits for me? 0x80000..0x1000000 */

# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */

/* more free bits. */

Overall it seems there's no real shortage of bits.

> things into protection bits, like we do with SAO (strong access
> ordering) and want to do with per-page endian on embedded.

mprotect already does this.

Unless someone finds a good reason why this can't work I'll just move
the range to 0x80000..0x1000000.

-Andi
--
[email protected] -- Speaking for myself only

2012-10-23 03:01:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, 2012-10-22 at 18:23 +0200, Michael Kerrisk (man-pages) wrote:
> Since PowerPC already allows 16GB page sizes, doesn't there need to be
> allowance for the possibility of future expansion? Choosing a larger
> minimum size (like 2^16) would allow that. Does the minimum size need
> to be 16k? (Surely, if you want a HUGEPAGE, you want a bigger page
> than that? I am not sure.)

I can't say for sure what we're going to come up with in 5 years but so
far, I am not aware of plans to go beyond 16G just yet :-)

Cheers,
Ben.

Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Tue, Oct 23, 2012 at 4:28 AM, Andi Kleen <[email protected]> wrote:
> On Tue, Oct 23, 2012 at 12:44:24PM +1100, Benjamin Herrenschmidt wrote:
>> On Mon, 2012-10-22 at 17:53 +0200, Michael Kerrisk (man-pages) wrote:
>>
>> > This is all seems to make an awful muck of the API...
>>
>> .../...
>>
>> > There seems to be a reasonable argument here for an mmap3() with a
>> > 64-bit flags argument...
>>
>> I tend to agree. There's a similar issue happening when we try to shovel
>
> Could you comment on the expect range of page sizes on PPC?
>
> I looked at this again and I don't think we have anywhere near true 28 flags
> so far. The man page currently only lists 16 (including MAP_UNUS^INITIALIZED)

As we know, man-pages are seldom complete ;:-}

> So I don't see why I can't have 6 bits from that.
>
> I have no idea why the MAP_UNINITIALIZED flag was put into this strange
> location anyways instead of directly after the existing flags or just
> into one of the unused slots.

The reason why you perhaps can't have six bits from that is quite
likely the same as the why MAP_UNINITIALIZED went to a strange place.
It's the unfortunate smearing of individual MAP_* values across
different bits on different architectures:

$ grep 'MAP_' $(find /home/mtk/linux-3.7-rc1 | grep mman) | awk -F':'
'{print $2}' |
grep '[0-9]' | grep -v MAP_TYPE | grep ' MAP_' | grep define |
sed 's/ *# *define *//' | sed 's/[ ]*\/\*.*//' | sort -k1 |
awk '$2 != "0" && $2 != "0x0"' | tr '\011' ' '| sed 's/ */ /;
s/0x0*/0x/' | sort -u
MAP_32BIT 0x40
MAP_ANONYMOUS 0x10
MAP_ANONYMOUS 0x20
MAP_ANONYMOUS 0x800
MAP_AUTOGROW 0x40
MAP_AUTORSRV 0x100
MAP_DENYWRITE 0x2000
MAP_DENYWRITE 0x800
MAP_EXECUTABLE 0x1000
MAP_EXECUTABLE 0x4000
MAP_FIXED 0x10
MAP_FIXED 0x100
MAP_FIXED 0x4
MAP_GROWSDOWN 0x100
MAP_GROWSDOWN 0x1000
MAP_GROWSDOWN 0x200
MAP_GROWSDOWN 0x8000
MAP_GROWSUP 0x200
MAP_HUGETLB 0x100000
MAP_HUGETLB 0x4000
MAP_HUGETLB 0x40000
MAP_HUGETLB 0x80000
MAP_INHERIT 0x80
MAP_LOCAL 0x80
MAP_LOCKED 0x100
MAP_LOCKED 0x200
MAP_LOCKED 0x2000
MAP_LOCKED 0x80
MAP_LOCKED 0x8000
MAP_NONBLOCK 0x10000
MAP_NONBLOCK 0x20000
MAP_NONBLOCK 0x40000
MAP_NONBLOCK 0x80
MAP_NORESERVE 0x10000
MAP_NORESERVE 0x40
MAP_NORESERVE 0x400
MAP_NORESERVE 0x4000
MAP_POPULATE 0x10000
MAP_POPULATE 0x20000
MAP_POPULATE 0x40
MAP_POPULATE 0x8000
MAP_PRIVATE 0x2
MAP_RENAME 0x20
MAP_SHARED 0x1
MAP_STACK 0x20000
MAP_STACK 0x40000
MAP_STACK 0x80000
MAP_UNINITIALIZED 0x4000000

> I suppose I could put my bits before it, there's plenty of space.

Only on x86...

> Existing flags on x86:
>
> #define MAP_SHARED 0x01 /* Share changes */
> #define MAP_PRIVATE 0x02 /* Changes are private */
>
> 4 unused
> 8 unused
>
> #define MAP_FIXED 0x10 /* Interpret addr exactly */
> #define MAP_ANONYMOUS 0x20 /* don't use a file */
>
> 0x40 unused
>
> #define MAP_GROWSDOWN 0x0100 /* stack-like segment */
>
> 0x200 unused
> 0x400 unused
>
> #define MAP_DENYWRITE 0x0800 /* ETXTBSY */
> #define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
> #define MAP_LOCKED 0x2000 /* pages are locked */
> #define MAP_NORESERVE 0x4000 /* don't check for reservations */
> #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */
> #define MAP_NONBLOCK 0x10000 /* do not block on IO */
> #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x40000 /* create a huge page mapping */
>
> /* all free here: 6 bits for me? 0x80000..0x1000000 */
>
> # define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */
>
> /* more free bits. */
>
> Overall it seems there's no real shortage of bits.

Across architectures, there is, unless you plan to just further
increase the mess...

Out of the mess shown above, the free bits across all architectures look to be
0xfbe00008 === 11111011111000000000000000001000
(Note: no 6 adjacent bits.)

Now, my scripting above may not have captured all of the bits, and
some of the MAP_* constants may actually be completely unused, so the
above may not be completely accurate, so don't rely on it. Anyway, the
point is, there are not so many spare bits, really.

>> things into protection bits, like we do with SAO (strong access
>> ordering) and want to do with per-page endian on embedded.
>
> mprotect already does this.
>
> Unless someone finds a good reason why this can't work I'll just move
> the range to 0x80000..0x1000000.

IMO, the output of the script above is a plausible reason not to do this.

The mmap2() API has become a crufty mess. *Maybe* you can do what you
want, at the cost of further extending the cruft (i.e., by finding
different groups of 6 bits on different architectures). But at some
point the mess is going to be bad enough that someone will need to do
mmap3(), and consuming 6 bits of the bit-space is bringing us a big
step closer to that point.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-10-23 13:39:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6


I thought further about this and I think the whole issue is a non issue
anyways: MAP_UNINITIALIZED is NOMMU only and HUGETLBFS is MMU only.
My flags only make sense with HUGETLBFS.

So they can never coexist anyways. So there is no reason to not overlap.

So I think the original patch is ok and doesn't need any change at all.

Sorry guys, if you really want to invent new syscalls please do it
yourself.

-Andi

2012-10-23 22:58:27

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v6

On Mon, 22 Oct 2012 08:36:33 -0700, Andi Kleen said:

> Thinking about it more PowerPC has a 16GB page, so we probably
> need to move this to prot.

Gaak - is that a typo? If not, what is the use case - allowing a small number of
pages to cover all memory, with big wins on TLB hit ratios? I certainly can't see
it as a sane primitive for a virtual memory design that does paging/swapping?


Attachments:
(No filename) (865.00 B)