2013-06-18 20:17:12

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 0/2] hugetlb fixes

As everyone knows, hugetlbfs sucks. But it is also necessary for
large memory machines, so we should make it suck less. Top of my list
were lack of rss accounting and refusing mmap with MAP_HUGETLB when
using hugetlbfs. The latter generally created a know in every brain I
explained this to.

Test program below is failing before these two patches and passing
after.

Joern Engel (2):
hugetlb: properly account rss
mmap: allow MAP_HUGETLB for hugetlbfs files

mm/hugetlb.c | 4 ++++
mm/mmap.c | 12 ++++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)

--
1.7.10.4

#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

typedef unsigned long long u64;

static size_t length = 1 << 24;

static u64 read_rss(void)
{
char buf[4096], *s = buf;
int i, fd;
u64 rss;

fd = open("/proc/self/statm", O_RDONLY);
assert(fd > 2);
memset(buf, 0, sizeof(buf));
read(fd, buf, sizeof(buf) - 1);
for (i = 0; i < 1; i++)
s = strchr(s, ' ') + 1;
rss = strtoull(s, NULL, 10);
return rss << 12; /* assumes 4k pagesize */
}

static void do_mmap(int fd, int extra_flags, int unmap)
{
int *p;
int flags = MAP_PRIVATE | MAP_POPULATE | extra_flags;
u64 before, after;

before = read_rss();
p = mmap(NULL, length, PROT_READ | PROT_WRITE, flags, fd, 0);
assert(p != MAP_FAILED ||
!"mmap returned an unexpected error");
after = read_rss();
assert(llabs(after - before - length) < 0x40000 ||
!"rss didn't grow as expected");
if (!unmap)
return;
munmap(p, length);
after = read_rss();
assert(llabs(after - before) < 0x40000 ||
!"rss didn't shrink as expected");
}

static int open_file(const char *path)
{
int fd, err;

unlink(path);
fd = open(path, O_CREAT | O_RDWR | O_TRUNC | O_EXCL
| O_LARGEFILE | O_CLOEXEC, 0600);
assert(fd > 2);
unlink(path);
err = ftruncate(fd, length);
assert(!err);
return fd;
}

int main(void)
{
int hugefd, fd;

fd = open_file("/dev/shm/hugetlbhog");
hugefd = open_file("/hugepages/hugetlbhog");

system("echo 100 > /proc/sys/vm/nr_hugepages");
do_mmap(-1, MAP_ANONYMOUS, 1);
do_mmap(fd, 0, 1);
do_mmap(-1, MAP_ANONYMOUS | MAP_HUGETLB, 1);
do_mmap(hugefd, 0, 1);
do_mmap(hugefd, MAP_HUGETLB, 1);
/* Leak the last one to test do_exit() */
do_mmap(-1, MAP_ANONYMOUS | MAP_HUGETLB, 0);
printf("oll korrekt.\n");
return 0;
}


2013-06-18 20:17:21

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 2/2] mmap: allow MAP_HUGETLB for hugetlbfs files

It is counterintuitive at best that mmap'ing a hugetlbfs file with
MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
return huge pages.

Signed-off-by: Joern Engel <[email protected]>
---
mm/mmap.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2a594246..76eb6df 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -33,6 +33,7 @@
#include <linux/uprobes.h>
#include <linux/rbtree_augmented.h>
#include <linux/sched/sysctl.h>
+#include <linux/magic.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1313,6 +1314,11 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
return addr;
}

+static inline int is_hugetlb_file(struct file *file)
+{
+ return file->f_inode->i_sb->s_magic == HUGETLBFS_MAGIC;
+}
+
SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
@@ -1322,11 +1328,12 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,

if (!(flags & MAP_ANONYMOUS)) {
audit_mmap_fd(fd, flags);
- if (unlikely(flags & MAP_HUGETLB))
- return -EINVAL;
file = fget(fd);
if (!file)
goto out;
+ retval = -EINVAL;
+ if (unlikely(flags & MAP_HUGETLB && !is_hugetlb_file(file)))
+ goto out_fput;
} else if (flags & MAP_HUGETLB) {
struct user_struct *user = NULL;
/*
@@ -1346,6 +1353,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+out_fput:
if (file)
fput(file);
out:
--
1.7.10.4

2013-06-18 20:17:32

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 1/2] hugetlb: properly account rss

When moving a program from mmap'ing small pages to mmap'ing huge pages,
a remarkable drop in rss ensues. For some reason hugepages were never
accounted for in rss, which in my book is a clear bug. Sadly this bug
has been present in hugetlbfs since it was merged back in 2002. There
is every chance existing programs depend on hugepages not being counted
as rss.

I think the correct solution is to fix the bug and wait for someone to
complain. It is just as likely that noone cares - as evidenced by the
fact that noone seems to have noticed for ten years.

Signed-off-by: Joern Engel <[email protected]>
---
mm/hugetlb.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1a12f5b..705036c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1174,6 +1174,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
set_page_private(page, (unsigned long)spool);

vma_commit_reservation(h, vma, addr);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, pages_per_huge_page(h));
return page;
}

@@ -2406,6 +2407,9 @@ again:
if (pte_dirty(pte))
set_page_dirty(page);

+ /* -pages_per_huge_page(h) wouldn't get sign-extended */
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1 << h->order);
+
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush)
--
1.7.10.4

2013-06-18 20:20:46

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb fixes

On Tue, 18 June 2013 14:47:03 -0400, Joern Engel wrote:
>
> Test program below is failing before these two patches and passing
> after.

Actually, do we have a place to stuff kernel tests? And if not,
should we have one?

Jörn

--
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra

2013-06-18 20:27:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb fixes

On Tue, 18 Jun 2013 14:50:55 -0400 J__rn Engel <[email protected]> wrote:

> On Tue, 18 June 2013 14:47:03 -0400, Joern Engel wrote:
> >
> > Test program below is failing before these two patches and passing
> > after.
>
> Actually, do we have a place to stuff kernel tests? And if not,
> should we have one?

Yep, tools/testing/selftests/vm. It's pretty simple and stupid at
present - it anything about the framework irritates you, please fix it!

General guidelines for tools/testing/selftests: the tool should execute
quickly and shouldn't break the overall selftests run at either compile
time or runtime if kernel features are absent, Kconfig is unexpected,
etc.

It's more a "place to accumulate and maintain selftest programs" than a
serious self-testing framework.

2013-06-18 21:34:25

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugetlb fixes

On Tue, 18 June 2013 13:27:05 -0700, Andrew Morton wrote:
> On Tue, 18 Jun 2013 14:50:55 -0400 J__rn Engel <[email protected]> wrote:
>
> > On Tue, 18 June 2013 14:47:03 -0400, Joern Engel wrote:
> > >
> > > Test program below is failing before these two patches and passing
> > > after.
> >
> > Actually, do we have a place to stuff kernel tests? And if not,
> > should we have one?
>
> Yep, tools/testing/selftests/vm. It's pretty simple and stupid at
> present - it anything about the framework irritates you, please fix it!

Just did. :)

Jörn

--
Maintenance in other professions and of other articles is concerned with
the return of the item to its original state; in Software, maintenance
is concerned with moving an item away from its original state.
-- Les Belady

2013-06-19 01:23:25

by Jianguo Wu

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap: allow MAP_HUGETLB for hugetlbfs files

Hi Joern,

On 2013/6/19 2:47, Joern Engel wrote:

> It is counterintuitive at best that mmap'ing a hugetlbfs file with
> MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
> return huge pages.
>
> Signed-off-by: Joern Engel <[email protected]>
> ---
> mm/mmap.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2a594246..76eb6df 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -33,6 +33,7 @@
> #include <linux/uprobes.h>
> #include <linux/rbtree_augmented.h>
> #include <linux/sched/sysctl.h>
> +#include <linux/magic.h>
>
> #include <asm/uaccess.h>
> #include <asm/cacheflush.h>
> @@ -1313,6 +1314,11 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> return addr;
> }
>
> +static inline int is_hugetlb_file(struct file *file)
> +{
> + return file->f_inode->i_sb->s_magic == HUGETLBFS_MAGIC;
> +}
> +
> SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
> unsigned long, prot, unsigned long, flags,
> unsigned long, fd, unsigned long, pgoff)
> @@ -1322,11 +1328,12 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
>
> if (!(flags & MAP_ANONYMOUS)) {
> audit_mmap_fd(fd, flags);
> - if (unlikely(flags & MAP_HUGETLB))
> - return -EINVAL;
> file = fget(fd);
> if (!file)
> goto out;
> + retval = -EINVAL;
> + if (unlikely(flags & MAP_HUGETLB && !is_hugetlb_file(file)))

We already have is_file_hugepages().

Thanks,
Jianguo Wu

> + goto out_fput;
> } else if (flags & MAP_HUGETLB) {
> struct user_struct *user = NULL;
> /*
> @@ -1346,6 +1353,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
> flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
>
> retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
> +out_fput:
> if (file)
> fput(file);
> out:


2013-06-19 17:55:04

by Jörn Engel

[permalink] [raw]
Subject: [PATCH] mmap: allow MAP_HUGETLB for hugetlbfs files v2

It is counterintuitive at best that mmap'ing a hugetlbfs file with
MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
return huge pages.
v2: use is_file_hugepages(), as suggested by Jianguo

Signed-off-by: Joern Engel <[email protected]>
Cc: Jianguo Wu <[email protected]>
---
mm/mmap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2a594246..cdc8e7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1322,11 +1322,12 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,

if (!(flags & MAP_ANONYMOUS)) {
audit_mmap_fd(fd, flags);
- if (unlikely(flags & MAP_HUGETLB))
- return -EINVAL;
file = fget(fd);
if (!file)
goto out;
+ retval = -EINVAL;
+ if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
+ goto out_fput;
} else if (flags & MAP_HUGETLB) {
struct user_struct *user = NULL;
/*
@@ -1346,6 +1347,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+out_fput:
if (file)
fput(file);
out:
--
1.7.10.4

2013-06-19 17:55:32

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 2/2] mmap: allow MAP_HUGETLB for hugetlbfs files

On Wed, 19 June 2013 09:22:49 +0800, Jianguo Wu wrote:
>
> We already have is_file_hugepages().

Indeed. Much nicer now. Thanks!

Jörn

--
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers

2013-06-20 01:32:20

by Jianguo Wu

[permalink] [raw]
Subject: Re: [PATCH] mmap: allow MAP_HUGETLB for hugetlbfs files v2

Hi Joern,

I cannot apply this patch to Linus tree, as the code has been modified since
commit af73e4d9506d ("hugetlbfs: fix mmap failure in unaligned size request").

Thanks,
Jianguo Wu

On 2013/6/20 0:25, Jörn Engel wrote:

> It is counterintuitive at best that mmap'ing a hugetlbfs file with
> MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
> return huge pages.
> v2: use is_file_hugepages(), as suggested by Jianguo
>
> Signed-off-by: Joern Engel <[email protected]>
> Cc: Jianguo Wu <[email protected]>
> ---
> mm/mmap.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2a594246..cdc8e7a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1322,11 +1322,12 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
>
> if (!(flags & MAP_ANONYMOUS)) {
> audit_mmap_fd(fd, flags);
> - if (unlikely(flags & MAP_HUGETLB))
> - return -EINVAL;
> file = fget(fd);
> if (!file)
> goto out;
> + retval = -EINVAL;
> + if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
> + goto out_fput;
> } else if (flags & MAP_HUGETLB) {
> struct user_struct *user = NULL;
> /*
> @@ -1346,6 +1347,7 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
> flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
>
> retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
> +out_fput:
> if (file)
> fput(file);
> out:


2013-06-28 22:03:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] hugetlb: properly account rss

On Tue, 18 Jun 2013 14:47:04 -0400 Joern Engel <[email protected]> wrote:

> When moving a program from mmap'ing small pages to mmap'ing huge pages,
> a remarkable drop in rss ensues. For some reason hugepages were never
> accounted for in rss, which in my book is a clear bug. Sadly this bug
> has been present in hugetlbfs since it was merged back in 2002. There
> is every chance existing programs depend on hugepages not being counted
> as rss.
>
> I think the correct solution is to fix the bug and wait for someone to
> complain. It is just as likely that noone cares - as evidenced by the
> fact that noone seems to have noticed for ten years.
>

Yes, that is a concern. I'll toss it in there so we can see what
happens, but I fear that any problems will take a long time to be
discovered.

2013-07-03 13:41:28

by Steve Capper

[permalink] [raw]
Subject: Re: [PATCH 1/2] hugetlb: properly account rss

On Tue, Jun 18, 2013 at 02:47:04PM -0400, Joern Engel wrote:
> When moving a program from mmap'ing small pages to mmap'ing huge pages,
> a remarkable drop in rss ensues. For some reason hugepages were never
> accounted for in rss, which in my book is a clear bug. Sadly this bug
> has been present in hugetlbfs since it was merged back in 2002. There
> is every chance existing programs depend on hugepages not being counted
> as rss.
>
> I think the correct solution is to fix the bug and wait for someone to
> complain. It is just as likely that noone cares - as evidenced by the
> fact that noone seems to have noticed for ten years.
>
> Signed-off-by: Joern Engel <[email protected]>
> ---
> mm/hugetlb.c | 4 ++++
> 1 file changed, 4 insertions(+)
>

Hi,
This patch has caused a few warnings for me today when it was integrated into
linux-next. The libhugetlbfs test suite gave me:
[ 94.320661] BUG: Bad rss-counter state mm:ffff880119461040 idx:1 val:-512
[ 94.330346] BUG: Bad rss-counter state mm:ffff880119460680 idx:1 val:-2560
[ 94.341746] BUG: Bad rss-counter state mm:ffff880119460d00 idx:1 val:-512
[ 94.347518] BUG: Bad rss-counter state mm:ffff880119460d00 idx:1 val:-512
[ 94.415203] BUG: Bad rss-counter state mm:ffff8801194f9040 idx:1 val:-1024

[ ...]

I think I've found the cause; MAP_SHARED mappings.
alloc_huge_page and __unmap_hugepage_range are called for shared pages. Also,
__unmap_hugepage_range is called more times than alloc_huge_page (which makes
sense as multiple views of a shared mapping are unmapped) leading to negative
counter values.

Excluding VM_SHARED VMAs from the counter increment/decrement stopped the
warnings for me. Although this may not be the best way to address the issue.

Cheers,
--
Steve

2013-07-03 14:56:05

by Steve Capper

[permalink] [raw]
Subject: Re: [PATCH 1/2] hugetlb: properly account rss

On Wed, Jul 03, 2013 at 02:41:19PM +0100, Steve Capper wrote:

[ ... ]

> Excluding VM_SHARED VMAs from the counter increment/decrement stopped the
> warnings for me.

Whoops sorry, the fork copy on write test flagged a BUG, hugetlb_cow will need
to be examined too.

Cheers,
--
Steve