LinuxLists.cc - [RFC PATCH 0/1] Address hugetlbfs mmap behavior

2024-05-03 01:09:54

Subject: [RFC PATCH 0/1] Address hugetlbfs mmap behavior

This patch proposes to fix hugetlbfs mmap behavior so that the
file size does not get updated in the mmap call.

The current behavior is that hugetlbfs file size will get extended by a
PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
not normal filesystem behavior.

There seem to have been very little discussion about this. There was a
patch discussion[1] a while back, implying hugetlbfs file size needs
extending because of the hugetlb page reservations. Looks like this was
not merged.

It appears there is no correlation between file size and hugetlb page
reservations. Take the case of PROT_READ mmap, where the file size is
not extended even though hugetlb pages are reserved.

On the other hand ftruncate(2) to increase a file size does not reserve
hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
even though hugetlb pages are not reserved.

Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
hugeltbfs file is mmapped, and it only covers the file's offset,length
range specified in the mmap call.

Issue:

Some applications would prefer to manage hugetlb page allocations explicity
with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
MAP_NORESERVE flag, which is accessed only after allocating necessary pages
using fallocate(2) and release the pages by truncating the file size. Any stray
access beyond file size is expected to generate a signal. This does not
work properly due to current behavior which extends file size in mmap call.

To address this issue, hugetlbfs behavior needs to be fixed to not extend
file size in mmap(2) call.

However changing current hugetlbfs behavior could potentially break some
applications. Therefore this patch proposes a mount option to hugetlbfs
to choose the mmap behavior of not extending file size.
Use of a mount option was suggested by Matthew Wilcox,

This patch adds a 'nommapfilesz' mount option to hugetlbfs mount option. The
mount option name can be changed if there is a better name suggested.

Submitting this patch as a RFC to get feedback on the approach and if there
is any reason that requires file size to be extended by mmap in hugetlbfs case.

[1] https://lore.kernel.org/lkml/[email protected]/

Prakash Sangappa (1):
hugetlbfs: Add mount option to choose normal mmap behavior

fs/hugetlbfs/inode.c | 19 ++++++++++++++++++-
include/linux/hugetlb.h | 1 +
2 files changed, 19 insertions(+), 1 deletion(-)

--
2.7.4

2024-05-03 01:09:58

by Prakash Sangappa

[permalink] [raw]

Subject: [RFC PATCH 1/1] hugetlbfs: Add mount option to choose normal mmap behavior

Currently hugetlbfs file size gets extended in a PROT_WRITE mmap call,
if mmap size exceeds the file size. This is not normal filesystem behavior.

Some applications expect normal file system behavior for mmap, such that
the process would receive a signal when accessing mapped address beyond
file size. This will not happen if the file size gets extended by mmap(2).

Changing current behavior for hugetlbfs can potentially break existing
applications. Therefore provide a mount option "nommapfilesz" to choose
normal mmap behavior.

Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Prakash Sangappa <[email protected]>
---
fs/hugetlbfs/inode.c | 19 ++++++++++++++++++-
include/linux/hugetlb.h | 1 +
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6502c7e..e37867a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -58,6 +58,7 @@ struct hugetlbfs_fs_context {
kuid_t uid;
kgid_t gid;
umode_t mode;
+ ushort nommapfilesz;
};

int sysctl_hugetlb_shm_group;
@@ -70,6 +71,7 @@ enum hugetlb_param {
Opt_pagesize,
Opt_size,
Opt_uid,
+ Opt_nommapfilesz,
};

static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
@@ -80,6 +82,7 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
fsparam_string("pagesize", Opt_pagesize),
fsparam_string("size", Opt_size),
fsparam_u32 ("uid", Opt_uid),
+ fsparam_flag ("nommapfilesz", Opt_nommapfilesz),
{}
};

@@ -159,7 +162,12 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
goto out;

ret = 0;
- if (vma->vm_flags & VM_WRITE && inode->i_size < len)
+ /*
+ * If filesystem is mounted with 'nommapfilesz' option, then
+ * don't update file size.
+ */
+ if (!(HUGETLBFS_SB(inode->i_sb)->nommapfilesz)
+ && vma->vm_flags & VM_WRITE && inode->i_size < len)
i_size_write(inode, len);
out:
inode_unlock(inode);
@@ -1169,6 +1177,8 @@ static int hugetlbfs_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",mode=%o", sbinfo->mode);
if (sbinfo->max_inodes != -1)
seq_printf(m, ",nr_inodes=%lu", sbinfo->max_inodes);
+ if (sbinfo->nommapfilesz)
+ seq_puts(m, ",nommapfilesz");

hpage_size /= 1024;
mod = 'K';
@@ -1430,6 +1440,11 @@ static int hugetlbfs_parse_param(struct fs_context *fc, struct fs_parameter *par
ctx->min_val_type = SIZE_PERCENT;
return 0;

+ case Opt_nommapfilesz:
+ /* don't update file size in mmap call */
+ ctx->nommapfilesz = 1;
+ return 0;
+
default:
return -EINVAL;
}
@@ -1487,6 +1502,7 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
sbinfo->uid = ctx->uid;
sbinfo->gid = ctx->gid;
sbinfo->mode = ctx->mode;
+ sbinfo->nommapfilesz = ctx->nommapfilesz;

/*
* Allocate and initialize subpool if maximum or minimum size is
@@ -1558,6 +1574,7 @@ static int hugetlbfs_init_fs_context(struct fs_context *fc)
ctx->min_hpages = -1; /* No default minimum size */
ctx->max_val_type = NO_SIZE;
ctx->min_val_type = NO_SIZE;
+ ctx->nommapfilesz = 0; /* allows file size update in mmap call */
fc->fs_private = ctx;
fc->ops = &hugetlbfs_fs_context_ops;
return 0;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 77b30a8..282cab5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -537,6 +537,7 @@ struct hugetlbfs_sb_info {
kuid_t uid;
kgid_t gid;
umode_t mode;
+ ushort nommapfilesz;
};

static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
--
2.7.4

2024-05-07 12:07:29

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior

On 03.05.24 03:21, Prakash Sangappa wrote:
> This patch proposes to fix hugetlbfs mmap behavior so that the
> file size does not get updated in the mmap call.
>
> The current behavior is that hugetlbfs file size will get extended by a
> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
> not normal filesystem behavior.
>
> There seem to have been very little discussion about this. There was a
> patch discussion[1] a while back, implying hugetlbfs file size needs
> extending because of the hugetlb page reservations. Looks like this was
> not merged.
>
> It appears there is no correlation between file size and hugetlb page
> reservations. Take the case of PROT_READ mmap, where the file size is
> not extended even though hugetlb pages are reserved.
>
> On the other hand ftruncate(2) to increase a file size does not reserve
> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
> even though hugetlb pages are not reserved.
>
> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
> hugeltbfs file is mmapped, and it only covers the file's offset,length
> range specified in the mmap call.
>
> Issue:
>
> Some applications would prefer to manage hugetlb page allocations explicity
> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
> using fallocate(2) and release the pages by truncating the file size. Any stray
> access beyond file size is expected to generate a signal. This does not
> work properly due to current behavior which extends file size in mmap call.

Would a simple workaround be to mmap(PROT_READ) and then
mprotect(PROT_READ|PROT_WRITE)?

I know, not perfect, but certainly better than mount options?

--
Cheers,

David / dhildenb

2024-05-08 17:00:50

by Prakash Sangappa

[permalink] [raw]

Subject: Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior

> On May 7, 2024, at 5:00 AM, David Hildenbrand <[email protected]> wrote:
>
> On 03.05.24 03:21, Prakash Sangappa wrote:
>> This patch proposes to fix hugetlbfs mmap behavior so that the
>> file size does not get updated in the mmap call.
>> The current behavior is that hugetlbfs file size will get extended by a
>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>> not normal filesystem behavior.
>> There seem to have been very little discussion about this. There was a
>> patch discussion[1] a while back, implying hugetlbfs file size needs
>> extending because of the hugetlb page reservations. Looks like this was
>> not merged.
>> It appears there is no correlation between file size and hugetlb page
>> reservations. Take the case of PROT_READ mmap, where the file size is
>> not extended even though hugetlb pages are reserved.
>> On the other hand ftruncate(2) to increase a file size does not reserve
>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>> even though hugetlb pages are not reserved.
>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>> range specified in the mmap call.
>> Issue:
>> Some applications would prefer to manage hugetlb page allocations explicity
>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>> using fallocate(2) and release the pages by truncating the file size. Any stray
>> access beyond file size is expected to generate a signal. This does not
>> work properly due to current behavior which extends file size in mmap call.
>
> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?

Another workaround could be to ftruncate(2) the file after mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.

However, should this mmap behavior be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.

Thanks,
-Prakash

>
> I know, not perfect, but certainly better than mount options?
>
> --
> Cheers,
>
> David / dhildenb

2024-05-08 17:28:56

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior

On 08.05.24 19:00, Prakash Sangappa wrote:
>
>
>> On May 7, 2024, at 5:00 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 03.05.24 03:21, Prakash Sangappa wrote:
>>> This patch proposes to fix hugetlbfs mmap behavior so that the
>>> file size does not get updated in the mmap call.
>>> The current behavior is that hugetlbfs file size will get extended by a
>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>>> not normal filesystem behavior.
>>> There seem to have been very little discussion about this. There was a
>>> patch discussion[1] a while back, implying hugetlbfs file size needs
>>> extending because of the hugetlb page reservations. Looks like this was
>>> not merged.
>>> It appears there is no correlation between file size and hugetlb page
>>> reservations. Take the case of PROT_READ mmap, where the file size is
>>> not extended even though hugetlb pages are reserved.
>>> On the other hand ftruncate(2) to increase a file size does not reserve
>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>>> even though hugetlb pages are not reserved.
>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>>> range specified in the mmap call.
>>> Issue:
>>> Some applications would prefer to manage hugetlb page allocations explicity
>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>>> using fallocate(2) and release the pages by truncating the file size. Any stray
>>> access beyond file size is expected to generate a signal. This does not
>>> work properly due to current behavior which extends file size in mmap call.
>>
>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?
>
> Another workaround could be to ftruncate(2) the file after mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.

I'd assume that most applications that mmap() hugetlb files need to
special-case hugetlb because of the different logical page size
granularity already. But yes, it's all unfortunate.

>
> However, should this mmap behavior be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.

The issue is, as you write, that it's existing behavior and changing it
could cause harm to other apps that rely on that. But I do wonder if really
anybody relies on that ...

Let's explore the history:

The current VM_WRITE check was added in:

commit b6174df5eec9cdfd598c03d6d0807e344e109213
Author: Zhang, Yanmin <[email protected]>
Date: Mon Jul 10 04:44:49 2006 -0700

[PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area

Sometimes, applications need below call to be successful although
"/mnt/hugepages/file1" doesn't exist.

fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
*addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);

As for regular pages (or files), above call does work, but as for huge
pages, above call would fail because hugetlbfs_file_mmap would fail if
(!(vma->vm_flags & VM_WRITE) && len > inode->i_size).

This capability on huge page is useful on ia64 when the process wants to
protect one area on region 4, so other threads couldn't read/write this
area. A famous JVM (Java Virtual Machine) implementation on IA64 needs the
capability.

But it was only moved.

Before that patch:
* mmap(PROT_WRITE) would have failed if the file size would be exceeded
* mmap(PROT_READ/PROT_NONE) would have extended the file

After that patch
* mmap(PROT_WRITE) will extend the file
* mmap(PROT_READ/PROT_NONE) do not extend the file

The code before that predates git times.

Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work
around all hugetlbfs quirks.

I suggest either

(a) Document it, along with the workaround
(b) Change it an cross fingers.

In QEMU source code is a very interesting comment:

* ftruncate is not supported by hugetlbfs in older
* hosts, so don't bother bailing out on errors.
* If anything goes wrong with it under other filesystems,
* mmap will fail.

So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support
was added?

QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a
smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's
a rare case to happen, though, and likely also undesired here: we want it to behave just
like ordinary files!

--
Cheers,

David / dhildenb

2024-05-10 16:29:05

by Prakash Sangappa

[permalink] [raw]

Subject: Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior

> On May 8, 2024, at 10:28 AM, David Hildenbrand <[email protected]> wrote:
>
> On 08.05.24 19:00, Prakash Sangappa wrote:
>>> On May 7, 2024, at 5:00 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 03.05.24 03:21, Prakash Sangappa wrote:
>>>> This patch proposes to fix hugetlbfs mmap behavior so that the
>>>> file size does not get updated in the mmap call.
>>>> The current behavior is that hugetlbfs file size will get extended by a
>>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>>>> not normal filesystem behavior.
>>>> There seem to have been very little discussion about this. There was a
>>>> patch discussion[1] a while back, implying hugetlbfs file size needs
>>>> extending because of the hugetlb page reservations. Looks like this was
>>>> not merged.
>>>> It appears there is no correlation between file size and hugetlb page
>>>> reservations. Take the case of PROT_READ mmap, where the file size is
>>>> not extended even though hugetlb pages are reserved.
>>>> On the other hand ftruncate(2) to increase a file size does not reserve
>>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>>>> even though hugetlb pages are not reserved.
>>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>>>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>>>> range specified in the mmap call.
>>>> Issue:
>>>> Some applications would prefer to manage hugetlb page allocations explicity
>>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>>>> using fallocate(2) and release the pages by truncating the file size. Any stray
>>>> access beyond file size is expected to generate a signal. This does not
>>>> work properly due to current behavior which extends file size in mmap call.
>>>
>>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?
>> Another workaround could be to ftruncate(2) the file after mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.
>
> I'd assume that most applications that mmap() hugetlb files need to
> special-case hugetlb because of the different logical page size
> granularity already. But yes, it's all unfortunate.

Will run this by out application/Database team regarding implementing workarounds.

>
>> However, should this mmap behavior be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.
>
> The issue is, as you write, that it's existing behavior and changing it
> could cause harm to other apps that rely on that. But I do wonder if really
> anybody relies on that ...
>
> Let's explore the history:
>
> The current VM_WRITE check was added in:
>
> commit b6174df5eec9cdfd598c03d6d0807e344e109213
> Author: Zhang, Yanmin <[email protected]>
> Date: Mon Jul 10 04:44:49 2006 -0700
>
> [PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area
> Sometimes, applications need below call to be successful although
> "/mnt/hugepages/file1" doesn't exist.
> fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
> *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);
> As for regular pages (or files), above call does work, but as for huge
> pages, above call would fail because hugetlbfs_file_mmap would fail if
> (!(vma->vm_flags & VM_WRITE) && len > inode->i_size).
> This capability on huge page is useful on ia64 when the process wants to
> protect one area on region 4, so other threads couldn't read/write this
> area. A famous JVM (Java Virtual Machine) implementation on IA64 needs the
> capability.
>
> But it was only moved.
>
> Before that patch:
> * mmap(PROT_WRITE) would have failed if the file size would be exceeded
> * mmap(PROT_READ/PROT_NONE) would have extended the file
>
> After that patch
> * mmap(PROT_WRITE) will extend the file
> * mmap(PROT_READ/PROT_NONE) do not extend the file
>
> The code before that predates git times.
>
> Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work
> around all hugetlbfs quirks.
>
> I suggest either
>
> (a) Document it, along with the workaround

At least needs documentation.

> (b) Change it an cross fingers.
>
>
> In QEMU source code is a very interesting comment:
>
> * ftruncate is not supported by hugetlbfs in older
> * hosts, so don't bother bailing out on errors.
> * If anything goes wrong with it under other filesystems,
> * mmap will fail.
>
> So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support
> was added?
>
> QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a
> smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's
> a rare case to happen, though, and likely also undesired here: we want it to behave just
> like ordinary files!

Ideally yes.

Thanks for your feedback.
-Prakash.

>
> --
> Cheers,
>
> David / dhildenb