2016-11-04 13:14:57

by Christopher Covington

[permalink] [raw]
Subject: [PATCH] procfs: Add mem_end to /proc/<pid>/stat

Applications such as Just-In-Time (JIT) compilers, Checkpoint/Restore In
Userspace (CRIU), and User Mode Linux (UML) need to know the highest
virtual address, TASK_SIZE, to implement pointer tagging or make a first
educated guess at where to find a large, unused region of memory.
Unfortunately the currently available mechanisms for determining TASK_SIZE
are either convoluted and potentially error-prone, such as making repeated
munmap() calls and checking the return code, or make use of hard-coded
assumptions that limit an application's portability across kernels with
different Kconfig options and multiple architectures.

Therefore, expose TASK_SIZE to userspace. While PAGE_SIZE is exposed to
userspace via an auxiliary vector, that approach is not used for TASK_SIZE
in case run-time alterations to the usable virtual address range are one
day implemented, such as through an extension to prctl(PR_SET_MM) or a flag
to clone. There is no prctl(PR_GET_MM). Instead such information is
expected to come from /proc/<pid>/stat[m]. For the same extendability
reason, use a per-pid proc entry rather than a system-wide entry like
/proc/sys/vm/mmap_min_addr.

Signed-off-by: Christopher Covington <[email protected]>
---
Documentation/filesystems/proc.txt | 1 +
fs/proc/array.c | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 74329fd..b9c19cf 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -343,6 +343,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
env_start address above which program environment is placed
env_end address below which program environment is placed
exit_code the thread's exit_code in the form reported by the waitpid system call
+ end_mem address below which all regular program parts are placed (TASK_SIZE)
..............................................................................

The /proc/PID/maps file containing the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9a3ca9e..32b5002 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -561,6 +561,11 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
else
seq_puts(m, " 0");

+ if (mm && permitted)
+ seq_put_decimal_ull(m, " ", mm->task_size);
+ else
+ seq_puts(m, " 0");
+
seq_putc(m, '\n');
if (mm)
mmput(mm);
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project.


2016-11-04 14:21:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] procfs: Add mem_end to /proc/<pid>/stat

On Fri, Nov 4, 2016 at 6:14 AM, Christopher Covington
<[email protected]> wrote:
> Applications such as Just-In-Time (JIT) compilers, Checkpoint/Restore In
> Userspace (CRIU), and User Mode Linux (UML) need to know the highest
> virtual address, TASK_SIZE, to implement pointer tagging or make a first
> educated guess at where to find a large, unused region of memory.
> Unfortunately the currently available mechanisms for determining TASK_SIZE
> are either convoluted and potentially error-prone, such as making repeated
> munmap() calls and checking the return code,

Oh boy -- if you do this you are just asking to segfault.

> or make use of hard-coded
> assumptions that limit an application's portability across kernels with
> different Kconfig options and multiple architectures.
>
> Therefore, expose TASK_SIZE to userspace. While PAGE_SIZE is exposed to
> userspace via an auxiliary vector, that approach is not used for TASK_SIZE
> in case run-time alterations to the usable virtual address range are one
> day implemented, such as through an extension to prctl(PR_SET_MM) or a flag
> to clone. There is no prctl(PR_GET_MM). Instead such information is
> expected to come from /proc/<pid>/stat[m]. For the same extendability
> reason, use a per-pid proc entry rather than a system-wide entry like
> /proc/sys/vm/mmap_min_addr.

First, this should be in status, not stat, but that's moot because
TASK_SIZE is nonsensical as a task property on x86. And, as was
nicely covered yesterday at LPC, we already have too much of a mess in
/proc where per-mm properties are mixed up with per-task properties.
Can we make a point of not adding any new mm-related things to files
that are about the task?

But also, NAK for x86 if you look at TASK_SIZE:

TASK_SIZE is a mess and needs to go away completely -- only
TASK_SIZE_MAX makes any sense. If you want to ask "what the largest
address that could possibly be mapped in this mm", the answer is
2^47-1-PAGE_SIZE [1] on present CPUs. If you want a prctl to return
that, then adding one *might* make sense. OTOH it's a bit unclear
what happens if your task is migrated to a hypothetical future CPU
with more address bits.

If you're a 32-bit process on x86, you have zero high bits free
because the address limit is above 2^31-1.

If you're an x32 process, then (a) I'm surprised and (b) there might
be room for "what is the highest address that an mmap call done
without trickery would return". That could be added as well with a
suitably scary name in prctl. But this is still rather odd: x32
pointers are exactly 32 bits unless you write weird asm code to use
64-bit pointers, and you wouldn't do that because it defeats the whole
point of x32 which is to treat all pointers as exactly 32 bits. So an
x32 application should just hard-code 32 as the number of bits.

[1] That PAGE_SIZE offset has an interesting backstory involving some
overly clever Intel hardware designers and a root hole that, as far as
I know, affected every single x86_64 operating system.

--Andy