2020-10-17 05:32:48

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 0/6] mm and ptrace: Track dumpability until task is freed

[sorry, had to resend - it was pointed out to me that when I sent
this series the first time, DKIM got broken by the kvack list
rewriting 8-bit into quoted-printable]

At the moment, there is a lifetime issue (no, not the UAF kind) around
__ptrace_may_access():

__ptrace_may_access() wants to check mm->flags and mm->user_ns to figure
out whether the caller should be allowed to access some target task.
__ptrace_may_access() can be called as long as __put_task_struct()
hasn't happened yet; but __put_task_struct() happens when the task is
about to be freed, which is much later than exit_mm() (which happens
pretty early during task exit).
So we can have a situation where we need to consult the mm for a
security check, but we don't have an mm anymore.

At the moment, this is solved by failing open: If the mm is gone, we
pretend that it was dumpable. That's dubious from a security
perspective - as one example, we drop the mm_struct before the file
descriptor table, so someone might be able to steal file descriptors
from an exiting tasks when dumpability was supposed to prevent that.

The easy fix would be to let __ptrace_may_access() instead always refuse
access to tasks that have lost their mm; but then that would e.g. mean
that the ability to inspect dead tasks in procfs would be restricted.
So while that might work in practice, it'd be a bit ugly, too.

Another option would be to move the dumpability information elsewhere -
but that would have to be the task_struct (the signal_struct can be
shared with dead pre-execve threads, so we can't use it here). So we'd
have to keep dumpability information in sync across threads - that'd
probably be pretty ugly.


So I think the proper fix is to let the task_struct hold a reference on
the mm_struct until the task goes away completely. This is implemented
in patch 1/6, which is also the only patch in this series that I
actually care about (and the only one with a stable backport marking);
the rest of the series are some tweaks in case people dislike the idea
of constantly freeing mm_structs from workqueue context.
Those tweaks should also reduce the memory usage of dead tasks, by
ensuring that they don't keep their PGDs alive.


Patch 1/6 is not particularly pretty, but I can't think of any better
way to do it.

So: Does this series (and in particular patch 1/6) look vaguely sane?
And if not, does anyone have a better approach?


Jann Horn (6):
ptrace: Keep mm around after exit_mm() for __ptrace_may_access()
refcount: Move refcount_t definition into linux/types.h
mm: Add refcount for preserving mm_struct without pgd
mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async()
ptrace: Use mm_ref() for ->exit_mm
mm: remove now-unused mmdrop_async()

arch/x86/kernel/tboot.c | 2 +
drivers/firmware/efi/efi.c | 2 +
include/linux/mm_types.h | 15 ++++++-
include/linux/refcount.h | 13 +-----
include/linux/sched.h | 8 ++++
include/linux/sched/mm.h | 13 ++++++
include/linux/types.h | 12 +++++
kernel/exit.c | 2 +
kernel/fork.c | 90 +++++++++++++++++---------------------
kernel/ptrace.c | 10 +++++
mm/init-mm.c | 2 +
mm/oom_kill.c | 2 +-
12 files changed, 105 insertions(+), 66 deletions(-)


base-commit: bbf5c979011a099af5dc76498918ed7df445635b
--
2.29.0.rc1.297.gfa9743e501-goog


2020-10-17 05:34:30

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

Currently, mm_struct has two refcounts:

- mm_users: preserves everything - the mm_struct, the page tables, the
memory mappings, and so on
- mm_count: preserves the mm_struct and pgd

However, there are three types of users of mm_struct:

1. users that want page tables, memory mappings and so on
2. users that want to preserve the pgd (for lazy TLB)
3. users that just want to keep the mm_struct itself around (e.g. for
mmget_not_zero() or __ptrace_may_access())

Dropping mm_count references can be messy because dropping mm_count to
zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
work from RCU callbacks (which run in IRQ context). In those cases,
mmdrop_async() must be used to punt the invocation of __mmdrop() to
workqueue context.

That's fine when mmdrop_async() is a rare case, but the preceding patch
"ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
the common case; we should probably avoid punting freeing to workqueue
context all the time if we can avoid it?

To resolve this, add a third refcount that just protects the mm_struct and
the user_ns it points to, and which can be dropped with synchronous freeing
from (almost) any context.

Signed-off-by: Jann Horn <[email protected]>
---
arch/x86/kernel/tboot.c | 2 ++
drivers/firmware/efi/efi.c | 2 ++
include/linux/mm_types.h | 13 +++++++++++--
include/linux/sched/mm.h | 13 +++++++++++++
kernel/fork.c | 14 ++++++++++----
mm/init-mm.c | 2 ++
6 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 992fb1415c0f..b92ea1bb3bb9 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -19,6 +19,7 @@
#include <linux/mm.h>
#include <linux/tboot.h>
#include <linux/debugfs.h>
+#include <linux/refcount.h>

#include <asm/realmode.h>
#include <asm/processor.h>
@@ -93,6 +94,7 @@ static struct mm_struct tboot_mm = {
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
+ .mm_bare_refs = REFCOUNT_INIT(1),
MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 3aa07c3b5136..3b73a0717c6e 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -26,6 +26,7 @@
#include <linux/platform_device.h>
#include <linux/random.h>
#include <linux/reboot.h>
+#include <linux/refcount.h>
#include <linux/slab.h>
#include <linux/acpi.h>
#include <linux/ucs2_string.h>
@@ -54,6 +55,7 @@ struct mm_struct efi_mm = {
.mm_rb = RB_ROOT,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
+ .mm_bare_refs = REFCOUNT_INIT(1),
MMAP_LOCK_INITIALIZER(efi_mm)
.page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(efi_mm.mmlist),
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ed028af3cb19..764d251966c7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -429,13 +429,22 @@ struct mm_struct {

/**
* @mm_count: The number of references to &struct mm_struct
- * (@mm_users count as 1).
+ * including its pgd (@mm_users count as 1).
*
* Use mmgrab()/mmdrop() to modify. When this drops to 0, the
- * &struct mm_struct is freed.
+ * pgd is freed.
*/
atomic_t mm_count;

+ /**
+ * @mm_bare_refs: The number of references to &struct mm_struct
+ * that preserve no page table state whatsoever (@mm_count
+ * counts as 1).
+ *
+ * When this drops to 0, the &struct mm_struct is freed.
+ */
+ refcount_t mm_bare_refs;
+
/**
* @has_pinned: Whether this mm has pinned any pages. This can
* be either replaced in the future by @pinned_vm when it
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f889e332912f..e5b236e15757 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -109,6 +109,19 @@ extern void mmput(struct mm_struct *);
void mmput_async(struct mm_struct *);
#endif

+static inline void mm_ref(struct mm_struct *mm)
+{
+ refcount_inc(&mm->mm_bare_refs);
+}
+
+void __mm_unref(struct mm_struct *mm);
+
+static inline void mm_unref(struct mm_struct *mm)
+{
+ if (refcount_dec_and_test(&mm->mm_bare_refs))
+ __mm_unref(mm);
+}
+
/* Grab a reference to a task's mm, if it is not already going away */
extern struct mm_struct *get_task_mm(struct task_struct *task);
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 4942428a217c..fcdd1ace79e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -642,10 +642,16 @@ static void check_mm(struct mm_struct *mm)
#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm)))

+void __mm_unref(struct mm_struct *mm)
+{
+ put_user_ns(mm->user_ns);
+ free_mm(mm);
+}
+
/*
- * Called when the last reference to the mm
+ * Called when the last PGD-preserving reference to the mm
* is dropped: either by a lazy thread or by
- * mmput. Free the page directory and the mm.
+ * mmput. Free the page directory.
*/
void __mmdrop(struct mm_struct *mm)
{
@@ -656,8 +662,7 @@ void __mmdrop(struct mm_struct *mm)
destroy_context(mm);
mmu_notifier_subscriptions_destroy(mm);
check_mm(mm);
- put_user_ns(mm->user_ns);
- free_mm(mm);
+ mm_unref(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);

@@ -1007,6 +1012,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->vmacache_seqnum = 0;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
+ refcount_set(&mm->mm_bare_refs, 1);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
mm->core_state = NULL;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 3a613c85f9ed..3c3cd35236fd 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -7,6 +7,7 @@
#include <linux/cpumask.h>
#include <linux/mman.h>
#include <linux/pgtable.h>
+#include <linux/refcount.h>

#include <linux/atomic.h>
#include <linux/user_namespace.h>
@@ -31,6 +32,7 @@ struct mm_struct init_mm = {
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
+ .mm_bare_refs = REFCOUNT_INIT(1),
MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-17 05:34:38

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 5/6] ptrace: Use mm_ref() for ->exit_mm

We only use ->exit_mm to look up dumpability and the ->user_mm; we don't
need to keep the PGD alive for this.
mmgrab() is also inconvenient here, because it means that we need to use
mmdrop_async() when dropping the reference to the mm from an RCU callback.
Use mm_ref() instead of mmgrab() to make things neater.

Signed-off-by: Jann Horn <[email protected]>
---
kernel/exit.c | 2 +-
kernel/fork.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 97253ef33486..03ba6d13ef1e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,7 +476,7 @@ static void exit_mm(void)
/* more a memory barrier than a real lock */
task_lock(current);
current->mm = NULL;
- mmgrab(mm); /* for current->exit_mm */
+ mm_ref(mm); /* for current->exit_mm */
current->exit_mm = mm;
mmap_read_unlock(mm);
enter_lazy_tlb(mm, current);
diff --git a/kernel/fork.c b/kernel/fork.c
index 59c119b03351..4383bf055b40 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -720,7 +720,7 @@ void free_task(struct task_struct *tsk)
if (tsk->flags & PF_KTHREAD)
free_kthread_struct(tsk);
if (tsk->exit_mm)
- mmdrop_async(tsk->exit_mm);
+ mm_unref(tsk->exit_mm);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-17 05:35:23

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 4/6] mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async()

The OOM killer uses MMF_OOM_SKIP to avoid running on an mm that has started
__mmput(); it only uses the mmgrab() reference to ensure that the mm_struct
itself stays alive.

This means that we don't need a full mmgrab() reference, which will keep
the pgd (and potentially also some pmd pages) alive and can't be cleaned up
from RCU callback context; we can use an mm_ref() reference instead.

Signed-off-by: Jann Horn <[email protected]>
---
kernel/fork.c | 6 +-----
mm/oom_kill.c | 2 +-
2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index fcdd1ace79e4..59c119b03351 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,12 +686,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
{
taskstats_tgid_free(sig);
sched_autogroup_exit(sig);
- /*
- * __mmdrop is not safe to call from softirq context on x86 due to
- * pgd_dtor so postpone it to the async context
- */
if (sig->oom_mm)
- mmdrop_async(sig->oom_mm);
+ mm_unref(sig->oom_mm);
kmem_cache_free(signal_cachep, sig);
}

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e90f25d6385d..12967f54fbcf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -704,7 +704,7 @@ static void mark_oom_victim(struct task_struct *tsk)

/* oom_mm is bound to the signal struct life time. */
if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
- mmgrab(tsk->signal->oom_mm);
+ mm_ref(tsk->signal->oom_mm);
set_bit(MMF_OOM_VICTIM, &mm->flags);
}

--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-17 07:04:23

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 1/6] ptrace: Keep mm around after exit_mm() for __ptrace_may_access()

__ptrace_may_access() checks can happen on target tasks that are in the
middle of do_exit(), past exit_mm(). At that point, the ->mm pointer has
been NULLed out, and the mm_struct has been mmput().

Unfortunately, the mm_struct contains the dumpability and the user_ns in
which the task last went through execve(), and we need those for
__ptrace_may_access(). Currently, that problem is handled by failing open:
If the ->mm is gone, we assume that the task was dumpable. In some edge
cases, this could potentially expose access to things like
/proc/$pid/fd/$fd of originally non-dumpable processes.
(exit_files() comes after exit_mm(), so the file descriptor table is still
there when we've gone through exit_mm().)

One way to fix this would be to move mm->user_ns and the dumpability state
over into the task_struct. However, that gets quite ugly if we want to
preserve existing semantics because e.g. PR_SET_DUMPABLE and commit_creds()
would then have to scan through all tasks sharing the mm_struct and keep
them in sync manually - that'd be a bit error-prone and overcomplicated.

(Moving these things into the signal_struct is not an option because that
is kept across executions, and pre-execve co-threads will share the
signal_struct that is also used by the task that has gone through
execve().)

I believe that this patch may be the least bad option to fix this - keep
the mm_struct (but not process memory) around with an mmgrab() reference
from exit_mm() until the task goes away completely.

Note that this moves free_task() down in order to make mmdrop_async()
available without a forward declaration.

Cc: [email protected]
Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks")
Signed-off-by: Jann Horn <[email protected]>
---
include/linux/sched.h | 8 +++++++
kernel/exit.c | 2 ++
kernel/fork.c | 54 ++++++++++++++++++++++---------------------
kernel/ptrace.c | 10 ++++++++
4 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935..55bec6ff5626 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -747,6 +747,14 @@ struct task_struct {

struct mm_struct *mm;
struct mm_struct *active_mm;
+ /*
+ * When we exit and ->mm (the reference pinning ->mm's address space)
+ * goes away, we stash a reference to the mm_struct itself (counted via
+ * exit_mm->mm_count) in this member.
+ * This allows us to continue using the mm_struct for security checks
+ * and such even after the task has started exiting.
+ */
+ struct mm_struct *exit_mm;

/* Per-thread vma caching: */
struct vmacache vmacache;
diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..97253ef33486 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,6 +476,8 @@ static void exit_mm(void)
/* more a memory barrier than a real lock */
task_lock(current);
current->mm = NULL;
+ mmgrab(mm); /* for current->exit_mm */
+ current->exit_mm = mm;
mmap_read_unlock(mm);
enter_lazy_tlb(mm, current);
task_unlock(current);
diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..4942428a217c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -438,32 +438,6 @@ void put_task_stack(struct task_struct *tsk)
}
#endif

-void free_task(struct task_struct *tsk)
-{
- scs_release(tsk);
-
-#ifndef CONFIG_THREAD_INFO_IN_TASK
- /*
- * The task is finally done with both the stack and thread_info,
- * so free both.
- */
- release_task_stack(tsk);
-#else
- /*
- * If the task had a separate stack allocation, it should be gone
- * by now.
- */
- WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0);
-#endif
- rt_mutex_debug_task_free(tsk);
- ftrace_graph_exit_task(tsk);
- arch_release_task_struct(tsk);
- if (tsk->flags & PF_KTHREAD)
- free_kthread_struct(tsk);
- free_task_struct(tsk);
-}
-EXPORT_SYMBOL(free_task);
-
#ifdef CONFIG_MMU
static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
@@ -722,6 +696,34 @@ static inline void put_signal_struct(struct signal_struct *sig)
free_signal_struct(sig);
}

+void free_task(struct task_struct *tsk)
+{
+ scs_release(tsk);
+
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+ /*
+ * The task is finally done with both the stack and thread_info,
+ * so free both.
+ */
+ release_task_stack(tsk);
+#else
+ /*
+ * If the task had a separate stack allocation, it should be gone
+ * by now.
+ */
+ WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0);
+#endif
+ rt_mutex_debug_task_free(tsk);
+ ftrace_graph_exit_task(tsk);
+ arch_release_task_struct(tsk);
+ if (tsk->flags & PF_KTHREAD)
+ free_kthread_struct(tsk);
+ if (tsk->exit_mm)
+ mmdrop_async(tsk->exit_mm);
+ free_task_struct(tsk);
+}
+EXPORT_SYMBOL(free_task);
+
void __put_task_struct(struct task_struct *tsk)
{
WARN_ON(!tsk->exit_state);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179508d6..0aedc6cf5bdc 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -342,7 +342,17 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
* Pairs with a write barrier in commit_creds().
*/
smp_rmb();
+ /*
+ * Look up the target task's mm_struct. If it fails because the task is
+ * exiting and has gone through exit_mm(), we can instead use ->exit_mm
+ * as long as we only use members that are preserved by an mmgrab()
+ * reference.
+ * The only case in which both ->mm and ->exit_mm can be NULL should be
+ * kernel threads.
+ */
mm = task->mm;
+ if (!mm)
+ mm = task->exit_mm;
if (mm &&
((get_dumpable(mm) != SUID_DUMP_USER) &&
!ptrace_has_cap(cred, mm->user_ns, mode)))
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-17 07:04:35

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 2/6] refcount: Move refcount_t definition into linux/types.h

I want to use refcount_t in mm_struct, but if refcount_t is defined in
linux/refcount.h, that header would have to be included in
linux/mm_types.h; that would be wasteful.

Let's move refcount_t over into linux/types.h so that includes can be
written less wastefully.

Signed-off-by: Jann Horn <[email protected]>
---
include/linux/refcount.h | 13 +------------
include/linux/types.h | 12 ++++++++++++
2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 0e3ee25eb156..fd8cf65e4e2f 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -96,22 +96,11 @@
#include <linux/bug.h>
#include <linux/compiler.h>
#include <linux/limits.h>
+#include <linux/types.h> /* refcount_t is defined here */
#include <linux/spinlock_types.h>

struct mutex;

-/**
- * struct refcount_t - variant of atomic_t specialized for reference counts
- * @refs: atomic_t counter field
- *
- * The counter saturates at REFCOUNT_SATURATED and will not move once
- * there. This avoids wrapping the counter and causing 'spurious'
- * use-after-free bugs.
- */
-typedef struct refcount_struct {
- atomic_t refs;
-} refcount_t;
-
#define REFCOUNT_INIT(n) { .refs = ATOMIC_INIT(n), }
#define REFCOUNT_MAX INT_MAX
#define REFCOUNT_SATURATED (INT_MIN / 2)
diff --git a/include/linux/types.h b/include/linux/types.h
index a147977602b5..34e4e779e767 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -175,6 +175,18 @@ typedef struct {
} atomic64_t;
#endif

+/**
+ * struct refcount_t - variant of atomic_t specialized for reference counts
+ * @refs: atomic_t counter field
+ *
+ * The counter saturates at REFCOUNT_SATURATED and will not move once
+ * there. This avoids wrapping the counter and causing 'spurious'
+ * use-after-free bugs.
+ */
+typedef struct refcount_struct {
+ atomic_t refs;
+} refcount_t;
+
struct list_head {
struct list_head *next, *prev;
};
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-17 07:08:22

by Jann Horn

[permalink] [raw]
Subject: [RFC PATCH resend 6/6] mm: remove now-unused mmdrop_async()

The preceding patches have removed all users of mmdrop_async(); get rid of
it.

Note that on MMU, we still need async_put_work because mmput_async() uses
it, which in turn is used by binder's shrinker callback. We could claw back
those 4 words per mm if we made mmput_async() depend on
CONFIG_ANDROID_BINDER_IPC.

Signed-off-by: Jann Horn <[email protected]>
---
include/linux/mm_types.h | 2 ++
kernel/fork.c | 16 ----------------
2 files changed, 2 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 764d251966c7..8fde2068bde1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -560,7 +560,9 @@ struct mm_struct {
#ifdef CONFIG_HUGETLB_PAGE
atomic_long_t hugetlb_usage;
#endif
+#ifdef CONFIG_MMU
struct work_struct async_put_work;
+#endif
} __randomize_layout;

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 4383bf055b40..c5f2ec544933 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -666,22 +666,6 @@ void __mmdrop(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(__mmdrop);

-static void mmdrop_async_fn(struct work_struct *work)
-{
- struct mm_struct *mm;
-
- mm = container_of(work, struct mm_struct, async_put_work);
- __mmdrop(mm);
-}
-
-static void mmdrop_async(struct mm_struct *mm)
-{
- if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
- INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
- schedule_work(&mm->async_put_work);
- }
-}
-
static inline void free_signal_struct(struct signal_struct *sig)
{
taskstats_tgid_free(sig);
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-17 07:48:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> Currently, mm_struct has two refcounts:
>
> - mm_users: preserves everything - the mm_struct, the page tables, the
> memory mappings, and so on
> - mm_count: preserves the mm_struct and pgd
>
> However, there are three types of users of mm_struct:
>
> 1. users that want page tables, memory mappings and so on
> 2. users that want to preserve the pgd (for lazy TLB)
> 3. users that just want to keep the mm_struct itself around (e.g. for
> mmget_not_zero() or __ptrace_may_access())
>
> Dropping mm_count references can be messy because dropping mm_count to
> zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
> work from RCU callbacks (which run in IRQ context). In those cases,
> mmdrop_async() must be used to punt the invocation of __mmdrop() to
> workqueue context.
>
> That's fine when mmdrop_async() is a rare case, but the preceding patch
> "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
> the common case; we should probably avoid punting freeing to workqueue
> context all the time if we can avoid it?
>
> To resolve this, add a third refcount that just protects the mm_struct and
> the user_ns it points to, and which can be dropped with synchronous freeing
> from (almost) any context.
>
> Signed-off-by: Jann Horn <[email protected]>
> ---
> arch/x86/kernel/tboot.c | 2 ++
> drivers/firmware/efi/efi.c | 2 ++
> include/linux/mm_types.h | 13 +++++++++++--
> include/linux/sched/mm.h | 13 +++++++++++++
> kernel/fork.c | 14 ++++++++++----
> mm/init-mm.c | 2 ++
> 6 files changed, 40 insertions(+), 6 deletions(-)

I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
converted to this as well..

Actually I kind of wonder if you should go the reverse and find the
few callers that care about the pgd and give them a new api with a
better name (mmget_pgd?).

Jason

2020-10-17 11:31:47

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe <[email protected]> wrote:
> On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> > Currently, mm_struct has two refcounts:
> >
> > - mm_users: preserves everything - the mm_struct, the page tables, the
> > memory mappings, and so on
> > - mm_count: preserves the mm_struct and pgd
> >
> > However, there are three types of users of mm_struct:
> >
> > 1. users that want page tables, memory mappings and so on
> > 2. users that want to preserve the pgd (for lazy TLB)
> > 3. users that just want to keep the mm_struct itself around (e.g. for
> > mmget_not_zero() or __ptrace_may_access())
> >
> > Dropping mm_count references can be messy because dropping mm_count to
> > zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
> > work from RCU callbacks (which run in IRQ context). In those cases,
> > mmdrop_async() must be used to punt the invocation of __mmdrop() to
> > workqueue context.
> >
> > That's fine when mmdrop_async() is a rare case, but the preceding patch
> > "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
> > the common case; we should probably avoid punting freeing to workqueue
> > context all the time if we can avoid it?
> >
> > To resolve this, add a third refcount that just protects the mm_struct and
> > the user_ns it points to, and which can be dropped with synchronous freeing
> > from (almost) any context.
> >
> > Signed-off-by: Jann Horn <[email protected]>
> > ---
> > arch/x86/kernel/tboot.c | 2 ++
> > drivers/firmware/efi/efi.c | 2 ++
> > include/linux/mm_types.h | 13 +++++++++++--
> > include/linux/sched/mm.h | 13 +++++++++++++
> > kernel/fork.c | 14 ++++++++++----
> > mm/init-mm.c | 2 ++
> > 6 files changed, 40 insertions(+), 6 deletions(-)
>
> I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
> converted to this as well..
>
> Actually I kind of wonder if you should go the reverse and find the
> few callers that care about the pgd and give them a new api with a
> better name (mmget_pgd?).

Yeah, that might make more sense... as long as I'm really sure about
all the places I haven't changed. ^^

I'll try to change it as you suggested for v2.

2020-11-03 02:14:09

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

On Sat, Oct 17, 2020 at 2:30 AM Jann Horn <[email protected]> wrote:
> On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe <[email protected]> wrote:
> > On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> > > Currently, mm_struct has two refcounts:
> > >
> > > - mm_users: preserves everything - the mm_struct, the page tables, the
> > > memory mappings, and so on
> > > - mm_count: preserves the mm_struct and pgd
> > >
> > > However, there are three types of users of mm_struct:
> > >
> > > 1. users that want page tables, memory mappings and so on
> > > 2. users that want to preserve the pgd (for lazy TLB)
> > > 3. users that just want to keep the mm_struct itself around (e.g. for
> > > mmget_not_zero() or __ptrace_may_access())
> > >
> > > Dropping mm_count references can be messy because dropping mm_count to
> > > zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
> > > work from RCU callbacks (which run in IRQ context). In those cases,
> > > mmdrop_async() must be used to punt the invocation of __mmdrop() to
> > > workqueue context.
> > >
> > > That's fine when mmdrop_async() is a rare case, but the preceding patch
> > > "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
> > > the common case; we should probably avoid punting freeing to workqueue
> > > context all the time if we can avoid it?
> > >
> > > To resolve this, add a third refcount that just protects the mm_struct and
> > > the user_ns it points to, and which can be dropped with synchronous freeing
> > > from (almost) any context.
> > >
> > > Signed-off-by: Jann Horn <[email protected]>
> > > ---
> > > arch/x86/kernel/tboot.c | 2 ++
> > > drivers/firmware/efi/efi.c | 2 ++
> > > include/linux/mm_types.h | 13 +++++++++++--
> > > include/linux/sched/mm.h | 13 +++++++++++++
> > > kernel/fork.c | 14 ++++++++++----
> > > mm/init-mm.c | 2 ++
> > > 6 files changed, 40 insertions(+), 6 deletions(-)
> >
> > I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
> > converted to this as well..
> >
> > Actually I kind of wonder if you should go the reverse and find the
> > few callers that care about the pgd and give them a new api with a
> > better name (mmget_pgd?).
>
> Yeah, that might make more sense... as long as I'm really sure about
> all the places I haven't changed. ^^
>
> I'll try to change it as you suggested for v2.

Actually, no - I think it ends up being around 30 mentions of the
"take reference without PGD" function and around 35 mentions of the
"take reference with PGD" function (assuming that all the weird
powerpc stuff I don't understand needs the mm_context to not yet be
destroyed). (A decent chunk of which are all the per-arch functions
for starting secondary processors.) So I don't think doing it the way
you suggested would really make the patch(es) smaller.

And I think that it is helpful for review purposes to have separate
patches for every converted site, and leave things as-is by default.
If the semantics change for every user that is *not* touched by the
patch, that makes it really easy for mistakes to slip through.

I could try to convert more callers though?

2020-11-03 03:23:45

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

On Tue, Nov 3, 2020 at 3:11 AM Jann Horn <[email protected]> wrote:
> On Sat, Oct 17, 2020 at 2:30 AM Jann Horn <[email protected]> wrote:
> > On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe <[email protected]> wrote:
> > > On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> > > > Currently, mm_struct has two refcounts:
> > > >
> > > > - mm_users: preserves everything - the mm_struct, the page tables, the
> > > > memory mappings, and so on
> > > > - mm_count: preserves the mm_struct and pgd
> > > >
> > > > However, there are three types of users of mm_struct:
> > > >
> > > > 1. users that want page tables, memory mappings and so on
> > > > 2. users that want to preserve the pgd (for lazy TLB)
> > > > 3. users that just want to keep the mm_struct itself around (e.g. for
> > > > mmget_not_zero() or __ptrace_may_access())
> > > >
> > > > Dropping mm_count references can be messy because dropping mm_count to
> > > > zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
> > > > work from RCU callbacks (which run in IRQ context). In those cases,
> > > > mmdrop_async() must be used to punt the invocation of __mmdrop() to
> > > > workqueue context.
> > > >
> > > > That's fine when mmdrop_async() is a rare case, but the preceding patch
> > > > "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
> > > > the common case; we should probably avoid punting freeing to workqueue
> > > > context all the time if we can avoid it?
> > > >
> > > > To resolve this, add a third refcount that just protects the mm_struct and
> > > > the user_ns it points to, and which can be dropped with synchronous freeing
> > > > from (almost) any context.
> > > >
> > > > Signed-off-by: Jann Horn <[email protected]>
> > > > ---
> > > > arch/x86/kernel/tboot.c | 2 ++
> > > > drivers/firmware/efi/efi.c | 2 ++
> > > > include/linux/mm_types.h | 13 +++++++++++--
> > > > include/linux/sched/mm.h | 13 +++++++++++++
> > > > kernel/fork.c | 14 ++++++++++----
> > > > mm/init-mm.c | 2 ++
> > > > 6 files changed, 40 insertions(+), 6 deletions(-)
> > >
> > > I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
> > > converted to this as well..
> > >
> > > Actually I kind of wonder if you should go the reverse and find the
> > > few callers that care about the pgd and give them a new api with a
> > > better name (mmget_pgd?).
> >
> > Yeah, that might make more sense... as long as I'm really sure about
> > all the places I haven't changed. ^^
> >
> > I'll try to change it as you suggested for v2.
>
> Actually, no - I think it ends up being around 30 mentions of the
> "take reference without PGD" function and around 35 mentions of the
> "take reference with PGD" function (assuming that all the weird
> powerpc stuff I don't understand needs the mm_context to not yet be
> destroyed). (A decent chunk of which are all the per-arch functions
> for starting secondary processors.) So I don't think doing it the way
> you suggested would really make the patch(es) smaller.
>
> And I think that it is helpful for review purposes to have separate
> patches for every converted site, and leave things as-is by default.
> If the semantics change for every user that is *not* touched by the
> patch, that makes it really easy for mistakes to slip through.
>
> I could try to convert more callers though?

But really, I would be happiest if I could just leave all the callers
where both refcounts work as-is, and let people more familiar with the
subsystems switch them over when that actually becomes necessary. Is
that not acceptable?

2020-11-03 13:23:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

On Tue, Nov 03, 2020 at 04:19:11AM +0100, Jann Horn wrote:
> On Tue, Nov 3, 2020 at 3:11 AM Jann Horn <[email protected]> wrote:
> > On Sat, Oct 17, 2020 at 2:30 AM Jann Horn <[email protected]> wrote:
> > > On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe <[email protected]> wrote:
> > > > On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
> > > > > Currently, mm_struct has two refcounts:
> > > > >
> > > > > - mm_users: preserves everything - the mm_struct, the page tables, the
> > > > > memory mappings, and so on
> > > > > - mm_count: preserves the mm_struct and pgd
> > > > >
> > > > > However, there are three types of users of mm_struct:
> > > > >
> > > > > 1. users that want page tables, memory mappings and so on
> > > > > 2. users that want to preserve the pgd (for lazy TLB)
> > > > > 3. users that just want to keep the mm_struct itself around (e.g. for
> > > > > mmget_not_zero() or __ptrace_may_access())
> > > > >
> > > > > Dropping mm_count references can be messy because dropping mm_count to
> > > > > zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
> > > > > work from RCU callbacks (which run in IRQ context). In those cases,
> > > > > mmdrop_async() must be used to punt the invocation of __mmdrop() to
> > > > > workqueue context.
> > > > >
> > > > > That's fine when mmdrop_async() is a rare case, but the preceding patch
> > > > > "ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
> > > > > the common case; we should probably avoid punting freeing to workqueue
> > > > > context all the time if we can avoid it?
> > > > >
> > > > > To resolve this, add a third refcount that just protects the mm_struct and
> > > > > the user_ns it points to, and which can be dropped with synchronous freeing
> > > > > from (almost) any context.
> > > > >
> > > > > Signed-off-by: Jann Horn <[email protected]>
> > > > > arch/x86/kernel/tboot.c | 2 ++
> > > > > drivers/firmware/efi/efi.c | 2 ++
> > > > > include/linux/mm_types.h | 13 +++++++++++--
> > > > > include/linux/sched/mm.h | 13 +++++++++++++
> > > > > kernel/fork.c | 14 ++++++++++----
> > > > > mm/init-mm.c | 2 ++
> > > > > 6 files changed, 40 insertions(+), 6 deletions(-)
> > > >
> > > > I think mmu notifiers and the stuff in drivers/infiniband/core/ can be
> > > > converted to this as well..
> > > >
> > > > Actually I kind of wonder if you should go the reverse and find the
> > > > few callers that care about the pgd and give them a new api with a
> > > > better name (mmget_pgd?).
> > >
> > > Yeah, that might make more sense... as long as I'm really sure about
> > > all the places I haven't changed. ^^
> > >
> > > I'll try to change it as you suggested for v2.
> >
> > Actually, no - I think it ends up being around 30 mentions of the
> > "take reference without PGD" function and around 35 mentions of the
> > "take reference with PGD" function (assuming that all the weird
> > powerpc stuff I don't understand needs the mm_context to not yet be
> > destroyed). (A decent chunk of which are all the per-arch functions
> > for starting secondary processors.) So I don't think doing it the way
> > you suggested would really make the patch(es) smaller.
> >
> > And I think that it is helpful for review purposes to have separate
> > patches for every converted site, and leave things as-is by default.
> > If the semantics change for every user that is *not* touched by the
> > patch, that makes it really easy for mistakes to slip through.
> >
> > I could try to convert more callers though?
>
> But really, I would be happiest if I could just leave all the callers
> where both refcounts work as-is, and let people more familiar with the
> subsystems switch them over when that actually becomes necessary. Is
> that not acceptable?

Either way can work, I liked the suggestion because it suggests an
good name for the ref: 'mmget_pgd' or somesuch

What I don't like is how nonsensical the names here are becoming:
mmget/mmgrab/mm_ref

Gives no impression at the callsite what is right/wrong

Names like this:
mmget_struct
mmget_pgd
mmget_tables

Make alot more sense to me..

I think this patch needs to do something about the naming..

Jason

2020-11-03 19:20:37

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH resend 3/6] mm: Add refcount for preserving mm_struct without pgd

On 11/3/20 5:21 AM, Jason Gunthorpe wrote:
> On Tue, Nov 03, 2020 at 04:19:11AM +0100, Jann Horn wrote:
>> On Tue, Nov 3, 2020 at 3:11 AM Jann Horn <[email protected]> wrote:
>>> On Sat, Oct 17, 2020 at 2:30 AM Jann Horn <[email protected]> wrote:
>>>> On Sat, Oct 17, 2020 at 1:21 AM Jason Gunthorpe <[email protected]> wrote:
>>>>> On Sat, Oct 17, 2020 at 01:09:12AM +0200, Jann Horn wrote:
>>>>>> Currently, mm_struct has two refcounts:
...
> Either way can work, I liked the suggestion because it suggests an
> good name for the ref: 'mmget_pgd' or somesuch
>
> What I don't like is how nonsensical the names here are becoming:
> mmget/mmgrab/mm_ref
>
> Gives no impression at the callsite what is right/wrong
>
> Names like this:
> mmget_struct
> mmget_pgd
> mmget_tables
>

What?! I had just resigned myself to a bimonthly exercise, re-memorizing
the mm_struct naming correlation between grab, drop, get, put, count,
and users. And now you want to make it directly understandable? :)

> Make alot more sense to me..
>
> I think this patch needs to do something about the naming..
>

A third counter also seems like the tipping point, to me.

thanks,
--
John Hubbard
NVIDIA