2016-12-16 08:54:02

by Vegard Nossum

[permalink] [raw]
Subject: [PATCH 1/4] mm: add new mmgrab() helper

Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Vegard Nossum <[email protected]>
---
arch/alpha/kernel/smp.c | 2 +-
arch/arc/kernel/smp.c | 2 +-
arch/arm/kernel/smp.c | 2 +-
arch/arm64/kernel/smp.c | 2 +-
arch/blackfin/mach-common/smp.c | 2 +-
arch/hexagon/kernel/smp.c | 2 +-
arch/ia64/kernel/setup.c | 2 +-
arch/m32r/kernel/setup.c | 2 +-
arch/metag/kernel/smp.c | 2 +-
arch/mips/kernel/traps.c | 2 +-
arch/mn10300/kernel/smp.c | 2 +-
arch/parisc/kernel/smp.c | 2 +-
arch/powerpc/kernel/smp.c | 2 +-
arch/s390/kernel/processor.c | 2 +-
arch/score/kernel/traps.c | 2 +-
arch/sh/kernel/smp.c | 2 +-
arch/sparc/kernel/leon_smp.c | 2 +-
arch/sparc/kernel/smp_64.c | 2 +-
arch/sparc/kernel/sun4d_smp.c | 2 +-
arch/sparc/kernel/sun4m_smp.c | 2 +-
arch/sparc/kernel/traps_32.c | 2 +-
arch/sparc/kernel/traps_64.c | 2 +-
arch/tile/kernel/smpboot.c | 2 +-
arch/x86/kernel/cpu/common.c | 4 ++--
arch/xtensa/kernel/smp.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +-
drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
fs/proc/base.c | 4 ++--
fs/userfaultfd.c | 2 +-
include/linux/sched.h | 5 +++++
kernel/exit.c | 2 +-
kernel/futex.c | 2 +-
kernel/sched/core.c | 4 ++--
mm/khugepaged.c | 2 +-
mm/ksm.c | 2 +-
mm/mmu_context.c | 2 +-
mm/mmu_notifier.c | 2 +-
mm/oom_kill.c | 4 ++--
virt/kvm/kvm_main.c | 2 +-
40 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
index 46bf263c3153..acb4b146a607 100644
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -144,7 +144,7 @@ smp_callin(void)
alpha_mv.smp_callin();

/* All kernel threads share the same mm context. */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

/* inform the notifiers about the new cpu */
diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index 88674d972c9d..9cbc7aba3ede 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -125,7 +125,7 @@ void start_kernel_secondary(void)
setup_processor();

atomic_inc(&mm->mm_users);
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 7dd14e8395e6..c6514ce0fcbc 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -371,7 +371,7 @@ asmlinkage void secondary_start_kernel(void)
* reference and switch to it.
*/
cpu = smp_processor_id();
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 8507703dabe4..61969ea29654 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -214,7 +214,7 @@ asmlinkage void secondary_start_kernel(void)
* All kernel threads share the same mm context; grab a
* reference and switch to it.
*/
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
current->active_mm = mm;

set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
index 23c4ef5f8bdc..bc5617ef7128 100644
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -308,7 +308,7 @@ void secondary_start_kernel(void)

/* Attach the new idle task to the global mm. */
atomic_inc(&mm->mm_users);
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
current->active_mm = mm;

preempt_disable();
diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c
index 983bae7d2665..c02a6455839e 100644
--- a/arch/hexagon/kernel/smp.c
+++ b/arch/hexagon/kernel/smp.c
@@ -162,7 +162,7 @@ void start_secondary(void)
);

/* Set the memory struct */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

cpu = smp_processor_id();
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 7ec7acc844c2..ecbff47b01f1 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -992,7 +992,7 @@ cpu_init (void)
*/
ia64_setreg(_IA64_REG_CR_DCR, ( IA64_DCR_DP | IA64_DCR_DK | IA64_DCR_DX | IA64_DCR_DR
| IA64_DCR_DA | IA64_DCR_DD | IA64_DCR_LC));
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
BUG_ON(current->mm);

diff --git a/arch/m32r/kernel/setup.c b/arch/m32r/kernel/setup.c
index 136c69f1fb8a..b18bc0bd6544 100644
--- a/arch/m32r/kernel/setup.c
+++ b/arch/m32r/kernel/setup.c
@@ -403,7 +403,7 @@ void __init cpu_init (void)
printk(KERN_INFO "Initializing CPU#%d\n", cpu_id);

/* Set up and load the per-CPU TSS and LDT */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
if (current->mm)
BUG();
diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c
index bad13232de51..af9cff547a19 100644
--- a/arch/metag/kernel/smp.c
+++ b/arch/metag/kernel/smp.c
@@ -345,7 +345,7 @@ asmlinkage void secondary_start_kernel(void)
* reference and switch to it.
*/
atomic_inc(&mm->mm_users);
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
enter_lazy_tlb(mm, current);
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index 3905003dfe2b..e50b0e0ca44c 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -2177,7 +2177,7 @@ void per_cpu_trap_init(bool is_boot_cpu)
if (!cpu_data[cpu].asid_cache)
cpu_data[cpu].asid_cache = asid_first_version(cpu);

- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
BUG_ON(current->mm);
enter_lazy_tlb(&init_mm, current);
diff --git a/arch/mn10300/kernel/smp.c b/arch/mn10300/kernel/smp.c
index 426173c4b0b9..e65b5cc2fa67 100644
--- a/arch/mn10300/kernel/smp.c
+++ b/arch/mn10300/kernel/smp.c
@@ -589,7 +589,7 @@ static void __init smp_cpu_init(void)
}
printk(KERN_INFO "Initializing CPU#%d\n", cpu_id);

- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
BUG_ON(current->mm);

diff --git a/arch/parisc/kernel/smp.c b/arch/parisc/kernel/smp.c
index 75dab2871346..67b452b41ff6 100644
--- a/arch/parisc/kernel/smp.c
+++ b/arch/parisc/kernel/smp.c
@@ -279,7 +279,7 @@ smp_cpu_init(int cpunum)
set_cpu_online(cpunum, true);

/* Initialise the idle task for this CPU */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
BUG_ON(current->mm);
enter_lazy_tlb(&init_mm, current);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9c6f3fd58059..42b82364c782 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -707,7 +707,7 @@ void start_secondary(void *unused)
unsigned int cpu = smp_processor_id();
int i, base;

- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

smp_store_cpu_info(cpu);
diff --git a/arch/s390/kernel/processor.c b/arch/s390/kernel/processor.c
index 81d0808085e6..ec9bc100895c 100644
--- a/arch/s390/kernel/processor.c
+++ b/arch/s390/kernel/processor.c
@@ -73,7 +73,7 @@ void cpu_init(void)
get_cpu_id(id);
if (machine_has_cpu_mhz)
update_cpu_mhz(NULL);
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
BUG_ON(current->mm);
enter_lazy_tlb(&init_mm, current);
diff --git a/arch/score/kernel/traps.c b/arch/score/kernel/traps.c
index 5cea1e750cec..6f6e5a39d147 100644
--- a/arch/score/kernel/traps.c
+++ b/arch/score/kernel/traps.c
@@ -336,7 +336,7 @@ void __init trap_init(void)
set_except_vector(18, handle_dbe);
flush_icache_range(DEBUG_VECTOR_BASE_ADDR, IRQ_VECTOR_BASE_ADDR);

- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
cpu_cache_init();
}
diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
index 38e7860845db..ee379c699c08 100644
--- a/arch/sh/kernel/smp.c
+++ b/arch/sh/kernel/smp.c
@@ -178,7 +178,7 @@ asmlinkage void start_secondary(void)
struct mm_struct *mm = &init_mm;

enable_mmu();
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
atomic_inc(&mm->mm_users);
current->active_mm = mm;
#ifdef CONFIG_MMU
diff --git a/arch/sparc/kernel/leon_smp.c b/arch/sparc/kernel/leon_smp.c
index 71e16f2241c2..b99d33797e1d 100644
--- a/arch/sparc/kernel/leon_smp.c
+++ b/arch/sparc/kernel/leon_smp.c
@@ -93,7 +93,7 @@ void leon_cpu_pre_online(void *arg)
: "memory" /* paranoid */);

/* Attach to the address space of init_task. */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index 8182f7caf5b1..c1d2bed22961 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -122,7 +122,7 @@ void smp_callin(void)
current_thread_info()->new_child = 0;

/* Attach to the address space of init_task. */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

/* inform the notifiers about the new cpu */
diff --git a/arch/sparc/kernel/sun4d_smp.c b/arch/sparc/kernel/sun4d_smp.c
index 9d98e5002a09..7b55c50eabe5 100644
--- a/arch/sparc/kernel/sun4d_smp.c
+++ b/arch/sparc/kernel/sun4d_smp.c
@@ -93,7 +93,7 @@ void sun4d_cpu_pre_online(void *arg)
show_leds(cpuid);

/* Attach to the address space of init_task. */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

local_ops->cache_all();
diff --git a/arch/sparc/kernel/sun4m_smp.c b/arch/sparc/kernel/sun4m_smp.c
index 278c40abce82..633c4cf6fdb0 100644
--- a/arch/sparc/kernel/sun4m_smp.c
+++ b/arch/sparc/kernel/sun4m_smp.c
@@ -59,7 +59,7 @@ void sun4m_cpu_pre_online(void *arg)
: "memory" /* paranoid */);

/* Attach to the address space of init_task. */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c
index 4f21df7d4f13..ecddac5a4c96 100644
--- a/arch/sparc/kernel/traps_32.c
+++ b/arch/sparc/kernel/traps_32.c
@@ -448,7 +448,7 @@ void trap_init(void)
thread_info_offsets_are_bolixed_pete();

/* Attach to the address space of init_task. */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;

/* NOTE: Other cpus have this done as they are started
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 4094a51b1970..0dbbe40012ef 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -2764,6 +2764,6 @@ void __init trap_init(void)
/* Attach to the address space of init_task. On SMP we
* do this in smp.c:smp_callin for other cpus.
*/
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
}
diff --git a/arch/tile/kernel/smpboot.c b/arch/tile/kernel/smpboot.c
index 6c0abaacec33..53ce940a5016 100644
--- a/arch/tile/kernel/smpboot.c
+++ b/arch/tile/kernel/smpboot.c
@@ -160,7 +160,7 @@ static void start_secondary(void)
__this_cpu_write(current_asid, min_asid);

/* Set up this thread as another owner of the init_mm */
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
current->active_mm = &init_mm;
if (current->mm)
BUG();
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cc9e980c68ec..b580da4582e1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1555,7 +1555,7 @@ void cpu_init(void)
for (i = 0; i <= IO_BITMAP_LONGS; i++)
t->io_bitmap[i] = ~0UL;

- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
me->active_mm = &init_mm;
BUG_ON(me->mm);
enter_lazy_tlb(&init_mm, me);
@@ -1606,7 +1606,7 @@ void cpu_init(void)
/*
* Set up and load the per-CPU TSS and LDT
*/
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
curr->active_mm = &init_mm;
BUG_ON(curr->mm);
enter_lazy_tlb(&init_mm, curr);
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index fc4ad21a5ed4..9bf5cea3bae4 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -136,7 +136,7 @@ void secondary_start_kernel(void)
/* All kernel threads share the same mm context. */

atomic_inc(&mm->mm_users);
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
enter_lazy_tlb(mm, current);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index ef7c8de7060e..ca5f2aa7232d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -262,7 +262,7 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
* and because the mmu_notifier_unregister function also drop
* mm_count we need to take an extra count here.
*/
- atomic_inc(&p->mm->mm_count);
+ mmgrab(p->mm);
mmu_notifier_unregister_no_release(&p->mmu_notifier, p->mm);
mmu_notifier_call_srcu(&p->rcu, &kfd_process_destroy_delayed);
}
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index c6f780f5abc9..f21ca404af79 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -341,7 +341,7 @@ i915_gem_userptr_init__mm_struct(struct drm_i915_gem_object *obj)
mm->i915 = to_i915(obj->base.dev);

mm->mm = current->mm;
- atomic_inc(&current->mm->mm_count);
+ mmgrab(current->mm);

mm->mn = NULL;

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index bd786b7bd30b..2e1a6643a910 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -185,7 +185,7 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
if (fd) {
fd->rec_cpu_num = -1; /* no cpu affinity by default */
fd->mm = current->mm;
- atomic_inc(&fd->mm->mm_count);
+ mmgrab(fd->mm);
fp->private_data = fd;
} else {
fp->private_data = NULL;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ca651ac00660..0b8ccacae8b3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -795,7 +795,7 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode)

if (!IS_ERR_OR_NULL(mm)) {
/* ensure this mm_struct can't be freed */
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
/* but do not pin its memory */
mmput(mm);
}
@@ -1093,7 +1093,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
if (p) {
if (atomic_read(&p->mm->mm_users) > 1) {
mm = p->mm;
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
}
task_unlock(p);
}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 85959d8324df..ffa9c7cbc5fa 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1304,7 +1304,7 @@ static struct file *userfaultfd_file_create(int flags)
ctx->released = false;
ctx->mm = current->mm;
/* prevent the mm struct to be freed */
- atomic_inc(&ctx->mm->mm_count);
+ mmgrab(ctx->mm);

file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e9c009dc3a4a..31ae1f49eebb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2872,6 +2872,11 @@ static inline unsigned long sigsp(unsigned long sp, struct ksignal *ksig)
*/
extern struct mm_struct * mm_alloc(void);

+static inline void mmgrab(struct mm_struct *mm)
+{
+ atomic_inc(&mm->mm_count);
+}
+
/* mmdrop drops the mm and the page tables */
extern void __mmdrop(struct mm_struct *);
static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/exit.c b/kernel/exit.c
index 3076f3089919..b12753840050 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -500,7 +500,7 @@ static void exit_mm(struct task_struct *tsk)
__set_task_state(tsk, TASK_RUNNING);
down_read(&mm->mmap_sem);
}
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
BUG_ON(mm != tsk->active_mm);
/* more a memory barrier than a real lock */
task_lock(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index 2c4be467fecd..cbe6056c17c1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -338,7 +338,7 @@ static inline bool should_fail_futex(bool fshared)

static inline void futex_get_mm(union futex_key *key)
{
- atomic_inc(&key->private.mm->mm_count);
+ mmgrab(key->private.mm);
/*
* Ensure futex_get_mm() implies a full barrier such that
* get_futex_key() implies a full barrier. This is relied upon
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 154fd689fe02..ee1fb0070544 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2877,7 +2877,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

if (!mm) {
next->active_mm = oldmm;
- atomic_inc(&oldmm->mm_count);
+ mmgrab(oldmm);
enter_lazy_tlb(oldmm, next);
} else
switch_mm_irqs_off(oldmm, mm, next);
@@ -7667,7 +7667,7 @@ void __init sched_init(void)
/*
* The boot idle thread does lazy MMU switching as well:
*/
- atomic_inc(&init_mm.mm_count);
+ mmgrab(&init_mm);
enter_lazy_tlb(&init_mm, current);

/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 87e1a7ca3846..1343271a18f1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -420,7 +420,7 @@ int __khugepaged_enter(struct mm_struct *mm)
list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
spin_unlock(&khugepaged_mm_lock);

- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
if (wakeup)
wake_up_interruptible(&khugepaged_wait);

diff --git a/mm/ksm.c b/mm/ksm.c
index 9ae6011a41f8..5a49aad9d87b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1813,7 +1813,7 @@ int __ksm_enter(struct mm_struct *mm)
spin_unlock(&ksm_mmlist_lock);

set_bit(MMF_VM_MERGEABLE, &mm->flags);
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);

if (needs_wakeup)
wake_up_interruptible(&ksm_thread_wait);
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 6f4d27c5bb32..daf67bb02b4a 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -25,7 +25,7 @@ void use_mm(struct mm_struct *mm)
task_lock(tsk);
active_mm = tsk->active_mm;
if (active_mm != mm) {
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
tsk->active_mm = mm;
}
tsk->mm = mm;
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index f4259e496f83..32bc9f2ff7eb 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -275,7 +275,7 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
}
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);

/*
* Serialize the update against mmu_notifier_unregister. A
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..ead093c6f2a6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -660,7 +660,7 @@ static void mark_oom_victim(struct task_struct *tsk)

/* oom_mm is bound to the signal struct life time. */
if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
- atomic_inc(&tsk->signal->oom_mm->mm_count);
+ mmgrab(tsk->signal->oom_mm);

/*
* Make sure that the task is woken up from uninterruptible sleep
@@ -877,7 +877,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)

/* Get a reference to safely compare mm after task_unlock(victim) */
mm = victim->mm;
- atomic_inc(&mm->mm_count);
+ mmgrab(mm);
/*
* We should send SIGKILL before setting TIF_MEMDIE in order to prevent
* the OOM victim from depleting the memory reserves from the user
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7f9ee2929cfe..43914b981691 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
return ERR_PTR(-ENOMEM);

spin_lock_init(&kvm->mmu_lock);
- atomic_inc(&current->mm->mm_count);
+ mmgrab(current->mm);
kvm->mm = current->mm;
kvm_eventfd_init(kvm);
mutex_init(&kvm->lock);
--
2.11.0.1.gaa10c3f


2016-12-16 09:02:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/4] [RFC!] mm: 'struct mm_struct' reference counting debugging

On Fri 16-12-16 09:22:02, Vegard Nossum wrote:
> Reference counting bugs are hard to debug by their nature since the actual
> manifestation of one can occur very far from where the error is introduced
> (e.g. a missing get() only manifest as a use-after-free when the reference
> count prematurely drops to 0, which could be arbitrarily long after where
> the get() should have happened if there are other users). I wrote this patch
> to try to track down a suspected 'mm_struct' reference counting bug.

I definitely agree that hunting these bugs is a royal PITA, no question
about that. I am just wondering whether this has been motivated by any
particular bug recently. I do not seem to remember any such an issue for
quite some time.

> The basic idea is to keep track of all references, not just with a reference
> counter, but with an actual reference _list_. Whenever you get() or put() a
> reference, you also add or remove yourself, respectively, from the reference
> list. This really helps debugging because (for example) you always put a
> specific reference, meaning that if that reference was not yours to put, you
> will notice it immediately (rather than when the reference counter goes to 0
> and you still have an active reference).

But who is the owner of the reference? A function/task? It is not all
that uncommon to take an mm reference from one context and release it
from a different one. But I might be missing your point here.

> The main interface is in <linux/mm_ref_types.h> and <linux/mm_ref.h>, while
> the implementation lives in mm/mm_ref.c. Since 'struct mm_struct' has both
> ->mm_users and ->mm_count, we introduce helpers for both of them, but use
> the same data structure for each (struct mm_ref). The low-level rules (i.e.
> the ones we have to follow, but which nobody else should really have to
> care about since they use the higher-level interface) are:
>
> - after incrementing ->mm_count you also have to call get_mm_ref()
>
> - before decrementing ->mm_count you also have to call put_mm_ref()
>
> - after incrementing ->mm_users you also have to call get_mm_users_ref()
>
> - before decrementing ->mm_users you also have to call put_mm_users_ref()
>
> The rules that most of the rest of the kernel will care about are:
>
> - functions that acquire and return a mm_struct should take a
> 'struct mm_ref *' which it can pass on to mmget()/mmgrab()/etc.
>
> - functions that release an mm_struct passed as a parameter should also
> take a 'struct mm_ref *' which it can pass on to mmput()/mmdrop()/etc.
>
> - any function that temporarily acquires a mm_struct reference should
> use MM_REF() to define an on-stack reference and pass it on to
> mmget()/mmput()/mmgrab()/mmdrop()/etc.
>
> - any structure that holds an mm_struct pointer must also include a
> 'struct mm_ref' member; when the mm_struct pointer is modified you
> would typically also call mmget()/mmgrab()/mmput()/mmdrop() and they
> should be called with this mm_ref
>
> - you can convert (for example) an on-stack reference to an in-struct
> reference using move_mm_ref(). This is semantically equivalent to
> (atomically) taking the new reference and dropping the old one, but
> doesn't actually need to modify the reference count

This all sounds way too intrusive to me so I am not really sure this is
something we really want. A nice thing for debugging for sure but I am
somehow skeptical whether it is really worth it considering how many
those ref. count bugs we've had.

[...]
--
Michal Hocko
SUSE Labs

2016-12-16 09:26:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/4] mm: add new mmget() helper

On Fri 16-12-16 09:22:00, Vegard Nossum wrote:
> Apart from adding the helper function itself, the rest of the kernel is
> converted mechanically using:
>
> git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
> git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'
>
> This is needed for a later patch that hooks into the helper, but might be
> a worthwhile cleanup on its own.

Same here a clarification comment would be really nice

/**
* mmget: pins the address space
*
* Makes sure that the address space of the given mm struct doesn't go
* away. This doesn't protect from freeing parts of the address space
* though.
*
* Never use this function if the time the address space is pinned is
* not bounded.
*/

>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: Vegard Nossum <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> arch/arc/kernel/smp.c | 2 +-
> arch/blackfin/mach-common/smp.c | 2 +-
> arch/frv/mm/mmu-context.c | 2 +-
> arch/metag/kernel/smp.c | 2 +-
> arch/sh/kernel/smp.c | 2 +-
> arch/xtensa/kernel/smp.c | 2 +-
> include/linux/sched.h | 5 +++++
> kernel/fork.c | 4 ++--
> mm/swapfile.c | 10 +++++-----
> virt/kvm/async_pf.c | 2 +-
> 10 files changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
> index 9cbc7aba3ede..eec70cb71db1 100644
> --- a/arch/arc/kernel/smp.c
> +++ b/arch/arc/kernel/smp.c
> @@ -124,7 +124,7 @@ void start_kernel_secondary(void)
> /* MMU, Caches, Vector Table, Interrupts etc */
> setup_processor();
>
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
> diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
> index bc5617ef7128..a2e6db2ce811 100644
> --- a/arch/blackfin/mach-common/smp.c
> +++ b/arch/blackfin/mach-common/smp.c
> @@ -307,7 +307,7 @@ void secondary_start_kernel(void)
> local_irq_disable();
>
> /* Attach the new idle task to the global mm. */
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> mmgrab(mm);
> current->active_mm = mm;
>
> diff --git a/arch/frv/mm/mmu-context.c b/arch/frv/mm/mmu-context.c
> index 81757d55a5b5..3473bde77f56 100644
> --- a/arch/frv/mm/mmu-context.c
> +++ b/arch/frv/mm/mmu-context.c
> @@ -188,7 +188,7 @@ int cxn_pin_by_pid(pid_t pid)
> task_lock(tsk);
> if (tsk->mm) {
> mm = tsk->mm;
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> ret = 0;
> }
> task_unlock(tsk);
> diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c
> index af9cff547a19..c622293254e4 100644
> --- a/arch/metag/kernel/smp.c
> +++ b/arch/metag/kernel/smp.c
> @@ -344,7 +344,7 @@ asmlinkage void secondary_start_kernel(void)
> * All kernel threads share the same mm context; grab a
> * reference and switch to it.
> */
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
> diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
> index ee379c699c08..edc4769b047e 100644
> --- a/arch/sh/kernel/smp.c
> +++ b/arch/sh/kernel/smp.c
> @@ -179,7 +179,7 @@ asmlinkage void start_secondary(void)
>
> enable_mmu();
> mmgrab(mm);
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> current->active_mm = mm;
> #ifdef CONFIG_MMU
> enter_lazy_tlb(mm, current);
> diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
> index 9bf5cea3bae4..fcea72019df7 100644
> --- a/arch/xtensa/kernel/smp.c
> +++ b/arch/xtensa/kernel/smp.c
> @@ -135,7 +135,7 @@ void secondary_start_kernel(void)
>
> /* All kernel threads share the same mm context. */
>
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 31ae1f49eebb..2ca3e15dad3b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2899,6 +2899,11 @@ static inline void mmdrop_async(struct mm_struct *mm)
> }
> }
>
> +static inline void mmget(struct mm_struct *mm)
> +{
> + atomic_inc(&mm->mm_users);
> +}
> +
> static inline bool mmget_not_zero(struct mm_struct *mm)
> {
> return atomic_inc_not_zero(&mm->mm_users);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 997ac1d584f7..f9c32dc6ccbc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -989,7 +989,7 @@ struct mm_struct *get_task_mm(struct task_struct *task)
> if (task->flags & PF_KTHREAD)
> mm = NULL;
> else
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> }
> task_unlock(task);
> return mm;
> @@ -1177,7 +1177,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> vmacache_flush(tsk);
>
> if (clone_flags & CLONE_VM) {
> - atomic_inc(&oldmm->mm_users);
> + mmget(oldmm);
> mm = oldmm;
> goto good_mm;
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f30438970cd1..cf73169ce153 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1402,7 +1402,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
> * that.
> */
> start_mm = &init_mm;
> - atomic_inc(&init_mm.mm_users);
> + mmget(&init_mm);
>
> /*
> * Keep on scanning until all entries have gone. Usually,
> @@ -1451,7 +1451,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
> if (atomic_read(&start_mm->mm_users) == 1) {
> mmput(start_mm);
> start_mm = &init_mm;
> - atomic_inc(&init_mm.mm_users);
> + mmget(&init_mm);
> }
>
> /*
> @@ -1488,8 +1488,8 @@ int try_to_unuse(unsigned int type, bool frontswap,
> struct mm_struct *prev_mm = start_mm;
> struct mm_struct *mm;
>
> - atomic_inc(&new_start_mm->mm_users);
> - atomic_inc(&prev_mm->mm_users);
> + mmget(new_start_mm);
> + mmget(prev_mm);
> spin_lock(&mmlist_lock);
> while (swap_count(*swap_map) && !retval &&
> (p = p->next) != &start_mm->mmlist) {
> @@ -1512,7 +1512,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
>
> if (set_start_mm && *swap_map < swcount) {
> mmput(new_start_mm);
> - atomic_inc(&mm->mm_users);
> + mmget(mm);
> new_start_mm = mm;
> set_start_mm = 0;
> }
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index efeceb0a222d..9ec9cef2b207 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -200,7 +200,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
> work->addr = hva;
> work->arch = *arch;
> work->mm = current->mm;
> - atomic_inc(&work->mm->mm_users);
> + mmget(work->mm);
> kvm_get_kvm(work->vcpu->kvm);
>
> /* this can't really happen otherwise gfn_to_pfn_async
> --
> 2.11.0.1.gaa10c3f
>

--
Michal Hocko
SUSE Labs

2016-12-16 09:28:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/4] mm: use mmget_not_zero() helper

On Fri 16-12-16 09:22:01, Vegard Nossum wrote:
> We already have the helper, we can convert the rest of the kernel
> mechanically using:
>
> git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'
>
> This is needed for a later patch that hooks into the helper, but might be
> a worthwhile cleanup on its own.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: Vegard Nossum <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +-
> drivers/iommu/intel-svm.c | 2 +-
> fs/proc/base.c | 4 ++--
> fs/proc/task_mmu.c | 4 ++--
> fs/proc/task_nommu.c | 2 +-
> kernel/events/uprobes.c | 2 +-
> mm/swapfile.c | 2 +-
> 7 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index f21ca404af79..e97f9ade99fc 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -514,7 +514,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
> flags |= FOLL_WRITE;
>
> ret = -EFAULT;
> - if (atomic_inc_not_zero(&mm->mm_users)) {
> + if (mmget_not_zero(mm)) {
> down_read(&mm->mmap_sem);
> while (pinned < npages) {
> ret = get_user_pages_remote
> diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
> index cb72e0011310..51f2b228723f 100644
> --- a/drivers/iommu/intel-svm.c
> +++ b/drivers/iommu/intel-svm.c
> @@ -579,7 +579,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
> if (!svm->mm)
> goto bad_req;
> /* If the mm is already defunct, don't handle faults. */
> - if (!atomic_inc_not_zero(&svm->mm->mm_users))
> + if (!mmget_not_zero(svm->mm))
> goto bad_req;
> down_read(&svm->mm->mmap_sem);
> vma = find_extend_vma(svm->mm, address);
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 0b8ccacae8b3..87fd5bf07578 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -842,7 +842,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
> return -ENOMEM;
>
> copied = 0;
> - if (!atomic_inc_not_zero(&mm->mm_users))
> + if (!mmget_not_zero(mm))
> goto free;
>
> /* Maybe we should limit FOLL_FORCE to actual ptrace users? */
> @@ -950,7 +950,7 @@ static ssize_t environ_read(struct file *file, char __user *buf,
> return -ENOMEM;
>
> ret = 0;
> - if (!atomic_inc_not_zero(&mm->mm_users))
> + if (!mmget_not_zero(mm))
> goto free;
>
> down_read(&mm->mmap_sem);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 35b92d81692f..c71975293dc8 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> return ERR_PTR(-ESRCH);
>
> mm = priv->mm;
> - if (!mm || !atomic_inc_not_zero(&mm->mm_users))
> + if (!mm || !mmget_not_zero(mm))
> return NULL;
>
> down_read(&mm->mmap_sem);
> @@ -1352,7 +1352,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> unsigned long end_vaddr;
> int ret = 0, copied = 0;
>
> - if (!mm || !atomic_inc_not_zero(&mm->mm_users))
> + if (!mm || !mmget_not_zero(mm))
> goto out;
>
> ret = -EINVAL;
> diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
> index 37175621e890..1ef97cfcf422 100644
> --- a/fs/proc/task_nommu.c
> +++ b/fs/proc/task_nommu.c
> @@ -219,7 +219,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
> return ERR_PTR(-ESRCH);
>
> mm = priv->mm;
> - if (!mm || !atomic_inc_not_zero(&mm->mm_users))
> + if (!mm || !mmget_not_zero(mm))
> return NULL;
>
> down_read(&mm->mmap_sem);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index f9ec9add2164..bcf0f9d77d4d 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -741,7 +741,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
> continue;
> }
>
> - if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + if (!mmget_not_zero(vma->vm_mm))
> continue;
>
> info = prev;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index cf73169ce153..8c92829326cb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1494,7 +1494,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
> while (swap_count(*swap_map) && !retval &&
> (p = p->next) != &start_mm->mmlist) {
> mm = list_entry(p, struct mm_struct, mmlist);
> - if (!atomic_inc_not_zero(&mm->mm_users))
> + if (!mmget_not_zero(mm))
> continue;
> spin_unlock(&mmlist_lock);
> mmput(prev_mm);
> --
> 2.11.0.1.gaa10c3f
>

--
Michal Hocko
SUSE Labs

2016-12-16 09:37:00

by Vegard Nossum

[permalink] [raw]
Subject: [PATCH 3/4] mm: use mmget_not_zero() helper

We already have the helper, we can convert the rest of the kernel
mechanically using:

git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Vegard Nossum <[email protected]>
---
drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +-
drivers/iommu/intel-svm.c | 2 +-
fs/proc/base.c | 4 ++--
fs/proc/task_mmu.c | 4 ++--
fs/proc/task_nommu.c | 2 +-
kernel/events/uprobes.c | 2 +-
mm/swapfile.c | 2 +-
7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index f21ca404af79..e97f9ade99fc 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -514,7 +514,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
flags |= FOLL_WRITE;

ret = -EFAULT;
- if (atomic_inc_not_zero(&mm->mm_users)) {
+ if (mmget_not_zero(mm)) {
down_read(&mm->mmap_sem);
while (pinned < npages) {
ret = get_user_pages_remote
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index cb72e0011310..51f2b228723f 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -579,7 +579,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (!svm->mm)
goto bad_req;
/* If the mm is already defunct, don't handle faults. */
- if (!atomic_inc_not_zero(&svm->mm->mm_users))
+ if (!mmget_not_zero(svm->mm))
goto bad_req;
down_read(&svm->mm->mmap_sem);
vma = find_extend_vma(svm->mm, address);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0b8ccacae8b3..87fd5bf07578 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -842,7 +842,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
return -ENOMEM;

copied = 0;
- if (!atomic_inc_not_zero(&mm->mm_users))
+ if (!mmget_not_zero(mm))
goto free;

/* Maybe we should limit FOLL_FORCE to actual ptrace users? */
@@ -950,7 +950,7 @@ static ssize_t environ_read(struct file *file, char __user *buf,
return -ENOMEM;

ret = 0;
- if (!atomic_inc_not_zero(&mm->mm_users))
+ if (!mmget_not_zero(mm))
goto free;

down_read(&mm->mmap_sem);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 35b92d81692f..c71975293dc8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
return ERR_PTR(-ESRCH);

mm = priv->mm;
- if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ if (!mm || !mmget_not_zero(mm))
return NULL;

down_read(&mm->mmap_sem);
@@ -1352,7 +1352,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
unsigned long end_vaddr;
int ret = 0, copied = 0;

- if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ if (!mm || !mmget_not_zero(mm))
goto out;

ret = -EINVAL;
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 37175621e890..1ef97cfcf422 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -219,7 +219,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ESRCH);

mm = priv->mm;
- if (!mm || !atomic_inc_not_zero(&mm->mm_users))
+ if (!mm || !mmget_not_zero(mm))
return NULL;

down_read(&mm->mmap_sem);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f9ec9add2164..bcf0f9d77d4d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -741,7 +741,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
continue;
}

- if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ if (!mmget_not_zero(vma->vm_mm))
continue;

info = prev;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cf73169ce153..8c92829326cb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1494,7 +1494,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
while (swap_count(*swap_map) && !retval &&
(p = p->next) != &start_mm->mmlist) {
mm = list_entry(p, struct mm_struct, mmlist);
- if (!atomic_inc_not_zero(&mm->mm_users))
+ if (!mmget_not_zero(mm))
continue;
spin_unlock(&mmlist_lock);
mmput(prev_mm);
--
2.11.0.1.gaa10c3f

2016-12-16 09:44:39

by Vegard Nossum

[permalink] [raw]
Subject: Re: [PATCH 4/4] [RFC!] mm: 'struct mm_struct' reference counting debugging

On 12/16/2016 10:01 AM, Michal Hocko wrote:
> On Fri 16-12-16 09:22:02, Vegard Nossum wrote:
>> Reference counting bugs are hard to debug by their nature since the actual
>> manifestation of one can occur very far from where the error is introduced
>> (e.g. a missing get() only manifest as a use-after-free when the reference
>> count prematurely drops to 0, which could be arbitrarily long after where
>> the get() should have happened if there are other users). I wrote this patch
>> to try to track down a suspected 'mm_struct' reference counting bug.
>
> I definitely agree that hunting these bugs is a royal PITA, no question
> about that. I am just wondering whether this has been motivated by any
> particular bug recently. I do not seem to remember any such an issue for
> quite some time.

Yes, I've been hitting a use-after-free with trinity that happens when
the OOM killer reaps a task. I can reproduce it reliably within a few
seconds, but with the amount of refs and syscalls going on I haven't
been able to figure out what's actually going wrong (to put things into
perspective the refcounts goes into the thousands before eventually
dropping down to 0 and trying to trace_printk() each get/put results in
several hundred megabytes of log files).

The UAF itself (sometimes a NULL pointer deref) is on a struct file
(sometimes in the page fault path, sometimes in clone(), sometimes in
execve()), and my initial debugging lead me to believe it was actually a
problem with mm_struct getting freed prematurely (hence this patch). But
disappointingly this patch didn't turn up anything so I must reevaluate
my suspicion of an mm_struct leak.

I don't think it's a bug in the OOM reaper itself, but either of the
following two patches will fix the problem (without my understand how or
why):

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..37b14b2e2af4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct
*tsk, struct mm_struct *mm)
*/
mutex_lock(&oom_lock);

- if (!down_read_trylock(&mm->mmap_sem)) {
+ if (!down_write_trylock(&mm->mmap_sem)) {
ret = false;
goto unlock_oom;
}
@@ -496,7 +496,7 @@ static bool __oom_reap_task_mm(struct task_struct
*tsk, struct mm_struct *mm)
* and delayed __mmput doesn't matter that much
*/
if (!mmget_not_zero(mm)) {
- up_read(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
goto unlock_oom;
}

@@ -540,7 +540,7 @@ static bool __oom_reap_task_mm(struct task_struct
*tsk, struct mm_struct *mm)
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
- up_read(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);

/*
* Drop our reference but make sure the mmput slow path is called from a

--OR--

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..559aec0acd21 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -508,6 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct
*tsk, struct mm_struct *mm)
*/
set_bit(MMF_UNSTABLE, &mm->flags);

+#if 0
tlb_gather_mmu(&tlb, mm, 0, -1);
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
if (is_vm_hugetlb_page(vma))
@@ -535,6 +536,7 @@ static bool __oom_reap_task_mm(struct task_struct
*tsk, struct mm_struct *mm)
&details);
}
tlb_finish_mmu(&tlb, 0, -1);
+#endif
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB,
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),

Maybe it's just the fact that we're not releasing the memory and so some
other bit of code is not able to make enough progress to trigger the
bug, although curiously, if I just move the #if 0..#endif inside
tlb_gather_mmu()..tlb_finish_mmu() itself (so just calling tlb_*()
without doing the for-loop), it still reproduces the crash.

Another clue, although it might just be a coincidence, is that it seems
the VMA/file in question is always a mapping for the exe file itself
(the reason I think this might be a coincidence is that the exe file
mapping is the first one and we usually traverse VMAs starting with this
one, that doesn't mean the other VMAs aren't affected by the same
problem, just that we never hit them).

I really wanted to figure out and fix the bug myself, it's a great way
to learn, after all, instead of just sending crash logs and letting
somebody else figure it out. But maybe I have to admit defeat on this one.

>> The basic idea is to keep track of all references, not just with a reference
>> counter, but with an actual reference _list_. Whenever you get() or put() a
>> reference, you also add or remove yourself, respectively, from the reference
>> list. This really helps debugging because (for example) you always put a
>> specific reference, meaning that if that reference was not yours to put, you
>> will notice it immediately (rather than when the reference counter goes to 0
>> and you still have an active reference).
>
> But who is the owner of the reference? A function/task? It is not all
> that uncommon to take an mm reference from one context and release it
> from a different one. But I might be missing your point here.

An owner is somebody who knows the pointer and increments the reference
counter for it.

You'll notice a bunch of functions just take a temporary on-stack
reference (e.g. struct mm_struct *mm = get_task_mm(tsk); ...
mmput(&mm)), in which case it's the function that owns the reference
until the mmput.

Some functions take a reference and stash it in some heap object, an
example from the patch could be 'struct vhost_dev' (which has a ->mm
field), and it does get_task_mm() in an init function and mmput() in a
cleanup function. In this case, it's the struct which is the owner of
the reference, for as long as ->mm points to something non-NULL. This
would be an example of taking the reference in one context and releasing
it in a different one. I guess the point is that we must always release
a _specific_ reference when we decrement a reference count. Yes, it's a
number, but that number does refer to a specific reference that was
taken at some point in the past (and we should know
which reference this is, otherwise we don't actually "have it").

We may not be used to thinking of reference counts as actual places of
reference, but that's what it is, fundamentally. This patch just makes
it very explicit what the owners are and where ownership transfers take
place.

>> The main interface is in <linux/mm_ref_types.h> and <linux/mm_ref.h>, while
>> the implementation lives in mm/mm_ref.c. Since 'struct mm_struct' has both
>> ->mm_users and ->mm_count, we introduce helpers for both of them, but use
>> the same data structure for each (struct mm_ref). The low-level rules (i.e.
>> the ones we have to follow, but which nobody else should really have to
>> care about since they use the higher-level interface) are:
[...]
>
> This all sounds way too intrusive to me so I am not really sure this is
> something we really want. A nice thing for debugging for sure but I am
> somehow skeptical whether it is really worth it considering how many
> those ref. count bugs we've had.

Yeah, I agree it's intrusive. And it did start out as just a debugging
patch, but I figured after having done all the work I might as well
slap on a changelog and submit it to see what people think.

However, it may have some value as documentation of who is the owner of
each reference and where/when those owners change. Maybe I should just
extract that knowledge and add it in as comments instead.

Thanks for your comments!


Vegard

2016-12-16 09:46:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: add new mmgrab() helper

On Fri 16-12-16 09:21:59, Vegard Nossum wrote:
> Apart from adding the helper function itself, the rest of the kernel is
> converted mechanically using:
>
> git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
> git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'
>
> This is needed for a later patch that hooks into the helper, but might be
> a worthwhile cleanup on its own.
>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Signed-off-by: Vegard Nossum <[email protected]>

Yes this make sense to me. I usually do not like wrappers around simple
atomic operations but the api will be more symmetric (and cscope will
work nicer as a bonus).

While you are there a comment explaining where mmgrab should be used
would be nice

/**
* mmgrab: pins the mm_struct
*
* Make sure that the mm struct will not get freed even after the owner
* task exits. This doesn't guarantee that the associated address space
* will still exist later on and mmget_not_zero has to be used before
* accessing it.
*
* This is a preferred way to to pin the mm struct for longer/unbound
* time.
*/

or something along those lines.

Anyway
Acked-by: Michal Hocko <[email protected]>

> ---
> arch/alpha/kernel/smp.c | 2 +-
> arch/arc/kernel/smp.c | 2 +-
> arch/arm/kernel/smp.c | 2 +-
> arch/arm64/kernel/smp.c | 2 +-
> arch/blackfin/mach-common/smp.c | 2 +-
> arch/hexagon/kernel/smp.c | 2 +-
> arch/ia64/kernel/setup.c | 2 +-
> arch/m32r/kernel/setup.c | 2 +-
> arch/metag/kernel/smp.c | 2 +-
> arch/mips/kernel/traps.c | 2 +-
> arch/mn10300/kernel/smp.c | 2 +-
> arch/parisc/kernel/smp.c | 2 +-
> arch/powerpc/kernel/smp.c | 2 +-
> arch/s390/kernel/processor.c | 2 +-
> arch/score/kernel/traps.c | 2 +-
> arch/sh/kernel/smp.c | 2 +-
> arch/sparc/kernel/leon_smp.c | 2 +-
> arch/sparc/kernel/smp_64.c | 2 +-
> arch/sparc/kernel/sun4d_smp.c | 2 +-
> arch/sparc/kernel/sun4m_smp.c | 2 +-
> arch/sparc/kernel/traps_32.c | 2 +-
> arch/sparc/kernel/traps_64.c | 2 +-
> arch/tile/kernel/smpboot.c | 2 +-
> arch/x86/kernel/cpu/common.c | 4 ++--
> arch/xtensa/kernel/smp.c | 2 +-
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
> drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +-
> drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
> fs/proc/base.c | 4 ++--
> fs/userfaultfd.c | 2 +-
> include/linux/sched.h | 5 +++++
> kernel/exit.c | 2 +-
> kernel/futex.c | 2 +-
> kernel/sched/core.c | 4 ++--
> mm/khugepaged.c | 2 +-
> mm/ksm.c | 2 +-
> mm/mmu_context.c | 2 +-
> mm/mmu_notifier.c | 2 +-
> mm/oom_kill.c | 4 ++--
> virt/kvm/kvm_main.c | 2 +-
> 40 files changed, 48 insertions(+), 43 deletions(-)
>
> diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
> index 46bf263c3153..acb4b146a607 100644
> --- a/arch/alpha/kernel/smp.c
> +++ b/arch/alpha/kernel/smp.c
> @@ -144,7 +144,7 @@ smp_callin(void)
> alpha_mv.smp_callin();
>
> /* All kernel threads share the same mm context. */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> /* inform the notifiers about the new cpu */
> diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
> index 88674d972c9d..9cbc7aba3ede 100644
> --- a/arch/arc/kernel/smp.c
> +++ b/arch/arc/kernel/smp.c
> @@ -125,7 +125,7 @@ void start_kernel_secondary(void)
> setup_processor();
>
> atomic_inc(&mm->mm_users);
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
>
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 7dd14e8395e6..c6514ce0fcbc 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -371,7 +371,7 @@ asmlinkage void secondary_start_kernel(void)
> * reference and switch to it.
> */
> cpu = smp_processor_id();
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 8507703dabe4..61969ea29654 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -214,7 +214,7 @@ asmlinkage void secondary_start_kernel(void)
> * All kernel threads share the same mm context; grab a
> * reference and switch to it.
> */
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> current->active_mm = mm;
>
> set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
> diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
> index 23c4ef5f8bdc..bc5617ef7128 100644
> --- a/arch/blackfin/mach-common/smp.c
> +++ b/arch/blackfin/mach-common/smp.c
> @@ -308,7 +308,7 @@ void secondary_start_kernel(void)
>
> /* Attach the new idle task to the global mm. */
> atomic_inc(&mm->mm_users);
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> current->active_mm = mm;
>
> preempt_disable();
> diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c
> index 983bae7d2665..c02a6455839e 100644
> --- a/arch/hexagon/kernel/smp.c
> +++ b/arch/hexagon/kernel/smp.c
> @@ -162,7 +162,7 @@ void start_secondary(void)
> );
>
> /* Set the memory struct */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> cpu = smp_processor_id();
> diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
> index 7ec7acc844c2..ecbff47b01f1 100644
> --- a/arch/ia64/kernel/setup.c
> +++ b/arch/ia64/kernel/setup.c
> @@ -992,7 +992,7 @@ cpu_init (void)
> */
> ia64_setreg(_IA64_REG_CR_DCR, ( IA64_DCR_DP | IA64_DCR_DK | IA64_DCR_DX | IA64_DCR_DR
> | IA64_DCR_DA | IA64_DCR_DD | IA64_DCR_LC));
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> BUG_ON(current->mm);
>
> diff --git a/arch/m32r/kernel/setup.c b/arch/m32r/kernel/setup.c
> index 136c69f1fb8a..b18bc0bd6544 100644
> --- a/arch/m32r/kernel/setup.c
> +++ b/arch/m32r/kernel/setup.c
> @@ -403,7 +403,7 @@ void __init cpu_init (void)
> printk(KERN_INFO "Initializing CPU#%d\n", cpu_id);
>
> /* Set up and load the per-CPU TSS and LDT */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> if (current->mm)
> BUG();
> diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c
> index bad13232de51..af9cff547a19 100644
> --- a/arch/metag/kernel/smp.c
> +++ b/arch/metag/kernel/smp.c
> @@ -345,7 +345,7 @@ asmlinkage void secondary_start_kernel(void)
> * reference and switch to it.
> */
> atomic_inc(&mm->mm_users);
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
> enter_lazy_tlb(mm, current);
> diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
> index 3905003dfe2b..e50b0e0ca44c 100644
> --- a/arch/mips/kernel/traps.c
> +++ b/arch/mips/kernel/traps.c
> @@ -2177,7 +2177,7 @@ void per_cpu_trap_init(bool is_boot_cpu)
> if (!cpu_data[cpu].asid_cache)
> cpu_data[cpu].asid_cache = asid_first_version(cpu);
>
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> BUG_ON(current->mm);
> enter_lazy_tlb(&init_mm, current);
> diff --git a/arch/mn10300/kernel/smp.c b/arch/mn10300/kernel/smp.c
> index 426173c4b0b9..e65b5cc2fa67 100644
> --- a/arch/mn10300/kernel/smp.c
> +++ b/arch/mn10300/kernel/smp.c
> @@ -589,7 +589,7 @@ static void __init smp_cpu_init(void)
> }
> printk(KERN_INFO "Initializing CPU#%d\n", cpu_id);
>
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> BUG_ON(current->mm);
>
> diff --git a/arch/parisc/kernel/smp.c b/arch/parisc/kernel/smp.c
> index 75dab2871346..67b452b41ff6 100644
> --- a/arch/parisc/kernel/smp.c
> +++ b/arch/parisc/kernel/smp.c
> @@ -279,7 +279,7 @@ smp_cpu_init(int cpunum)
> set_cpu_online(cpunum, true);
>
> /* Initialise the idle task for this CPU */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> BUG_ON(current->mm);
> enter_lazy_tlb(&init_mm, current);
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 9c6f3fd58059..42b82364c782 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -707,7 +707,7 @@ void start_secondary(void *unused)
> unsigned int cpu = smp_processor_id();
> int i, base;
>
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> smp_store_cpu_info(cpu);
> diff --git a/arch/s390/kernel/processor.c b/arch/s390/kernel/processor.c
> index 81d0808085e6..ec9bc100895c 100644
> --- a/arch/s390/kernel/processor.c
> +++ b/arch/s390/kernel/processor.c
> @@ -73,7 +73,7 @@ void cpu_init(void)
> get_cpu_id(id);
> if (machine_has_cpu_mhz)
> update_cpu_mhz(NULL);
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> BUG_ON(current->mm);
> enter_lazy_tlb(&init_mm, current);
> diff --git a/arch/score/kernel/traps.c b/arch/score/kernel/traps.c
> index 5cea1e750cec..6f6e5a39d147 100644
> --- a/arch/score/kernel/traps.c
> +++ b/arch/score/kernel/traps.c
> @@ -336,7 +336,7 @@ void __init trap_init(void)
> set_except_vector(18, handle_dbe);
> flush_icache_range(DEBUG_VECTOR_BASE_ADDR, IRQ_VECTOR_BASE_ADDR);
>
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> cpu_cache_init();
> }
> diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
> index 38e7860845db..ee379c699c08 100644
> --- a/arch/sh/kernel/smp.c
> +++ b/arch/sh/kernel/smp.c
> @@ -178,7 +178,7 @@ asmlinkage void start_secondary(void)
> struct mm_struct *mm = &init_mm;
>
> enable_mmu();
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> atomic_inc(&mm->mm_users);
> current->active_mm = mm;
> #ifdef CONFIG_MMU
> diff --git a/arch/sparc/kernel/leon_smp.c b/arch/sparc/kernel/leon_smp.c
> index 71e16f2241c2..b99d33797e1d 100644
> --- a/arch/sparc/kernel/leon_smp.c
> +++ b/arch/sparc/kernel/leon_smp.c
> @@ -93,7 +93,7 @@ void leon_cpu_pre_online(void *arg)
> : "memory" /* paranoid */);
>
> /* Attach to the address space of init_task. */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index 8182f7caf5b1..c1d2bed22961 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -122,7 +122,7 @@ void smp_callin(void)
> current_thread_info()->new_child = 0;
>
> /* Attach to the address space of init_task. */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> /* inform the notifiers about the new cpu */
> diff --git a/arch/sparc/kernel/sun4d_smp.c b/arch/sparc/kernel/sun4d_smp.c
> index 9d98e5002a09..7b55c50eabe5 100644
> --- a/arch/sparc/kernel/sun4d_smp.c
> +++ b/arch/sparc/kernel/sun4d_smp.c
> @@ -93,7 +93,7 @@ void sun4d_cpu_pre_online(void *arg)
> show_leds(cpuid);
>
> /* Attach to the address space of init_task. */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> local_ops->cache_all();
> diff --git a/arch/sparc/kernel/sun4m_smp.c b/arch/sparc/kernel/sun4m_smp.c
> index 278c40abce82..633c4cf6fdb0 100644
> --- a/arch/sparc/kernel/sun4m_smp.c
> +++ b/arch/sparc/kernel/sun4m_smp.c
> @@ -59,7 +59,7 @@ void sun4m_cpu_pre_online(void *arg)
> : "memory" /* paranoid */);
>
> /* Attach to the address space of init_task. */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> while (!cpumask_test_cpu(cpuid, &smp_commenced_mask))
> diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c
> index 4f21df7d4f13..ecddac5a4c96 100644
> --- a/arch/sparc/kernel/traps_32.c
> +++ b/arch/sparc/kernel/traps_32.c
> @@ -448,7 +448,7 @@ void trap_init(void)
> thread_info_offsets_are_bolixed_pete();
>
> /* Attach to the address space of init_task. */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
>
> /* NOTE: Other cpus have this done as they are started
> diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
> index 4094a51b1970..0dbbe40012ef 100644
> --- a/arch/sparc/kernel/traps_64.c
> +++ b/arch/sparc/kernel/traps_64.c
> @@ -2764,6 +2764,6 @@ void __init trap_init(void)
> /* Attach to the address space of init_task. On SMP we
> * do this in smp.c:smp_callin for other cpus.
> */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> }
> diff --git a/arch/tile/kernel/smpboot.c b/arch/tile/kernel/smpboot.c
> index 6c0abaacec33..53ce940a5016 100644
> --- a/arch/tile/kernel/smpboot.c
> +++ b/arch/tile/kernel/smpboot.c
> @@ -160,7 +160,7 @@ static void start_secondary(void)
> __this_cpu_write(current_asid, min_asid);
>
> /* Set up this thread as another owner of the init_mm */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> current->active_mm = &init_mm;
> if (current->mm)
> BUG();
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index cc9e980c68ec..b580da4582e1 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1555,7 +1555,7 @@ void cpu_init(void)
> for (i = 0; i <= IO_BITMAP_LONGS; i++)
> t->io_bitmap[i] = ~0UL;
>
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> me->active_mm = &init_mm;
> BUG_ON(me->mm);
> enter_lazy_tlb(&init_mm, me);
> @@ -1606,7 +1606,7 @@ void cpu_init(void)
> /*
> * Set up and load the per-CPU TSS and LDT
> */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> curr->active_mm = &init_mm;
> BUG_ON(curr->mm);
> enter_lazy_tlb(&init_mm, curr);
> diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
> index fc4ad21a5ed4..9bf5cea3bae4 100644
> --- a/arch/xtensa/kernel/smp.c
> +++ b/arch/xtensa/kernel/smp.c
> @@ -136,7 +136,7 @@ void secondary_start_kernel(void)
> /* All kernel threads share the same mm context. */
>
> atomic_inc(&mm->mm_users);
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> current->active_mm = mm;
> cpumask_set_cpu(cpu, mm_cpumask(mm));
> enter_lazy_tlb(mm, current);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index ef7c8de7060e..ca5f2aa7232d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -262,7 +262,7 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
> * and because the mmu_notifier_unregister function also drop
> * mm_count we need to take an extra count here.
> */
> - atomic_inc(&p->mm->mm_count);
> + mmgrab(p->mm);
> mmu_notifier_unregister_no_release(&p->mmu_notifier, p->mm);
> mmu_notifier_call_srcu(&p->rcu, &kfd_process_destroy_delayed);
> }
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index c6f780f5abc9..f21ca404af79 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -341,7 +341,7 @@ i915_gem_userptr_init__mm_struct(struct drm_i915_gem_object *obj)
> mm->i915 = to_i915(obj->base.dev);
>
> mm->mm = current->mm;
> - atomic_inc(&current->mm->mm_count);
> + mmgrab(current->mm);
>
> mm->mn = NULL;
>
> diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
> index bd786b7bd30b..2e1a6643a910 100644
> --- a/drivers/infiniband/hw/hfi1/file_ops.c
> +++ b/drivers/infiniband/hw/hfi1/file_ops.c
> @@ -185,7 +185,7 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
> if (fd) {
> fd->rec_cpu_num = -1; /* no cpu affinity by default */
> fd->mm = current->mm;
> - atomic_inc(&fd->mm->mm_count);
> + mmgrab(fd->mm);
> fp->private_data = fd;
> } else {
> fp->private_data = NULL;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ca651ac00660..0b8ccacae8b3 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -795,7 +795,7 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode)
>
> if (!IS_ERR_OR_NULL(mm)) {
> /* ensure this mm_struct can't be freed */
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> /* but do not pin its memory */
> mmput(mm);
> }
> @@ -1093,7 +1093,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> if (p) {
> if (atomic_read(&p->mm->mm_users) > 1) {
> mm = p->mm;
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> }
> task_unlock(p);
> }
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 85959d8324df..ffa9c7cbc5fa 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1304,7 +1304,7 @@ static struct file *userfaultfd_file_create(int flags)
> ctx->released = false;
> ctx->mm = current->mm;
> /* prevent the mm struct to be freed */
> - atomic_inc(&ctx->mm->mm_count);
> + mmgrab(ctx->mm);
>
> file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
> O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e9c009dc3a4a..31ae1f49eebb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2872,6 +2872,11 @@ static inline unsigned long sigsp(unsigned long sp, struct ksignal *ksig)
> */
> extern struct mm_struct * mm_alloc(void);
>
> +static inline void mmgrab(struct mm_struct *mm)
> +{
> + atomic_inc(&mm->mm_count);
> +}
> +
> /* mmdrop drops the mm and the page tables */
> extern void __mmdrop(struct mm_struct *);
> static inline void mmdrop(struct mm_struct *mm)
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 3076f3089919..b12753840050 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -500,7 +500,7 @@ static void exit_mm(struct task_struct *tsk)
> __set_task_state(tsk, TASK_RUNNING);
> down_read(&mm->mmap_sem);
> }
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> BUG_ON(mm != tsk->active_mm);
> /* more a memory barrier than a real lock */
> task_lock(tsk);
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 2c4be467fecd..cbe6056c17c1 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -338,7 +338,7 @@ static inline bool should_fail_futex(bool fshared)
>
> static inline void futex_get_mm(union futex_key *key)
> {
> - atomic_inc(&key->private.mm->mm_count);
> + mmgrab(key->private.mm);
> /*
> * Ensure futex_get_mm() implies a full barrier such that
> * get_futex_key() implies a full barrier. This is relied upon
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 154fd689fe02..ee1fb0070544 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2877,7 +2877,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>
> if (!mm) {
> next->active_mm = oldmm;
> - atomic_inc(&oldmm->mm_count);
> + mmgrab(oldmm);
> enter_lazy_tlb(oldmm, next);
> } else
> switch_mm_irqs_off(oldmm, mm, next);
> @@ -7667,7 +7667,7 @@ void __init sched_init(void)
> /*
> * The boot idle thread does lazy MMU switching as well:
> */
> - atomic_inc(&init_mm.mm_count);
> + mmgrab(&init_mm);
> enter_lazy_tlb(&init_mm, current);
>
> /*
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 87e1a7ca3846..1343271a18f1 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -420,7 +420,7 @@ int __khugepaged_enter(struct mm_struct *mm)
> list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
> spin_unlock(&khugepaged_mm_lock);
>
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> if (wakeup)
> wake_up_interruptible(&khugepaged_wait);
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 9ae6011a41f8..5a49aad9d87b 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1813,7 +1813,7 @@ int __ksm_enter(struct mm_struct *mm)
> spin_unlock(&ksm_mmlist_lock);
>
> set_bit(MMF_VM_MERGEABLE, &mm->flags);
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
>
> if (needs_wakeup)
> wake_up_interruptible(&ksm_thread_wait);
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 6f4d27c5bb32..daf67bb02b4a 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -25,7 +25,7 @@ void use_mm(struct mm_struct *mm)
> task_lock(tsk);
> active_mm = tsk->active_mm;
> if (active_mm != mm) {
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> tsk->active_mm = mm;
> }
> tsk->mm = mm;
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index f4259e496f83..32bc9f2ff7eb 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -275,7 +275,7 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
> mm->mmu_notifier_mm = mmu_notifier_mm;
> mmu_notifier_mm = NULL;
> }
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
>
> /*
> * Serialize the update against mmu_notifier_unregister. A
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ec9f11d4f094..ead093c6f2a6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -660,7 +660,7 @@ static void mark_oom_victim(struct task_struct *tsk)
>
> /* oom_mm is bound to the signal struct life time. */
> if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
> - atomic_inc(&tsk->signal->oom_mm->mm_count);
> + mmgrab(tsk->signal->oom_mm);
>
> /*
> * Make sure that the task is woken up from uninterruptible sleep
> @@ -877,7 +877,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>
> /* Get a reference to safely compare mm after task_unlock(victim) */
> mm = victim->mm;
> - atomic_inc(&mm->mm_count);
> + mmgrab(mm);
> /*
> * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> * the OOM victim from depleting the memory reserves from the user
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7f9ee2929cfe..43914b981691 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> return ERR_PTR(-ENOMEM);
>
> spin_lock_init(&kvm->mmu_lock);
> - atomic_inc(&current->mm->mm_count);
> + mmgrab(current->mm);
> kvm->mm = current->mm;
> kvm_eventfd_init(kvm);
> mutex_init(&kvm->lock);
> --
> 2.11.0.1.gaa10c3f
>

--
Michal Hocko
SUSE Labs

2016-12-16 09:46:56

by Vegard Nossum

[permalink] [raw]
Subject: [PATCH 2/4] mm: add new mmget() helper

Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Vegard Nossum <[email protected]>
---
arch/arc/kernel/smp.c | 2 +-
arch/blackfin/mach-common/smp.c | 2 +-
arch/frv/mm/mmu-context.c | 2 +-
arch/metag/kernel/smp.c | 2 +-
arch/sh/kernel/smp.c | 2 +-
arch/xtensa/kernel/smp.c | 2 +-
include/linux/sched.h | 5 +++++
kernel/fork.c | 4 ++--
mm/swapfile.c | 10 +++++-----
virt/kvm/async_pf.c | 2 +-
10 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index 9cbc7aba3ede..eec70cb71db1 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -124,7 +124,7 @@ void start_kernel_secondary(void)
/* MMU, Caches, Vector Table, Interrupts etc */
setup_processor();

- atomic_inc(&mm->mm_users);
+ mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
index bc5617ef7128..a2e6db2ce811 100644
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -307,7 +307,7 @@ void secondary_start_kernel(void)
local_irq_disable();

/* Attach the new idle task to the global mm. */
- atomic_inc(&mm->mm_users);
+ mmget(mm);
mmgrab(mm);
current->active_mm = mm;

diff --git a/arch/frv/mm/mmu-context.c b/arch/frv/mm/mmu-context.c
index 81757d55a5b5..3473bde77f56 100644
--- a/arch/frv/mm/mmu-context.c
+++ b/arch/frv/mm/mmu-context.c
@@ -188,7 +188,7 @@ int cxn_pin_by_pid(pid_t pid)
task_lock(tsk);
if (tsk->mm) {
mm = tsk->mm;
- atomic_inc(&mm->mm_users);
+ mmget(mm);
ret = 0;
}
task_unlock(tsk);
diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c
index af9cff547a19..c622293254e4 100644
--- a/arch/metag/kernel/smp.c
+++ b/arch/metag/kernel/smp.c
@@ -344,7 +344,7 @@ asmlinkage void secondary_start_kernel(void)
* All kernel threads share the same mm context; grab a
* reference and switch to it.
*/
- atomic_inc(&mm->mm_users);
+ mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
index ee379c699c08..edc4769b047e 100644
--- a/arch/sh/kernel/smp.c
+++ b/arch/sh/kernel/smp.c
@@ -179,7 +179,7 @@ asmlinkage void start_secondary(void)

enable_mmu();
mmgrab(mm);
- atomic_inc(&mm->mm_users);
+ mmget(mm);
current->active_mm = mm;
#ifdef CONFIG_MMU
enter_lazy_tlb(mm, current);
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index 9bf5cea3bae4..fcea72019df7 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -135,7 +135,7 @@ void secondary_start_kernel(void)

/* All kernel threads share the same mm context. */

- atomic_inc(&mm->mm_users);
+ mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 31ae1f49eebb..2ca3e15dad3b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2899,6 +2899,11 @@ static inline void mmdrop_async(struct mm_struct *mm)
}
}

+static inline void mmget(struct mm_struct *mm)
+{
+ atomic_inc(&mm->mm_users);
+}
+
static inline bool mmget_not_zero(struct mm_struct *mm)
{
return atomic_inc_not_zero(&mm->mm_users);
diff --git a/kernel/fork.c b/kernel/fork.c
index 997ac1d584f7..f9c32dc6ccbc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -989,7 +989,7 @@ struct mm_struct *get_task_mm(struct task_struct *task)
if (task->flags & PF_KTHREAD)
mm = NULL;
else
- atomic_inc(&mm->mm_users);
+ mmget(mm);
}
task_unlock(task);
return mm;
@@ -1177,7 +1177,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
vmacache_flush(tsk);

if (clone_flags & CLONE_VM) {
- atomic_inc(&oldmm->mm_users);
+ mmget(oldmm);
mm = oldmm;
goto good_mm;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f30438970cd1..cf73169ce153 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1402,7 +1402,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
* that.
*/
start_mm = &init_mm;
- atomic_inc(&init_mm.mm_users);
+ mmget(&init_mm);

/*
* Keep on scanning until all entries have gone. Usually,
@@ -1451,7 +1451,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
if (atomic_read(&start_mm->mm_users) == 1) {
mmput(start_mm);
start_mm = &init_mm;
- atomic_inc(&init_mm.mm_users);
+ mmget(&init_mm);
}

/*
@@ -1488,8 +1488,8 @@ int try_to_unuse(unsigned int type, bool frontswap,
struct mm_struct *prev_mm = start_mm;
struct mm_struct *mm;

- atomic_inc(&new_start_mm->mm_users);
- atomic_inc(&prev_mm->mm_users);
+ mmget(new_start_mm);
+ mmget(prev_mm);
spin_lock(&mmlist_lock);
while (swap_count(*swap_map) && !retval &&
(p = p->next) != &start_mm->mmlist) {
@@ -1512,7 +1512,7 @@ int try_to_unuse(unsigned int type, bool frontswap,

if (set_start_mm && *swap_map < swcount) {
mmput(new_start_mm);
- atomic_inc(&mm->mm_users);
+ mmget(mm);
new_start_mm = mm;
set_start_mm = 0;
}
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index efeceb0a222d..9ec9cef2b207 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -200,7 +200,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
work->addr = hva;
work->arch = *arch;
work->mm = current->mm;
- atomic_inc(&work->mm->mm_users);
+ mmget(work->mm);
kvm_get_kvm(work->vcpu->kvm);

/* this can't really happen otherwise gfn_to_pfn_async
--
2.11.0.1.gaa10c3f

2016-12-16 09:56:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: add new mmgrab() helper

On Fri, Dec 16, 2016 at 09:21:59AM +0100, Vegard Nossum wrote:
> Apart from adding the helper function itself, the rest of the kernel is
> converted mechanically using:
>
> git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
> git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'
>
> This is needed for a later patch that hooks into the helper, but might be
> a worthwhile cleanup on its own.

Given the desire to replace all refcounting with a specific refcount
type, this seems to make sense.

FYI: http://www.openwall.com/lists/kernel-hardening/2016/12/07/8

But I must say mmget() vs mmgrab() is a wee bit confusing.

2016-12-16 09:59:49

by Vegard Nossum

[permalink] [raw]
Subject: [PATCH 4/4] [RFC!] mm: 'struct mm_struct' reference counting debugging

Reference counting bugs are hard to debug by their nature since the actual
manifestation of one can occur very far from where the error is introduced
(e.g. a missing get() only manifest as a use-after-free when the reference
count prematurely drops to 0, which could be arbitrarily long after where
the get() should have happened if there are other users). I wrote this patch
to try to track down a suspected 'mm_struct' reference counting bug.

The basic idea is to keep track of all references, not just with a reference
counter, but with an actual reference _list_. Whenever you get() or put() a
reference, you also add or remove yourself, respectively, from the reference
list. This really helps debugging because (for example) you always put a
specific reference, meaning that if that reference was not yours to put, you
will notice it immediately (rather than when the reference counter goes to 0
and you still have an active reference).

The main interface is in <linux/mm_ref_types.h> and <linux/mm_ref.h>, while
the implementation lives in mm/mm_ref.c. Since 'struct mm_struct' has both
->mm_users and ->mm_count, we introduce helpers for both of them, but use
the same data structure for each (struct mm_ref). The low-level rules (i.e.
the ones we have to follow, but which nobody else should really have to
care about since they use the higher-level interface) are:

- after incrementing ->mm_count you also have to call get_mm_ref()

- before decrementing ->mm_count you also have to call put_mm_ref()

- after incrementing ->mm_users you also have to call get_mm_users_ref()

- before decrementing ->mm_users you also have to call put_mm_users_ref()

The rules that most of the rest of the kernel will care about are:

- functions that acquire and return a mm_struct should take a
'struct mm_ref *' which it can pass on to mmget()/mmgrab()/etc.

- functions that release an mm_struct passed as a parameter should also
take a 'struct mm_ref *' which it can pass on to mmput()/mmdrop()/etc.

- any function that temporarily acquires a mm_struct reference should
use MM_REF() to define an on-stack reference and pass it on to
mmget()/mmput()/mmgrab()/mmdrop()/etc.

- any structure that holds an mm_struct pointer must also include a
'struct mm_ref' member; when the mm_struct pointer is modified you
would typically also call mmget()/mmgrab()/mmput()/mmdrop() and they
should be called with this mm_ref

- you can convert (for example) an on-stack reference to an in-struct
reference using move_mm_ref(). This is semantically equivalent to
(atomically) taking the new reference and dropping the old one, but
doesn't actually need to modify the reference count

I don't really have any delusions about getting this into mainline
(_especially_ not without a CONFIG_MM_REF toggle and zero impact in the =n
case), but I'm posting it in case somebody would find it useful and maybe
to start a discussion about whether this is something that can be usefully
generalized to other core data structures with complicated
reference/ownership models.

The patch really does make it very explicit who holds every reference
taken and where references are implicitly transferred, for example in
finish_task_switch() where the ownership of the reference to 'oldmm' is
implicitly transferred from 'prev->mm' to 'rq->prev_mm', or in
flush_old_exec() where the ownership of the 'bprm->mm' is implicitly
transferred from 'bprm' to 'current->mm'. These ones are a bit subtle
because there is no explicit get()/put() in the code.

There are some users which haven't been converted by this patch (and
many more that haven't been tested) -- x86-64 defconfig should work out of
the box, though. The conversion for the rest of the kernel should be
mostly straightforward (the main challenge was fork/exec).

Thanks-to: Rik van Riel <[email protected]>
Thanks-to: Matthew Wilcox <[email protected]>
Thanks-to: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Vegard Nossum <[email protected]>
---
arch/x86/kernel/cpu/common.c | 2 +-
drivers/gpu/drm/i915/i915_gem_userptr.c | 25 +++--
drivers/vhost/vhost.c | 7 +-
drivers/vhost/vhost.h | 1 +
fs/exec.c | 20 +++-
fs/proc/array.c | 15 +--
fs/proc/base.c | 97 ++++++++++++-------
fs/proc/internal.h | 4 +-
fs/proc/task_mmu.c | 50 +++++++---
include/linux/binfmts.h | 1 +
include/linux/init_task.h | 1 +
include/linux/kvm_host.h | 3 +
include/linux/mm_ref.h | 48 ++++++++++
include/linux/mm_ref_types.h | 41 ++++++++
include/linux/mm_types.h | 8 ++
include/linux/mmu_notifier.h | 8 +-
include/linux/sched.h | 42 +++++---
kernel/cpuset.c | 22 +++--
kernel/events/core.c | 5 +-
kernel/exit.c | 5 +-
kernel/fork.c | 51 ++++++----
kernel/futex.c | 124 +++++++++++++-----------
kernel/sched/core.c | 14 ++-
kernel/sched/sched.h | 1 +
kernel/sys.c | 5 +-
kernel/trace/trace_output.c | 5 +-
kernel/tsacct.c | 5 +-
mm/Makefile | 2 +-
mm/init-mm.c | 6 ++
mm/memory.c | 5 +-
mm/mempolicy.c | 5 +-
mm/migrate.c | 5 +-
mm/mm_ref.c | 163 ++++++++++++++++++++++++++++++++
mm/mmu_context.c | 9 +-
mm/mmu_notifier.c | 20 ++--
mm/oom_kill.c | 12 ++-
mm/process_vm_access.c | 5 +-
mm/swapfile.c | 29 +++---
mm/util.c | 5 +-
virt/kvm/async_pf.c | 9 +-
virt/kvm/kvm_main.c | 16 +++-
41 files changed, 668 insertions(+), 233 deletions(-)
create mode 100644 include/linux/mm_ref.h
create mode 100644 include/linux/mm_ref_types.h
create mode 100644 mm/mm_ref.c

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b580da4582e1..edf16f695130 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1555,7 +1555,7 @@ void cpu_init(void)
for (i = 0; i <= IO_BITMAP_LONGS; i++)
t->io_bitmap[i] = ~0UL;

- mmgrab(&init_mm);
+ mmgrab(&init_mm, &me->mm_ref);
me->active_mm = &init_mm;
BUG_ON(me->mm);
enter_lazy_tlb(&init_mm, me);
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index e97f9ade99fc..498d311e1a80 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -34,8 +34,10 @@

struct i915_mm_struct {
struct mm_struct *mm;
+ struct mm_ref mm_ref;
struct drm_i915_private *i915;
struct i915_mmu_notifier *mn;
+ struct mm_ref mn_ref;
struct hlist_node node;
struct kref kref;
struct work_struct work;
@@ -159,7 +161,7 @@ static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
};

static struct i915_mmu_notifier *
-i915_mmu_notifier_create(struct mm_struct *mm)
+i915_mmu_notifier_create(struct mm_struct *mm, struct mm_ref *mm_ref)
{
struct i915_mmu_notifier *mn;
int ret;
@@ -178,7 +180,7 @@ i915_mmu_notifier_create(struct mm_struct *mm)
}

/* Protected by mmap_sem (write-lock) */
- ret = __mmu_notifier_register(&mn->mn, mm);
+ ret = __mmu_notifier_register(&mn->mn, mm, mm_ref);
if (ret) {
destroy_workqueue(mn->wq);
kfree(mn);
@@ -217,7 +219,7 @@ i915_mmu_notifier_find(struct i915_mm_struct *mm)
down_write(&mm->mm->mmap_sem);
mutex_lock(&mm->i915->mm_lock);
if ((mn = mm->mn) == NULL) {
- mn = i915_mmu_notifier_create(mm->mm);
+ mn = i915_mmu_notifier_create(mm->mm, &mm->mn_ref);
if (!IS_ERR(mn))
mm->mn = mn;
}
@@ -260,12 +262,12 @@ i915_gem_userptr_init__mmu_notifier(struct drm_i915_gem_object *obj,

static void
i915_mmu_notifier_free(struct i915_mmu_notifier *mn,
- struct mm_struct *mm)
+ struct mm_struct *mm, struct mm_ref *mm_ref)
{
if (mn == NULL)
return;

- mmu_notifier_unregister(&mn->mn, mm);
+ mmu_notifier_unregister(&mn->mn, mm, mm_ref);
destroy_workqueue(mn->wq);
kfree(mn);
}
@@ -341,9 +343,11 @@ i915_gem_userptr_init__mm_struct(struct drm_i915_gem_object *obj)
mm->i915 = to_i915(obj->base.dev);

mm->mm = current->mm;
- mmgrab(current->mm);
+ INIT_MM_REF(&mm->mm_ref);
+ mmgrab(current->mm, &mm->mm_ref);

mm->mn = NULL;
+ INIT_MM_REF(&mm->mn_ref);

/* Protected by dev_priv->mm_lock */
hash_add(dev_priv->mm_structs,
@@ -361,8 +365,8 @@ static void
__i915_mm_struct_free__worker(struct work_struct *work)
{
struct i915_mm_struct *mm = container_of(work, typeof(*mm), work);
- i915_mmu_notifier_free(mm->mn, mm->mm);
- mmdrop(mm->mm);
+ i915_mmu_notifier_free(mm->mn, mm->mm, &mm->mn_ref);
+ mmdrop(mm->mm, &mm->mm_ref);
kfree(mm);
}

@@ -508,13 +512,14 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
pvec = drm_malloc_gfp(npages, sizeof(struct page *), GFP_TEMPORARY);
if (pvec != NULL) {
struct mm_struct *mm = obj->userptr.mm->mm;
+ MM_REF(mm_ref);
unsigned int flags = 0;

if (!obj->userptr.read_only)
flags |= FOLL_WRITE;

ret = -EFAULT;
- if (mmget_not_zero(mm)) {
+ if (mmget_not_zero(mm, &mm_ref)) {
down_read(&mm->mmap_sem);
while (pinned < npages) {
ret = get_user_pages_remote
@@ -529,7 +534,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work)
pinned += ret;
}
up_read(&mm->mmap_sem);
- mmput(mm);
+ mmput(mm, &mm_ref);
}
}

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c6f2d89c0e97..4470abf94fe8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -407,6 +407,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
+ INIT_MM_REF(&dev->mm_ref);
dev->worker = NULL;
init_llist_head(&dev->work_list);
init_waitqueue_head(&dev->wait);
@@ -483,7 +484,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
}

/* No owner, become one */
- dev->mm = get_task_mm(current);
+ dev->mm = get_task_mm(current, &dev->mm_ref);
worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
if (IS_ERR(worker)) {
err = PTR_ERR(worker);
@@ -507,7 +508,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
dev->worker = NULL;
err_worker:
if (dev->mm)
- mmput(dev->mm);
+ mmput(dev->mm, &dev->mm_ref);
dev->mm = NULL;
err_mm:
return err;
@@ -639,7 +640,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
dev->worker = NULL;
}
if (dev->mm)
- mmput(dev->mm);
+ mmput(dev->mm, &dev->mm_ref);
dev->mm = NULL;
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 78f3c5fc02e4..64fdcfa9cf67 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -151,6 +151,7 @@ struct vhost_msg_node {

struct vhost_dev {
struct mm_struct *mm;
+ struct mm_ref mm_ref;
struct mutex mutex;
struct vhost_virtqueue **vqs;
int nvqs;
diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9ee71e..13afedb2821d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -380,7 +380,7 @@ static int bprm_mm_init(struct linux_binprm *bprm)
int err;
struct mm_struct *mm = NULL;

- bprm->mm = mm = mm_alloc();
+ bprm->mm = mm = mm_alloc(&bprm->mm_ref);
err = -ENOMEM;
if (!mm)
goto err;
@@ -394,7 +394,7 @@ static int bprm_mm_init(struct linux_binprm *bprm)
err:
if (mm) {
bprm->mm = NULL;
- mmdrop(mm);
+ mmdrop(mm, &bprm->mm_ref);
}

return err;
@@ -996,6 +996,8 @@ static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct *old_mm, *active_mm;
+ MM_REF(old_mm_ref);
+ MM_REF(active_mm_ref);

/* Notify parent that we're no longer interested in the old VM */
tsk = current;
@@ -1015,9 +1017,14 @@ static int exec_mmap(struct mm_struct *mm)
up_read(&old_mm->mmap_sem);
return -EINTR;
}
+
+ move_mm_users_ref(old_mm, &current->mm_ref, &old_mm_ref);
}
task_lock(tsk);
+
active_mm = tsk->active_mm;
+ if (!old_mm)
+ move_mm_ref(active_mm, &tsk->mm_ref, &active_mm_ref);
tsk->mm = mm;
tsk->active_mm = mm;
activate_mm(active_mm, mm);
@@ -1029,10 +1036,10 @@ static int exec_mmap(struct mm_struct *mm)
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
mm_update_next_owner(old_mm);
- mmput(old_mm);
+ mmput(old_mm, &old_mm_ref);
return 0;
}
- mmdrop(active_mm);
+ mmdrop(active_mm, &active_mm_ref);
return 0;
}

@@ -1258,6 +1265,7 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;

+ move_mm_ref(bprm->mm, &bprm->mm_ref, &current->mm_ref);
bprm->mm = NULL; /* We're using it now */

set_fs(USER_DS);
@@ -1674,6 +1682,8 @@ static int do_execveat_common(int fd, struct filename *filename,
if (!bprm)
goto out_files;

+ INIT_MM_REF(&bprm->mm_ref);
+
retval = prepare_bprm_creds(bprm);
if (retval)
goto out_free;
@@ -1760,7 +1770,7 @@ static int do_execveat_common(int fd, struct filename *filename,
out:
if (bprm->mm) {
acct_arg_size(bprm, 0);
- mmput(bprm->mm);
+ mmput(bprm->mm, &bprm->mm_ref);
}

out_unmark:
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 81818adb8e9e..3e02be82c2f4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -367,14 +367,15 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
- struct mm_struct *mm = get_task_mm(task);
+ MM_REF(mm_ref);
+ struct mm_struct *mm = get_task_mm(task, &mm_ref);

task_name(m, task);
task_state(m, ns, pid, task);

if (mm) {
task_mem(m, mm);
- mmput(mm);
+ mmput(mm, &mm_ref);
}
task_sig(m, task);
task_cap(m, task);
@@ -397,6 +398,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
int num_threads = 0;
int permitted;
struct mm_struct *mm;
+ MM_REF(mm_ref);
unsigned long long start_time;
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long min_flt = 0, maj_flt = 0;
@@ -409,7 +411,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
state = *get_task_state(task);
vsize = eip = esp = 0;
permitted = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT);
- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (mm) {
vsize = task_vsize(mm);
/*
@@ -562,7 +564,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

seq_putc(m, '\n');
if (mm)
- mmput(mm);
+ mmput(mm, &mm_ref);
return 0;
}

@@ -582,11 +584,12 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
- struct mm_struct *mm = get_task_mm(task);
+ MM_REF(mm_ref);
+ struct mm_struct *mm = get_task_mm(task, &mm_ref);

if (mm) {
size = task_statm(mm, &shared, &text, &data, &resident);
- mmput(mm);
+ mmput(mm, &mm_ref);
}
/*
* For quick read, open code by putting numbers directly
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 87fd5bf07578..9c8bbfc0ab45 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -201,6 +201,7 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
{
struct task_struct *tsk;
struct mm_struct *mm;
+ MM_REF(mm_ref);
char *page;
unsigned long count = _count;
unsigned long arg_start, arg_end, env_start, env_end;
@@ -214,7 +215,7 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
tsk = get_proc_task(file_inode(file));
if (!tsk)
return -ESRCH;
- mm = get_task_mm(tsk);
+ mm = get_task_mm(tsk, &mm_ref);
put_task_struct(tsk);
if (!mm)
return 0;
@@ -389,7 +390,7 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
out_free_page:
free_page((unsigned long)page);
out_mmput:
- mmput(mm);
+ mmput(mm, &mm_ref);
if (rv > 0)
*pos += rv;
return rv;
@@ -784,34 +785,50 @@ static const struct file_operations proc_single_file_operations = {
};


-struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode)
+struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode, struct mm_ref *mm_ref)
{
struct task_struct *task = get_proc_task(inode);
struct mm_struct *mm = ERR_PTR(-ESRCH);
+ MM_REF(tmp_ref);

if (task) {
- mm = mm_access(task, mode | PTRACE_MODE_FSCREDS);
+ mm = mm_access(task, mode | PTRACE_MODE_FSCREDS, &tmp_ref);
put_task_struct(task);

if (!IS_ERR_OR_NULL(mm)) {
/* ensure this mm_struct can't be freed */
- mmgrab(mm);
+ mmgrab(mm, mm_ref);
/* but do not pin its memory */
- mmput(mm);
+ mmput(mm, &tmp_ref);
}
}

return mm;
}

+struct mem_private {
+ struct mm_struct *mm;
+ struct mm_ref mm_ref;
+};
+
static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
{
- struct mm_struct *mm = proc_mem_open(inode, mode);
+ struct mem_private *priv;
+ struct mm_struct *mm;

- if (IS_ERR(mm))
+ priv = kmalloc(sizeof(struct mem_private), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ INIT_MM_REF(&priv->mm_ref);
+ mm = proc_mem_open(inode, mode, &priv->mm_ref);
+ if (IS_ERR(mm)) {
+ kfree(priv);
return PTR_ERR(mm);
+ }

- file->private_data = mm;
+ priv->mm = mm;
+ file->private_data = priv;
return 0;
}

@@ -828,7 +845,9 @@ static int mem_open(struct inode *inode, struct file *file)
static ssize_t mem_rw(struct file *file, char __user *buf,
size_t count, loff_t *ppos, int write)
{
- struct mm_struct *mm = file->private_data;
+ struct mem_private *priv = file->private_data;
+ struct mm_struct *mm = priv->mm;
+ MM_REF(mm_ref);
unsigned long addr = *ppos;
ssize_t copied;
char *page;
@@ -842,7 +861,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
return -ENOMEM;

copied = 0;
- if (!mmget_not_zero(mm))
+ if (!mmget_not_zero(mm, &mm_ref))
goto free;

/* Maybe we should limit FOLL_FORCE to actual ptrace users? */
@@ -877,7 +896,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
}
*ppos = addr;

- mmput(mm);
+ mmput(mm, &mm_ref);
free:
free_page((unsigned long) page);
return copied;
@@ -913,9 +932,11 @@ loff_t mem_lseek(struct file *file, loff_t offset, int orig)

static int mem_release(struct inode *inode, struct file *file)
{
- struct mm_struct *mm = file->private_data;
+ struct mem_private *priv = file->private_data;
+ struct mm_struct *mm = priv->mm;
if (mm)
- mmdrop(mm);
+ mmdrop(mm, &priv->mm_ref);
+ kfree(priv);
return 0;
}

@@ -935,10 +956,12 @@ static int environ_open(struct inode *inode, struct file *file)
static ssize_t environ_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
+ struct mem_private *priv = file->private_data;
char *page;
unsigned long src = *ppos;
int ret = 0;
- struct mm_struct *mm = file->private_data;
+ struct mm_struct *mm = priv->mm;
+ MM_REF(mm_ref);
unsigned long env_start, env_end;

/* Ensure the process spawned far enough to have an environment. */
@@ -950,7 +973,7 @@ static ssize_t environ_read(struct file *file, char __user *buf,
return -ENOMEM;

ret = 0;
- if (!mmget_not_zero(mm))
+ if (!mmget_not_zero(mm, &mm_ref))
goto free;

down_read(&mm->mmap_sem);
@@ -988,7 +1011,7 @@ static ssize_t environ_read(struct file *file, char __user *buf,
count -= retval;
}
*ppos = src;
- mmput(mm);
+ mmput(mm, &mm_ref);

free:
free_page((unsigned long) page);
@@ -1010,7 +1033,8 @@ static int auxv_open(struct inode *inode, struct file *file)
static ssize_t auxv_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
- struct mm_struct *mm = file->private_data;
+ struct mem_private *priv = file->private_data;
+ struct mm_struct *mm = priv->mm;
unsigned int nwords = 0;

if (!mm)
@@ -1053,6 +1077,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
{
static DEFINE_MUTEX(oom_adj_mutex);
struct mm_struct *mm = NULL;
+ MM_REF(mm_ref);
struct task_struct *task;
int err = 0;

@@ -1093,7 +1118,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
if (p) {
if (atomic_read(&p->mm->mm_users) > 1) {
mm = p->mm;
- mmgrab(mm);
+ mmgrab(mm, &mm_ref);
}
task_unlock(p);
}
@@ -1129,7 +1154,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
task_unlock(p);
}
rcu_read_unlock();
- mmdrop(mm);
+ mmdrop(mm, &mm_ref);
}
err_unlock:
mutex_unlock(&oom_adj_mutex);
@@ -1875,6 +1900,7 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
unsigned long vm_start, vm_end;
bool exact_vma_exists = false;
struct mm_struct *mm = NULL;
+ MM_REF(mm_ref);
struct task_struct *task;
const struct cred *cred;
struct inode *inode;
@@ -1888,7 +1914,7 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
if (!task)
goto out_notask;

- mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
+ mm = mm_access(task, PTRACE_MODE_READ_FSCREDS, &mm_ref);
if (IS_ERR_OR_NULL(mm))
goto out;

@@ -1898,7 +1924,7 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
up_read(&mm->mmap_sem);
}

- mmput(mm);
+ mmput(mm, &mm_ref);

if (exact_vma_exists) {
if (task_dumpable(task)) {
@@ -1933,6 +1959,7 @@ static int map_files_get_link(struct dentry *dentry, struct path *path)
struct vm_area_struct *vma;
struct task_struct *task;
struct mm_struct *mm;
+ MM_REF(mm_ref);
int rc;

rc = -ENOENT;
@@ -1940,7 +1967,7 @@ static int map_files_get_link(struct dentry *dentry, struct path *path)
if (!task)
goto out;

- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
put_task_struct(task);
if (!mm)
goto out;
@@ -1960,7 +1987,7 @@ static int map_files_get_link(struct dentry *dentry, struct path *path)
up_read(&mm->mmap_sem);

out_mmput:
- mmput(mm);
+ mmput(mm, &mm_ref);
out:
return rc;
}
@@ -2034,6 +2061,7 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
struct task_struct *task;
int result;
struct mm_struct *mm;
+ MM_REF(mm_ref);

result = -ENOENT;
task = get_proc_task(dir);
@@ -2048,7 +2076,7 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,
if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
goto out_put_task;

- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (!mm)
goto out_put_task;

@@ -2063,7 +2091,7 @@ static struct dentry *proc_map_files_lookup(struct inode *dir,

out_no_vma:
up_read(&mm->mmap_sem);
- mmput(mm);
+ mmput(mm, &mm_ref);
out_put_task:
put_task_struct(task);
out:
@@ -2082,6 +2110,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
struct vm_area_struct *vma;
struct task_struct *task;
struct mm_struct *mm;
+ MM_REF(mm_ref);
unsigned long nr_files, pos, i;
struct flex_array *fa = NULL;
struct map_files_info info;
@@ -2101,7 +2130,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
if (!dir_emit_dots(file, ctx))
goto out_put_task;

- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (!mm)
goto out_put_task;
down_read(&mm->mmap_sem);
@@ -2132,7 +2161,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
if (fa)
flex_array_free(fa);
up_read(&mm->mmap_sem);
- mmput(mm);
+ mmput(mm, &mm_ref);
goto out_put_task;
}
for (i = 0, vma = mm->mmap, pos = 2; vma;
@@ -2164,7 +2193,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
}
if (fa)
flex_array_free(fa);
- mmput(mm);
+ mmput(mm, &mm_ref);

out_put_task:
put_task_struct(task);
@@ -2567,6 +2596,7 @@ static ssize_t proc_coredump_filter_read(struct file *file, char __user *buf,
{
struct task_struct *task = get_proc_task(file_inode(file));
struct mm_struct *mm;
+ MM_REF(mm_ref);
char buffer[PROC_NUMBUF];
size_t len;
int ret;
@@ -2575,12 +2605,12 @@ static ssize_t proc_coredump_filter_read(struct file *file, char __user *buf,
return -ESRCH;

ret = 0;
- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (mm) {
len = snprintf(buffer, sizeof(buffer), "%08lx\n",
((mm->flags & MMF_DUMP_FILTER_MASK) >>
MMF_DUMP_FILTER_SHIFT));
- mmput(mm);
+ mmput(mm, &mm_ref);
ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
}

@@ -2596,6 +2626,7 @@ static ssize_t proc_coredump_filter_write(struct file *file,
{
struct task_struct *task;
struct mm_struct *mm;
+ MM_REF(mm_ref);
unsigned int val;
int ret;
int i;
@@ -2610,7 +2641,7 @@ static ssize_t proc_coredump_filter_write(struct file *file,
if (!task)
goto out_no_task;

- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (!mm)
goto out_no_mm;
ret = 0;
@@ -2622,7 +2653,7 @@ static ssize_t proc_coredump_filter_write(struct file *file,
clear_bit(i + MMF_DUMP_FILTER_SHIFT, &mm->flags);
}

- mmput(mm);
+ mmput(mm, &mm_ref);
out_no_mm:
put_task_struct(task);
out_no_task:
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5378441ec1b7..9aed2e391b15 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -280,6 +280,8 @@ struct proc_maps_private {
struct inode *inode;
struct task_struct *task;
struct mm_struct *mm;
+ struct mm_ref mm_open_ref;
+ struct mm_ref mm_start_ref;
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
#endif
@@ -288,7 +290,7 @@ struct proc_maps_private {
#endif
};

-struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
+struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode, struct mm_ref *mm_ref);

extern const struct file_operations proc_pid_maps_operations;
extern const struct file_operations proc_tid_maps_operations;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c71975293dc8..06ed5d67dd84 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -133,7 +133,7 @@ static void vma_stop(struct proc_maps_private *priv)

release_task_mempolicy(priv);
up_read(&mm->mmap_sem);
- mmput(mm);
+ mmput(mm, &priv->mm_start_ref);
}

static struct vm_area_struct *
@@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
return ERR_PTR(-ESRCH);

mm = priv->mm;
- if (!mm || !mmget_not_zero(mm))
+ if (!mm || !mmget_not_zero(mm, &priv->mm_start_ref))
return NULL;

down_read(&mm->mmap_sem);
@@ -232,7 +232,9 @@ static int proc_maps_open(struct inode *inode, struct file *file,
return -ENOMEM;

priv->inode = inode;
- priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);
+ INIT_MM_REF(&priv->mm_open_ref);
+ INIT_MM_REF(&priv->mm_start_ref);
+ priv->mm = proc_mem_open(inode, PTRACE_MODE_READ, &priv->mm_open_ref);
if (IS_ERR(priv->mm)) {
int err = PTR_ERR(priv->mm);

@@ -249,7 +251,7 @@ static int proc_map_release(struct inode *inode, struct file *file)
struct proc_maps_private *priv = seq->private;

if (priv->mm)
- mmdrop(priv->mm);
+ mmdrop(priv->mm, &priv->mm_open_ref);

return seq_release_private(inode, file);
}
@@ -997,6 +999,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
struct task_struct *task;
char buffer[PROC_NUMBUF];
struct mm_struct *mm;
+ MM_REF(mm_ref);
struct vm_area_struct *vma;
enum clear_refs_types type;
int itype;
@@ -1017,7 +1020,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (mm) {
struct clear_refs_private cp = {
.type = type,
@@ -1069,7 +1072,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
flush_tlb_mm(mm);
up_read(&mm->mmap_sem);
out_mm:
- mmput(mm);
+ mmput(mm, &mm_ref);
}
put_task_struct(task);

@@ -1340,10 +1343,17 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
+struct pagemap_private {
+ struct mm_struct *mm;
+ struct mm_ref mm_ref;
+};
+
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
- struct mm_struct *mm = file->private_data;
+ struct pagemap_private *priv = file->private_data;
+ struct mm_struct *mm = priv->mm;
+ MM_REF(mm_ref);
struct pagemapread pm;
struct mm_walk pagemap_walk = {};
unsigned long src;
@@ -1352,7 +1362,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
unsigned long end_vaddr;
int ret = 0, copied = 0;

- if (!mm || !mmget_not_zero(mm))
+ if (!mm || !mmget_not_zero(mm, &mm_ref))
goto out;

ret = -EINVAL;
@@ -1427,28 +1437,40 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
out_free:
kfree(pm.buffer);
out_mm:
- mmput(mm);
+ mmput(mm, &mm_ref);
out:
return ret;
}

static int pagemap_open(struct inode *inode, struct file *file)
{
+ struct pagemap_private *priv;
struct mm_struct *mm;

- mm = proc_mem_open(inode, PTRACE_MODE_READ);
- if (IS_ERR(mm))
+ priv = kmalloc(sizeof(struct pagemap_private), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ mm = proc_mem_open(inode, PTRACE_MODE_READ, &priv->mm_ref);
+ if (IS_ERR(mm)) {
+ kfree(priv);
return PTR_ERR(mm);
- file->private_data = mm;
+ }
+
+ priv->mm = mm;
+ file->private_data = priv;
return 0;
}

static int pagemap_release(struct inode *inode, struct file *file)
{
- struct mm_struct *mm = file->private_data;
+ struct pagemap_private *priv = file->private_data;
+ struct mm_struct *mm = priv->mm;

if (mm)
- mmdrop(mm);
+ mmdrop(mm, &priv->mm_ref);
+
+ kfree(priv);
return 0;
}

diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 1303b570b18c..8bee41838bd5 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -21,6 +21,7 @@ struct linux_binprm {
struct page *page[MAX_ARG_PAGES];
#endif
struct mm_struct *mm;
+ struct mm_ref mm_ref;
unsigned long p; /* current top of mem */
unsigned int
cred_prepared:1,/* true if creds already prepared (multiple
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 325f649d77ff..02c9ecf243d1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -211,6 +211,7 @@ extern struct task_group root_task_group;
.cpus_allowed = CPU_MASK_ALL, \
.nr_cpus_allowed= NR_CPUS, \
.mm = NULL, \
+ .mm_ref = MM_REF_INIT(tsk.mm_ref), \
.active_mm = &init_mm, \
.restart_block = { \
.fn = do_no_restart_syscall, \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9cc3915..635d4a84f03b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -174,6 +174,7 @@ struct kvm_async_pf {
struct list_head queue;
struct kvm_vcpu *vcpu;
struct mm_struct *mm;
+ struct mm_ref mm_ref;
gva_t gva;
unsigned long addr;
struct kvm_arch_async_pf arch;
@@ -376,6 +377,7 @@ struct kvm {
spinlock_t mmu_lock;
struct mutex slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
+ struct mm_ref mm_ref;
struct kvm_memslots *memslots[KVM_ADDRESS_SPACE_NUM];
struct srcu_struct srcu;
struct srcu_struct irq_srcu;
@@ -424,6 +426,7 @@ struct kvm {

#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
struct mmu_notifier mmu_notifier;
+ struct mm_ref mmu_notifier_ref;
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
#endif
diff --git a/include/linux/mm_ref.h b/include/linux/mm_ref.h
new file mode 100644
index 000000000000..0de29bd64542
--- /dev/null
+++ b/include/linux/mm_ref.h
@@ -0,0 +1,48 @@
+#ifndef LINUX_MM_REF_H
+#define LINUX_MM_REF_H
+
+#include <linux/mm_types.h>
+#include <linux/mm_ref_types.h>
+
+struct mm_struct;
+
+extern void INIT_MM_REF(struct mm_ref *ref);
+
+extern void _get_mm_ref(struct mm_struct *mm, struct list_head *list,
+ struct mm_ref *ref);
+extern void _put_mm_ref(struct mm_struct *mm, struct list_head *list,
+ struct mm_ref *ref);
+extern void _move_mm_ref(struct mm_struct *mm, struct list_head *list,
+ struct mm_ref *old_ref, struct mm_ref *new_ref);
+
+static inline void get_mm_ref(struct mm_struct *mm, struct mm_ref *ref)
+{
+ _get_mm_ref(mm, &mm->mm_count_list, ref);
+}
+
+static inline void put_mm_ref(struct mm_struct *mm, struct mm_ref *ref)
+{
+ _put_mm_ref(mm, &mm->mm_count_list, ref);
+}
+
+static inline void move_mm_ref(struct mm_struct *mm, struct mm_ref *old_ref, struct mm_ref *new_ref)
+{
+ _move_mm_ref(mm, &mm->mm_count_list, old_ref, new_ref);
+}
+
+static inline void get_mm_users_ref(struct mm_struct *mm, struct mm_ref *ref)
+{
+ _get_mm_ref(mm, &mm->mm_users_list, ref);
+}
+
+static inline void put_mm_users_ref(struct mm_struct *mm, struct mm_ref *ref)
+{
+ _put_mm_ref(mm, &mm->mm_users_list, ref);
+}
+
+static inline void move_mm_users_ref(struct mm_struct *mm, struct mm_ref *old_ref, struct mm_ref *new_ref)
+{
+ _move_mm_ref(mm, &mm->mm_users_list, old_ref, new_ref);
+}
+
+#endif
diff --git a/include/linux/mm_ref_types.h b/include/linux/mm_ref_types.h
new file mode 100644
index 000000000000..5c45995688bd
--- /dev/null
+++ b/include/linux/mm_ref_types.h
@@ -0,0 +1,41 @@
+#ifndef LINUX_MM_REF_TYPES_H
+#define LINUX_MM_REF_TYPES_H
+
+#include <linux/list.h>
+#include <linux/stacktrace.h>
+
+#define NR_MM_REF_STACK_ENTRIES 10
+
+enum mm_ref_state {
+ /*
+ * Pick 0 as uninitialized so we have a chance at catching
+ * uninitialized references by noticing that they are zero.
+ *
+ * The rest are random 32-bit integers.
+ */
+ MM_REF_UNINITIALIZED = 0,
+ MM_REF_INITIALIZED = 0x28076894UL,
+ MM_REF_ACTIVE = 0xdaf46189UL,
+ MM_REF_INACTIVE = 0xf5358bafUL,
+};
+
+struct mm_ref {
+ /*
+ * See ->mm_users_list/->mm_count_list in struct mm_struct.
+ * Access is protected by ->mm_refs_lock.
+ */
+ struct list_head list_entry;
+
+ enum mm_ref_state state;
+ int pid;
+ struct stack_trace trace;
+ unsigned long trace_entries[NR_MM_REF_STACK_ENTRIES];
+};
+
+#define MM_REF_INIT(name) \
+ { LIST_HEAD_INIT(name.list_entry), MM_REF_INITIALIZED, }
+
+#define MM_REF(name) \
+ struct mm_ref name = MM_REF_INIT(name)
+
+#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8acedf4b7d..520cde63305d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
+#include <linux/mm_ref_types.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -407,8 +408,14 @@ struct mm_struct {
unsigned long task_size; /* size of task vm space */
unsigned long highest_vm_end; /* highest vma end address */
pgd_t * pgd;
+
+ spinlock_t mm_refs_lock; /* Protects mm_users_list and mm_count_list */
atomic_t mm_users; /* How many users with user space? */
+ struct list_head mm_users_list;
+ struct mm_ref mm_users_ref;
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
+ struct list_head mm_count_list;
+
atomic_long_t nr_ptes; /* PTE page table pages */
#if CONFIG_PGTABLE_LEVELS > 2
atomic_long_t nr_pmds; /* PMD page table pages */
@@ -516,6 +523,7 @@ struct mm_struct {
atomic_long_t hugetlb_usage;
#endif
struct work_struct async_put_work;
+ struct mm_ref async_put_ref;
};

static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a1a210d59961..e67867bec2d1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -201,13 +201,13 @@ static inline int mm_has_notifiers(struct mm_struct *mm)
}

extern int mmu_notifier_register(struct mmu_notifier *mn,
- struct mm_struct *mm);
+ struct mm_struct *mm, struct mm_ref *mm_ref);
extern int __mmu_notifier_register(struct mmu_notifier *mn,
- struct mm_struct *mm);
+ struct mm_struct *mm, struct mm_ref *mm_ref);
extern void mmu_notifier_unregister(struct mmu_notifier *mn,
- struct mm_struct *mm);
+ struct mm_struct *mm, struct mm_ref *mm_ref);
extern void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
- struct mm_struct *mm);
+ struct mm_struct *mm, struct mm_ref *ref);
extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2ca3e15dad3b..293c64a15dfa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -25,6 +25,7 @@ struct sched_param {
#include <linux/errno.h>
#include <linux/nodemask.h>
#include <linux/mm_types.h>
+#include <linux/mm_ref.h>
#include <linux/preempt.h>

#include <asm/page.h>
@@ -808,6 +809,7 @@ struct signal_struct {
* Only settable by CAP_SYS_RESOURCE. */
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
+ struct mm_ref oom_mm_ref;

struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
@@ -1546,7 +1548,16 @@ struct task_struct {
struct rb_node pushable_dl_tasks;
#endif

+ /*
+ * ->mm and ->active_mm share the mm_ref. Not ideal IMHO, but that's
+ * how it's done. For kernel threads, ->mm == NULL, and for user
+ * threads, ->mm == ->active_mm, so we only need one reference.
+ *
+ * See <Documentation/vm/active_mm.txt> for more information.
+ */
struct mm_struct *mm, *active_mm;
+ struct mm_ref mm_ref;
+
/* per-thread vma caching */
u32 vmacache_seqnum;
struct vm_area_struct *vmacache[VMACACHE_SIZE];
@@ -2639,6 +2650,7 @@ extern union thread_union init_thread_union;
extern struct task_struct init_task;

extern struct mm_struct init_mm;
+extern struct mm_ref init_mm_ref;

extern struct pid_namespace init_pid_ns;

@@ -2870,17 +2882,19 @@ static inline unsigned long sigsp(unsigned long sp, struct ksignal *ksig)
/*
* Routines for handling mm_structs
*/
-extern struct mm_struct * mm_alloc(void);
+extern struct mm_struct * mm_alloc(struct mm_ref *ref);

-static inline void mmgrab(struct mm_struct *mm)
+static inline void mmgrab(struct mm_struct *mm, struct mm_ref *ref)
{
atomic_inc(&mm->mm_count);
+ get_mm_ref(mm, ref);
}

/* mmdrop drops the mm and the page tables */
extern void __mmdrop(struct mm_struct *);
-static inline void mmdrop(struct mm_struct *mm)
+static inline void mmdrop(struct mm_struct *mm, struct mm_ref *ref)
{
+ put_mm_ref(mm, ref);
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}
@@ -2891,41 +2905,47 @@ static inline void mmdrop_async_fn(struct work_struct *work)
__mmdrop(mm);
}

-static inline void mmdrop_async(struct mm_struct *mm)
+static inline void mmdrop_async(struct mm_struct *mm, struct mm_ref *ref)
{
+ put_mm_ref(mm, ref);
if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
schedule_work(&mm->async_put_work);
}
}

-static inline void mmget(struct mm_struct *mm)
+static inline void mmget(struct mm_struct *mm, struct mm_ref *ref)
{
atomic_inc(&mm->mm_users);
+ get_mm_users_ref(mm, ref);
}

-static inline bool mmget_not_zero(struct mm_struct *mm)
+static inline bool mmget_not_zero(struct mm_struct *mm, struct mm_ref *ref)
{
- return atomic_inc_not_zero(&mm->mm_users);
+ bool not_zero = atomic_inc_not_zero(&mm->mm_users);
+ if (not_zero)
+ get_mm_users_ref(mm, ref);
+
+ return not_zero;
}

/* mmput gets rid of the mappings and all user-space */
-extern void mmput(struct mm_struct *);
+extern void mmput(struct mm_struct *, struct mm_ref *);
#ifdef CONFIG_MMU
/* same as above but performs the slow path from the async context. Can
* be called from the atomic context as well
*/
-extern void mmput_async(struct mm_struct *);
+extern void mmput_async(struct mm_struct *, struct mm_ref *ref);
#endif

/* Grab a reference to a task's mm, if it is not already going away */
-extern struct mm_struct *get_task_mm(struct task_struct *task);
+extern struct mm_struct *get_task_mm(struct task_struct *task, struct mm_ref *mm_ref);
/*
* Grab a reference to a task's mm, if it is not already going away
* and ptrace_may_access with the mode parameter passed to it
* succeeds.
*/
-extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
+extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode, struct mm_ref *mm_ref);
/* Remove the current tasks stale references to the old mm_struct */
extern void mm_release(struct task_struct *, struct mm_struct *);

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 29f815d2ef7e..66c5778f4052 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -994,6 +994,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
struct cpuset_migrate_mm_work {
struct work_struct work;
struct mm_struct *mm;
+ struct mm_ref mm_ref;
nodemask_t from;
nodemask_t to;
};
@@ -1005,24 +1006,25 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)

/* on a wq worker, no need to worry about %current's mems_allowed */
do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
- mmput(mwork->mm);
+ mmput(mwork->mm, &mwork->mm_ref);
kfree(mwork);
}

static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
- const nodemask_t *to)
+ const nodemask_t *to, struct mm_ref *mm_ref)
{
struct cpuset_migrate_mm_work *mwork;

mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
if (mwork) {
mwork->mm = mm;
+ move_mm_users_ref(mm, mm_ref, &mwork->mm_ref);
mwork->from = *from;
mwork->to = *to;
INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
queue_work(cpuset_migrate_mm_wq, &mwork->work);
} else {
- mmput(mm);
+ mmput(mm, mm_ref);
}
}

@@ -1107,11 +1109,12 @@ static void update_tasks_nodemask(struct cpuset *cs)
css_task_iter_start(&cs->css, &it);
while ((task = css_task_iter_next(&it))) {
struct mm_struct *mm;
+ MM_REF(mm_ref);
bool migrate;

cpuset_change_task_nodemask(task, &newmems);

- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
if (!mm)
continue;

@@ -1119,9 +1122,9 @@ static void update_tasks_nodemask(struct cpuset *cs)

mpol_rebind_mm(mm, &cs->mems_allowed);
if (migrate)
- cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
+ cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems, &mm_ref);
else
- mmput(mm);
+ mmput(mm, &mm_ref);
}
css_task_iter_end(&it);

@@ -1556,7 +1559,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
*/
cpuset_attach_nodemask_to = cs->effective_mems;
cgroup_taskset_for_each_leader(leader, css, tset) {
- struct mm_struct *mm = get_task_mm(leader);
+ MM_REF(mm_ref);
+ struct mm_struct *mm = get_task_mm(leader, &mm_ref);

if (mm) {
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
@@ -1571,9 +1575,9 @@ static void cpuset_attach(struct cgroup_taskset *tset)
*/
if (is_memory_migrate(cs))
cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
- &cpuset_attach_nodemask_to);
+ &cpuset_attach_nodemask_to, &mm_ref);
else
- mmput(mm);
+ mmput(mm, &mm_ref);
}
}

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 02c8421f8c01..2909d6db3b7a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7965,6 +7965,7 @@ static void perf_event_addr_filters_apply(struct perf_event *event)
struct task_struct *task = READ_ONCE(event->ctx->task);
struct perf_addr_filter *filter;
struct mm_struct *mm = NULL;
+ MM_REF(mm_ref);
unsigned int count = 0;
unsigned long flags;

@@ -7975,7 +7976,7 @@ static void perf_event_addr_filters_apply(struct perf_event *event)
if (task == TASK_TOMBSTONE)
return;

- mm = get_task_mm(event->ctx->task);
+ mm = get_task_mm(event->ctx->task, &mm_ref);
if (!mm)
goto restart;

@@ -8001,7 +8002,7 @@ static void perf_event_addr_filters_apply(struct perf_event *event)

up_read(&mm->mmap_sem);

- mmput(mm);
+ mmput(mm, &mm_ref);

restart:
perf_event_stop(event, 1);
diff --git a/kernel/exit.c b/kernel/exit.c
index b12753840050..d367ef9bcfe6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -462,6 +462,7 @@ void mm_update_next_owner(struct mm_struct *mm)
static void exit_mm(struct task_struct *tsk)
{
struct mm_struct *mm = tsk->mm;
+ MM_REF(mm_ref);
struct core_state *core_state;

mm_release(tsk, mm);
@@ -500,7 +501,7 @@ static void exit_mm(struct task_struct *tsk)
__set_task_state(tsk, TASK_RUNNING);
down_read(&mm->mmap_sem);
}
- mmgrab(mm);
+ mmgrab(mm, &mm_ref);
BUG_ON(mm != tsk->active_mm);
/* more a memory barrier than a real lock */
task_lock(tsk);
@@ -509,7 +510,7 @@ static void exit_mm(struct task_struct *tsk)
enter_lazy_tlb(mm, current);
task_unlock(tsk);
mm_update_next_owner(mm);
- mmput(mm);
+ mmput(mm, &mm_ref);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
}
diff --git a/kernel/fork.c b/kernel/fork.c
index f9c32dc6ccbc..a431a52375d7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -367,7 +367,7 @@ static inline void free_signal_struct(struct signal_struct *sig)
* pgd_dtor so postpone it to the async context
*/
if (sig->oom_mm)
- mmdrop_async(sig->oom_mm);
+ mmdrop_async(sig->oom_mm, &sig->oom_mm_ref);
kmem_cache_free(signal_cachep, sig);
}

@@ -745,13 +745,22 @@ static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
#endif
}

-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
+static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct mm_ref *mm_ref)
{
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+
atomic_set(&mm->mm_users, 1);
+ INIT_LIST_HEAD(&mm->mm_users_list);
+ INIT_MM_REF(&mm->mm_users_ref);
+
atomic_set(&mm->mm_count, 1);
+ INIT_LIST_HEAD(&mm->mm_count_list);
+
+ get_mm_ref(mm, mm_ref);
+ get_mm_users_ref(mm, &mm->mm_users_ref);
+
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
mm->core_state = NULL;
@@ -821,7 +830,7 @@ static void check_mm(struct mm_struct *mm)
/*
* Allocate and initialize an mm_struct.
*/
-struct mm_struct *mm_alloc(void)
+struct mm_struct *mm_alloc(struct mm_ref *ref)
{
struct mm_struct *mm;

@@ -830,7 +839,7 @@ struct mm_struct *mm_alloc(void)
return NULL;

memset(mm, 0, sizeof(*mm));
- return mm_init(mm, current);
+ return mm_init(mm, current, ref);
}

/*
@@ -868,16 +877,17 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
set_bit(MMF_OOM_SKIP, &mm->flags);
- mmdrop(mm);
+ mmdrop(mm, &mm->mm_users_ref);
}

/*
* Decrement the use count and release all resources for an mm.
*/
-void mmput(struct mm_struct *mm)
+void mmput(struct mm_struct *mm, struct mm_ref *ref)
{
might_sleep();

+ put_mm_users_ref(mm, ref);
if (atomic_dec_and_test(&mm->mm_users))
__mmput(mm);
}
@@ -890,8 +900,9 @@ static void mmput_async_fn(struct work_struct *work)
__mmput(mm);
}

-void mmput_async(struct mm_struct *mm)
+void mmput_async(struct mm_struct *mm, struct mm_ref *ref)
{
+ put_mm_users_ref(mm, ref);
if (atomic_dec_and_test(&mm->mm_users)) {
INIT_WORK(&mm->async_put_work, mmput_async_fn);
schedule_work(&mm->async_put_work);
@@ -979,7 +990,7 @@ EXPORT_SYMBOL(get_task_exe_file);
* bumping up the use count. User must release the mm via mmput()
* after use. Typically used by /proc and ptrace.
*/
-struct mm_struct *get_task_mm(struct task_struct *task)
+struct mm_struct *get_task_mm(struct task_struct *task, struct mm_ref *mm_ref)
{
struct mm_struct *mm;

@@ -989,14 +1000,14 @@ struct mm_struct *get_task_mm(struct task_struct *task)
if (task->flags & PF_KTHREAD)
mm = NULL;
else
- mmget(mm);
+ mmget(mm, mm_ref);
}
task_unlock(task);
return mm;
}
EXPORT_SYMBOL_GPL(get_task_mm);

-struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
+struct mm_struct *mm_access(struct task_struct *task, unsigned int mode, struct mm_ref *mm_ref)
{
struct mm_struct *mm;
int err;
@@ -1005,10 +1016,10 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
if (err)
return ERR_PTR(err);

- mm = get_task_mm(task);
+ mm = get_task_mm(task, mm_ref);
if (mm && mm != current->mm &&
!ptrace_may_access(task, mode)) {
- mmput(mm);
+ mmput(mm, mm_ref);
mm = ERR_PTR(-EACCES);
}
mutex_unlock(&task->signal->cred_guard_mutex);
@@ -1115,7 +1126,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
* Allocate a new mm structure and copy contents from the
* mm structure of the passed in task structure.
*/
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk, struct mm_ref *ref)
{
struct mm_struct *mm, *oldmm = current->mm;
int err;
@@ -1126,7 +1137,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)

memcpy(mm, oldmm, sizeof(*mm));

- if (!mm_init(mm, tsk))
+ if (!mm_init(mm, tsk, ref))
goto fail_nomem;

err = dup_mmap(mm, oldmm);
@@ -1144,7 +1155,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
free_pt:
/* don't put binfmt in mmput, we haven't got module yet */
mm->binfmt = NULL;
- mmput(mm);
+ mmput(mm, ref);

fail_nomem:
return NULL;
@@ -1163,6 +1174,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)

tsk->mm = NULL;
tsk->active_mm = NULL;
+ INIT_MM_REF(&tsk->mm_ref);

/*
* Are we cloning a kernel thread?
@@ -1177,13 +1189,13 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
vmacache_flush(tsk);

if (clone_flags & CLONE_VM) {
- mmget(oldmm);
+ mmget(oldmm, &tsk->mm_ref);
mm = oldmm;
goto good_mm;
}

retval = -ENOMEM;
- mm = dup_mm(tsk);
+ mm = dup_mm(tsk, &tsk->mm_ref);
if (!mm)
goto fail_nomem;

@@ -1360,6 +1372,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;

+ sig->oom_mm = NULL;
+ INIT_MM_REF(&sig->oom_mm_ref);
+
sig->has_child_subreaper = current->signal->has_child_subreaper ||
current->signal->is_child_subreaper;

@@ -1839,7 +1854,7 @@ static __latent_entropy struct task_struct *copy_process(
exit_task_namespaces(p);
bad_fork_cleanup_mm:
if (p->mm)
- mmput(p->mm);
+ mmput(p->mm, &p->mm_ref);
bad_fork_cleanup_signal:
if (!(clone_flags & CLONE_THREAD))
free_signal_struct(p->signal);
diff --git a/kernel/futex.c b/kernel/futex.c
index cbe6056c17c1..3a279ee2166b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -240,6 +240,7 @@ struct futex_q {
struct task_struct *task;
spinlock_t *lock_ptr;
union futex_key key;
+ struct mm_ref mm_ref;
struct futex_pi_state *pi_state;
struct rt_mutex_waiter *rt_waiter;
union futex_key *requeue_pi_key;
@@ -249,6 +250,7 @@ struct futex_q {
static const struct futex_q futex_q_init = {
/* list gets initialized in queue_me()*/
.key = FUTEX_KEY_INIT,
+ /* .mm_ref must be initialized for each futex_q */
.bitset = FUTEX_BITSET_MATCH_ANY
};

@@ -336,9 +338,9 @@ static inline bool should_fail_futex(bool fshared)
}
#endif /* CONFIG_FAIL_FUTEX */

-static inline void futex_get_mm(union futex_key *key)
+static inline void futex_get_mm(union futex_key *key, struct mm_ref *ref)
{
- mmgrab(key->private.mm);
+ mmgrab(key->private.mm, ref);
/*
* Ensure futex_get_mm() implies a full barrier such that
* get_futex_key() implies a full barrier. This is relied upon
@@ -417,7 +419,7 @@ static inline int match_futex(union futex_key *key1, union futex_key *key2)
* Can be called while holding spinlocks.
*
*/
-static void get_futex_key_refs(union futex_key *key)
+static void get_futex_key_refs(union futex_key *key, struct mm_ref *ref)
{
if (!key->both.ptr)
return;
@@ -437,7 +439,7 @@ static void get_futex_key_refs(union futex_key *key)
ihold(key->shared.inode); /* implies smp_mb(); (B) */
break;
case FUT_OFF_MMSHARED:
- futex_get_mm(key); /* implies smp_mb(); (B) */
+ futex_get_mm(key, ref); /* implies smp_mb(); (B) */
break;
default:
/*
@@ -455,7 +457,7 @@ static void get_futex_key_refs(union futex_key *key)
* a no-op for private futexes, see comment in the get
* counterpart.
*/
-static void drop_futex_key_refs(union futex_key *key)
+static void drop_futex_key_refs(union futex_key *key, struct mm_ref *ref)
{
if (!key->both.ptr) {
/* If we're here then we tried to put a key we failed to get */
@@ -471,7 +473,7 @@ static void drop_futex_key_refs(union futex_key *key)
iput(key->shared.inode);
break;
case FUT_OFF_MMSHARED:
- mmdrop(key->private.mm);
+ mmdrop(key->private.mm, ref);
break;
}
}
@@ -495,7 +497,7 @@ static void drop_futex_key_refs(union futex_key *key)
* lock_page() might sleep, the caller should not hold a spinlock.
*/
static int
-get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
+get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw, struct mm_ref *mm_ref)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -527,7 +529,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
if (!fshared) {
key->private.mm = mm;
key->private.address = address;
- get_futex_key_refs(key); /* implies smp_mb(); (B) */
+ get_futex_key_refs(key, mm_ref); /* implies smp_mb(); (B) */
return 0;
}

@@ -630,7 +632,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
key->private.mm = mm;
key->private.address = address;

- get_futex_key_refs(key); /* implies smp_mb(); (B) */
+ get_futex_key_refs(key, mm_ref); /* implies smp_mb(); (B) */

} else {
struct inode *inode;
@@ -701,9 +703,9 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
return err;
}

-static inline void put_futex_key(union futex_key *key)
+static inline void put_futex_key(union futex_key *key, struct mm_ref *mm_ref)
{
- drop_futex_key_refs(key);
+ drop_futex_key_refs(key, mm_ref);
}

/**
@@ -1414,13 +1416,14 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
+ MM_REF(mm_ref);
int ret;
WAKE_Q(wake_q);

if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ, &mm_ref);
if (unlikely(ret != 0))
goto out;

@@ -1452,7 +1455,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
spin_unlock(&hb->lock);
wake_up_q(&wake_q);
out_put_key:
- put_futex_key(&key);
+ put_futex_key(&key, &mm_ref);
out:
return ret;
}
@@ -1466,16 +1469,18 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
+ MM_REF(mm_ref1);
+ MM_REF(mm_ref2);
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
int ret, op_ret;
WAKE_Q(wake_q);

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ, &mm_ref1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE, &mm_ref2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1510,8 +1515,8 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(&key2, &mm_ref2);
+ put_futex_key(&key1, &mm_ref1);
goto retry;
}

@@ -1547,9 +1552,9 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
out_put_keys:
- put_futex_key(&key2);
+ put_futex_key(&key2, &mm_ref2);
out_put_key1:
- put_futex_key(&key1);
+ put_futex_key(&key1, &mm_ref1);
out:
return ret;
}
@@ -1563,7 +1568,7 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
*/
static inline
void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
- struct futex_hash_bucket *hb2, union futex_key *key2)
+ struct futex_hash_bucket *hb2, union futex_key *key2, struct mm_ref *mm_ref2)
{

/*
@@ -1577,7 +1582,7 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
}
- get_futex_key_refs(key2);
+ get_futex_key_refs(key2, mm_ref2);
q->key = *key2;
}

@@ -1597,9 +1602,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
*/
static inline
void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
- struct futex_hash_bucket *hb)
+ struct futex_hash_bucket *hb, struct mm_ref *ref)
{
- get_futex_key_refs(key);
+ get_futex_key_refs(key, ref);
q->key = *key;

__unqueue_futex(q);
@@ -1636,7 +1641,8 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
struct futex_hash_bucket *hb1,
struct futex_hash_bucket *hb2,
union futex_key *key1, union futex_key *key2,
- struct futex_pi_state **ps, int set_waiters)
+ struct futex_pi_state **ps, int set_waiters,
+ struct mm_ref *mm_ref2)
{
struct futex_q *top_waiter = NULL;
u32 curval;
@@ -1675,7 +1681,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
set_waiters);
if (ret == 1) {
- requeue_pi_wake_futex(top_waiter, key2, hb2);
+ requeue_pi_wake_futex(top_waiter, key2, hb2, mm_ref2);
return vpid;
}
return ret;
@@ -1704,6 +1710,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
u32 *cmpval, int requeue_pi)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
+ MM_REF(mm_ref1);
+ MM_REF(mm_ref2);
int drop_count = 0, task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
@@ -1739,11 +1747,11 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
}

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ, &mm_ref1);
if (unlikely(ret != 0))
goto out;
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
- requeue_pi ? VERIFY_WRITE : VERIFY_READ);
+ requeue_pi ? VERIFY_WRITE : VERIFY_READ, &mm_ref2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1779,8 +1787,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(&key2, &mm_ref2);
+ put_futex_key(&key1, &mm_ref1);
goto retry;
}
if (curval != *cmpval) {
@@ -1797,7 +1805,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
* faults rather in the requeue loop below.
*/
ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
- &key2, &pi_state, nr_requeue);
+ &key2, &pi_state, nr_requeue, &mm_ref2);

/*
* At this point the top_waiter has either taken uaddr2 or is
@@ -1836,8 +1844,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
case -EFAULT:
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(&key2, &mm_ref2);
+ put_futex_key(&key1, &mm_ref1);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -1851,8 +1859,8 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
*/
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(&key2, &mm_ref2);
+ put_futex_key(&key1, &mm_ref1);
cond_resched();
goto retry;
default:
@@ -1921,7 +1929,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
* value. It will drop the refcount after
* doing so.
*/
- requeue_pi_wake_futex(this, &key2, hb2);
+ requeue_pi_wake_futex(this, &key2, hb2, &mm_ref2);
drop_count++;
continue;
} else if (ret) {
@@ -1942,7 +1950,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
break;
}
}
- requeue_futex(this, hb1, hb2, &key2);
+ requeue_futex(this, hb1, hb2, &key2, &mm_ref2);
drop_count++;
}

@@ -1965,12 +1973,12 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
* hold the references to key1.
*/
while (--drop_count >= 0)
- drop_futex_key_refs(&key1);
+ drop_futex_key_refs(&key1, &mm_ref1);

out_put_keys:
- put_futex_key(&key2);
+ put_futex_key(&key2, &mm_ref2);
out_put_key1:
- put_futex_key(&key1);
+ put_futex_key(&key1, &mm_ref1);
out:
return ret ? ret : task_count;
}
@@ -2091,7 +2099,7 @@ static int unqueue_me(struct futex_q *q)
ret = 1;
}

- drop_futex_key_refs(&q->key);
+ drop_futex_key_refs(&q->key, &q->mm_ref);
return ret;
}

@@ -2365,7 +2373,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ, &q->mm_ref);
if (unlikely(ret != 0))
return ret;

@@ -2384,7 +2392,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(&q->key);
+ put_futex_key(&q->key, &q->mm_ref);
goto retry;
}

@@ -2395,7 +2403,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,

out:
if (ret)
- put_futex_key(&q->key);
+ put_futex_key(&q->key, &q->mm_ref);
return ret;
}

@@ -2408,6 +2416,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct futex_q q = futex_q_init;
int ret;

+ INIT_MM_REF(&q.mm_ref);
+
if (!bitset)
return -EINVAL;
q.bitset = bitset;
@@ -2507,6 +2517,8 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
struct futex_q q = futex_q_init;
int res, ret;

+ INIT_MM_REF(&q.mm_ref);
+
if (refill_pi_state_cache())
return -ENOMEM;

@@ -2519,7 +2531,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE, &q.mm_ref);
if (unlikely(ret != 0))
goto out;

@@ -2547,7 +2559,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
* - The user space value changed.
*/
queue_unlock(hb);
- put_futex_key(&q.key);
+ put_futex_key(&q.key, &q.mm_ref);
cond_resched();
goto retry;
default:
@@ -2601,7 +2613,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
queue_unlock(hb);

out_put_key:
- put_futex_key(&q.key);
+ put_futex_key(&q.key, &q.mm_ref);
out:
if (to)
destroy_hrtimer_on_stack(&to->timer);
@@ -2617,7 +2629,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(&q.key);
+ put_futex_key(&q.key, &q.mm_ref);
goto retry;
}

@@ -2630,6 +2642,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
+ MM_REF(mm_ref);
struct futex_hash_bucket *hb;
struct futex_q *match;
int ret;
@@ -2643,7 +2656,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE, &mm_ref);
if (ret)
return ret;

@@ -2676,7 +2689,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
if (ret == -EAGAIN) {
spin_unlock(&hb->lock);
- put_futex_key(&key);
+ put_futex_key(&key, &mm_ref);
goto retry;
}
/*
@@ -2704,12 +2717,12 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
out_unlock:
spin_unlock(&hb->lock);
out_putkey:
- put_futex_key(&key);
+ put_futex_key(&key, &mm_ref);
return ret;

pi_faulted:
spin_unlock(&hb->lock);
- put_futex_key(&key);
+ put_futex_key(&key, &mm_ref);

ret = fault_in_user_writeable(uaddr);
if (!ret)
@@ -2816,9 +2829,12 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
struct rt_mutex *pi_mutex = NULL;
struct futex_hash_bucket *hb;
union futex_key key2 = FUTEX_KEY_INIT;
+ MM_REF(mm_ref2);
struct futex_q q = futex_q_init;
int res, ret;

+ INIT_MM_REF(&q.mm_ref);
+
if (uaddr == uaddr2)
return -EINVAL;

@@ -2844,7 +2860,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
RB_CLEAR_NODE(&rt_waiter.tree_entry);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE, &mm_ref2);
if (unlikely(ret != 0))
goto out;

@@ -2951,9 +2967,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
}

out_put_keys:
- put_futex_key(&q.key);
+ put_futex_key(&q.key, &q.mm_ref);
out_key2:
- put_futex_key(&key2);
+ put_futex_key(&key2, &mm_ref2);

out:
if (to) {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee1fb0070544..460c57d0d9af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2771,7 +2771,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)

fire_sched_in_preempt_notifiers(current);
if (mm)
- mmdrop(mm);
+ mmdrop(mm, &rq->prev_mm_ref);
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
@@ -2877,12 +2877,14 @@ context_switch(struct rq *rq, struct task_struct *prev,

if (!mm) {
next->active_mm = oldmm;
- mmgrab(oldmm);
+ mmgrab(oldmm, &next->mm_ref);
enter_lazy_tlb(oldmm, next);
} else
switch_mm_irqs_off(oldmm, mm, next);

if (!prev->mm) {
+ if (oldmm)
+ move_mm_ref(oldmm, &prev->mm_ref, &rq->prev_mm_ref);
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
@@ -5472,7 +5474,7 @@ void idle_task_exit(void)
switch_mm_irqs_off(mm, &init_mm, current);
finish_arch_post_lock_switch();
}
- mmdrop(mm);
+ mmdrop(mm, &current->mm_ref);
}

/*
@@ -7640,6 +7642,10 @@ void __init sched_init(void)
rq->balance_callback = NULL;
rq->active_balance = 0;
rq->next_balance = jiffies;
+
+ BUG_ON(rq->prev_mm != NULL);
+ INIT_MM_REF(&rq->prev_mm_ref);
+
rq->push_cpu = 0;
rq->cpu = i;
rq->online = 0;
@@ -7667,7 +7673,7 @@ void __init sched_init(void)
/*
* The boot idle thread does lazy MMU switching as well:
*/
- mmgrab(&init_mm);
+ mmgrab(&init_mm, &init_mm_ref);
enter_lazy_tlb(&init_mm, current);

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935d4421..98680abb882a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -636,6 +636,7 @@ struct rq {
struct task_struct *curr, *idle, *stop;
unsigned long next_balance;
struct mm_struct *prev_mm;
+ struct mm_ref prev_mm_ref;

unsigned int clock_skip_update;
u64 clock;
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..01a5bd227a53 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1603,11 +1603,12 @@ static void k_getrusage(struct task_struct *p, int who, struct rusage *r)
cputime_to_timeval(stime, &r->ru_stime);

if (who != RUSAGE_CHILDREN) {
- struct mm_struct *mm = get_task_mm(p);
+ MM_REF(mm_ref);
+ struct mm_struct *mm = get_task_mm(p, &mm_ref);

if (mm) {
setmax_mm_hiwater_rss(&maxrss, mm);
- mmput(mm);
+ mmput(mm, &mm_ref);
}
}
r->ru_maxrss = maxrss * (PAGE_SIZE / 1024); /* convert pages to KBs */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3fc20422c166..decd72ec58e1 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1046,6 +1046,7 @@ static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
struct userstack_entry *field;
struct trace_seq *s = &iter->seq;
struct mm_struct *mm = NULL;
+ MM_REF(mm_ref);
unsigned int i;

trace_assign_type(field, iter->ent);
@@ -1061,7 +1062,7 @@ static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
rcu_read_lock();
task = find_task_by_vpid(field->tgid);
if (task)
- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
rcu_read_unlock();
}

@@ -1084,7 +1085,7 @@ static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
}

if (mm)
- mmput(mm);
+ mmput(mm, &mm_ref);

return trace_handle_return(s);
}
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index f8e26ab963ed..58595a3dca3f 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -92,18 +92,19 @@ void bacct_add_tsk(struct user_namespace *user_ns,
void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
{
struct mm_struct *mm;
+ MM_REF(mm_ref);

/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
stats->coremem = p->acct_rss_mem1 * PAGE_SIZE;
do_div(stats->coremem, 1000 * KB);
stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE;
do_div(stats->virtmem, 1000 * KB);
- mm = get_task_mm(p);
+ mm = get_task_mm(p, &mm_ref);
if (mm) {
/* adjust to KB unit */
stats->hiwater_rss = get_mm_hiwater_rss(mm) * PAGE_SIZE / KB;
stats->hiwater_vm = get_mm_hiwater_vm(mm) * PAGE_SIZE / KB;
- mmput(mm);
+ mmput(mm, &mm_ref);
}
stats->read_char = p->ioac.rchar & KB_MASK;
stats->write_char = p->ioac.wchar & KB_MASK;
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..1d6acdf0a4a7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,7 @@ obj-y := filemap.o mempool.o oom_kill.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
compaction.o vmacache.o \
interval_tree.o list_lru.o workingset.o \
- debug.o $(mmu-y)
+ debug.o mm_ref.o $(mmu-y)

obj-y += init-mm.o

diff --git a/mm/init-mm.c b/mm/init-mm.c
index a56a851908d2..deb315a4c240 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -16,10 +16,16 @@
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
+ .mm_refs_lock = __SPIN_LOCK_UNLOCKED(init_mm.mm_refs_lock),
.mm_users = ATOMIC_INIT(2),
+ .mm_users_list = LIST_HEAD_INIT(init_mm.mm_users_list),
+ .mm_users_ref = MM_REF_INIT(init_mm.mm_users_ref),
.mm_count = ATOMIC_INIT(1),
+ .mm_count_list = LIST_HEAD_INIT(init_mm.mm_count_list),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
INIT_MM_CONTEXT(init_mm)
};
+
+MM_REF(init_mm_ref);
diff --git a/mm/memory.c b/mm/memory.c
index e18c57bdc75c..3be253b54c04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3954,15 +3954,16 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags)
{
struct mm_struct *mm;
+ MM_REF(mm_ref);
int ret;

- mm = get_task_mm(tsk);
+ mm = get_task_mm(tsk, &mm_ref);
if (!mm)
return 0;

ret = __access_remote_vm(tsk, mm, addr, buf, len, gup_flags);

- mmput(mm);
+ mmput(mm, &mm_ref);

return ret;
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b859af06b87..4790274af596 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1374,6 +1374,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
{
const struct cred *cred = current_cred(), *tcred;
struct mm_struct *mm = NULL;
+ MM_REF(mm_ref);
struct task_struct *task;
nodemask_t task_nodes;
int err;
@@ -1439,7 +1440,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
if (err)
goto out_put;

- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
put_task_struct(task);

if (!mm) {
@@ -1450,7 +1451,7 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
err = do_migrate_pages(mm, old, new,
capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);

- mmput(mm);
+ mmput(mm, &mm_ref);
out:
NODEMASK_SCRATCH_FREE(scratch);

diff --git a/mm/migrate.c b/mm/migrate.c
index 99250aee1ac1..593942dabbc1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1659,6 +1659,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
const struct cred *cred = current_cred(), *tcred;
struct task_struct *task;
struct mm_struct *mm;
+ MM_REF(mm_ref);
int err;
nodemask_t task_nodes;

@@ -1699,7 +1700,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
goto out;

task_nodes = cpuset_mems_allowed(task);
- mm = get_task_mm(task);
+ mm = get_task_mm(task, &mm_ref);
put_task_struct(task);

if (!mm)
@@ -1711,7 +1712,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
else
err = do_pages_stat(mm, nr_pages, pages, status);

- mmput(mm);
+ mmput(mm, &mm_ref);
return err;

out:
diff --git a/mm/mm_ref.c b/mm/mm_ref.c
new file mode 100644
index 000000000000..cf14334aec58
--- /dev/null
+++ b/mm/mm_ref.c
@@ -0,0 +1,163 @@
+#include <linux/list.h>
+#include <linux/mm_ref.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/stacktrace.h>
+
+static void _mm_ref_save_trace(struct mm_ref *ref)
+{
+ ref->pid = current->pid;
+
+ /* Save stack trace */
+ ref->trace.nr_entries = 0;
+ ref->trace.entries = ref->trace_entries;
+ ref->trace.max_entries = NR_MM_REF_STACK_ENTRIES;
+ ref->trace.skip = 1;
+ save_stack_trace(&ref->trace);
+}
+
+void INIT_MM_REF(struct mm_ref *ref)
+{
+ _mm_ref_save_trace(ref);
+ INIT_LIST_HEAD(&ref->list_entry);
+ ref->state = MM_REF_INITIALIZED;
+}
+
+static void dump_refs_list(const char *label, struct list_head *list)
+{
+ struct mm_ref *ref;
+
+ if (list_empty(list)) {
+ printk(KERN_ERR "%s: no refs\n", label);
+ return;
+ }
+
+ printk(KERN_ERR "%s:\n", label);
+ list_for_each_entry(ref, list, list_entry) {
+ printk(KERN_ERR " - %p %x acquired by %d at:%s\n",
+ ref, ref->state,
+ ref->pid,
+ ref->state == MM_REF_ACTIVE ? "" : " (bogus)");
+ if (ref->state == MM_REF_ACTIVE)
+ print_stack_trace(&ref->trace, 2);
+ }
+}
+
+static void dump_refs(struct mm_struct *mm)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&mm->mm_refs_lock, flags);
+ printk(KERN_ERR "mm_users = %u\n", atomic_read(&mm->mm_users));
+ dump_refs_list("mm_users_list", &mm->mm_users_list);
+ printk(KERN_ERR "mm_count = %u\n", atomic_read(&mm->mm_count));
+ dump_refs_list("mm_count_list", &mm->mm_count_list);
+ spin_unlock_irqrestore(&mm->mm_refs_lock, flags);
+}
+
+static bool _mm_ref_expect_inactive(struct mm_struct *mm, struct mm_ref *ref)
+{
+ if (ref->state == MM_REF_INITIALIZED || ref->state == MM_REF_INACTIVE)
+ return true;
+
+ if (ref->state == MM_REF_ACTIVE) {
+ printk(KERN_ERR "trying to overwrite active ref %p to mm %p\n", ref, mm);
+ printk(KERN_ERR "previous ref taken by %d at:\n", ref->pid);
+ print_stack_trace(&ref->trace, 0);
+ } else {
+ printk(KERN_ERR "trying to overwrite ref %p in unknown state %x to mm %p\n",
+ ref, ref->state, mm);
+ }
+
+ return false;
+}
+
+static bool _mm_ref_expect_active(struct mm_struct *mm, struct mm_ref *ref)
+{
+ if (ref->state == MM_REF_ACTIVE)
+ return true;
+
+ if (ref->state == MM_REF_INITIALIZED || ref->state == MM_REF_INACTIVE) {
+ printk(KERN_ERR "trying to put inactive ref %p to mm %p\n", ref, mm);
+ if (ref->state == MM_REF_INITIALIZED)
+ printk(KERN_ERR "ref initialized by %d at:\n", ref->pid);
+ else
+ printk(KERN_ERR "previous ref dropped by %d at:\n", ref->pid);
+ print_stack_trace(&ref->trace, 0);
+ } else {
+ printk(KERN_ERR "trying to put ref %p in unknown state %x to mm %p\n",
+ ref, ref->state, mm);
+ }
+
+ return false;
+}
+
+void _get_mm_ref(struct mm_struct *mm, struct list_head *list, struct mm_ref *ref)
+{
+ unsigned long flags;
+
+ if (!_mm_ref_expect_inactive(mm, ref)) {
+ dump_refs(mm);
+ BUG();
+ }
+
+ _mm_ref_save_trace(ref);
+
+ spin_lock_irqsave(&mm->mm_refs_lock, flags);
+ list_add_tail(&ref->list_entry, list);
+ spin_unlock_irqrestore(&mm->mm_refs_lock, flags);
+
+ ref->state = MM_REF_ACTIVE;
+}
+
+void _put_mm_ref(struct mm_struct *mm, struct list_head *list, struct mm_ref *ref)
+{
+ unsigned long flags;
+
+ if (!_mm_ref_expect_active(mm, ref)) {
+ dump_refs(mm);
+ BUG();
+ }
+
+ _mm_ref_save_trace(ref);
+
+ spin_lock_irqsave(&mm->mm_refs_lock, flags);
+ BUG_ON(list_empty(&ref->list_entry));
+ list_del_init(&ref->list_entry);
+ spin_unlock_irqrestore(&mm->mm_refs_lock, flags);
+
+ ref->state = MM_REF_INACTIVE;
+}
+
+/*
+ * TODO: we have a choice here whether to ignore mm == NULL or
+ * treat it as an error.
+ * TODO: there's also a question about whether old_ref == new_ref
+ * is an error or not.
+ */
+void _move_mm_ref(struct mm_struct *mm, struct list_head *list,
+ struct mm_ref *old_ref, struct mm_ref *new_ref)
+{
+ unsigned long flags;
+
+ if (!_mm_ref_expect_active(mm, old_ref)) {
+ dump_refs(mm);
+ BUG();
+ }
+ if (!_mm_ref_expect_inactive(mm, new_ref)) {
+ dump_refs(mm);
+ BUG();
+ }
+
+ _mm_ref_save_trace(old_ref);
+ _mm_ref_save_trace(new_ref);
+
+ spin_lock_irqsave(&mm->mm_refs_lock, flags);
+ BUG_ON(list_empty(&old_ref->list_entry));
+ list_del_init(&old_ref->list_entry);
+ list_add_tail(&new_ref->list_entry, list);
+ spin_unlock_irqrestore(&mm->mm_refs_lock, flags);
+
+ old_ref->state = MM_REF_INACTIVE;
+ new_ref->state = MM_REF_ACTIVE;
+}
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index daf67bb02b4a..3e28db145982 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -20,12 +20,14 @@
void use_mm(struct mm_struct *mm)
{
struct mm_struct *active_mm;
+ struct mm_ref active_mm_ref;
struct task_struct *tsk = current;

task_lock(tsk);
active_mm = tsk->active_mm;
if (active_mm != mm) {
- mmgrab(mm);
+ move_mm_ref(mm, &tsk->mm_ref, &active_mm_ref);
+ mmgrab(mm, &tsk->mm_ref);
tsk->active_mm = mm;
}
tsk->mm = mm;
@@ -35,8 +37,9 @@ void use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
#endif

- if (active_mm != mm)
- mmdrop(active_mm);
+ if (active_mm != mm) {
+ mmdrop(active_mm, &active_mm_ref);
+ }
}
EXPORT_SYMBOL_GPL(use_mm);

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 32bc9f2ff7eb..8187d46c8d05 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -244,7 +244,7 @@ EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);

static int do_mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm,
- int take_mmap_sem)
+ int take_mmap_sem, struct mm_ref *mm_ref)
{
struct mmu_notifier_mm *mmu_notifier_mm;
int ret;
@@ -275,7 +275,7 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
mm->mmu_notifier_mm = mmu_notifier_mm;
mmu_notifier_mm = NULL;
}
- mmgrab(mm);
+ mmgrab(mm, mm_ref);

/*
* Serialize the update against mmu_notifier_unregister. A
@@ -312,9 +312,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
* after exit_mmap. ->release will always be called before exit_mmap
* frees the pages.
*/
-int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm, struct mm_ref *mm_ref)
{
- return do_mmu_notifier_register(mn, mm, 1);
+ return do_mmu_notifier_register(mn, mm, 1, mm_ref);
}
EXPORT_SYMBOL_GPL(mmu_notifier_register);

@@ -322,9 +322,9 @@ EXPORT_SYMBOL_GPL(mmu_notifier_register);
* Same as mmu_notifier_register but here the caller must hold the
* mmap_sem in write mode.
*/
-int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm, struct mm_ref *mm_ref)
{
- return do_mmu_notifier_register(mn, mm, 0);
+ return do_mmu_notifier_register(mn, mm, 0, mm_ref);
}
EXPORT_SYMBOL_GPL(__mmu_notifier_register);

@@ -346,7 +346,7 @@ void __mmu_notifier_mm_destroy(struct mm_struct *mm)
* and only after mmu_notifier_unregister returned we're guaranteed
* that ->release or any other method can't run anymore.
*/
-void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm, struct mm_ref *mm_ref)
{
BUG_ON(atomic_read(&mm->mm_count) <= 0);

@@ -383,7 +383,7 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)

BUG_ON(atomic_read(&mm->mm_count) <= 0);

- mmdrop(mm);
+ mmdrop(mm, mm_ref);
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister);

@@ -391,7 +391,7 @@ EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
* Same as mmu_notifier_unregister but no callback and no srcu synchronization.
*/
void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
- struct mm_struct *mm)
+ struct mm_struct *mm, struct mm_ref *mm_ref)
{
spin_lock(&mm->mmu_notifier_mm->lock);
/*
@@ -402,7 +402,7 @@ void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
spin_unlock(&mm->mmu_notifier_mm->lock);

BUG_ON(atomic_read(&mm->mm_count) <= 0);
- mmdrop(mm);
+ mmdrop(mm, mm_ref);
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister_no_release);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ead093c6f2a6..0aa0b364ec0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -463,6 +463,7 @@ static DEFINE_SPINLOCK(oom_reaper_lock);

static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
+ MM_REF(mm_ref);
struct mmu_gather tlb;
struct vm_area_struct *vma;
struct zap_details details = {.check_swap_entries = true,
@@ -495,7 +496,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* that the mmput_async is called only when we have reaped something
* and delayed __mmput doesn't matter that much
*/
- if (!mmget_not_zero(mm)) {
+ if (!mmget_not_zero(mm, &mm_ref)) {
up_read(&mm->mmap_sem);
goto unlock_oom;
}
@@ -547,7 +548,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* different context because we shouldn't risk we get stuck there and
* put the oom_reaper out of the way.
*/
- mmput_async(mm);
+ mmput_async(mm, &mm_ref);
unlock_oom:
mutex_unlock(&oom_lock);
return ret;
@@ -660,7 +661,7 @@ static void mark_oom_victim(struct task_struct *tsk)

/* oom_mm is bound to the signal struct life time. */
if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
- mmgrab(tsk->signal->oom_mm);
+ mmgrab(tsk->signal->oom_mm, &tsk->signal->oom_mm_ref);

/*
* Make sure that the task is woken up from uninterruptible sleep
@@ -812,6 +813,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
struct task_struct *child;
struct task_struct *t;
struct mm_struct *mm;
+ MM_REF(mm_ref);
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -877,7 +879,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)

/* Get a reference to safely compare mm after task_unlock(victim) */
mm = victim->mm;
- mmgrab(mm);
+ mmgrab(mm, &mm_ref);
/*
* We should send SIGKILL before setting TIF_MEMDIE in order to prevent
* the OOM victim from depleting the memory reserves from the user
@@ -928,7 +930,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
if (can_oom_reap)
wake_oom_reaper(victim);

- mmdrop(mm);
+ mmdrop(mm, &mm_ref);
put_task_struct(victim);
}
#undef K
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index be8dc8d1edb9..8eef73c5ed81 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -155,6 +155,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
struct page *pp_stack[PVM_MAX_PP_ARRAY_COUNT];
struct page **process_pages = pp_stack;
struct mm_struct *mm;
+ MM_REF(mm_ref);
unsigned long i;
ssize_t rc = 0;
unsigned long nr_pages = 0;
@@ -202,7 +203,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
goto free_proc_pages;
}

- mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS, &mm_ref);
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
@@ -228,7 +229,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (total_len)
rc = total_len;

- mmput(mm);
+ mmput(mm, &mm_ref);

put_task_struct:
put_task_struct(task);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8c92829326cb..781122d8be77 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1376,6 +1376,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
{
struct swap_info_struct *si = swap_info[type];
struct mm_struct *start_mm;
+ MM_REF(start_mm_ref);
volatile unsigned char *swap_map; /* swap_map is accessed without
* locking. Mark it as volatile
* to prevent compiler doing
@@ -1402,7 +1403,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
* that.
*/
start_mm = &init_mm;
- mmget(&init_mm);
+ mmget(&init_mm, &start_mm_ref);

/*
* Keep on scanning until all entries have gone. Usually,
@@ -1449,9 +1450,9 @@ int try_to_unuse(unsigned int type, bool frontswap,
* Don't hold on to start_mm if it looks like exiting.
*/
if (atomic_read(&start_mm->mm_users) == 1) {
- mmput(start_mm);
+ mmput(start_mm, &start_mm_ref);
start_mm = &init_mm;
- mmget(&init_mm);
+ mmget(&init_mm, &start_mm_ref);
}

/*
@@ -1485,19 +1486,22 @@ int try_to_unuse(unsigned int type, bool frontswap,
int set_start_mm = (*swap_map >= swcount);
struct list_head *p = &start_mm->mmlist;
struct mm_struct *new_start_mm = start_mm;
+ MM_REF(new_start_mm_ref);
struct mm_struct *prev_mm = start_mm;
+ MM_REF(prev_mm_ref);
struct mm_struct *mm;
+ MM_REF(mm_ref);

- mmget(new_start_mm);
- mmget(prev_mm);
+ mmget(new_start_mm, &new_start_mm_ref);
+ mmget(prev_mm, &prev_mm_ref);
spin_lock(&mmlist_lock);
while (swap_count(*swap_map) && !retval &&
(p = p->next) != &start_mm->mmlist) {
mm = list_entry(p, struct mm_struct, mmlist);
- if (!mmget_not_zero(mm))
+ if (!mmget_not_zero(mm, &mm_ref))
continue;
spin_unlock(&mmlist_lock);
- mmput(prev_mm);
+ mmput(prev_mm, &prev_mm_ref);
prev_mm = mm;

cond_resched();
@@ -1511,17 +1515,18 @@ int try_to_unuse(unsigned int type, bool frontswap,
retval = unuse_mm(mm, entry, page);

if (set_start_mm && *swap_map < swcount) {
- mmput(new_start_mm);
- mmget(mm);
+ mmput(new_start_mm, &new_start_mm_ref);
+ mmget(mm, &mm_ref);
new_start_mm = mm;
set_start_mm = 0;
}
spin_lock(&mmlist_lock);
}
spin_unlock(&mmlist_lock);
- mmput(prev_mm);
- mmput(start_mm);
+ mmput(prev_mm, &prev_mm_ref);
+ mmput(start_mm, &start_mm_ref);
start_mm = new_start_mm;
+ move_mm_users_ref(start_mm, &new_start_mm_ref, &start_mm_ref);
}
if (retval) {
unlock_page(page);
@@ -1590,7 +1595,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
}
}

- mmput(start_mm);
+ mmput(start_mm, &start_mm_ref);
return retval;
}

diff --git a/mm/util.c b/mm/util.c
index 1a41553db866..9bace6820707 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -607,7 +607,8 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
{
int res = 0;
unsigned int len;
- struct mm_struct *mm = get_task_mm(task);
+ MM_REF(mm_ref);
+ struct mm_struct *mm = get_task_mm(task, &mm_ref);
unsigned long arg_start, arg_end, env_start, env_end;
if (!mm)
goto out;
@@ -647,7 +648,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
}
}
out_mm:
- mmput(mm);
+ mmput(mm, &mm_ref);
out:
return res;
}
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 9ec9cef2b207..972084e84bd6 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -108,7 +108,7 @@ static void async_pf_execute(struct work_struct *work)
if (swait_active(&vcpu->wq))
swake_up(&vcpu->wq);

- mmput(mm);
+ mmput(mm, &apf->mm_ref);
kvm_put_kvm(vcpu->kvm);
}

@@ -135,7 +135,7 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
flush_work(&work->work);
#else
if (cancel_work_sync(&work->work)) {
- mmput(work->mm);
+ mmput(work->mm, &work->mm_ref);
kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
kmem_cache_free(async_pf_cache, work);
}
@@ -200,7 +200,8 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
work->addr = hva;
work->arch = *arch;
work->mm = current->mm;
- mmget(work->mm);
+ INIT_MM_REF(&work->mm_ref);
+ mmget(work->mm, &work->mm_ref);
kvm_get_kvm(work->vcpu->kvm);

/* this can't really happen otherwise gfn_to_pfn_async
@@ -218,7 +219,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
return 1;
retry_sync:
kvm_put_kvm(work->vcpu->kvm);
- mmput(work->mm);
+ mmput(work->mm, &work->mm_ref);
kmem_cache_free(async_pf_cache, work);
return 0;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 43914b981691..d608457033d5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -482,7 +482,8 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
static int kvm_init_mmu_notifier(struct kvm *kvm)
{
kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
- return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
+ return mmu_notifier_register(&kvm->mmu_notifier, current->mm,
+ &kvm->mmu_notifier_ref);
}

#else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
@@ -608,12 +609,13 @@ static struct kvm *kvm_create_vm(unsigned long type)
{
int r, i;
struct kvm *kvm = kvm_arch_alloc_vm();
+ MM_REF(mm_ref);

if (!kvm)
return ERR_PTR(-ENOMEM);

spin_lock_init(&kvm->mmu_lock);
- mmgrab(current->mm);
+ mmgrab(current->mm, &kvm->mm_ref);
kvm->mm = current->mm;
kvm_eventfd_init(kvm);
mutex_init(&kvm->lock);
@@ -654,6 +656,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
goto out_err;
}

+ INIT_MM_REF(&kvm->mmu_notifier_ref);
r = kvm_init_mmu_notifier(kvm);
if (r)
goto out_err;
@@ -677,8 +680,9 @@ static struct kvm *kvm_create_vm(unsigned long type)
kfree(kvm->buses[i]);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
kvm_free_memslots(kvm, kvm->memslots[i]);
+ move_mm_ref(kvm->mm, &kvm->mm_ref, &mm_ref);
kvm_arch_free_vm(kvm);
- mmdrop(current->mm);
+ mmdrop(current->mm, &mm_ref);
return ERR_PTR(r);
}

@@ -713,6 +717,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
{
int i;
struct mm_struct *mm = kvm->mm;
+ MM_REF(mm_ref);

kvm_destroy_vm_debugfs(kvm);
kvm_arch_sync_events(kvm);
@@ -724,7 +729,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_io_bus_destroy(kvm->buses[i]);
kvm_coalesced_mmio_free(kvm);
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
- mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
+ mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm, &kvm->mmu_notifier_ref);
#else
kvm_arch_flush_shadow_all(kvm);
#endif
@@ -734,10 +739,11 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_free_memslots(kvm, kvm->memslots[i]);
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
+ move_mm_ref(mm, &kvm->mm_ref, &mm_ref);
kvm_arch_free_vm(kvm);
preempt_notifier_dec();
hardware_disable_all();
- mmdrop(mm);
+ mmdrop(mm, &mm_ref);
}

void kvm_get_kvm(struct kvm *kvm)
--
2.11.0.1.gaa10c3f

2016-12-16 10:11:40

by Michal Hocko

[permalink] [raw]
Subject: crash during oom reaper (was: Re: [PATCH 4/4] [RFC!] mm: 'struct mm_struct' reference counting debugging)

On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
[...]
> I don't think it's a bug in the OOM reaper itself, but either of the
> following two patches will fix the problem (without my understand how or
> why):
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ec9f11d4f094..37b14b2e2af4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> struct mm_struct *mm)
> */
> mutex_lock(&oom_lock);
>
> - if (!down_read_trylock(&mm->mmap_sem)) {
> + if (!down_write_trylock(&mm->mmap_sem)) {

__oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
doesn't require the exlusive mmap_sem. So this looks correct to me.
[...]

> --OR--
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ec9f11d4f094..559aec0acd21 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -508,6 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> struct mm_struct *mm)
> */
> set_bit(MMF_UNSTABLE, &mm->flags);
>
> +#if 0
> tlb_gather_mmu(&tlb, mm, 0, -1);
> for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> if (is_vm_hugetlb_page(vma))
> @@ -535,6 +536,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> struct mm_struct *mm)
> &details);
> }
> tlb_finish_mmu(&tlb, 0, -1);
> +#endif

same here, nothing different from the madvise... Well, except for the
MMF_UNSTABLE part which will force any page fault on this mm to SEGV.

> pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB,
> file-rss:%lukB, shmem-rss:%lukB\n",
> task_pid_nr(tsk), tsk->comm,
> K(get_mm_counter(mm, MM_ANONPAGES)),
>
> Maybe it's just the fact that we're not releasing the memory and so some
> other bit of code is not able to make enough progress to trigger the
> bug, although curiously, if I just move the #if 0..#endif inside
> tlb_gather_mmu()..tlb_finish_mmu() itself (so just calling tlb_*()
> without doing the for-loop), it still reproduces the crash.

What is the atual crash?

> Another clue, although it might just be a coincidence, is that it seems
> the VMA/file in question is always a mapping for the exe file itself
> (the reason I think this might be a coincidence is that the exe file
> mapping is the first one and we usually traverse VMAs starting with this
> one, that doesn't mean the other VMAs aren't affected by the same
> problem, just that we never hit them).

You can experiment a bit and exclude PROT_EXEC vmas...
--
Michal Hocko
SUSE Labs

2016-12-16 10:19:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: add new mmgrab() helper

On Fri, Dec 16, 2016 at 10:56:24AM +0100, Peter Zijlstra wrote:
> But I must say mmget() vs mmgrab() is a wee bit confusing.

mm_count vs mm_users is not very clear too. :)

--
Kirill A. Shutemov

2016-12-16 10:43:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: add new mmgrab() helper

On Fri 16-12-16 11:20:40, Vegard Nossum wrote:
> On 12/16/2016 11:19 AM, Kirill A. Shutemov wrote:
> > On Fri, Dec 16, 2016 at 10:56:24AM +0100, Peter Zijlstra wrote:
> > > But I must say mmget() vs mmgrab() is a wee bit confusing.
> >
> > mm_count vs mm_users is not very clear too. :)
> >
>
> I was about to say, I'm not sure it's much better than mmput() vs
> mmdrop() or mm_users vs mm_count either, although the way I rationalised
> it was the 3 vs 4 letters:
>
> mmget() -- mmgrab()
> mmput() -- mmdrop()
> ^^^ 3 ^^^^ 4

get -> put
grab -> drop

makes sense to me... No idea about a much more clear name.
--
Michal Hocko
SUSE Labs

2016-12-16 10:46:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: crash during oom reaper (was: Re: [PATCH 4/4] [RFC!] mm: 'struct mm_struct' reference counting debugging)

On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> [...]
> > I don't think it's a bug in the OOM reaper itself, but either of the
> > following two patches will fix the problem (without my understand how or
> > why):
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ec9f11d4f094..37b14b2e2af4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > struct mm_struct *mm)
> > */
> > mutex_lock(&oom_lock);
> >
> > - if (!down_read_trylock(&mm->mmap_sem)) {
> > + if (!down_write_trylock(&mm->mmap_sem)) {
>
> __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> doesn't require the exlusive mmap_sem. So this looks correct to me.

BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
least.

MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
on __oom_reap_task_mm() side.

Other difference is that you use unmap_page_range() witch doesn't touch
mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
the range. Not sure if it can make any difference here.

--
Kirill A. Shutemov

2016-12-16 10:52:24

by Vegard Nossum

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: add new mmgrab() helper

On 12/16/2016 11:19 AM, Kirill A. Shutemov wrote:
> On Fri, Dec 16, 2016 at 10:56:24AM +0100, Peter Zijlstra wrote:
>> But I must say mmget() vs mmgrab() is a wee bit confusing.
>
> mm_count vs mm_users is not very clear too. :)
>

I was about to say, I'm not sure it's much better than mmput() vs
mmdrop() or mm_users vs mm_count either, although the way I rationalised
it was the 3 vs 4 letters:

mmget() -- mmgrab()
mmput() -- mmdrop()
^^^ 3 ^^^^ 4


Vegard

2016-12-16 11:16:34

by Vegard Nossum

[permalink] [raw]
Subject: Re: [PATCH 1/4] mm: add new mmgrab() helper

On 12/16/2016 10:56 AM, Peter Zijlstra wrote:
> On Fri, Dec 16, 2016 at 09:21:59AM +0100, Vegard Nossum wrote:
>> Apart from adding the helper function itself, the rest of the kernel is
>> converted mechanically using:
>>
>> git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
>> git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'
>>
>> This is needed for a later patch that hooks into the helper, but might be
>> a worthwhile cleanup on its own.
>
> Given the desire to replace all refcounting with a specific refcount
> type, this seems to make sense.
>
> FYI: http://www.openwall.com/lists/kernel-hardening/2016/12/07/8

If we're going that way eventually (replacing all reference counting
things with a generic interface), I wonder if we shouldn't consider a
generic mechanism for reference counting debugging too.

We could wrap all the 'type *' + 'type_ref' occurrences in a struct, so
that with debugging it boils down to just a pointer (like we have now):

struct ref {
void *ptr;
#ifdef CONFIG_REF_DEBUG
/* list_entry, pid, stacktrace, etc. */
#endif
};

Instead of calling refcount_inc() in most of the kernel code, that would
be considered a low-level detail and you'd have the main interface be
something like:

void ref_acquire(refcount_t *count, struct ref *old, struct ref *new)
{
refcount_inc(&count);
new->ptr = old->ptr;
#ifdef CONFIG_REF_DEBUG
/* extra code for debugging case */
#endif
}

So if you had old code that did (for example):

struct task_struct {
struct mm_struct *mm;
...
};

int proc_pid_cmdline_read(struct task_struct *task)
{
struct mm_struct *mm;

task_lock(task);
mm = task->mm;
atomic_inc(&mm->mm_users);
task_unlock(task);

...

mmput(mm);
}

you'd instead have:

struct task_struct {
struct ref mm;
};

int proc_pid_cmdline_read(struct task_struct *task)
{
REF(mm);

task_lock(task);
ref_acquire(&mm->mm_users, &task->mm, &mm)
task_unlock(task);

...

ref_release(&mm->mm_users, &mm);
}

Of course you'd define a 'struct ref' per type using a macro or
something to keep it type safe (maybe even wrap the counter itself in
there, e.g. mm_users in the example above, so you wouldn't have to pass
it explicitly).

Functions that don't touch reference counts (because the caller holds
one) can just take a plain pointer as usual.

In the example above, you could also have ref_release() set mm->ptr =
NULL; as the pointer should not be considered usable after it has been
released anyway for added safety/debugability.

Best of both worlds?


Vegard

2016-12-16 11:50:11

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > [...]
> > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > following two patches will fix the problem (without my understand how or
> > > why):
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index ec9f11d4f094..37b14b2e2af4 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > struct mm_struct *mm)
> > > */
> > > mutex_lock(&oom_lock);
> > >
> > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > + if (!down_write_trylock(&mm->mmap_sem)) {
> >
> > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > doesn't require the exlusive mmap_sem. So this looks correct to me.
>
> BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> least.
>
> MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> on __oom_reap_task_mm() side.

I guess you are right and we should match the MADV_DONTNEED behavior
here. Care to send a patch?

> Other difference is that you use unmap_page_range() witch doesn't touch
> mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> the range. Not sure if it can make any difference here.

Which mmu notifier would care about this? I am not really familiar with
those users so I might miss something easily.

--
Michal Hocko
SUSE Labs

2016-12-16 12:13:09

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri 16-12-16 12:42:43, Michal Hocko wrote:
> On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > > [...]
> > > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > > following two patches will fix the problem (without my understand how or
> > > > why):
> > > >
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index ec9f11d4f094..37b14b2e2af4 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > > struct mm_struct *mm)
> > > > */
> > > > mutex_lock(&oom_lock);
> > > >
> > > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > > + if (!down_write_trylock(&mm->mmap_sem)) {
> > >
> > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > > doesn't require the exlusive mmap_sem. So this looks correct to me.
> >
> > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> > least.
> >
> > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> > on __oom_reap_task_mm() side.
>
> I guess you are right and we should match the MADV_DONTNEED behavior
> here. Care to send a patch?
>
> > Other difference is that you use unmap_page_range() witch doesn't touch
> > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> > the range. Not sure if it can make any difference here.
>
> Which mmu notifier would care about this? I am not really familiar with
> those users so I might miss something easily.

Just forgot to add. Unlike the MADV_DONTNEED, there is nobody who should
observe the address space of the oom killed (and reaped) task so why
should notifiers matter in the first place?
--
Michal Hocko
SUSE Labs

2016-12-16 12:46:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote:
> On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > > [...]
> > > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > > following two patches will fix the problem (without my understand how or
> > > > why):
> > > >
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index ec9f11d4f094..37b14b2e2af4 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > > struct mm_struct *mm)
> > > > */
> > > > mutex_lock(&oom_lock);
> > > >
> > > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > > + if (!down_write_trylock(&mm->mmap_sem)) {
> > >
> > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > > doesn't require the exlusive mmap_sem. So this looks correct to me.
> >
> > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> > least.
> >
> > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> > on __oom_reap_task_mm() side.
>
> I guess you are right and we should match the MADV_DONTNEED behavior
> here. Care to send a patch?

Below. Testing required.

> > Other difference is that you use unmap_page_range() witch doesn't touch
> > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> > the range. Not sure if it can make any difference here.
>
> Which mmu notifier would care about this? I am not really familiar with
> those users so I might miss something easily.

No idea either.

Is there any reason not to use zap_page_range here too?

Few more notes:

I propably miss something, but why do we need details->ignore_dirty?
It only appiled for non-anon pages, but since we filter out shared
mappings, how can we have pte_dirty() for !PageAnon()?

check_swap_entries is also sloppy: the behavior doesn't match the comment:
details == NULL makes it check swap entries. I removed it and restore
details->check_mapping test as we had before.

After the change no user of zap_page_range() wants non-NULL details, I've
dropped the argument.

If it looks okay, I'll split it into several patches with proper commit
messages.

-----8<-----

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index ec1f0dedb948..59ac93714fa4 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -687,7 +687,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
/* Find vma in the parent mm */
vma = find_vma(gmap->mm, vmaddr);
size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
- zap_page_range(vma, vmaddr, size, NULL);
+ zap_page_range(vma, vmaddr, size);
}
up_read(&gmap->mm->mmap_sem);
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index e4f800999b32..4bfb31e79d5d 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -796,7 +796,7 @@ static noinline int zap_bt_entries_mapping(struct mm_struct *mm,
return -EINVAL;

len = min(vma->vm_end, end) - addr;
- zap_page_range(vma, addr, len, NULL);
+ zap_page_range(vma, addr, len);
trace_mpx_unmap_zap(addr, addr+len);

vma = vma->vm_next;
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 3c71b982bf2a..d97f6725cf8c 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -629,7 +629,7 @@ static int binder_update_page_range(struct binder_proc *proc, int allocate,
page = &proc->pages[(page_addr - proc->buffer) / PAGE_SIZE];
if (vma)
zap_page_range(vma, (uintptr_t)page_addr +
- proc->user_buffer_offset, PAGE_SIZE, NULL);
+ proc->user_buffer_offset, PAGE_SIZE);
err_vm_insert_page_failed:
unmap_kernel_range((unsigned long)page_addr, PAGE_SIZE);
err_map_kernel_failed:
diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
index b653451843c8..0fb0e28ace70 100644
--- a/drivers/staging/android/ion/ion.c
+++ b/drivers/staging/android/ion/ion.c
@@ -865,8 +865,7 @@ static void ion_buffer_sync_for_device(struct ion_buffer *buffer,
list_for_each_entry(vma_list, &buffer->vmas, list) {
struct vm_area_struct *vma = vma_list->vma;

- zap_page_range(vma, vma->vm_start, vma->vm_end - vma->vm_start,
- NULL);
+ zap_page_range(vma, vma->vm_start, vma->vm_end - vma->vm_start);
}
mutex_unlock(&buffer->lock);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4424784ac374..92dcada8caaf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1148,8 +1148,6 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
- bool ignore_dirty; /* Ignore dirty pages */
- bool check_swap_entries; /* Check also swap entries */
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1160,7 +1158,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size, struct zap_details *);
+ unsigned long size);
void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long start, unsigned long end);

diff --git a/mm/internal.h b/mm/internal.h
index 44d68895a9b9..5c355855e4ad 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,10 +41,9 @@ int do_swap_page(struct vm_fault *vmf);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);

-void unmap_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end,
- struct zap_details *details);
+long madvise_dontneed(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end);

extern int __do_page_cache_readahead(struct address_space *mapping,
struct file *filp, pgoff_t offset, unsigned long nr_to_read,
diff --git a/mm/madvise.c b/mm/madvise.c
index 0e3828eae9f8..8c9f19b62b4a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -468,7 +468,7 @@ static long madvise_free(struct vm_area_struct *vma,
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
-static long madvise_dontneed(struct vm_area_struct *vma,
+long madvise_dontneed(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
@@ -476,7 +476,7 @@ static long madvise_dontneed(struct vm_area_struct *vma,
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;

- zap_page_range(vma, start, end - start, NULL);
+ zap_page_range(vma, start, end - start);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 455c3e628d52..f8836232a492 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1155,12 +1155,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
- /*
- * oom_reaper cannot tear down dirty
- * pages
- */
- if (unlikely(details && details->ignore_dirty))
- continue;
force_flush = 1;
set_page_dirty(page);
}
@@ -1179,8 +1173,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
continue;
}
- /* only check swap_entries if explicitly asked for in details */
- if (unlikely(details && !details->check_swap_entries))
+ /* If details->check_mapping, we leave swap entries. */
+ if (unlikely(details))
continue;

entry = pte_to_swp_entry(ptent);
@@ -1277,7 +1271,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
return addr;
}

-void unmap_page_range(struct mmu_gather *tlb,
+static void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1381,7 +1375,7 @@ void unmap_vmas(struct mmu_gather *tlb,
* Caller must protect the VMA list
*/
void zap_page_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long size, struct zap_details *details)
+ unsigned long size)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
@@ -1392,7 +1386,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(mm, start, end);
for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
- unmap_single_vma(&tlb, vma, start, end, details);
+ unmap_single_vma(&tlb, vma, start, end, NULL);
mmu_notifier_invalidate_range_end(mm, start, end);
tlb_finish_mmu(&tlb, start, end);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ec9f11d4f094..f6451eacb0aa 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -465,8 +465,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
struct mmu_gather tlb;
struct vm_area_struct *vma;
- struct zap_details details = {.check_swap_entries = true,
- .ignore_dirty = true};
bool ret = true;

/*
@@ -481,7 +479,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* out_of_memory
* select_bad_process
* # no TIF_MEMDIE task selects new victim
- * unmap_page_range # frees some memory
+ * madv_dontneed # frees some memory
*/
mutex_lock(&oom_lock);

@@ -510,16 +508,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)

tlb_gather_mmu(&tlb, mm, 0, -1);
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
- if (is_vm_hugetlb_page(vma))
- continue;
-
- /*
- * mlocked VMAs require explicit munlocking before unmap.
- * Let's keep it simple here and skip such VMAs.
- */
- if (vma->vm_flags & VM_LOCKED)
- continue;
-
/*
* Only anonymous pages have a good chance to be dropped
* without additional steps which we cannot afford as we
@@ -531,8 +519,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* count elevated without a good reason.
*/
if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
- unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
- &details);
+ madvise_dontneed(vma, &vma, vma->vm_start, vma->vm_end);
}
tlb_finish_mmu(&tlb, 0, -1);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
--
Kirill A. Shutemov

2016-12-16 12:57:01

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri 16-12-16 15:35:55, Kirill A. Shutemov wrote:
> On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote:
> > On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> > > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > > > [...]
> > > > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > > > following two patches will fix the problem (without my understand how or
> > > > > why):
> > > > >
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > index ec9f11d4f094..37b14b2e2af4 100644
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > > > struct mm_struct *mm)
> > > > > */
> > > > > mutex_lock(&oom_lock);
> > > > >
> > > > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > > > + if (!down_write_trylock(&mm->mmap_sem)) {
> > > >
> > > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > > > doesn't require the exlusive mmap_sem. So this looks correct to me.
> > >
> > > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> > > least.
> > >
> > > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> > > on __oom_reap_task_mm() side.
> >
> > I guess you are right and we should match the MADV_DONTNEED behavior
> > here. Care to send a patch?
>
> Below. Testing required.
>
> > > Other difference is that you use unmap_page_range() witch doesn't touch
> > > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> > > the range. Not sure if it can make any difference here.
> >
> > Which mmu notifier would care about this? I am not really familiar with
> > those users so I might miss something easily.
>
> No idea either.
>
> Is there any reason not to use zap_page_range here too?

Yes, zap_page_range is much more heavy and performs operations which
might lock AFAIR which I really would like to prevent from.

> Few more notes:
>
> I propably miss something, but why do we need details->ignore_dirty?
>
> It only appiled for non-anon pages, but since we filter out shared
> mappings, how can we have pte_dirty() for !PageAnon()?

Why couldn't we have dirty pages on the private file mappings? The
underlying page might be still in the page cache, right?

> check_swap_entries is also sloppy: the behavior doesn't match the comment:
> details == NULL makes it check swap entries. I removed it and restore
> details->check_mapping test as we had before.

the reason is unmap_mapping_range which didn't use to check swap entries
so I wanted to have it opt in AFAIR.

> @@ -531,8 +519,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> * count elevated without a good reason.
> */
> if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> - unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> - &details);
> + madvise_dontneed(vma, &vma, vma->vm_start, vma->vm_end);

I would rather keep the unmap_page_range because it is the bare minumum
we have to do. Currently we are doing

if (is_vm_hugetlb_page(vma))
continue;

so I would rather do something like
if (!can_vma_madv_dontneed(vma))
continue;
instead.
--
Michal Hocko
SUSE Labs

2016-12-16 13:07:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri, Dec 16, 2016 at 01:56:50PM +0100, Michal Hocko wrote:
> On Fri 16-12-16 15:35:55, Kirill A. Shutemov wrote:
> > On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote:
> > > On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> > > > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > > > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > > > > [...]
> > > > > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > > > > following two patches will fix the problem (without my understand how or
> > > > > > why):
> > > > > >
> > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > index ec9f11d4f094..37b14b2e2af4 100644
> > > > > > --- a/mm/oom_kill.c
> > > > > > +++ b/mm/oom_kill.c
> > > > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > > > > struct mm_struct *mm)
> > > > > > */
> > > > > > mutex_lock(&oom_lock);
> > > > > >
> > > > > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > > > > + if (!down_write_trylock(&mm->mmap_sem)) {
> > > > >
> > > > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > > > > doesn't require the exlusive mmap_sem. So this looks correct to me.
> > > >
> > > > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> > > > least.
> > > >
> > > > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> > > > on __oom_reap_task_mm() side.
> > >
> > > I guess you are right and we should match the MADV_DONTNEED behavior
> > > here. Care to send a patch?
> >
> > Below. Testing required.
> >
> > > > Other difference is that you use unmap_page_range() witch doesn't touch
> > > > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> > > > the range. Not sure if it can make any difference here.
> > >
> > > Which mmu notifier would care about this? I am not really familiar with
> > > those users so I might miss something easily.
> >
> > No idea either.
> >
> > Is there any reason not to use zap_page_range here too?
>
> Yes, zap_page_range is much more heavy and performs operations which
> might lock AFAIR which I really would like to prevent from.

What exactly can block there? I don't see anything with that potential.

> > Few more notes:
> >
> > I propably miss something, but why do we need details->ignore_dirty?
> >
> > It only appiled for non-anon pages, but since we filter out shared
> > mappings, how can we have pte_dirty() for !PageAnon()?
>
> Why couldn't we have dirty pages on the private file mappings? The
> underlying page might be still in the page cache, right?

The check is about dirty PTE, not dirty page.

> > check_swap_entries is also sloppy: the behavior doesn't match the comment:
> > details == NULL makes it check swap entries. I removed it and restore
> > details->check_mapping test as we had before.
>
> the reason is unmap_mapping_range which didn't use to check swap entries
> so I wanted to have it opt in AFAIR.

details == NULL would give you it in both cases.

> > @@ -531,8 +519,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> > * count elevated without a good reason.
> > */
> > if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > - unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > - &details);
> > + madvise_dontneed(vma, &vma, vma->vm_start, vma->vm_end);
>
> I would rather keep the unmap_page_range because it is the bare minumum
> we have to do. Currently we are doing
>
> if (is_vm_hugetlb_page(vma))
> continue;
>
> so I would rather do something like
> if (!can_vma_madv_dontneed(vma))
> continue;
> instead.

We can do that.
But let's first understand why code should differ from madvise_dontneed().
It's not obvious to me.

--
Kirill A. Shutemov

2016-12-16 13:14:31

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri 16-12-16 16:07:30, Kirill A. Shutemov wrote:
> On Fri, Dec 16, 2016 at 01:56:50PM +0100, Michal Hocko wrote:
> > On Fri 16-12-16 15:35:55, Kirill A. Shutemov wrote:
> > > On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote:
> > > > On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> > > > > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > > > > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > > > > > [...]
> > > > > > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > > > > > following two patches will fix the problem (without my understand how or
> > > > > > > why):
> > > > > > >
> > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > > index ec9f11d4f094..37b14b2e2af4 100644
> > > > > > > --- a/mm/oom_kill.c
> > > > > > > +++ b/mm/oom_kill.c
> > > > > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > > > > > struct mm_struct *mm)
> > > > > > > */
> > > > > > > mutex_lock(&oom_lock);
> > > > > > >
> > > > > > > - if (!down_read_trylock(&mm->mmap_sem)) {
> > > > > > > + if (!down_write_trylock(&mm->mmap_sem)) {
> > > > > >
> > > > > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > > > > > doesn't require the exlusive mmap_sem. So this looks correct to me.
> > > > >
> > > > > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> > > > > least.
> > > > >
> > > > > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> > > > > on __oom_reap_task_mm() side.
> > > >
> > > > I guess you are right and we should match the MADV_DONTNEED behavior
> > > > here. Care to send a patch?
> > >
> > > Below. Testing required.
> > >
> > > > > Other difference is that you use unmap_page_range() witch doesn't touch
> > > > > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> > > > > the range. Not sure if it can make any difference here.
> > > >
> > > > Which mmu notifier would care about this? I am not really familiar with
> > > > those users so I might miss something easily.
> > >
> > > No idea either.
> > >
> > > Is there any reason not to use zap_page_range here too?
> >
> > Yes, zap_page_range is much more heavy and performs operations which
> > might lock AFAIR which I really would like to prevent from.
>
> What exactly can block there? I don't see anything with that potential.

I would have to rememeber all the details. This is mostly off-topic for
this particular thread so I think it would be better if you could send a
full patch separatelly and we can discuss it there?
--
Michal Hocko
SUSE Labs

2016-12-16 13:15:11

by Vegard Nossum

[permalink] [raw]
Subject: Re: crash during oom reaper

On 12/16/2016 11:11 AM, Michal Hocko wrote:
> On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> [...]
>> I don't think it's a bug in the OOM reaper itself, but either of the
>> following two patches will fix the problem (without my understand how or
>> why):
>>
> What is the atual crash?

Annoyingly it doesn't seem to reproduce with the very latest
linus/master, so maybe it's been fixed recently after all and I missed it.

I've started a bisect to see what fixed it. Just in case, I added 4
different crashes I saw with various kernels. I think there may have
been a few others too (I remember seeing one in a page fault path), but
these were the most frequent ones.


Vegard

--

Manifestation 1:

Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
file-rss:112kB, shmem-rss:112kB
BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
IP: [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
PGD c001067 PUD c000067
PMD 0
Oops: 0002 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff88000f9bc440 task.stack: ffff88000c778000
RIP: 0010:[<ffffffff8126b1c0>] [<ffffffff8126b1c0>]
copy_process.part.41+0x2150/0x5580
RSP: 0018:ffff88000c77fc18 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88000fa11c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff88000f2a33b0
RBP: ffff88000c77fdb0 R08: ffff88000c77f900 R09: 0000000000000002
R10: 00000000cb9401ca R11: 00000000c6eda739 R12: ffff88000f894d00
R13: ffff88000c7c4700 R14: ffff88000fa11c50 R15: ffff88000f2a3200
FS: 00007fb7d2a24700(0000) GS:ffff880011b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000001e8 CR3: 000000001010d000 CR4: 00000000000406e0
Stack:
0000000000000046 0000000001200011 ffffed0001f129ac ffff88000f894d60
0000000000000000 0000000000000000 ffff88000f894d08 ffff88000f894da0
ffff88000c7a8620 ffff88000f020318 ffff88000fa11c18 ffff88000f894e40
Call Trace:
[<ffffffff81269070>] ? __cleanup_sighand+0x50/0x50
[<ffffffff81fd552e>] ? memzero_explicit+0xe/0x10
[<ffffffff822cb592>] ? urandom_read+0x232/0x4d0
[<ffffffff8126e974>] _do_fork+0x1a4/0xa40
[<ffffffff8126e7d0>] ? fork_idle+0x180/0x180
[<ffffffff81002dba>] ? syscall_trace_enter+0x3aa/0xd40
[<ffffffff815179ea>] ? __context_tracking_exit.part.4+0x9a/0x1e0
[<ffffffff81002a10>] ? exit_to_usermode_loop+0x150/0x150
[<ffffffff8201df57>] ? check_preemption_disabled+0x37/0x1e0
[<ffffffff8126f2e7>] SyS_clone+0x37/0x50
[<ffffffff83caea50>] ? ptregs_sys_rt_sigreturn+0x10/0x10
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83cae974>] entry_SYSCALL64_slow_path+0x25/0x25
Code: be 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 30 00 74 08 4c 89 f7
e8 d0 7d 3c 00 f6 43 51 08 74 11 e8 45 fa 1d 00 48 8b 44 24 20 <f0> ff
88 e8 01 00 00 e8 34 fa 1d 00 48 8b 44 24 70 48 83 c0 60
RIP [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
RSP <ffff88000c77fc18>
CR2: 00000000000001e8
---[ end trace b8f81ad60c106e75 ]---

Manifestation 2:

Killed process 1775 (trinity-c21) total-vm:37404kB, anon-rss:232kB,
file-rss:420kB, shmem-rss:116kB
oom_reaper: reaped process 1775 (trinity-c21), now anon-rss:0kB,
file-rss:0kB, shmem-rss:116kB
==================================================================
BUG: KASAN: use-after-free in p9_client_read+0x8f0/0x960 at addr
ffff880010284d00
Read of size 8 by task trinity-main/1649
CPU: 3 PID: 1649 Comm: trinity-main Not tainted 4.9.0+ #318
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffff8800068a7770 ffffffff82012301 ffff88001100f600 ffff880010284d00
ffff880010284d60 ffff880010284d00 ffff8800068a7798 ffffffff8165872c
ffff8800068a7828 ffff880010284d00 ffff88001100f600 ffff8800068a7818
Call Trace:
[<ffffffff82012301>] dump_stack+0x83/0xb2
[<ffffffff8165872c>] kasan_object_err+0x1c/0x70
[<ffffffff816589c5>] kasan_report_error+0x1f5/0x4e0
[<ffffffff81657d92>] ? kasan_slab_alloc+0x12/0x20
[<ffffffff82079357>] ? check_preemption_disabled+0x37/0x1e0
[<ffffffff81658e4e>] __asan_report_load8_noabort+0x3e/0x40
[<ffffffff82079300>] ? assoc_array_gc+0x1310/0x1330
[<ffffffff83b84c30>] ? p9_client_read+0x8f0/0x960
[<ffffffff83b84c30>] p9_client_read+0x8f0/0x960
[<ffffffff8207953c>] ? __this_cpu_preempt_check+0x1c/0x20
[<ffffffff81685388>] ? memcg_check_events+0x28/0x460
[<ffffffff83b84340>] ? p9_client_unlinkat+0x100/0x100
[<ffffffff8168f147>] ? mem_cgroup_commit_charge+0xb7/0x11b0
[<ffffffff815357f4>] ? __add_to_page_cache_locked+0x3a4/0x520
[<ffffffff815a6bdb>] ? __inc_node_state+0x6b/0xe0
[<ffffffff813727c7>] ? do_raw_spin_unlock+0x137/0x210
[<ffffffff820587d0>] ? iov_iter_bvec+0x30/0x120
[<ffffffff81b7d6ac>] v9fs_fid_readpage+0x15c/0x390
[<ffffffff81b7d550>] ? v9fs_write_end+0x410/0x410
[<ffffffff81571205>] ? __lru_cache_add+0x145/0x1f0
[<ffffffff815713d5>] ? lru_cache_add+0x15/0x20
[<ffffffff8153761b>] ? add_to_page_cache_lru+0x13b/0x280
[<ffffffff815374e0>] ? add_to_page_cache_locked+0x40/0x40
[<ffffffff8153786d>] ? __page_cache_alloc+0x10d/0x290
[<ffffffff81b7d91f>] v9fs_vfs_readpage+0x3f/0x50
[<ffffffff8153b188>] filemap_fault+0xbe8/0x1140
[<ffffffff8153af1d>] ? filemap_fault+0x97d/0x1140
[<ffffffff815d2eb6>] __do_fault+0x206/0x410
[<ffffffff815d2cb0>] ? do_page_mkwrite+0x320/0x320
[<ffffffff815ddc4c>] ? handle_mm_fault+0x1cc/0x2a60
[<ffffffff815df76f>] handle_mm_fault+0x1cef/0x2a60
[<ffffffff815ddbb2>] ? handle_mm_fault+0x132/0x2a60
[<ffffffff815dda80>] ? __pmd_alloc+0x370/0x370
[<ffffffff81302550>] ? dl_bw_of+0x80/0x80
[<ffffffff8123b5f0>] ? __do_page_fault+0x220/0x9f0
[<ffffffff815f1820>] ? find_vma+0x30/0x150
[<ffffffff8123b822>] __do_page_fault+0x452/0x9f0
[<ffffffff8123c075>] trace_do_page_fault+0x1e5/0x3a0
[<ffffffff8122e497>] do_async_page_fault+0x27/0xa0
[<ffffffff83d86c58>] async_page_fault+0x28/0x30
Object at ffff880010284d00, in cache kmalloc-96 size: 96
Allocated:
PID = 1649
[<ffffffff811db686>] save_stack_trace+0x16/0x20
[<ffffffff81657566>] save_stack+0x46/0xd0
[<ffffffff81657d4d>] kasan_kmalloc+0xad/0xe0
[<ffffffff81653532>] kmem_cache_alloc_trace+0x152/0x2c0
[<ffffffff83b7cc58>] p9_fid_create+0x58/0x3a0
[<ffffffff83b83a3d>] p9_client_walk+0xbd/0x7a0
[<ffffffff81b7df1c>] v9fs_file_open+0x38c/0x740
[<ffffffff816ab927>] do_dentry_open+0x5c7/0xc50
[<ffffffff816af4c5>] vfs_open+0x105/0x220
[<ffffffff816e07a0>] path_openat+0x8f0/0x2920
[<ffffffff816e539e>] do_filp_open+0x18e/0x250
[<ffffffff816c4403>] do_open_execat+0xe3/0x4c0
[<ffffffff816cad41>] do_execveat_common.isra.36+0x671/0x1d00
[<ffffffff816cce52>] SyS_execve+0x42/0x50
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83d85b74>] return_from_SYSCALL_64+0x0/0x6a
Freed:
PID = 1280
[<ffffffff811db686>] save_stack_trace+0x16/0x20
[<ffffffff81657566>] save_stack+0x46/0xd0
[<ffffffff81657c61>] kasan_slab_free+0x71/0xb0
[<ffffffff8165506c>] kfree+0xfc/0x230
[<ffffffff83b7cb42>] p9_fid_destroy+0x1c2/0x280
[<ffffffff83b838bd>] p9_client_clunk+0xdd/0x1a0
[<ffffffff81b80694>] v9fs_dir_release+0x44/0x60
[<ffffffff816b9d67>] __fput+0x287/0x710
[<ffffffff816ba239>] delayed_fput+0x49/0x70
[<ffffffff812c6600>] process_one_work+0x8b0/0x14c0
[<ffffffff812c72fb>] worker_thread+0xeb/0x1210
[<ffffffff812da5a4>] kthread+0x244/0x2d0
[<ffffffff83d85d25>] ret_from_fork+0x25/0x30
Memory state around the buggy address:
ffff880010284c00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff880010284c80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff880010284d00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff880010284d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff880010284e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
==================================================================

Manifestation 3:

Out of memory: Kill process 1650 (trinity-main) score 91 or sacrifice child
Killed process 1731 (trinity-main) total-vm:37140kB, anon-rss:192kB,
file-rss:0kB, shmem-rss:0kB
==================================================================
BUG: KASAN: use-after-free in unlink_file_vma+0xa5/0xb0 at addr
ffff880006689db0
Read of size 8 by task trinity-main/1731
CPU: 5 PID: 1731 Comm: trinity-main Not tainted 4.9.0 #314
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffff880000aaf7f8 ffffffff81fb1ab1 ffff8800110ed500 ffff880006689c00
ffff880006689db8 ffff880000aaf998 ffff880000aaf820 ffffffff8162c5ac
ffff880000aaf8b0 ffff880006689c00Out of memory: Kill process 1650
(trinity-main) score 91 or sacrifice child
Killed process 1650 (trinity-main) total-vm:37140kB, anon-rss:192kB,
file-rss:140kB, shmem-rss:18632kB
oom_reaper: reaped process 1650 (trinity-main), now anon-rss:0kB,
file-rss:0kB, shmem-rss:18632kB
ffff8800110ed500 ffff880000aaf8a0
Call Trace:
[<ffffffff81fb1ab1>] dump_stack+0x83/0xb2
[<ffffffff8162c5ac>] kasan_object_err+0x1c/0x70
[<ffffffff8162c845>] kasan_report_error+0x1f5/0x4e0
[<ffffffff815afd80>] ? vm_normal_page_pmd+0x240/0x240
[<ffffffff8162ccce>] __asan_report_load8_noabort+0x3e/0x40
[<ffffffff815c4305>] ? unlink_file_vma+0xa5/0xb0
[<ffffffff815c4305>] unlink_file_vma+0xa5/0xb0
[<ffffffff815ad610>] free_pgtables+0x80/0x350
[<ffffffff815ca662>] exit_mmap+0x212/0x3d0
[<ffffffff815ca450>] ? SyS_munmap+0xa0/0xa0
[<ffffffff8130b2a5>] ? __might_sleep+0x95/0x1a0
[<ffffffff812684a0>] mmput+0x90/0x1c0
[<ffffffff8127e3dd>] do_exit+0x71d/0x2930
[<ffffffff815adeb0>] ? vm_normal_page+0x200/0x200
[<ffffffff8127dcc0>] ? mm_update_next_owner+0x710/0x710
[<ffffffff815b614f>] ? handle_mm_fault+0xcbf/0x2a60
[<ffffffff81298403>] ? __dequeue_signal+0x133/0x470
[<ffffffff81280768>] do_group_exit+0x108/0x330
[<ffffffff812a1d83>] get_signal+0x613/0x1390
[<ffffffff8135e912>] ? __lock_acquire.isra.32+0xc2/0x1a30
[<ffffffff811b1b9f>] do_signal+0x7f/0x18f0
[<ffffffff81237354>] ? __do_page_fault+0x474/0x9f0
[<ffffffff811b1b20>] ? setup_sigcontext+0x7d0/0x7d0
[<ffffffff8123719c>] ? __do_page_fault+0x2bc/0x9f0
[<ffffffff82017d27>] ? check_preemption_disabled+0x37/0x1e0
[<ffffffff81004ee8>] ? prepare_exit_to_usermode+0xb8/0xd0
[<ffffffff81237b94>] ? trace_do_page_fault+0x1f4/0x3a0
[<ffffffff81004ee8>] ? prepare_exit_to_usermode+0xb8/0xd0
[<ffffffff81229fa7>] ? do_async_page_fault+0x27/0xa0
[<ffffffff83c993d8>] ? async_page_fault+0x28/0x30
[<ffffffff81004ee8>] ? prepare_exit_to_usermode+0xb8/0xd0
[<ffffffff81002975>] exit_to_usermode_loop+0xb5/0x150
[<ffffffff81004ee8>] ? prepare_exit_to_usermode+0xb8/0xd0
[<ffffffff8100506e>] syscall_return_slowpath+0x16e/0x1a0
[<ffffffff83c98495>] ret_from_fork+0x15/0x30
Object at ffff880006689c00, in cache filp size: 440
Allocated:
PID = 1650
[<ffffffff811d77d6>] save_stack_trace+0x16/0x20
[<ffffffff8162b3e6>] save_stack+0x46/0xd0
[<ffffffff8162bbcd>] kasan_kmalloc+0xad/0xe0
[<ffffffff8162bc12>] kasan_slab_alloc+0x12/0x20
[<ffffffff81627195>] kmem_cache_alloc+0xf5/0x2c0
[<ffffffff8168aaa1>] get_empty_filp+0x91/0x3e0
[<ffffffff816affd2>] path_openat+0xb2/0x2920
[<ffffffff816b540e>] do_filp_open+0x18e/0x250
[<ffffffff816946f3>] do_open_execat+0xe3/0x4c0
[<ffffffff8169adb1>] do_execveat_common.isra.36+0x671/0x1d00
[<ffffffff8169cec2>] SyS_execve+0x42/0x50
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c982f4>] return_from_SYSCALL_64+0x0/0x6a
Freed:
PID = 2
[<ffffffff811d77d6>] save_stack_trace+0x16/0x20
[<ffffffff8162b3e6>] save_stack+0x46/0xd0
[<ffffffff8162bae1>] kasan_slab_free+0x71/0xb0
[<ffffffff81627ddf>] kmem_cache_free+0xaf/0x2a0
[<ffffffff8168a1b5>] file_free_rcu+0x65/0xa0
[<ffffffff8139bd87>] rcu_process_callbacks+0x9b7/0x10e0
[<ffffffff83c9ae01>] __do_softirq+0x1c1/0x5ba
Memory state around the buggy address:
ffff880006689c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880006689d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff880006689d80: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
^
ffff880006689e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Manifestation 4:

Killed process 1650 (trinity-main) total-vm:37140kB, anon-rss:192kB,
file-rss:144kB, shmem-rss:18632kB
oom_reaper: reaped process 1650 (trinity-main), now anon-rss:0kB,
file-rss:0kB, shmem-rss:18632kB
==================================================================
BUG: KASAN: use-after-free in unlink_file_vma+0xa5/0xb0 at addr
ffff880006b523b0
Read of size 8 by task kworker/3:1/1344
CPU: 3 PID: 1344 Comm: kworker/3:1 Not tainted 4.9.0 #314
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: events mmput_async_fn
ffff88000d3bf978 ffffffff81fb1ab1 ffff8800110ed500 ffff880006b52200
ffff880006b523b8 ffff88000d3bfb18 ffff88000d3bf9a0/home/vegard/lin
ffffffff8162c5acux/init-trinity.
sh: line 127: 1 ffff88000d3bfa30650 Killed ffff880006b52200
/sbi ffff8800110ed500n/capsh --user=n ffff88000d3bfa20obody --caps= --
-c '/home/vegarCall Trace:
d/trinity/trinit [<ffffffff81fb1ab1>] dump_stack+0x83/0xb2
y -qq -C32 --ena [<ffffffff8162c5ac>] kasan_object_err+0x1c/0x70
ble-fds=pseudo - [<ffffffff8162c845>] kasan_report_error+0x1f5/0x4e0
cexecve,64 -V /p [<ffffffff815afd80>] ? vm_normal_page_pmd+0x240/0x240
roc/self/mem'
[<ffffffff8162ccce>] __asan_report_load8_noabort+0x3e/0x40
[<ffffffff815c4305>] ? unlink_file_vma+0xa5/0xb0
[<ffffffff815c4305>] unlink_file_vma+0xa5/0xb0
[<ffffffff815ad610>] free_pgtables+0x80/0x350
[<ffffffff815ca662>] exit_mmap+0x212/0x3d0
[<ffffffff815ca450>] ? SyS_munmap+0xa0/0xa0
[<ffffffff812bf478>] ? process_one_work+0x6e8/0x13a0
+ true [<ffffffff812682d1>] mmput_async_fn+0x61/0x1a0
[<ffffffff812bf54f>] process_one_work+0x7bf/0x13a0
[<ffffffff812bf478>] ? process_one_work+0x6e8/0x13a0
[<ffffffff812ebf84>] ? finish_task_switch+0x184/0x660
[<ffffffff812bed90>] ? __cancel_work+0x220/0x220
[<ffffffff812c021b>] worker_thread+0xeb/0x1150
[<ffffffff83c88471>] ? __schedule+0x461/0x17c0
[<ffffffff812d29f4>] kthread+0x244/0x2d0
[<ffffffff812c0130>] ? process_one_work+0x13a0/0x13a0

+ true
[<ffffffff812d27b0>] ? __kthread_create_on_node+0x380/0x380
[<ffffffff812d27b0>] ? __kthread_create_on_node+0x380/0x380
[<ffffffff812d27b0>] ? __kthread_create_on_node+0x380/0x380
[<ffffffff83c984a5>] ret_from_fork+0x25/0x30
Object at ffff880006b52200, in cache filp size: 440
Allocated:
PID = 1650
[<ffffffff811d77d6>] save_stack_trace+0x16/0x20
[<ffffffff8162b3e6>] save_stack+0x46/0xd0
[<ffffffff8162bbcd>] kasan_kmalloc+0xad/0xe0
[<ffffffff8162bc12>] kasan_slab_alloc+0x12/0x20
[<ffffffff81627195>] kmem_cache_alloc+0xf5/0x2c0
[<ffffffff8168aaa1>] get_empty_filp+0x91/0x3e0
[<ffffffff816affd2>] path_openat+0xb2/0x2920
[<ffffffff816b540e>] do_filp_open+0x18e/0x250
[<ffffffff816946f3>] do_open_execat+0xe3/0x4c0
[<ffffffff8169adb1>] do_execveat_common.isra.36+0x671/0x1d00
[<ffffffff8169cec2>] SyS_execve+0x42/0x50
[<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
[<ffffffff83c982f4>] return_from_SYSCALL_64+0x0/0x6a
Freed:
PID = 0
[<ffffffff811d77d6>] save_stack_trace+0x16/0x20
[<ffffffff8162b3e6>] save_stack+0x46/0xd0
[<ffffffff8162bae1>] kasan_slab_free+0x71/0xb0
[<ffffffff81627ddf>] kmem_cache_free+0xaf/0x2a0
[<ffffffff8168a1b5>] file_free_rcu+0x65/0xa0
[<ffffffff8139bd87>] rcu_process_callbacks+0x9b7/0x10e0
[<ffffffff83c9ae01>] __do_softirq+0x1c1/0x5ba
Memory state around the buggy address:
ffff880006b52280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880006b52300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff880006b52380: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
^
ffff880006b52400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff880006b52480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

2016-12-16 14:01:23

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
[...]
> Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
> Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
> file-rss:112kB, shmem-rss:112kB
> BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
> IP: [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
> PGD c001067 PUD c000067
> PMD 0
> Oops: 0002 [#1] PREEMPT SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317

Hmm, so this was the oom victim initially but we have decided to kill
its child 1724 instead.

> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> task: ffff88000f9bc440 task.stack: ffff88000c778000
> RIP: 0010:[<ffffffff8126b1c0>] [<ffffffff8126b1c0>]
> copy_process.part.41+0x2150/0x5580

Could you match this to the kernel source please?

> RSP: 0018:ffff88000c77fc18 EFLAGS: 00010297
> RAX: 0000000000000000 RBX: ffff88000fa11c00 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff88000f2a33b0
> RBP: ffff88000c77fdb0 R08: ffff88000c77f900 R09: 0000000000000002
> R10: 00000000cb9401ca R11: 00000000c6eda739 R12: ffff88000f894d00
> R13: ffff88000c7c4700 R14: ffff88000fa11c50 R15: ffff88000f2a3200
> FS: 00007fb7d2a24700(0000) GS:ffff880011b00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000001e8 CR3: 000000001010d000 CR4: 00000000000406e0
> Stack:
> 0000000000000046 0000000001200011 ffffed0001f129ac ffff88000f894d60
> 0000000000000000 0000000000000000 ffff88000f894d08 ffff88000f894da0
> ffff88000c7a8620 ffff88000f020318 ffff88000fa11c18 ffff88000f894e40
> Call Trace:
> [<ffffffff81269070>] ? __cleanup_sighand+0x50/0x50
> [<ffffffff81fd552e>] ? memzero_explicit+0xe/0x10
> [<ffffffff822cb592>] ? urandom_read+0x232/0x4d0
> [<ffffffff8126e974>] _do_fork+0x1a4/0xa40

and here we are copying the pid 1650 to its child. I am wondering
whether that might be the killed child. But the child is visible only
very late during the fork to the oom killer.

> [<ffffffff8126e7d0>] ? fork_idle+0x180/0x180
> [<ffffffff81002dba>] ? syscall_trace_enter+0x3aa/0xd40
> [<ffffffff815179ea>] ? __context_tracking_exit.part.4+0x9a/0x1e0
> [<ffffffff81002a10>] ? exit_to_usermode_loop+0x150/0x150
> [<ffffffff8201df57>] ? check_preemption_disabled+0x37/0x1e0
> [<ffffffff8126f2e7>] SyS_clone+0x37/0x50
> [<ffffffff83caea50>] ? ptregs_sys_rt_sigreturn+0x10/0x10
> [<ffffffff8100524f>] do_syscall_64+0x1af/0x4d0
> [<ffffffff83cae974>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: be 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 30 00 74 08 4c 89 f7 e8
> d0 7d 3c 00 f6 43 51 08 74 11 e8 45 fa 1d 00 48 8b 44 24 20 <f0> ff 88 e8 01
> 00 00 e8 34 fa 1d 00 48 8b 44 24 70 48 83 c0 60
> RIP [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
> RSP <ffff88000c77fc18>
> CR2: 00000000000001e8
> ---[ end trace b8f81ad60c106e75 ]---
>
> Manifestation 2:
>
> Killed process 1775 (trinity-c21) total-vm:37404kB, anon-rss:232kB,
> file-rss:420kB, shmem-rss:116kB
> oom_reaper: reaped process 1775 (trinity-c21), now anon-rss:0kB,
> file-rss:0kB, shmem-rss:116kB
> ==================================================================
> BUG: KASAN: use-after-free in p9_client_read+0x8f0/0x960 at addr
> ffff880010284d00
> Read of size 8 by task trinity-main/1649
> CPU: 3 PID: 1649 Comm: trinity-main Not tainted 4.9.0+ #318
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> ffff8800068a7770 ffffffff82012301 ffff88001100f600 ffff880010284d00
> ffff880010284d60 ffff880010284d00 ffff8800068a7798 ffffffff8165872c
> ffff8800068a7828 ffff880010284d00 ffff88001100f600 ffff8800068a7818
> Call Trace:
> [<ffffffff82012301>] dump_stack+0x83/0xb2
> [<ffffffff8165872c>] kasan_object_err+0x1c/0x70
> [<ffffffff816589c5>] kasan_report_error+0x1f5/0x4e0
> [<ffffffff81657d92>] ? kasan_slab_alloc+0x12/0x20
> [<ffffffff82079357>] ? check_preemption_disabled+0x37/0x1e0
> [<ffffffff81658e4e>] __asan_report_load8_noabort+0x3e/0x40
> [<ffffffff82079300>] ? assoc_array_gc+0x1310/0x1330
> [<ffffffff83b84c30>] ? p9_client_read+0x8f0/0x960
> [<ffffffff83b84c30>] p9_client_read+0x8f0/0x960

no idea how we would end up with use after here. Even if I unmapped the
page then the read code should be able to cope with that. This smells
like a p9 issue to me.

[...]
> Manifestation 3:
>
> Out of memory: Kill process 1650 (trinity-main) score 91 or sacrifice child
> Killed process 1731 (trinity-main) total-vm:37140kB, anon-rss:192kB,
> file-rss:0kB, shmem-rss:0kB
> ==================================================================
> BUG: KASAN: use-after-free in unlink_file_vma+0xa5/0xb0 at addr
> ffff880006689db0
> Read of size 8 by task trinity-main/1731
> CPU: 5 PID: 1731 Comm: trinity-main Not tainted 4.9.0 #314
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> ffff880000aaf7f8 ffffffff81fb1ab1 ffff8800110ed500 ffff880006689c00
> ffff880006689db8 ffff880000aaf998 ffff880000aaf820 ffffffff8162c5ac
> ffff880000aaf8b0 ffff880006689c00Out of memory: Kill process 1650
> (trinity-main) score 91 or sacrifice child
> Killed process 1650 (trinity-main) total-vm:37140kB, anon-rss:192kB,
> file-rss:140kB, shmem-rss:18632kB
> oom_reaper: reaped process 1650 (trinity-main), now anon-rss:0kB,
> file-rss:0kB, shmem-rss:18632kB
> ffff8800110ed500 ffff880000aaf8a0
> Call Trace:
> [<ffffffff81fb1ab1>] dump_stack+0x83/0xb2
> [<ffffffff8162c5ac>] kasan_object_err+0x1c/0x70
> [<ffffffff8162c845>] kasan_report_error+0x1f5/0x4e0
> [<ffffffff815afd80>] ? vm_normal_page_pmd+0x240/0x240
> [<ffffffff8162ccce>] __asan_report_load8_noabort+0x3e/0x40
> [<ffffffff815c4305>] ? unlink_file_vma+0xa5/0xb0
> [<ffffffff815c4305>] unlink_file_vma+0xa5/0xb0

Hmm, the oom repaper doesn't touch vma->vm_file so I do not see how it
could be related to the activity of the reaper. I will have a look
closer.
--
Michal Hocko
SUSE Labs

2016-12-16 14:04:50

by Vegard Nossum

[permalink] [raw]
Subject: Re: crash during oom reaper

On 12/16/2016 02:14 PM, Vegard Nossum wrote:
> On 12/16/2016 11:11 AM, Michal Hocko wrote:
>> On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
>> [...]
>>> I don't think it's a bug in the OOM reaper itself, but either of the
>>> following two patches will fix the problem (without my understand how or
>>> why):
>>>
>> What is the atual crash?
>
> Annoyingly it doesn't seem to reproduce with the very latest
> linus/master, so maybe it's been fixed recently after all and I missed it.
>
> I've started a bisect to see what fixed it. Just in case, I added 4
> different crashes I saw with various kernels. I think there may have
> been a few others too (I remember seeing one in a page fault path), but
> these were the most frequent ones.

The bisect points to:

commit 6b94780e45c17b83e3e75f8aaca5a328db583c74
Author: Vincent Guittot <[email protected]>
Date: Thu Dec 8 17:56:54 2016 +0100

sched/core: Use load_avg for selecting idlest group

as fixing the crash, which seems odd to me. The only bit that sticks out
from the changelog to me:

"""
For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.
"""

Reverting it from linus/master doesn't reintroduce the crash, but the
commit just before (6b94780e4^) does crash, so I'm not sure what's going
on. Maybe the crash is just really sensitive to scheduling decisions or
something.


Vegard

2016-12-16 14:26:13

by Vegard Nossum

[permalink] [raw]
Subject: Re: crash during oom reaper

On 12/16/2016 03:00 PM, Michal Hocko wrote:
> On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
> [...]
>> Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
>> Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
>> file-rss:112kB, shmem-rss:112kB
>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
>> IP: [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
>> PGD c001067 PUD c000067
>> PMD 0
>> Oops: 0002 [#1] PREEMPT SMP KASAN
>> Dumping ftrace buffer:
>> (ftrace buffer empty)
>> CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317
>
> Hmm, so this was the oom victim initially but we have decided to kill
> its child 1724 instead.
>
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> task: ffff88000f9bc440 task.stack: ffff88000c778000
>> RIP: 0010:[<ffffffff8126b1c0>] [<ffffffff8126b1c0>]
>> copy_process.part.41+0x2150/0x5580
>
> Could you match this to the kernel source please?

kernel/fork.c:629 dup_mmap()

it's atomic_dec(&inode->i_writecount), it matches up with
file_inode(file) == NULL:

(gdb) p &((struct inode *)0)->i_writecount
$1 = (atomic_t *) 0x1e8 <irq_stack_union+488>

>> Killed process 1775 (trinity-c21) total-vm:37404kB, anon-rss:232kB,
>> file-rss:420kB, shmem-rss:116kB
>> oom_reaper: reaped process 1775 (trinity-c21), now anon-rss:0kB,
>> file-rss:0kB, shmem-rss:116kB
>> ==================================================================
>> BUG: KASAN: use-after-free in p9_client_read+0x8f0/0x960 at addr
>> ffff880010284d00
>> Read of size 8 by task trinity-main/1649
>> CPU: 3 PID: 1649 Comm: trinity-main Not tainted 4.9.0+ #318
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> ffff8800068a7770 ffffffff82012301 ffff88001100f600 ffff880010284d00
>> ffff880010284d60 ffff880010284d00 ffff8800068a7798 ffffffff8165872c
>> ffff8800068a7828 ffff880010284d00 ffff88001100f600 ffff8800068a7818
>> Call Trace:
>> [<ffffffff82012301>] dump_stack+0x83/0xb2
>> [<ffffffff8165872c>] kasan_object_err+0x1c/0x70
>> [<ffffffff816589c5>] kasan_report_error+0x1f5/0x4e0
>> [<ffffffff81657d92>] ? kasan_slab_alloc+0x12/0x20
>> [<ffffffff82079357>] ? check_preemption_disabled+0x37/0x1e0
>> [<ffffffff81658e4e>] __asan_report_load8_noabort+0x3e/0x40
>> [<ffffffff82079300>] ? assoc_array_gc+0x1310/0x1330
>> [<ffffffff83b84c30>] ? p9_client_read+0x8f0/0x960
>> [<ffffffff83b84c30>] p9_client_read+0x8f0/0x960
>
> no idea how we would end up with use after here. Even if I unmapped the
> page then the read code should be able to cope with that. This smells
> like a p9 issue to me.

This is fid->clnt dereference at the top of p9_client_read().

Ah, yes, this is the one coming from a page fault:

p9_client_read
v9fs_fid_readpage
v9fs_vfs_readpage
handle_mm_fault
__do_page_fault

the bad fid pointer is filp->private_data.

Hm, so I guess the file itself was NOT freed prematurely (as otherwise
we'd probably have seen a KASAN report for the filp->private_data
dereference), but the ->private_data itself was.

Maybe the whole thing is fundamentally a 9p bug and the OOM killer just
happens to trigger it.


Vegard

2016-12-16 14:41:15

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Fri 16-12-16 15:25:27, Vegard Nossum wrote:
> On 12/16/2016 03:00 PM, Michal Hocko wrote:
> > On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
> > [...]
> > > Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
> > > Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
> > > file-rss:112kB, shmem-rss:112kB
> > > BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
> > > IP: [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
> > > PGD c001067 PUD c000067
> > > PMD 0
> > > Oops: 0002 [#1] PREEMPT SMP KASAN
> > > Dumping ftrace buffer:
> > > (ftrace buffer empty)
> > > CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317
> >
> > Hmm, so this was the oom victim initially but we have decided to kill
> > its child 1724 instead.
> >
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > > task: ffff88000f9bc440 task.stack: ffff88000c778000
> > > RIP: 0010:[<ffffffff8126b1c0>] [<ffffffff8126b1c0>]
> > > copy_process.part.41+0x2150/0x5580
> >
> > Could you match this to the kernel source please?
>
> kernel/fork.c:629 dup_mmap()

Ok, so this is before the child is made visible so the oom reaper
couldn't have seen it.

> it's atomic_dec(&inode->i_writecount), it matches up with
> file_inode(file) == NULL:
>
> (gdb) p &((struct inode *)0)->i_writecount
> $1 = (atomic_t *) 0x1e8 <irq_stack_union+488>

is this a p9 inode?

>
> > > Killed process 1775 (trinity-c21) total-vm:37404kB, anon-rss:232kB,
> > > file-rss:420kB, shmem-rss:116kB
> > > oom_reaper: reaped process 1775 (trinity-c21), now anon-rss:0kB,
> > > file-rss:0kB, shmem-rss:116kB
> > > ==================================================================
> > > BUG: KASAN: use-after-free in p9_client_read+0x8f0/0x960 at addr
> > > ffff880010284d00
> > > Read of size 8 by task trinity-main/1649
> > > CPU: 3 PID: 1649 Comm: trinity-main Not tainted 4.9.0+ #318
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > Ubuntu-1.8.2-1ubuntu1 04/01/2014
> > > ffff8800068a7770 ffffffff82012301 ffff88001100f600 ffff880010284d00
> > > ffff880010284d60 ffff880010284d00 ffff8800068a7798 ffffffff8165872c
> > > ffff8800068a7828 ffff880010284d00 ffff88001100f600 ffff8800068a7818
> > > Call Trace:
> > > [<ffffffff82012301>] dump_stack+0x83/0xb2
> > > [<ffffffff8165872c>] kasan_object_err+0x1c/0x70
> > > [<ffffffff816589c5>] kasan_report_error+0x1f5/0x4e0
> > > [<ffffffff81657d92>] ? kasan_slab_alloc+0x12/0x20
> > > [<ffffffff82079357>] ? check_preemption_disabled+0x37/0x1e0
> > > [<ffffffff81658e4e>] __asan_report_load8_noabort+0x3e/0x40
> > > [<ffffffff82079300>] ? assoc_array_gc+0x1310/0x1330
> > > [<ffffffff83b84c30>] ? p9_client_read+0x8f0/0x960
> > > [<ffffffff83b84c30>] p9_client_read+0x8f0/0x960
> >
> > no idea how we would end up with use after here. Even if I unmapped the
> > page then the read code should be able to cope with that. This smells
> > like a p9 issue to me.
>
> This is fid->clnt dereference at the top of p9_client_read().
>
> Ah, yes, this is the one coming from a page fault:
>
> p9_client_read
> v9fs_fid_readpage
> v9fs_vfs_readpage
> handle_mm_fault
> __do_page_fault
>
> the bad fid pointer is filp->private_data.
>
> Hm, so I guess the file itself was NOT freed prematurely (as otherwise
> we'd probably have seen a KASAN report for the filp->private_data
> dereference), but the ->private_data itself was.
>
> Maybe the whole thing is fundamentally a 9p bug and the OOM killer just
> happens to trigger it.

It smells like that.
--
Michal Hocko
SUSE Labs

2016-12-16 14:53:58

by Vegard Nossum

[permalink] [raw]
Subject: Re: crash during oom reaper

On 12/16/2016 03:32 PM, Michal Hocko wrote:
> On Fri 16-12-16 15:25:27, Vegard Nossum wrote:
>> On 12/16/2016 03:00 PM, Michal Hocko wrote:
>>> On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
>>> [...]
>>>> Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
>>>> Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
>>>> file-rss:112kB, shmem-rss:112kB
>>>> BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
>>>> IP: [<ffffffff8126b1c0>] copy_process.part.41+0x2150/0x5580
>>>> PGD c001067 PUD c000067
>>>> PMD 0
>>>> Oops: 0002 [#1] PREEMPT SMP KASAN
>>>> Dumping ftrace buffer:
>>>> (ftrace buffer empty)
>>>> CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317
>>>
>>> Hmm, so this was the oom victim initially but we have decided to kill
>>> its child 1724 instead.
>>>
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>> task: ffff88000f9bc440 task.stack: ffff88000c778000
>>>> RIP: 0010:[<ffffffff8126b1c0>] [<ffffffff8126b1c0>]
>>>> copy_process.part.41+0x2150/0x5580
>>>
>>> Could you match this to the kernel source please?
>>
>> kernel/fork.c:629 dup_mmap()
>
> Ok, so this is before the child is made visible so the oom reaper
> couldn't have seen it.
>
>> it's atomic_dec(&inode->i_writecount), it matches up with
>> file_inode(file) == NULL:
>>
>> (gdb) p &((struct inode *)0)->i_writecount
>> $1 = (atomic_t *) 0x1e8 <irq_stack_union+488>
>
> is this a p9 inode?

When I looked at this before it always crashed in this spot for the very
first VMA in the mm (which happens to be the exe, which is on a 9p root fs).

I added a trace_printk() to dup_mmap() to print inode->i_sb->s_type and
the last thing I see for a new crash in the same place is:

trinity--9280 28.... 136345090us : copy_process.part.41: ffffffff8485ec40
---------------------------------
CPU: 0 PID: 9302 Comm: trinity-c0 Not tainted 4.9.0-rc8+ #332
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff880000070000 task.stack: ffff8800099e0000
RIP: 0010:[<ffffffff8126c7c9>] [<ffffffff8126c7c9>]
copy_process.part.41+0x22c9/0x55b0

As you can see, the addresses match:

(gdb) p &v9fs_fs_type
$1 = (struct file_system_type *) 0xffffffff8485ec40 <v9fs_fs_type>

So I think we can safely say that yes, it's a p9 inode.


Vegard

2016-12-18 13:49:50

by Tetsuo Handa

[permalink] [raw]
Subject: Re: crash during oom reaper

On 2016/12/16 22:14, Michal Hocko wrote:
> On Fri 16-12-16 16:07:30, Kirill A. Shutemov wrote:
>> On Fri, Dec 16, 2016 at 01:56:50PM +0100, Michal Hocko wrote:
>>> On Fri 16-12-16 15:35:55, Kirill A. Shutemov wrote:
>>>> On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote:
>>>>> On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
>>>>>> On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
>>>>>>> On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
>>>>>>> [...]
>>>>>>>> I don't think it's a bug in the OOM reaper itself, but either of the
>>>>>>>> following two patches will fix the problem (without my understand how or
>>>>>>>> why):
>>>>>>>>
>>>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>>>>> index ec9f11d4f094..37b14b2e2af4 100644
>>>>>>>> --- a/mm/oom_kill.c
>>>>>>>> +++ b/mm/oom_kill.c
>>>>>>>> @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
>>>>>>>> struct mm_struct *mm)
>>>>>>>> */
>>>>>>>> mutex_lock(&oom_lock);
>>>>>>>>
>>>>>>>> - if (!down_read_trylock(&mm->mmap_sem)) {
>>>>>>>> + if (!down_write_trylock(&mm->mmap_sem)) {
>>>>>>>
>>>>>>> __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
>>>>>>> doesn't require the exlusive mmap_sem. So this looks correct to me.
>>>>>>
>>>>>> BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
>>>>>> least.
>>>>>>
>>>>>> MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
>>>>>> on __oom_reap_task_mm() side.
>>>>>
>>>>> I guess you are right and we should match the MADV_DONTNEED behavior
>>>>> here. Care to send a patch?
>>>>
>>>> Below. Testing required.
>>>>
>>>>>> Other difference is that you use unmap_page_range() witch doesn't touch
>>>>>> mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
>>>>>> the range. Not sure if it can make any difference here.
>>>>>
>>>>> Which mmu notifier would care about this? I am not really familiar with
>>>>> those users so I might miss something easily.
>>>>
>>>> No idea either.
>>>>
>>>> Is there any reason not to use zap_page_range here too?
>>>
>>> Yes, zap_page_range is much more heavy and performs operations which
>>> might lock AFAIR which I really would like to prevent from.
>>
>> What exactly can block there? I don't see anything with that potential.
>
> I would have to rememeber all the details. This is mostly off-topic for
> this particular thread so I think it would be better if you could send a
> full patch separatelly and we can discuss it there?
>

zap_page_range() calls mmu_notifier_invalidate_range_start().
mmu_notifier_invalidate_range_start() calls __mmu_notifier_invalidate_range_start().
__mmu_notifier_invalidate_range_start() calls srcu_read_lock()/srcu_read_unlock().
This means that zap_page_range() might sleep.

I don't know what individual notifier will do, but for example

static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
.invalidate_range_start = i915_gem_userptr_mn_invalidate_range_start,
};

i915_gem_userptr_mn_invalidate_range_start() calls flush_workqueue()
which means that we can OOM livelock if work item involves memory allocation.
Some of other notifiers call mutex_lock()/mutex_unlock().

Even if none of currently in-tree notifier users are blocked on memory
allocation, I think it is not guaranteed that future changes/users won't be
blocked on memory allocation.

2016-12-18 16:06:13

by Michal Hocko

[permalink] [raw]
Subject: Re: crash during oom reaper

On Sun 18-12-16 22:47:07, Tetsuo Handa wrote:
> On 2016/12/16 22:14, Michal Hocko wrote:
[...]
> > I would have to rememeber all the details. This is mostly off-topic for
> > this particular thread so I think it would be better if you could send a
> > full patch separatelly and we can discuss it there?
> >
>
> zap_page_range() calls mmu_notifier_invalidate_range_start().
> mmu_notifier_invalidate_range_start() calls __mmu_notifier_invalidate_range_start().
> __mmu_notifier_invalidate_range_start() calls srcu_read_lock()/srcu_read_unlock().
> This means that zap_page_range() might sleep.
>
> I don't know what individual notifier will do, but for example
>
> static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> .invalidate_range_start = i915_gem_userptr_mn_invalidate_range_start,
> };
>
> i915_gem_userptr_mn_invalidate_range_start() calls flush_workqueue()
> which means that we can OOM livelock if work item involves memory allocation.
> Some of other notifiers call mutex_lock()/mutex_unlock().
>
> Even if none of currently in-tree notifier users are blocked on memory
> allocation, I think it is not guaranteed that future changes/users won't be
> blocked on memory allocation.

Kirill has sent this as a separate patchset [1]. Could you follow up on
that there please?

http://lkml.kernel.org/r/[email protected]

--
Michal Hocko
SUSE Labs