Received: by 10.223.176.5 with SMTP id f5csp1359704wra; Wed, 7 Feb 2018 18:12:11 -0800 (PST) X-Google-Smtp-Source: AH8x227g3ELdCjkcOlZ1iydBuMgRRLYS1Sh8sHh8zG5dpkzy2hFNk4yge/1BJKBRZ9rkPAWksfG6 X-Received: by 2002:a17:902:8d97:: with SMTP id v23-v6mr7885816plo.248.1518055931271; Wed, 07 Feb 2018 18:12:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518055931; cv=none; d=google.com; s=arc-20160816; b=qhjmwkxb02KJPKORReioMWCA2AbgrseteBhMUGyd7xi1ia9BZIBE55NNjXbL/9WxQR 4CD4547dXmoFBZDqk3AtnV9Uv1ZruDnBmI9IVLRmDRqhnI1XS2AK8Vl9zZwylpa1Skmx 0TiDFSuI/dPsZ8tViB+WAIhRTTTWh4OxAtVuzxGztepuHt+k5LkyZNkRG65pn1DvOYni hYVRTI3YpWEqeqgHKv7xci4vaMlXR0uGdbVyKrxZUxUJqBZ/wYDpRgXILETCVg6YUb7v /rSk7A41nynesFpEhDIA3h3ZkojSyfbRaE2DB0kv0NQyT//9Bpo1rLVzZlL/9Sdl80HC UGRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:content-disposition :mime-version:message-id:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=3aDh98nHhwA4Y1dQiBFICxHfk1JImWD7d6OT3BVJ20k=; b=PoGCqFuZSl/TIobf35XsSuFJOhVw3/y6T/YNIgUUlQQ7aKjLIjP1kPHXwqV0lB4E7T 2PXz/8nJgmCMhDR3w2YpjaZcuLYbpF34mspmveMEKwjk6Oy+iAm75vzb6vq6slgGOyZv /6WL2JXx/5+SetINHnB199hem36Ez0s4OLcPhDBxlSV48ex1DIWm0mdt5f3S/uHK884O FGQ0PZmsou41H+zZDmdVFbWSRUMTQ6Em4eaTzefJbhE9VF03++/OULtJt8GzworvbJrD bLKZXCfwYGl8dzveb8VxmwnlKviuVnIXJ2JyKLMgFtm1gGSl4+7ng643VXMbSn075r55 W4Yg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=DoiMWvGt; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z3si525481pgr.12.2018.02.07.18.11.56; Wed, 07 Feb 2018 18:12:11 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=DoiMWvGt; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751700AbeBHCLQ (ORCPT + 99 others); Wed, 7 Feb 2018 21:11:16 -0500 Received: from bombadil.infradead.org ([65.50.211.133]:57523 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751130AbeBHCLO (ORCPT ); Wed, 7 Feb 2018 21:11:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=Content-Type:MIME-Version:Message-ID: Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=3aDh98nHhwA4Y1dQiBFICxHfk1JImWD7d6OT3BVJ20k=; b=DoiMWvGte8I+ViNJ9KI3j0lIUz LsHJdeyJu7WTt7xSBhIkQdwHrt3cU9cPnVLd2xd9n561zH/JLF1Pc2v95qmLqaYGkokYCJj9CxrrL hYOKE6V5G4f5LsKxZx1JNA7uCV7kYj2T3TIIbRCuw+uH5NAvnzUt1hJ1Ursy6GKUs5yKHavoC2c7Z LQ7+IqkOVLBYlOEypEDSQRFtiGBVEHXYeidmeHDUzdgnkdynmGuvqrHGDSGv2mx4qMcddpnK0IuUL Ud1hhkdgn6zA9SR2C+Tsxsn7LP7QujeLCO9hCaDsMsr7sxAA8k+L99cbvtH9ry1NJtAWkAqAByQvH 9dwFJbyA==; Received: from willy by bombadil.infradead.org with local (Exim 4.89 #1 (Red Hat Linux)) id 1ejbg8-0001Mz-Dt; Thu, 08 Feb 2018 02:11:12 +0000 Date: Wed, 7 Feb 2018 18:11:12 -0800 From: Matthew Wilcox To: linux-mm@kvack.org Cc: kernel-hardening@lists.openwall.com, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [RFC] Warn the user when they could overflow mapcount Message-ID: <20180208021112.GB14918@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Kirill and I were talking about trying to overflow page->_mapcount the other day and realised that the default settings of pid_max and max_map_count prevent it [1]. But there isn't even documentation to warn a sysadmin that they've just opened themselves up to the possibility that they've opened their system up to a sufficiently-determined attacker. I'm not sufficiently wise in the ways of the MM to understand exactly what goes wrong if we do wrap mapcount. Kirill says: rmap depends on mapcount to decide when the page is not longer mapped. If it sees page_mapcount() == 0 due to 32-bit wrap we are screwed; data corruption, etc. That seems pretty bad. So here's a patch which adds documentation to the two sysctls that a sysadmin could use to shoot themselves in the foot, and adds a warning if they change either of them to a dangerous value. It's possible to get into a dangerous situation without triggering this warning (already have the file mapped a lot of times, then lower pid_max, then raise max_map_count, then map the file a lot more times), but it's unlikely to happen. Comments? [1] map_count counts the number of times that a page is mapped to userspace; max_map_count restricts the number of times a process can map a page and pid_max restricts the number of processes that can exist. So map_count can never be larger than pid_max * max_map_count. diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 412314eebda6..ec90cd633e99 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -718,6 +718,8 @@ pid_max: PID allocation wrap value. When the kernel's next PID value reaches this value, it wraps back to a minimum PID value. PIDs of value pid_max or larger are not allocated. +Increasing this value without decreasing vm.max_map_count may +allow a hostile user to corrupt kernel memory ============================================================== diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index ff234d229cbb..0ab306ea8f80 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -379,7 +379,8 @@ While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g., up to one or two maps per allocation. -The default value is 65536. +The default value is 65530. Increasing this value without decreasing +pid_max may allow a hostile user to corrupt kernel memory. ============================================================= diff --git a/include/linux/mm.h b/include/linux/mm.h index 173d2484f6e3..ebc301b21589 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -123,8 +123,6 @@ extern int mmap_rnd_compat_bits __read_mostly; #define MAPCOUNT_ELF_CORE_MARGIN (5) #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN) -extern int sysctl_max_map_count; - extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes; diff --git a/include/linux/pid.h b/include/linux/pid.h index 7633d55d9a24..7bb10c1b3be3 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -4,6 +4,8 @@ #include +extern int pid_max; + enum pid_type { PIDTYPE_PID, diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 992bc9948232..c939f403ad08 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -235,5 +235,9 @@ static inline void setup_sysctl_set(struct ctl_table_set *p, int sysctl_max_threads(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +int sysctl_pid_max(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); +int sysctl_max_map_count(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); #endif /* _LINUX_SYSCTL_H */ diff --git a/kernel/pid.c b/kernel/pid.c index 5d30c87e3c42..9e230ae214c9 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -61,6 +61,27 @@ int pid_max = PID_MAX_DEFAULT; int pid_max_min = RESERVED_PIDS + 1; int pid_max_max = PID_MAX_LIMIT; +extern int max_map_count; + +int sysctl_pid_max(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table t; + int ret; + + t = *table; + t.data = &pid_max; + t.extra1 = &pid_max_min; + t.extra2 = &pid_max_max; + + ret = proc_douintvec_minmax(&t, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + if ((INT_MAX / max_map_count) > pid_max) + pr_warn("pid_max is dangerously large\n"); + return 0; +} /* * PID-map pages start out as NULL, they get allocated upon diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 0b53eef7d34b..e24becc39020 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -308,7 +308,6 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write, return ret; } -extern int pid_max; static int zero = 0; static struct ctl_table pid_ns_ctl_table[] = { { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 2fb4e27c636a..a137acc0971f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -105,8 +105,6 @@ extern int core_uses_pid; extern char core_pattern[]; extern unsigned int core_pipe_limit; #endif -extern int pid_max; -extern int pid_max_min, pid_max_max; extern int percpu_pagelist_fraction; extern int latencytop_enabled; extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; @@ -784,12 +782,10 @@ static struct ctl_table kern_table[] = { #endif { .procname = "pid_max", - .data = &pid_max, - .maxlen = sizeof (int), + .data = NULL, + .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &pid_max_min, - .extra2 = &pid_max_max, + .proc_handler = sysctl_pid_max, }, { .procname = "panic_on_oops", @@ -1454,11 +1450,10 @@ static struct ctl_table vm_table[] = { #ifdef CONFIG_MMU { .procname = "max_map_count", - .data = &sysctl_max_map_count, - .maxlen = sizeof(sysctl_max_map_count), + .data = NULL, + .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &zero, + .proc_handler = sysctl_max_map_count, }, #else { diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 2a6d0325a761..3e9d08a1416a 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -663,8 +663,6 @@ extern unsigned long tracing_thresh; /* PID filtering */ -extern int pid_max; - bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids, pid_t search_pid); bool trace_ignore_this_task(struct trace_pid_list *filtered_pids, diff --git a/mm/internal.h b/mm/internal.h index e6bd35182dae..23b014958eb9 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -79,6 +79,7 @@ static inline void set_page_refcounted(struct page *page) } extern unsigned long highest_memmap_pfn; +extern int max_map_count; /* * Maximum number of reclaim retries without progress before the OOM diff --git a/mm/madvise.c b/mm/madvise.c index 4d3c922ea1a1..5b66a4a48192 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -147,7 +147,7 @@ static long madvise_behavior(struct vm_area_struct *vma, *prev = vma; if (start != vma->vm_start) { - if (unlikely(mm->map_count >= sysctl_max_map_count)) { + if (unlikely(mm->map_count >= max_map_count)) { error = -ENOMEM; goto out; } @@ -164,7 +164,7 @@ static long madvise_behavior(struct vm_area_struct *vma, } if (end != vma->vm_end) { - if (unlikely(mm->map_count >= sysctl_max_map_count)) { + if (unlikely(mm->map_count >= max_map_count)) { error = -ENOMEM; goto out; } diff --git a/mm/mmap.c b/mm/mmap.c index 9efdc021ad22..9016dae43fee 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1355,7 +1355,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW; /* Too many mappings? */ - if (mm->map_count > sysctl_max_map_count) + if (mm->map_count > max_map_count) return -ENOMEM; /* Obtain the address to map to. we verify (or select) it and ensure @@ -2546,7 +2546,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma, } /* - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it + * __split_vma() bypasses max_map_count checking. We use this where it * has already been checked or doesn't make sense to fail. */ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, @@ -2621,7 +2621,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, int new_below) { - if (mm->map_count >= sysctl_max_map_count) + if (mm->map_count >= max_map_count) return -ENOMEM; return __split_vma(mm, vma, addr, new_below); @@ -2672,7 +2672,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, * not exceed its limit; but let map_count go just above * its limit temporarily, to help free resources as expected. */ - if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) + if (end < vma->vm_end && mm->map_count >= max_map_count) return -ENOMEM; error = __split_vma(mm, vma, start, 0); @@ -2917,7 +2917,7 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT)) return -ENOMEM; - if (mm->map_count > sysctl_max_map_count) + if (mm->map_count > max_map_count) return -ENOMEM; if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) @@ -3532,6 +3532,30 @@ void mm_drop_all_locks(struct mm_struct *mm) mutex_unlock(&mm_all_locks_mutex); } +int max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; + +int sysctl_max_map_count(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table t; + int ret; + int min = 0; + int max = ~0; + + t = *table; + t.data = &max_map_count; + t.extra1 = &min; + t.extra2 = &max; + + ret = proc_douintvec_minmax(&t, write, buffer, lenp, ppos); + if (ret || !write) + return ret; + + if ((INT_MAX / max_map_count) > pid_max) + pr_warn("max_map_count is dangerously large\n"); + return 0; +} + /* * initialise the percpu counter for VM */ diff --git a/mm/mremap.c b/mm/mremap.c index 049470aa1e3e..fdb1d71ab2cc 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -281,7 +281,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */ - if (mm->map_count >= sysctl_max_map_count - 3) + if (mm->map_count >= max_map_count - 3) return -ENOMEM; /* diff --git a/mm/nommu.c b/mm/nommu.c index 4b9864b17cb0..4cd9d4b9f473 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1487,7 +1487,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, if (vma->vm_file) return -ENOMEM; - if (mm->map_count >= sysctl_max_map_count) + if (mm->map_count >= max_map_count) return -ENOMEM; region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c7dd9c86e353..9a2edf3925be 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -740,16 +740,14 @@ static inline void rmv_page_order(struct page *page) /* * This function checks whether a page is free && is the buddy - * we can do coalesce a page and its buddy if + * we can coalesce a page and its buddy if * (a) the buddy is not in a hole (check before calling!) && * (b) the buddy is in the buddy system && * (c) a page and its buddy have the same order && * (d) a page and its buddy are in the same zone. * - * For recording whether a page is in the buddy system, we set ->_mapcount - * PAGE_BUDDY_MAPCOUNT_VALUE. - * Setting, clearing, and testing _mapcount PAGE_BUDDY_MAPCOUNT_VALUE is - * serialized by zone->lock. + * For recording whether a page is in the buddy system, we set PG_buddy. + * Setting, clearing, and testing PG_buddy is serialized by zone->lock. * * For recording page's order, we use page_private(page). */ @@ -794,9 +792,8 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, * as necessary, plus some accounting needed to play nicely with other * parts of the VM system. * At each level, we keep a list of pages, which are heads of continuous - * free pages of length of (1 << order) and marked with _mapcount - * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page) - * field. + * free pages of length of (1 << order) and marked with PageBuddy(). + * Page's order is recorded in page_private(page) field. * So when we are allocating or freeing one, we can derive the state of the * other. That is, if we allocate a small block, and both were * free, the remainder of the region must be split into blocks. diff --git a/mm/util.c b/mm/util.c index c1250501364f..2ac777548694 100644 --- a/mm/util.c +++ b/mm/util.c @@ -538,7 +538,6 @@ EXPORT_SYMBOL_GPL(__page_mapcount); int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; int sysctl_overcommit_ratio __read_mostly = 50; unsigned long sysctl_overcommit_kbytes __read_mostly; -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */ unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */