Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp4153247ybi; Mon, 29 Jul 2019 20:25:38 -0700 (PDT) X-Google-Smtp-Source: APXvYqzL64+rZ6Zja/Yn5AXfd9JBoIcn2RPXjB64W3w9ayvOokmzltkq9PYdNZN5I4n3wgu72H8l X-Received: by 2002:a62:38c6:: with SMTP id f189mr39855788pfa.157.1564457138719; Mon, 29 Jul 2019 20:25:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564457138; cv=none; d=google.com; s=arc-20160816; b=yEC8WQduqil61UqrB7gYiBIyTIvaidscJdaK3aaB9ixDQd/B6bJOnYQQhFBhySlvHH M9tMyCAiED3OzRSLkoxXXIcIVqybaR5XqBWLE36z1z6Nr4L2SECawgc4u2un831621fh OQ7m+ITRpP6sf7rxYFA5jdxI6K85/+eOfCnu8szcqDOIoPi4uYi8sf9mBSG6FdHaZHYJ j0oh1YScLUu4+qM6av4vOEZn3Fb5Nme44ClLgxESfH+xA99wZSIxMwyySP/duwETTVqw bAK7W0UnR2HRcxAJeam3QTdm7qHKY6/NjaxoAagk3mcgunnjM2Yo1FjbO5mZd8DkEvI6 hlBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=8vQx2SaIluZlxmWU43Ri4IQibJqv9AScVfwsxvrcGOE=; b=NMJ1BCT11qIaapZJsEe0ztr7lmKbFC0exwhPRuNHSE7mJOj+Y0qySLApjZJdD8UKKX D1UtV4cVd0d5ic0SlRu+sSAj7P0BCNqPZGy9WKx78oZyzt/P+ygUsR7UekiXwgSo2ikQ Br/+PVER24vzPMr0Hx2v44ggtXMDcraUbmyf8R9EA3tZoL2RGBwQEpxwb4cfeDkY+fhj 2tWpLPbsWTsRwHWfUH0CfN3v9O2EYMKux77GYfz6sCFcj2AS/qqKG4QFjiq9C2FQdPW1 bfTqfsaF9R1G4fQeh8BDme3GG154d0gK5GrYfvHprMd/BQpVmAXOJKL2gtVh+xpI3DpL ce2w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g4si7835400pjt.70.2019.07.29.20.25.23; Mon, 29 Jul 2019 20:25:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729455AbfG2XYc (ORCPT + 99 others); Mon, 29 Jul 2019 19:24:32 -0400 Received: from mga01.intel.com ([192.55.52.88]:25700 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726748AbfG2XYb (ORCPT ); Mon, 29 Jul 2019 19:24:31 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 29 Jul 2019 16:24:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.64,324,1559545200"; d="scan'208";a="346810144" Received: from skuppusw-desk.jf.intel.com ([10.54.74.33]) by orsmga005.jf.intel.com with ESMTP; 29 Jul 2019 16:24:30 -0700 From: sathyanarayanan.kuppuswamy@linux.intel.com To: akpm@linux-foundation.org, urezki@gmail.com Cc: dave.hansen@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com Subject: [PATCH v1 1/1] mm/vmalloc.c: Fix percpu free VM area search criteria Date: Mon, 29 Jul 2019 16:21:39 -0700 Message-Id: <20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com> X-Mailer: git-send-email 2.21.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Kuppuswamy Sathyanarayanan Recent changes to the vmalloc code by Commit 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can cause spurious percpu allocation failures. These, in turn, can result in panic()s in the slub code. One such possible panic was reported by Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939. Another related panic observed is, RIP: 0033:0x7f46f7441b9b Call Trace: dump_stack+0x61/0x80 pcpu_alloc.cold.30+0x22/0x4f mem_cgroup_css_alloc+0x110/0x650 cgroup_apply_control_enable+0x133/0x330 cgroup_mkdir+0x41b/0x500 kernfs_iop_mkdir+0x5a/0x90 vfs_mkdir+0x102/0x1b0 do_mkdirat+0x7d/0xf0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly uses two lists (vmap_area_list & free_vmap_area_list) to track the used and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[], sizes[], nr_vms, align) function is used for allocating congruent VM areas for percpu memory allocator. In order to not conflict with VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC space. So the search for free vm_area for the given requirement starts near VMALLOC_END and moves upwards towards VMALLOC_START. Prior to commit 68ad4a330433, the search for free vm_area in pcpu_get_vm_areas() involves following two main steps. Step 1: Find a aligned "base" adress near VMALLOC_END. va = free vm area near VMALLOC_END Step 2: Loop through number of requested vm_areas and check, Step 2.1: if (base < VMALLOC_START) 1. fail with error Step 2.2: // end is offsets[area] + sizes[area] if (base + end > va->vm_end) 1. Move the base downwards and repeat Step 2 Step 2.3: if (base + start < va->vm_start) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below: Step 2.3: if (base + start < va->vm_start || base + end > va->vm_end) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 Above change is the root cause of spurious percpu memory allocation failures. For example, consider a case where a relatively large vm_area (~ 30 TB) was ignored in free vm_area search because it did not pass the base + end < vm->vm_end boundary check. Ignoring such large free vm_area's would lead to not finding free vm_area within boundary of VMALLOC_start to VMALLOC_END which in turn leads to allocation failures. So modify the search algorithm to include Step 2.2. Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Kuppuswamy Sathyanarayanan --- mm/vmalloc.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4fa8d84599b0..1faa45a38c08 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3269,10 +3269,20 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, if (va == NULL) goto overflow; + /* + * If required width exeeds current VA block, move + * base downwards and then recheck. + */ + if (base + end > va->va_end) { + base = pvm_determine_end_from_reverse(&va, align) - end; + term_area = area; + continue; + } + /* * If this VA does not fit, move base downwards and recheck. */ - if (base + start < va->va_start || base + end > va->va_end) { + if (base + start < va->va_start) { va = node_to_va(rb_prev(&va->rb_node)); base = pvm_determine_end_from_reverse(&va, align) - end; term_area = area; -- 2.21.0