Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp324351pxb; Mon, 13 Sep 2021 20:43:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw52HtHOKZUF9uvrKHLi8XOKc54sVCjyewiTFAAet2sTbpWnzNBd65LSL+evm/MQl0G9MBJ X-Received: by 2002:aa7:c4ce:: with SMTP id p14mr9499676edr.129.1631591009006; Mon, 13 Sep 2021 20:43:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631591009; cv=none; d=google.com; s=arc-20160816; b=z4yQxfupbc4cASefIODUNltd/+tbfdj5BKJ8PzzMTRe+Svis2QiyWrHV369ueRqSpx Nf0eaYJncYc4g7V5rkaPmKgUvYb6OUcnqX/CqsxQJqzurhVvLJUA7h0pFQFJ36zIeuAO CRkBcQdmxmdgnA88Ph6IOsfyX76329Tyxq1E+6+2USOsdnRMk3CUHqK8PmEHkZowvtQV d3F3JxVSC98Jx2tJGf6SUx7OR4jeg9+hgdGiw7lGbwnEwW3mO0AYdDIOJKxgUqb41C9f Etzsx89YzMw7vRm9fU/aJaPbXRHbIWI98H6wHHG+2lBPWNWvlavBzaLAcNbaTsJkGUdy Rf/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=ANbgcUSVBIAMAqdSPcbi1VLk5owIXC/ZJwTPJ4izhuc=; b=JaAzUge/N4TXhop07WM6brFE5yaMFrnpOnq2lAXc/gAjEn9FGUzJk6nrmupPwgUVZh lQguu16lWksOlOJe0d5Aax2sdgqjhpw5WxvSwVPtz2DYuMW5GzRzOMDhbr1ta6q/cfWB 3vcgq+DPQsbVYxuvDQJa5MYvB1Cehe2kALMVy3GFxEZb3+q44kBVVfHRsnqF4sYMesAe TXA1llXlAl3jW1xZ5o5BMedEY30lkddUuFahGk+uRQwn+KJdEszQR1JFXk2cH0RtbXMw t6psrVjkmNcy0IC+VJ6cd9oGa65eVUg8m4+A0se0DuTQBrDjFOv6bcqIUZkF1YfDnsam 4Zzg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id jr12si9777944ejb.335.2021.09.13.20.43.05; Mon, 13 Sep 2021 20:43:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238979AbhINDlu (ORCPT + 99 others); Mon, 13 Sep 2021 23:41:50 -0400 Received: from mga02.intel.com ([134.134.136.20]:60416 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238834AbhINDlt (ORCPT ); Mon, 13 Sep 2021 23:41:49 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10106"; a="209095050" X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="209095050" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 20:40:32 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="507664161" Received: from shbuild999.sh.intel.com ([10.239.146.151]) by fmsmga008.fm.intel.com with ESMTP; 13 Sep 2021 20:40:29 -0700 From: Feng Tang To: Andrew Morton , Michal Hocko , David Rientjes , Tejun Heo , Zefan Li , Johannes Weiner , Mel Gorman , Vlastimil Babka , linux-mm@kvack.org, cgroups@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Feng Tang Subject: [PATCH v3] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Date: Tue, 14 Sep 2021 11:40:28 +0800 Message-Id: <1631590828-25565-1-git-send-email-feng.tang@intel.com> X-Mailer: git-send-email 2.7.4 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There was report that starting an Ubuntu in docker while using cpuset to bind it to movable nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (node 4 is a movable node) runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 Call Trace: dump_stack+0x6b/0x88 dump_header+0x4a/0x1e2 oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0xaf/0x230 out_of_memory+0x3d/0x80 __alloc_pages_slowpath.constprop.0+0x954/0xa20 __alloc_pages_nodemask+0x2d3/0x300 pipe_write+0x322/0x590 new_sync_write+0x196/0x1b0 vfs_write+0x1c3/0x1f0 ksys_write+0xa7/0xe0 do_syscall_64+0x52/0xd0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mem-Info: active_anon:392832 inactive_anon:182 isolated_anon:0 active_file:68130 inactive_file:151527 isolated_file:0 unevictable:2701 dirty:0 writeback:7 slab_reclaimable:51418 slab_unreclaimable:116300 mapped:45825 shmem:735 pagetables:2540 bounce:0 free:159849484 free_pcp:73 free_cma:0 Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The reason is, in the case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. So add a check for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. As page allocation is one of the hottest path in kernel, this check will hurt all users with sane cpuset configuration, add a static branch check and detect the abnormal config in cpuset memory binding setup so that the extra check in page allocation is not paid by everyone. [thanks to Micho Hocko and David Rientjes for suggesting not handle it inside OOM code, adding cpuset check, refining comments] Suggested-by: Michal Hocko Signed-off-by: Feng Tang --- Changelog: v3: * refine the movable_only_nodes() and the nodemask check in cpuset code (Michal Hocko) * fix a compiling problem (0day test robot) v2: * add a static branch detection in cpuset code to reduce the overhead in allocation hotpath (Michal Hocko) v1 (since RFC): * move the handling from oom code to page allocation path (Michal/David) include/linux/cpuset.h | 17 +++++++++++++++++ include/linux/mmzone.h | 16 ++++++++++++++++ kernel/cgroup/cpuset.c | 15 +++++++++++++++ mm/page_alloc.c | 13 +++++++++++++ 4 files changed, 61 insertions(+) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d2b9c41..d58e047 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -34,6 +34,8 @@ */ extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; +extern struct static_key_false cpusets_insane_config_key; + static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); @@ -51,6 +53,19 @@ static inline void cpuset_dec(void) static_branch_dec_cpuslocked(&cpusets_pre_enable_key); } +/* + * This will get enabled whenever a cpuset configuration is considered + * unsupportable in general. E.g. movable only node which cannot satisfy + * any non movable allocations (see update_nodemask). Page allocator + * needs to make additional checks for those configurations and this + * check is meant to guard those checks without any overhead for sane + * configurations. + */ +static inline bool cpusets_insane_config(void) +{ + return static_branch_unlikely(&cpusets_insane_config_key); +} + extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); @@ -167,6 +182,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) static inline bool cpusets_enabled(void) { return false; } +static inline bool cpusets_insane_config(void) { return false; } + static inline int cpuset_init(void) { return 0; } static inline void cpuset_init_smp(void) {} diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d..a455333 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1220,6 +1220,22 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, #define for_each_zone_zonelist(zone, z, zlist, highidx) \ for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL) +/* Whether the 'nodes' are all movable nodes */ +static inline bool movable_only_nodes(nodemask_t *nodes) +{ + struct zonelist *zonelist; + struct zoneref *z; + + if (nodes_empty(*nodes)) + return false; + + zonelist = + &NODE_DATA(first_node(*nodes))->node_zonelists[ZONELIST_FALLBACK]; + z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); + return (!z->zone) ? true : false; +} + + #ifdef CONFIG_SPARSEMEM #include #endif diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index df1ccf4..7fa633e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -69,6 +69,13 @@ DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); +/* + * There could be abnormal cpuset configurations for cpu or memory + * node binding, add this key to provide a quick low-cost judgement + * of the situation. + */ +DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key); + /* See "Frequency meter" comments, below. */ struct fmeter { @@ -1868,6 +1875,14 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs, if (retval < 0) goto done; + if (!cpusets_insane_config() && + movable_only_nodes(&trialcs->mems_allowed)) { + static_branch_enable(&cpusets_insane_config_key); + pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)! " + "Cpuset allocations might fail even with a lot of memory available.\n", + nodemask_pr_args(&trialcs->mems_allowed)); + } + spin_lock_irq(&callback_lock); cs->mems_allowed = trialcs->mems_allowed; spin_unlock_irq(&callback_lock); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c..a7e0854 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4914,6 +4914,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (!ac->preferred_zoneref->zone) goto nopage; + /* + * Check for insane configurations where the cpuset doesn't contain + * any suitable zone to satisfy the request - e.g. non-movable + * GFP_HIGHUSER allocations from MOVABLE nodes only. + */ + if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) { + struct zoneref *z = first_zones_zonelist(ac->zonelist, + ac->highest_zoneidx, + &cpuset_current_mems_allowed); + if (!z->zone) + goto nopage; + } + if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); -- 2.7.4