Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3320128imu; Sun, 11 Nov 2018 12:17:38 -0800 (PST) X-Google-Smtp-Source: AJdET5eOfP76IDLlRoJl6fP1uiJHJtF4C2Cp09tFgWvrT+L4ocUdmJE2z9jEy4XWHbIMysVNfq7Z X-Received: by 2002:a17:902:7c0a:: with SMTP id x10-v6mr10062182pll.263.1541967458269; Sun, 11 Nov 2018 12:17:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541967458; cv=none; d=google.com; s=arc-20160816; b=CyvfOq5+8pHQz3lwdvjZB0gRC4ssK8pZaT9qNKH55pIpOLQ4O+h6zmvE48QD+suW+D LziwjQMd95VC17FZRC9qdAo8K9KwXzPHY49B2AcK8dNb90SoVKyY8JdWxfQxi+/fPh0+ qGruH2MbRf04ERwgrui8xR4TSA9xYkH+Zr4Hu8LaOK8Qvl+TKdB4dCMrxr0o3TOSiCx5 41yryCsjw+3loWAOnnLBmbWqOVcE7v7SKAFI3i5okYAFucDoTjHentfIgdGnf6oeDFjH /PmFZjSjDkEBeAl6tx7p+96xrntg+dMI6k9qIlj0e+a05zw1cD5DB0hhUj14pMWqrs/5 s87Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:subject:message-id:date:cc:to :from:mime-version:content-transfer-encoding:content-disposition; bh=/9kHHVMy4rvRqaY8zvAfIsFqqjw84v2vKzU68/MXI4s=; b=NV1OwoXt8YtBpe7nDb+fKMy425+NTs2OTImnHExlMxQusTaUmT3G25ALGwdrVW6K22 fhKNDxG9GtqiR3oJWzjDh+mhk3ZX5MpCCHHQLMq0WFthJ2NAo68qogSX04Pi3nYUhNgy Maf+85/dRv34KU8tjeIisk0rxfwz8PnB6aeOrhtfJl0E0yEASyDVxISr8kY05KJBNSSq yDLaoc1rsh2bqxX42u4hNreqGfL+6SJ0Q1It5tyFEUYysBOIAeRkq5VLIqeJdE6BvB4m ErErmgT6qfXbWP9I0ef/w5t06d6EpbZFgBVs0jOzWsGFlWtMcDLPrgiYb8g748P0KqsS xryQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g69-v6si14915070plb.400.2018.11.11.12.17.23; Sun, 11 Nov 2018 12:17:38 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731835AbeKLGFV (ORCPT + 99 others); Mon, 12 Nov 2018 01:05:21 -0500 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:53076 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731042AbeKLGFU (ORCPT ); Mon, 12 Nov 2018 01:05:20 -0500 Received: from [192.168.4.242] (helo=deadeye) by shadbolt.decadent.org.uk with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1gLvt3-0000l9-DZ; Sun, 11 Nov 2018 19:59:13 +0000 Received: from ben by deadeye with local (Exim 4.91) (envelope-from ) id 1gLvsR-0001X2-Mx; Sun, 11 Nov 2018 19:58:35 +0000 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, "Vlastimil Babka" , "Michal Hocko" , "Mel Gorman" , "Joonsoo Kim" , "David Rientjes" , "Linus Torvalds" Date: Sun, 11 Nov 2018 19:49:05 +0000 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) Subject: [PATCH 3.16 126/366] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset In-Reply-To: X-SA-Exim-Connect-IP: 192.168.4.242 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 3.16.61-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Vlastimil Babka commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream. In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for allocations that can ignore memory policies. The zonelist is obtained from current CPU's node. This is a problem for __GFP_THISNODE allocations that want to allocate on a different node, e.g. because the allocating thread has been migrated to a different CPU. This has been observed to break SLAB in our 4.4-based kernel, because there it relies on __GFP_THISNODE working as intended. If a slab page is put on wrong node's list, then further list manipulations may corrupt the list because page_to_nid() is used to determine which node's list_lock should be locked and thus we may take a wrong lock and race. Current SLAB implementation seems to be immune by luck thanks to commit 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on arbitrary node") but there may be others assuming that __GFP_THISNODE works as promised. We can fix it by simply removing the zonelist reset completely. There is actually no reason to reset it, because memory policies and cpusets don't affect the zonelist choice in the first place. This was different when commit 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their own restricted zonelists. We might consider this for 4.17 although I don't know if there's anything currently broken. SLAB is currently not affected, but in kernels older than 4.7 that don't yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on arbitrary node") it is. That's at least 4.4 LTS. Older ones I'll have to check. So stable backports should be more important, but will have to be reviewed carefully, as the code went through many changes. BTW I think that also the ac->preferred_zoneref reset is currently useless if we don't also reset ac->nodemask from a mempolicy to NULL first (which we probably should for the OOM victims etc?), but I would leave that for a separate patch. Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") Acked-by: Mel Gorman Cc: Michal Hocko Cc: David Rientjes Cc: Joonsoo Kim Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [bwh: Backported to 3.16: Resetting the zonelist may still be useful here, so keep doing it if __GFP_THISNODE is not used.] Signed-off-by: Ben Hutchings --- mm/page_alloc.c | 1 - 1 file changed, 1 deletion(-) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2594,7 +2594,8 @@ rebalance: * the allocation is high priority and these type of * allocations are system rather than user orientated */ - zonelist = node_zonelist(numa_node_id(), gfp_mask); + if (!(gfp_mask & __GFP_THISNODE)) + zonelist = node_zonelist(numa_node_id(), gfp_mask); page = __alloc_pages_high_priority(gfp_mask, order, zonelist, high_zoneidx, nodemask,