Received: by 2002:a05:6358:489b:b0:bb:da1:e618 with SMTP id x27csp2442879rwn; Fri, 16 Sep 2022 10:18:14 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5Dsk1NTX8hhCuNlT8PuHX3eThHnH4e3l5mcDjfS34gEx44snalH7NZ+aKXRhAnc1nUToMp X-Received: by 2002:aa7:d34e:0:b0:44e:8d4b:6e02 with SMTP id m14-20020aa7d34e000000b0044e8d4b6e02mr4867632edr.306.1663348694227; Fri, 16 Sep 2022 10:18:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663348694; cv=none; d=google.com; s=arc-20160816; b=dMtuL7plLq/bQ31iNfGOt7W8ig5Cxvm8lD8t0SyRhrU6ShxJsFDjFh6fJcEiXdF9DK x7q5uBSGXH28WvEifqGJsqY1SPmaGFg3YhKvNhUbjNjQ9WY+RlPFC55Hh9hYkTMwqIyq QZySZF9R/vEt5ORusHYRWNADWEA6HYJ8iW4+mWoyH8D5DkteUJ8uluw6H9oFxTUkhzJB NwZ1doYQIqCzmm/tK22A45wF4fwE2cJHPJwkg7dQUNoxZUsW21viFk5dXfce5Z/meO7V HlwprQP7A3jbt+FDybS/houi0GBKS0lDJFcuRTmFbpVGSc7f2y9U/K32DJwfM5On9cni KdoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:dkim-signature; bh=NHX0v7JDxtUMM333DJPPrllexhVOYL7Nj+UJXW7k7Y8=; b=QZg54Kvc633d6T8awjYuxG0XBiR40w51HQYMwgC73ZF89mRSoPiz9RdiocRHH/eStd EQuPkVqz9zRFAokmBDtJs1AfWJmY+3kyBTpHo+vj+ln0Yiw4FqjWpJAeAh4IxRFTtTnI pW8X8PpLveiPpSliwbFJrryRzyJ/zLFUC/0X4F38S34cLfX7Xs0g0oraH4iku6j4g5q9 DSuSjEuz4DNWo8GK8KdcZ23nDTJ1Pe08Ux55xvEFiSmiGhlaJClrPbE0ugpsSR1gSZud cTT1IKxSdkUngTti0b/qOXjSqGxN7MdU0unhlU+UOCek5cP9FYXUNZIemdzR7sKMp/4U QhmQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b="pa/sJLJc"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nd16-20020a170907629000b00770887ad669si21013551ejc.219.2022.09.16.10.17.47; Fri, 16 Sep 2022 10:18:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b="pa/sJLJc"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229507AbiIPRGp (ORCPT + 99 others); Fri, 16 Sep 2022 13:06:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34942 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229614AbiIPRGm (ORCPT ); Fri, 16 Sep 2022 13:06:42 -0400 Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3E9AFB657D; Fri, 16 Sep 2022 10:06:37 -0700 (PDT) Received: by mail-pl1-x62f.google.com with SMTP id t3so22019742ply.2; Fri, 16 Sep 2022 10:06:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from:from:to :cc:subject:date; bh=NHX0v7JDxtUMM333DJPPrllexhVOYL7Nj+UJXW7k7Y8=; b=pa/sJLJcN+B/Gm25t+BGj5bgblNIz7/PAmxl6hiXx1smZzr03fp4MqJ/q/wlShntIw BcXbPjsBHGmXGrE0s2Sv2om6ZYgRIGDe3KIDNILW73JyMXR68sT7uhKIw3iRr/p7MwxR 1FWxvUH5LeudO/mP+KZS3neYWq/v8ZRj7BzdnW48emGzopX4Sh7z33C+1E5eq4iIkeRr 1GoWDaFGgVN2o0T/On9Rla8LAIZAUwzjFpc/Eytfqz5Q+sXAcRnzkPi7TR1v01WVWHlS TaDTCkgQXQ4ewRC92j3wEH3bMvPyGukKbsK1R3Yojr1yn1scu/e/FbsHKMQdZV6ohBp2 eh2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date; bh=NHX0v7JDxtUMM333DJPPrllexhVOYL7Nj+UJXW7k7Y8=; b=akIsvl4zMQNjiwU3afO8iVDgfguG7q7lACAGU/eqGQ94ATszS8lntWubvPVvRvnqA5 AY4cOYUlpFIk392Y9IqRuf8rI3cGSq3gQJ2upYsgrSIXxvlt6cUbkd/DOekaek6Gzlru xjc4GgEOWj2VhQj7YZUkwBd3jjhevxNMz0qAbIwiBMBKs+oBOya+DndsD1HuBHMNenXy y+A9nwpXSnUKhfkv7tdwo7ISI7F2EaQIOfceC7SPSkIuhWhV7eGD30qn3K3QgDTdJD3J cXqyGGnTRLL2dW7WRJGGfACZnNeRLvcZB2nqpxsJ7zDBt8a/oIGHt8M/3qmY+Q0RRieY rX1A== X-Gm-Message-State: ACrzQf1eM7lUSGCUQ46jZ2l3AohD48ukFvOn3TRInC4SijruVpU5HitR Wj2Y8bhiVRUf5W1h52Om1VY= X-Received: by 2002:a17:90b:1d01:b0:203:2bda:abb1 with SMTP id on1-20020a17090b1d0100b002032bdaabb1mr11969428pjb.204.1663347996703; Fri, 16 Sep 2022 10:06:36 -0700 (PDT) Received: from localhost.localdomain ([117.176.186.9]) by smtp.gmail.com with ESMTPSA id z12-20020a170903018c00b00176d4b093e1sm15386677plg.16.2022.09.16.10.06.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Sep 2022 10:06:36 -0700 (PDT) From: wangyong X-Google-Original-From: wangyong To: gregkh@linuxfoundation.org Cc: jaewon31.kim@samsung.com, linux-kernel@vger.kernel.org, mhocko@kernel.org, stable@vger.kernel.org, wang.yong12@zte.com.cn, yongw.pur@gmail.com, Joonsoo Kim , Andrew Morton , Johannes Weiner , Minchan Kim , Mel Gorman , Linus Torvalds Subject: [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx Date: Fri, 16 Sep 2022 10:05:47 -0700 Message-Id: <1663347949-20389-2-git-send-email-wang.yong12@zte.com.cn> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1663347949-20389-1-git-send-email-wang.yong12@zte.com.cn> References: <1663347949-20389-1-git-send-email-wang.yong12@zte.com.cn> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Joonsoo Kim Patch series "integrate classzone_idx and high_zoneidx", v5. This patchset is followup of the problem reported and discussed two years ago [1, 2]. The problem this patchset solves is related to the classzone_idx on the NUMA system. It causes a problem when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. This problem was reported two years ago, and, at that time, the solution got general agreements [2]. But it was not upstreamed. [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com This patch (of 2): Currently, we use classzone_idx to calculate lowmem reserve proetection for an allocation request. This classzone_idx causes a problem on NUMA systems when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. Before further explanation, I should first clarify how to compute the classzone_idx and the high_zoneidx. - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and represents the index of the highest zone the allocation can use - classzone_idx was supposed to be the index of the highest zone on the local node that the allocation can use, that is actually available in the system Think about following example. Node 0 has 4 populated zone, DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones, such as MOVABLE, doesn't exist on node 1 and this makes following difference. Assume that there is an allocation request whose gfp_zone(gfp_mask) is the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is initiated on node 0, it's classzone_idx is 3 since actually available/usable zone on local (node 0) is MOVABLE. If this allocation is initiated on node 1, it's classzone_idx is 2 since actually available/usable zone on local (node 1) is NORMAL. You can see that classzone_idx of the allocation request are different according to their starting node, even if their high_zoneidx is the same. Think more about these two allocation requests. If they are processed on local, there is no problem. However, if allocation is initiated on node 1 are processed on remote, in this example, at the NORMAL zone on node 0, due to memory shortage, problem occurs. Their different classzone_idx leads to different lowmem reserve and then different min watermark. See the following example. root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo Node 0, zone DMA per-node stats ... pages free 3965 min 5 low 8 high 11 spanned 4095 present 3998 managed 3977 protection: (0, 2961, 4928, 5440) ... Node 0, zone DMA32 pages free 757955 min 1129 low 1887 high 2645 spanned 1044480 present 782303 managed 758116 protection: (0, 0, 1967, 2479) ... Node 0, zone Normal pages free 459806 min 750 low 1253 high 1756 spanned 524288 present 524288 managed 503620 protection: (0, 0, 0, 4096) ... Node 0, zone Movable pages free 130759 min 195 low 326 high 457 spanned 1966079 present 131072 managed 131072 protection: (0, 0, 0, 0) ... Node 1, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone Normal per-node stats ... pages free 233277 min 383 low 640 high 897 spanned 262144 present 262144 managed 257744 protection: (0, 0, 0, 0) ... Node 1, zone Movable pages free 0 min 0 low 0 high 0 spanned 262144 present 0 managed 0 protection: (0, 0, 0, 0) - static min watermark for the NORMAL zone on node 0 is 750. - lowmem reserve for the request with classzone idx 3 at the NORMAL on node 0 is 4096. - lowmem reserve for the request with classzone idx 2 at the NORMAL on node 0 is 0. So, overall min watermark is: allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846 allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750 Allocation initiated on node 1 will have some precedence than allocation initiated on node 0 because min watermark of the former allocation is lower than the other. So, allocation initiated on node 1 could succeed on node 0 when allocation initiated on node 0 could not, and, this could cause too many numa_miss allocation. Then, performance could be downgraded. Recently, there was a regression report about this problem on CMA patches since CMA memory are placed in ZONE_MOVABLE by those patches. I checked that problem is disappeared with this fix that uses high_zoneidx for classzone_idx. http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop Using high_zoneidx for classzone_idx is more consistent way than previous approach because system's memory layout doesn't affect anything to it. With this patch, both classzone_idx on above example will be 3 so will have the same min watermark. allocation initiated on node 0: 750 + 4096 = 4846 allocation initiated on node 1: 750 + 4096 = 4846 One could wonder if there is a side effect that allocation initiated on node 1 will use higher bar when allocation is handled on local since classzone_idx could be higher than before. It will not happen because the zone without managed page doesn't contributes lowmem_reserve at all. Reported-by: Ye Xiaolong Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Tested-by: Ye Xiaolong Reviewed-by: Baoquan He Acked-by: Vlastimil Babka Acked-by: David Rientjes Cc: Johannes Weiner Cc: Michal Hocko Cc: Minchan Kim Cc: Mel Gorman Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds --- mm/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 3a2e973..922a173 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -123,7 +123,7 @@ struct alloc_context { bool spread_dirty_pages; }; -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref) +#define ac_classzone_idx(ac) (ac->high_zoneidx) /* * Locate the struct page for both the matching buddy in our -- 2.7.4