Received: by 2002:a05:6a10:eb17:0:0:0:0 with SMTP id hx23csp131512pxb; Thu, 2 Sep 2021 21:32:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwDxOVT8wbjvI/WWUVT/vdhp4KddLB+/wjrTjPn443PrSQfflpq7nrdc/P+qh0uFwfTjPvM X-Received: by 2002:aa7:d501:: with SMTP id y1mr1859018edq.6.1630643535815; Thu, 02 Sep 2021 21:32:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630643535; cv=none; d=google.com; s=arc-20160816; b=lBInujTTBSWinR+Xeezfy1Z2/lE1YNITEHxStfMa0bUJZy8BClVLH0oa9wXnSxBpcp ahLFkY3ruql355s8/qQOZdv8uCYrwk/gBakbKDE4g3WGOb+zbX0XUcv62XrYivgephc7 XM/gHGPdJu21DdzQeP6dauEcajMSyYaoFSrHJ5zLyUroNqysBhY1XSXJVV0h4PvRIP4V PtFDH+e26i/4v6LzETXutlbWDThTu1XWi3JZXTFtPU0Z9VACy3bq4yXXxTuygdNO8C26 mg+yoB3MbD7iudqlwvs83I5ha/15zccyr6tJbLLHYeZa/A/18xO8yxx3/27uxTF7m1Uv bv6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=xSiOvTAcrQV6jXWWUQKu1i+8HsgWisftTDS13SP8JOg=; b=rDnpJixObhRRQ3qzuY1NuNiTOoRQNKz4NlPVgh15QSPNBDMnWwhchMV01zgEhZeKjV Jt5D1whUOyaFZ3nVPFmXSgD7mf9Jh0OZ+PoIN0D72XuhnIOsgeAYw1o+LRRmYM6jqu40 L8I2n73FhnUeAVMPer7MmrZ9c4PfjEdz3aO0KzZCP4Z13tYYU4rNfs+dg3SWtzOliBJU 4sNmOwSs7p15F5VTu5J8l2geE5xEzxCePz+ArolyC4cwFrwOSB/WUQ0vCipyyKuEFc+l pAQk6JfNbsKK1YQ4QmH15rgImHK8ZxL2oaVloKi2zUy67RfllTafYfdf5g0P2d34JRjg mXXg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hp5si4770105ejc.447.2021.09.02.21.31.37; Thu, 02 Sep 2021 21:32:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231316AbhICEUM (ORCPT + 99 others); Fri, 3 Sep 2021 00:20:12 -0400 Received: from foss.arm.com ([217.140.110.172]:34808 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229561AbhICEUL (ORCPT ); Fri, 3 Sep 2021 00:20:11 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2576ED6E; Thu, 2 Sep 2021 21:19:12 -0700 (PDT) Received: from [10.163.72.65] (unknown [10.163.72.65]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0D9943F5A1; Thu, 2 Sep 2021 21:19:08 -0700 (PDT) Subject: Re: [FIX PATCH 2/2] mm/page_alloc: Use accumulated load when building node fallback list To: Bharata B Rao , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, kamezawa.hiroyu@jp.fujitsu.com, lee.schermerhorn@hp.com, mgorman@suse.de, Krupa.Ramakrishnan@amd.com, Sadagopan.Srinivasan@amd.com References: <20210830121603.1081-1-bharata@amd.com> <20210830121603.1081-3-bharata@amd.com> From: Anshuman Khandual Message-ID: Date: Fri, 3 Sep 2021 09:50:09 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20210830121603.1081-3-bharata@amd.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/30/21 5:46 PM, Bharata B Rao wrote: > From: Krupa Ramakrishnan > > In build_zonelists(), when the fallback list is built for the nodes, > the node load gets reinitialized during each iteration. This results > in nodes with same distances occupying the same slot in different > node fallback lists rather than appearing in the intended round- > robin manner. This results in one node getting picked for allocation > more compared to other nodes with the same distance. > > As an example, consider a 4 node system with the following distance > matrix. > > Node 0 1 2 3 > ---------------- > 0 10 12 32 32 > 1 12 10 32 32 > 2 32 32 10 12 > 3 32 32 12 10 > > For this case, the node fallback list gets built like this: > > Node Fallback list > --------------------- > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 0 1 <-- Unexpected fallback order > > In the fallback list for nodes 2 and 3, the nodes 0 and 1 > appear in the same order which results in more allocations > getting satisfied from node 0 compared to node 1. > > The effect of this on remote memory bandwidth as seen by stream > benchmark is shown below: > > Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 > (numactl -m 0,1 ./stream_lowOverhead ... --cores ) > Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 > (numactl -m 2,3 ./stream_lowOverhead ... --cores ) > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 57479.6 110791.8 > SCALE 55372.9 105685.9 > ADD 50460.6 96734.2 > TRIADD 50397.6 97119.1 > ---------------------------------------- > > The bandwidth drop in Case 1 occurs because most of the allocations > get satisfied by node 0 as it appears first in the fallback order > for both nodes 2 and 3. > > This can be fixed by accumulating the node load in build_zonelists() > rather than reinitializing it during each iteration. With this the > nodes with the same distance rightly get assigned in the round robin > manner. In fact this was how it was originally until the > commit f0c0b2b808f2 ("change zonelist order: zonelist order selection > logic") dropped the load accumulation and resorted to initializing > the load during each iteration. While zonelist ordering was removed by > commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"), > the change to the node load accumulation in build_zonelists() remained. > So essentially this patch reverts back to the accumulated node load > logic. > > After this fix, the fallback order gets built like this: > > Node Fallback list > ------------------ > 0 0 1 2 3 > 1 1 0 3 2 > 2 2 3 0 1 > 3 3 2 1 0 <-- Note the change here > > The bandwidth in Case 1 improves and matches Case 2 as shown below. > > ---------------------------------------- > BANDWIDTH (MB/s) > TEST Case 1 Case 2 > ---------------------------------------- > COPY 110438.9 110107.2 > SCALE 105930.5 105817.5 > ADD 97005.1 96159.8 > TRIADD 97441.5 96757.1 > ---------------------------------------- > > The correctness of the fallback list generation has been verified > for the above node configuration where the node 3 starts as > memory-less node and comes up online only during memory hotplug. > > [bharata@amd.com: Added changelog, review, test validation] > > Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection > logic") > Signed-off-by: Krupa Ramakrishnan > Co-developed-by: Sadagopan Srinivasan > Signed-off-by: Sadagopan Srinivasan > Signed-off-by: Bharata B Rao > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 22f7ad6ec11c..47f4d160971e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6268,7 +6268,7 @@ static void build_zonelists(pg_data_t *pgdat) > */ > if (node_distance(local_node, node) != > node_distance(local_node, prev_node)) > - node_load[node] = load; > + node_load[node] += load; > > node_order[nr_nodes++] = node; > prev_node = node; > Reviewed-by: Anshuman Khandual